ATM

Message boards : News : ATM
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · 21 · 22 . . . 35 · Next

AuthorMessage
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60638 - Posted: 6 Aug 2023, 12:16:04 UTC - in response to Message 60636.  
Last modified: 6 Aug 2023, 12:18:50 UTC

If it hurts when you <do that> then the most obvious solution is to not <do that>. This applies to most things in life.

It’s well known that these tasks do not like to be interrupted. If your power grid is that unstable then it’s probably best for you to crunch something else, or invest in a battery backup to keep the computer running during power outages.

These are still classified as Beta after all and that comes with the implication that things will not always work, and you need to accept whatever compromises that comes with it. If you don’t then your other solution could be to just disable Beta processing from your profile and wait for ACEMD3 work.
ID: 60638 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60651 - Posted: 16 Aug 2023, 10:44:28 UTC

Ok, back from holidays.
I've saw that in the last batch of jobs many Energy NaN errors were happening, which it was completely unexpected.

I am testing some different settings internally to see if it overcomes this issue. In case it is successfull, new jobs would be send by tomorrow/Friday.

This might be more time consuming and I would not like to split them in even more chunks (might have a suspicion that this gives wonky results at some point) but if people see that they take too long time/space please let me know.
ID: 60651 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60652 - Posted: 16 Aug 2023, 10:50:32 UTC - in response to Message 60633.  

It does not make any difference to Quico. The task will be completed on one or another computer and his science is done. It is our very expensive energy that is wasted but as Quico himself said, the science gets done who cares about wasted energy?

I'm pretty sure I never said that about wasted energy. What I might have mentioned is that completed jobs come back to me and since I don't check what happens to every WU manually then these crashes might go under my radar.

As Ian&Steve C. these app is in "beta"/not ideal conditions. Sadly I don't have the knowledge to fix it, otherwise I would. Errors on my end can be one I forget to upload some files (happened) or I sent jobs without equilibrated systems (also happened). By trial and error I ended up with a workflow that should avoid these issues 99% of the time. Any other kind of errors I can pass them to the devs but I can't promise much more apart from it.
I'm here testing the science.
ID: 60652 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,075
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60653 - Posted: 16 Aug 2023, 12:57:20 UTC - in response to Message 60651.  

This might be more time consuming and I would not like to split them in even more chunks (might have a suspicion that this gives wonky results at some point) but if people see that they take too long time/space please let me know.

I think that the most harmful risk is that excessively heavy tasks generate result files bigger than 512 MB in size.
GPUGRID server can't handle them, and they won't upload...
ID: 60653 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60654 - Posted: 16 Aug 2023, 18:07:55 UTC - in response to Message 60653.  

Absolutely, this is the biggest risk. Shame that you can't get Gianni or Toni to reconfigure the website html upload size limit to 1GB.
ID: 60654 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bluestang

Send message
Joined: 13 Apr 15
Posts: 11
Credit: 3,003,712,606
RAC: 2,331
Level
Arg
Scientific publications
wat
Message 60655 - Posted: 17 Aug 2023, 14:27:48 UTC

Can you at least fix the "daily quota" limit or whatever it is that prevents a machine from getting more WUs?

8/17/2023 10:25:12 AM | GPUGRID | This computer has finished a daily quota of 14 tasks


After all, it is your WUs that are Erroring out by the hundreds and causing this "daily quota" to kick in. Seems this batch is even worse than before.
ID: 60655 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60656 - Posted: 17 Aug 2023, 17:05:27 UTC - in response to Message 60655.  

The tasks I received this morning are running fine, and have reached 92%. Any problems will be the result of the combination of their tasks and your computer. The quota is there to protect their science from your computer. Trying to increase the quota without sorting out the cause of the underlying problem would be counter-productive.
ID: 60656 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60657 - Posted: 17 Aug 2023, 18:21:49 UTC

It seems windows hosts in general seem to have a lot more problems than Linux hosts.
ID: 60657 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60658 - Posted: 17 Aug 2023, 18:46:06 UTC - in response to Message 60657.  

It seems windows hosts in general seem to have a lot more problems than Linux hosts.

Depends what the error is. I looked at host 553738:

Coprocessors [4] NVIDIA NVIDIA GeForce RTX 2080 Ti (11263MB) driver: 528.2

and tasks with

openmm.OpenMMException: Illegal value for DeviceIndex: 1

BOINC has a well-known design flaw: it reports a number of identical GPUs, even if in reality they're different. And this project's apps are very picky about being told exactly what sort of GPU they've been told to run on. So, if Device_0 is a RTX 2080 Ti, and Device_1 isn't, you'll get an error like that.

The machine has completed other tasks today, presumably on Device 0, although the project doesn't report that for a successful task.
ID: 60658 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60659 - Posted: 17 Aug 2023, 18:48:58 UTC

Just anecdotally, most issues reported seem to be coming from windows users with ATMbeta.
ID: 60659 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
roundup

Send message
Joined: 11 May 10
Posts: 68
Credit: 12,293,503,875
RAC: 3,253
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60661 - Posted: 18 Aug 2023, 4:18:08 UTC - in response to Message 60659.  

Most of the errors so far have occurred on the otherwise reliable Linux machine.
Since the app version is still 1.09, I had little hope that the 4080 would work on Windows this time - and rightly so.
When an OS-hardware combination works flawlessly on all other BOINC projects, it is not very far-fetched to suspect that the app is buggy when there are errors on GPUGrid.
I hope that the developers can look deeply into it again.
ID: 60661 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Freewill

Send message
Joined: 18 Mar 10
Posts: 28
Credit: 41,810,587,419
RAC: 9,322
Level
Trp
Scientific publications
watwatwatwatwat
Message 60662 - Posted: 18 Aug 2023, 10:47:18 UTC

I have 11 valid and 3 error so far for this batch of tasks. I am getting the "raise ValueError('Energy is NaN." error. This is for Ubuntu 22.04 and single GPU or 2 identical GPU systems. What really hurts is the tasks are running a long time and then the error comes.
ID: 60662 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 193
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60663 - Posted: 18 Aug 2023, 16:08:20 UTC

I completed my first of these longer tasks in Win10. Sometimes this PC completes tasks failed by others and the opposite also happens.

This one failed 3x by the same person on different PCs, then a 4th fail by someone. All 4 PCs with over 100 fails in 1-2 min. Seems these users are getting plenty past 14 per day.

https://www.gpugrid.net/workunit.php?wuid=27541862
ID: 60663 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60664 - Posted: 19 Aug 2023, 9:17:49 UTC

This is a new one on me: task 33579127

openmm.OpenMMException: Called setPositions() on a Context with the wrong number of positions
ID: 60664 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60665 - Posted: 19 Aug 2023, 11:58:50 UTC

one strange thing I noticed when I re-started downloading and crunching tasks on 2 PCs yesterday evening:
one of the two PCs got a new 6-digit number in the GPUGRID system.
It's now 610779, before it was 600874. No big deal, but I am wondering how come.
Does anyone have a logical explanation?
ID: 60665 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60666 - Posted: 19 Aug 2023, 16:29:42 UTC - in response to Message 60665.  
Last modified: 19 Aug 2023, 16:31:17 UTC

BOINC thought it was a new host.

Could be many reasons.

Scheduler database issues, corruption of or missing client_state.xml file, change in hardware or OS etc. etc.

All it takes is the server to lose track of how many times the host has contacted the project and the server will reissue a new host ID.
ID: 60666 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,075
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60667 - Posted: 20 Aug 2023, 11:45:45 UTC - in response to Message 60653.  

I think that the most harmful risk is that excessively heavy tasks generate result files bigger than 512 MB in size.
GPUGRID server can't handle them, and they won't upload...

I see that one of the issues that Quico seems to have addressed is just this one.
At this two GTX1650 GPU host, I was processing these two tasks:
syk_m39_m14_1-QUICO_ATM_Mck_Sage_2fs-1-5-RND8946_0, with this characteristics:
syk_m31_m03_2-QUICO_ATM_Mck_GAFF2_2fs-1-5-RND5226_0, with this characteristics
To test the sizes of the generated result files, I suspended network activity at BOINC Manager.
And now there are two only result files per task, both being much lighter than previous batches.
Excessively heavy result files problem solved, and by the way, less stress for server storage.
ID: 60667 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60668 - Posted: 20 Aug 2023, 14:11:29 UTC - in response to Message 60667.  

Yes, I saw that and it looks like good news. But, I would say, unproven as yet. This new batch have varying run times, according to what gene or protein is being worked on - the first part of the name, anyway.

The longest ones seem to be shp2_, but even those finish in under 12 hours on my cards. I think the really massive 500MB+ tasks ran over 24 hours on the same cards, so we haven't really tested the limits yet.
ID: 60668 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
wujj123456

Send message
Joined: 9 Jun 10
Posts: 19
Credit: 2,233,932,323
RAC: 0
Level
Phe
Scientific publications
watwatwatwat
Message 60671 - Posted: 21 Aug 2023, 2:41:07 UTC

I am a bit curious what's going on with this 1660 super host. I occasionally check the error out units to make sure it's not just always failing on my hosts and I noticed more than once some host would finish it super fast. It happened to some of my own valid WUs too. While this sometimes happens, what I haven't seen until now is a host that either errors out, or finishes super fast: https://www.gpugrid.net/results.php?hostid=610334

What makes a WU finish super fast when other computers error out? Are these fast completion actually valid results?
ID: 60671 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bluestang

Send message
Joined: 13 Apr 15
Posts: 11
Credit: 3,003,712,606
RAC: 2,331
Level
Arg
Scientific publications
wat
Message 60672 - Posted: 22 Aug 2023, 17:18:26 UTC
Last modified: 22 Aug 2023, 17:41:13 UTC

What's with the new error/issue...

8/22/2023 12:49:47 PM | GPUGRID | Output file syk_m12_m14_3-QUICO_ATM_Mck_Sage_v3-1-5-RND3920_0_0 for task syk_m12_m14_3-QUICO_ATM_Mck_Sage_v3-1-5-RND3920_0 absent


Of course, may not ne new...I just saw it in my log after having a string of errors.
ID: 60672 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · 21 · 22 . . . 35 · Next

Message boards : News : ATM

©2025 Universitat Pompeu Fabra