Message boards :
Graphics cards (GPUs) :
NOELIA WUs getting "stuck"
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 11 Jul 09 Posts: 27 Credit: 1,000,618,568 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I am seeing about 10% of my NOELIA WUs getting "stuck" - the "fraction done" output stops moving. This seems to happen most often with "run4" WUs, but I have also seen with the other run numbers. If I restart BOINC, it starts the WU over from 0.00000. Sometimes it will freeze again at another spot, sometimes the WU will finish successfully after this restart, sometimes they error out. Link to the machine - http://www.gpugrid.net/show_host_detail.php?hostid=111125 Linux x86_64 (Fedora 17 3.4.6-2.fc17.x86_64) NVIDIA UNIX x86_64 Kernel Module 304.37 GeForce GTX 560 Ti |
|
Send message Joined: 16 Jul 12 Posts: 98 Credit: 386,043,752 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Hmm, I haven't encountered this issue, but I run Windows, so that might be why. For me, sometimes tasks restart when Boinc does, but not all the time, though. Sorry I can't help you. |
K1atOdessaSend message Joined: 25 Feb 08 Posts: 249 Credit: 444,646,963 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I had this happen twice on me to date (I am running Windows). One time it was 2 days before I noticed, so I just aborted (given the utilization was practically 0% and progress not moving). The 2nd time I saw it, I closed BOINC and reopened, after which the WU errored out. Kinda a pain, but just keeping an out eye at this point. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
One of my hosts is crunching such a workunit right now. It's running for 21h26m now, and it's at 78.320%, progressing very slowly. This type of workunits usually take less than 7 hours to complete. I've tried to pause and restart the task, then I put it on another GPU in the same host, but there's no change in its speed. I've double checked that none of this host's GPUs is downclocked. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The same workunit was aborted on another system with these verbosely challenged details: Stderr output <core_client_version>6.12.34</core_client_version>
FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It's finished after 27 hours... Stderr output: <core_client_version>6.10.60</core_client_version> <![CDATA[ <stderr_txt> MDIO: cannot open file "restart.coor" No heartbeat from core client for 30 sec - exiting # Time per step (avg over 545000 steps): 19.800 ms # Approximate elapsed time for entire WU: 99000.143 s called boinc_finish </stderr_txt> ]]> Since then my host finished a couple of workunits without any problems, and without any restart. |
|
Send message Joined: 11 Jul 09 Posts: 27 Credit: 1,000,618,568 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I was seeing this issue with the much shorter "trypsin_lig" runs. When these ran successfully, they ran very quickly (like under an hour). The CPU has been under 20% utilization and the GPU/fan are normal. I have been running other ACEMD and long runs on this same machine without issue for almost a year. |
|
Send message Joined: 11 Jul 09 Posts: 27 Credit: 1,000,618,568 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Here are some examples of failed WUs - three errored out and two I aborted after they got stuck. http://www.gpugrid.net/result.php?resultid=5726005 http://www.gpugrid.net/result.php?resultid=5724879 http://www.gpugrid.net/result.php?resultid=5723761 http://www.gpugrid.net/result.php?resultid=5719313 http://www.gpugrid.net/result.php?resultid=5713069 |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
We are seeing three different errors here. Zoltan's system had a "No heartbeat from core client for 30 sec" error, and ETQuestor had 2 different errors (3 tasks were aborted). One error was a sig abort and the other was an "energies have become nan" error: SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1574. acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed. SIGABRT: abort called Stack trace (15 frames): ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(boinc_catch_signal+0x4d)[0x551f6d] /lib64/libc.so.6(+0x359a0)[0x7fdff7ad39a0] /lib64/libc.so.6(gsignal+0x35)[0x7fdff7ad3925] /lib64/libc.so.6(abort+0x148)[0x7fdff7ad50d8] /lib64/libc.so.6(+0x2e6a2)[0x7fdff7acc6a2] /lib64/libc.so.6(+0x2e752)[0x7fdff7acc752] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x482916] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x4848da] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44d4bd] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44e54c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x41ec14] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0xb6c)[0x407d6c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0x256)[0x407456] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fdff7abf735] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sinh+0x49)[0x4072f9] Exiting... -- ERROR: file deven.cpp line 1106: # Energies have become nan Each of this errors can be caused by more than one problem. There have been suggestions about these errors in the past. While the No Heartbeat error can simply be caused by not having enough access to the CPU, its often due to not being able to write to the hard drive, due to other tasks basically thrashing the drive (I think any process that reads or writes to the drive would be prioritised over all things Boinc). Things like automatic disk defrags, Windows search engines and AV scans can trigger this. I have also experienced this error when a SATA cable became lose (the sub-standard Red ones)! I suspect opening or leaving the Event Log window open might contribute to this as well (if you have lots of cc_config flags set). The SIGABRT (an abort task signal) and the Not a Number errors could well be task related, but also Linux setup/driver/library issues. In the past similar errors were supposedly caused by Boinc running CPU benchmarks, amongst other things. A lib.so.6 "double free or corruption" was reported back in Jan when crunching one a TONI task, though there are no suggestions in that thread. Soneageman also reported this error. Again I don't see any specific helpful info. If the problem continues look at hardware (temps/fan speed/noise), the driver and the Boinc client (config/updates). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
We are seeing three different errors here. While this is true, it's not the source of the slowness of this workunit. It was slow right from the start. My rosetta@home tasks were going wild, using 400 to 850MB, so when I started Skype, it caused the BOINC manager to shut down every task, because the memory used by BOINC applications exceeded the treshold of maximum usable physical memory (90%). Then it restarted the tasks one by one, but rosetta@home tasks read 1.3GB (and write 130MB) at startup, and since I don't have SSD in this PC, this could overwhelm the file system, causing tasks starting and stopping several times (because the "No Heartbeat from core client" error), and rendering my PC unusable for a couple of minutes. Since then I've doubled the RAM in this host (now it has 12GB). The other 3 tasks running at the same time were experiencing this "No heartbeat error", but they didn't slow down. This error makes the BOINC manager to stop the task, and restart it from the last checkpoint. Here is a list of my workunits which experienced this error, but didn't slow down: 5743222 2 times 5743031 2 times 5742951 2 times 5742426 4 times 5741987 2 times 5741772 2 times 5741561 2 times 5741537 2 times 5740471 5740130 5739827 5739821 |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Sort of an aside, as the (no heartbeat) 30sec no response->stop task issue is clouding the problem here, but I think a delay needs to be introduced during task startup/restart, but this should really be done at the Boinc level, rather than app or by the users. That said a script to do so would be a good workaround, similar to the Linux startup delay, but on a task by task basis (allow a few seconds for each task to load). If possible using a secondary hard drive should help avoid this issue. That said, for normal usage you would want an SSD for the system to boost system performance, especially startup/shutdown, rather than just being used for Boinc. Of course I'm guilty of buying an SSD just to support some of the more challenging projects, but then I like a challenge. I really see this as a problem with Rosetta and Boinc. Basically I think Boinc should always prioritise GPU projects over CPU projects, even if it means using delayed write or suspending the CPU project. Frankly I never want any CPU project to interfere with a GPUGrid Long run for any reason (HDD, CPU, RAM...). If CPU projects could ascertain how much resources were available to them, after the GPU project starts as a priority, this sort of problem should never happen. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
Raptures RiotSend message Joined: 30 Apr 11 Posts: 6 Credit: 220,588,795 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I am getting a lot of 'Energies have become nan' the same as many others. This usually occurs 3 or 4 hours into the calc. I'm presuming 'nan' means 'indeterminate' which is a legitimate conclusion to the model. Therefore I do not understand why the calculation ends in an error. Please, if this result is useful info, can a 'completeded successfully' be awarded? There seems to be some disenchantment in the forums over this topic. I know everyone here is dedicated and respectful and we try real hard for results. Let's understand why 'nan' can be a good 3 letter word. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The nan error has it's own thread. |
©2025 Universitat Pompeu Fabra