Units failing

Author	Message
Ryan Munro Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,405,292,080 RAC: 593 Level Scientific publications	Message 58929 - Posted: 15 Jun 2022, 9:14:29 UTC Can someone have a quick look and let me know the problem here, a few computed fine but most errored out. https://www.gpugrid.net/results.php?userid=524374&offset=0&show_names=0&state=5&appid= Also, I have just started the project again on another machine with an Nvidia card and most of the time the card is idle with some second or so long spikes every now and again, is that normal? Thanks ID: 58929 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 58930 - Posted: 15 Jun 2022, 12:08:50 UTC - in response to Message 58929. Can someone have a quick look and let me know the problem here, a few computed fine but most errored out. https://www.gpugrid.net/results.php?userid=524374&offset=0&show_names=0&state=5&appid= Also, I have just started the project again on another machine with an Nvidia card and most of the time the card is idle with some second or so long spikes every now and again, is that normal? Thanks likely failing because your GT1030 doesn't have enough GPU memory. these python tasks use a lot of VRAM. GT1030 is probably too weak to run these kinds of tasks unfortunately. and yes it's normal to see that behavior with your RTX 3090. the app has intermittent GPU use ID: 58930 · Rating: 0 · rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,405,292,080 RAC: 593 Level Scientific publications	Message 58931 - Posted: 15 Jun 2022, 13:32:26 UTC - in response to Message 58930. Thanks, some other odd behaviour I see on the 3090 machine, it seems to start the WU at 2%, if I pause Boinc and restart later the units elapsed time resets to 0 and the percentage goes back to 2%? ID: 58931 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58932 - Posted: 15 Jun 2022, 14:04:24 UTC - in response to Message 58931. The program goes through several stages. The first and second 1% stages are unpacking files from an archive, and don't need to be repeated - progress will move to 2% instantly. The rest of the run involves the serious work, and the app doesn't work out exactly how far its progressed immediately. If you wait a few seconds or minutes (depending on the speed of the rest of the machine), it should jump back up to where it was before the restart, and continue from there in 0.98% increments. ID: 58932 · Rating: 0 · rate: / Reply Quote

Jurgen Send message Joined: 7 Nov 14 Posts: 10 Credit: 109,979,264 RAC: 0 Level Scientific publications	Message 59412 - Posted: 7 Oct 2022, 14:57:43 UTC Last modified: 7 Oct 2022, 14:58:42 UTC I have the same problem. I have 109 units errored out (zero completed) between 2% and 4% completed. What is going on? GT 1030 Nvidia card ID: 59412 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59413 - Posted: 7 Oct 2022, 15:35:19 UTC - in response to Message 59412. A GT 1030 with 2047 MB of video RAM will be below the minimum specification to run these tasks. Sorry about that. ID: 59413 · Rating: 0 · rate: / Reply Quote

JohnMD Send message Joined: 4 Dec 10 Posts: 5 Credit: 26,860,106 RAC: 0 Level Scientific publications	Message 59433 - Posted: 10 Oct 2022, 23:29:33 UTC - in response to Message 59413. There are so many GPU's out there with ONLY 2GB memory - it is inconceivable you are unable to harness this energy source. ID: 59433 · Rating: 0 · rate: / Reply Quote

[CSF] Aleksey Belkov Send message Joined: 26 Dec 13 Posts: 87 Credit: 1,292,358,731 RAC: 0 Level Scientific publications	Message 59434 - Posted: 11 Oct 2022, 0:11:24 UTC - in response to Message 59433. Last modified: 11 Oct 2022, 0:12:44 UTC There are so many GPU's out there with ONLY 2GB memory Alas, but at the moment this is true. If you are interested in helping projects in the field of medicine, then you should pay attention to Folding@home. While this project is outside the BOINC ecosystem, it is undoubtedly worthy of attention. Its hardware requirements are quite modest and there are always tasks to crunch. ID: 59434 · Rating: 0 · rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,405,292,080 RAC: 593 Level Scientific publications	Message 59616 - Posted: 6 Dec 2022, 20:36:10 UTC Just had a unit fail on my other machine, W11 with the following: 5950x 32gb memory 3090 I have 24 cores assigned to the units and the page file is automatically set to: Virtual Memory: Max Size: 61,366 MB Virtual Memory: Available: 41,259 MB Virtual Memory: In Use: 20,107 MB I have seen higher values, over 70gb set though. I think a second unit has crashed as well but the site has not updated yet. Any thoughts? https://www.gpugrid.net/result.php?resultid=33158308 ID: 59616 · Rating: 0 · rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,405,292,080 RAC: 593 Level Scientific publications	Message 59617 - Posted: 8 Dec 2022, 19:00:03 UTC Outcome Computation error Client state Compute error Exit status 195 (0xc3) EXIT_CHILD_FAILED Any idea what this means? ID: 59617 · Rating: 0 · rate: / Reply Quote

jjch Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,776,211,122 RAC: 0 Level Scientific publications	Message 59618 - Posted: 11 Dec 2022, 5:46:05 UTC - in response to Message 59617. This simply means the task failed. Some GPUgrid tasks will fail. It is somewhat inherent in the type of computation they are doing. Errors will occur with some jobs for other reasons. What you need to look at are the details of the stderr file and see if there is anything from your system causing the problem. In this example https://www.gpugrid.net/result.php?resultid=33158308 I can see that the task restarted 5 times. While GPUgrid tasks have the ability to restart you should leave them alone and let them run. Sometimes they will fail after too many restarts. ID: 59618 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 2 Jan 09 Posts: 303 Credit: 7,387,800,090 RAC: 635 Level Scientific publications	Message 59619 - Posted: 13 Dec 2022, 4:11:57 UTC - in response to Message 59616. Just had a unit fail on my other machine, W11 with the following: 5950x 32gb memory 3090 I have 24 cores assigned to the units and the page file is automatically set to: Virtual Memory: Max Size: 61,366 MB Virtual Memory: Available: 41,259 MB Virtual Memory: In Use: 20,107 MB I have seen higher values, over 70gb set though. I think a second unit has crashed as well but the site has not updated yet. Any thoughts? https://www.gpugrid.net/result.php?resultid=33158308 Are you running these on your cpu? ID: 59619 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 59620 - Posted: 13 Dec 2022, 17:30:15 UTC - in response to Message 59619. The Python on GPU tasks ALWAYS run on the cpu. ID: 59620 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59621 - Posted: 14 Dec 2022, 15:54:51 UTC - in response to Message 59619. Just had a unit fail on my other machine, W11 with the following: 5950x 32gb memory 3090 I have 24 cores assigned to the units and the page file is automatically set to: Virtual Memory: Max Size: 61,366 MB Virtual Memory: Available: 41,259 MB Virtual Memory: In Use: 20,107 MB I have seen higher values, over 70gb set though. I think a second unit has crashed as well but the site has not updated yet. Any thoughts? https://www.gpugrid.net/result.php?resultid=33158308 Are you running these on your cpu? ____________ Not just one CPU but also on all your cores plus GPU. Need to set your swap file to at least 50GB. It is memory hungry. On message boards click on news. Only the latest two or three threads concern Python and everything is being discussed on those threads. The rest are ACMED threads. Mikey, enjoy. ID: 59621 · Rating: 0 · rate: / Reply Quote

Pop Piasa Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level Scientific publications	Message 59622 - Posted: 16 Dec 2022, 18:17:42 UTC - in response to Message 59620. Keith noted: The Python on GPU tasks ALWAYS run on the cpu. I see that happening with the windows version too. My last task completed in 14:45:35 but it shows 304,952.2 seconds (84.7 hrs) as well as same CPU time. That is a bit confusing to me but the fun is in the challenges. As for the 37 errors that preceded my 3 successful runs, they were caused by a lack of page file size according to the STDERR output: : [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\1\lib\site-packages\torch\lib\shm.dll" or one of its dependencies That host only has a 256GB m.2 drive in it and I had to free up 30GB of space in order for it to expand the virtual memory enough for the unpacked files to reside completely. My combined commit charge is usually around 53GB while running these Python apps. I let windoze create a second swap file on a SATA HHD in this host and now I have ample (windows managed) 44GB swap files on both drives. Haven't noticed any drop in performance, but haven't benchmarked it to know for sure. Anyone running multiple GPUs will need tons of swap file space I would speculate. I'm going to try to run these on other hosts but I'm having problems joining those hosts to GPUgrid. "Together we crunch To check out a hunch And wish all our credit Could just buy us lunch" Piasa Tribe - Illini Nation ID: 59622 · Rating: 0 · rate: / Reply Quote

Pop Piasa Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level Scientific publications	Message 59623 - Posted: 16 Dec 2022, 22:28:51 UTC - in response to Message 59617. Outcome Computation error Client state Compute error Exit status 195 (0xc3) EXIT_CHILD_FAILED Any idea what this means? That only tells you that the WU failed and exited. To find out what actually caused the error you need to scroll down to the stderr output section. Look for events immediately above the line which reads "called BOINC finish". They are the fatal errors usually. These WUs require a 4GB graphics card as a minimum from current experience, although I will try to run one on a GTX1060 3GB if I can. They use about 2.8 GB graphics mem from observation. Be sure to give BOINC access to a large percentage of virtual memory, too. Python apps appear to me to run almost completely in virtual memory as my 16GBs of RAM are only half used. The CPU appears to use the GPU as a slave and claim the co-processing as its own CPU time. It appears to run a worker scenario where the GPU is intermittently called on to provide the math required for the scenario laid out by the wrapper program. Define rollouts storage Define scheme Created CWorker with worker_index 0 Created GWorker with worker_index 0 Created UWorker with worker_index 0 Created training scheme. Define learner Created Learner. (From stderr output of successful WU.) Looks like machine learning research. Cool. Someone please tell me if I'm assuming something wrong. "Together we crunch To check out a hunch And wish all our credit Could just buy us lunch" Piasa Tribe - Illini Nation ID: 59623 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 241,278,292 RAC: 236 Level Scientific publications	Message 59625 - Posted: 18 Dec 2022, 17:08:26 UTC - in response to Message 59623. You can use tail -F from mingw64 to read wrapper_run.out file. ID: 59625 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59626 - Posted: 18 Dec 2022, 17:57:09 UTC https://www.gpugrid.net/forum_thread.php?id=5233 ID: 59626 · Rating: 0 · rate: / Reply Quote

Igor Misic Send message Joined: 12 Apr 11 Posts: 4 Credit: 2,196,296,835 RAC: 362 Level Scientific publications	Message 59641 - Posted: 22 Dec 2022, 11:19:32 UTC I've started to see ACEMD 3 tasks are failing for me while Python GPU tasks run properly. I run Ubuntu, RTX 3080 Driver Version: 525.60.11 (Cuda 12.0) Boinc reports Cuda1131 Any hint? Additional data from log (Computation error): Thu 22 Dec 2022 12:17:59 PM CET \| GPUGRID \| Output file 5efh-ADRIA_KDeepMD_100ns_5076-0-1-RND7448_5_9 for task 5efh-ADRIA_KDeepMD_100ns_5076-0-1-RND7448_5 absent ID: 59641 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 59642 - Posted: 22 Dec 2022, 13:18:04 UTC - in response to Message 59641. I run Ubuntu, RTX 3080 Driver Version: 525.60.11 (Cuda 12.0) Boinc reports Cuda1131 Your driver supports CUDA 12, but the application is CUDA 11.3.1 ID: 59642 · Rating: 0 · rate: / Reply Quote