Message boards :
Graphics cards (GPUs) :
Units failing
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,340,042,080 RAC: 33 Level ![]() Scientific publications
|
Can someone have a quick look and let me know the problem here, a few computed fine but most errored out. https://www.gpugrid.net/results.php?userid=524374&offset=0&show_names=0&state=5&appid= Also, I have just started the project again on another machine with an Nvidia card and most of the time the card is idle with some second or so long spikes every now and again, is that normal? Thanks |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
Can someone have a quick look and let me know the problem here, a few computed fine but most errored out. likely failing because your GT1030 doesn't have enough GPU memory. these python tasks use a lot of VRAM. GT1030 is probably too weak to run these kinds of tasks unfortunately. and yes it's normal to see that behavior with your RTX 3090. the app has intermittent GPU use
|
|
Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,340,042,080 RAC: 33 Level ![]() Scientific publications
|
Thanks, some other odd behaviour I see on the 3090 machine, it seems to start the WU at 2%, if I pause Boinc and restart later the units elapsed time resets to 0 and the percentage goes back to 2%? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The program goes through several stages. The first and second 1% stages are unpacking files from an archive, and don't need to be repeated - progress will move to 2% instantly. The rest of the run involves the serious work, and the app doesn't work out exactly how far its progressed immediately. If you wait a few seconds or minutes (depending on the speed of the rest of the machine), it should jump back up to where it was before the restart, and continue from there in 0.98% increments. |
|
Send message Joined: 7 Nov 14 Posts: 8 Credit: 109,979,264 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
I have the same problem. I have 109 units errored out (zero completed) between 2% and 4% completed. What is going on? GT 1030 Nvidia card |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
A GT 1030 with 2047 MB of video RAM will be below the minimum specification to run these tasks. Sorry about that. |
JohnMDSend message Joined: 4 Dec 10 Posts: 5 Credit: 26,860,106 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
There are so many GPU's out there with ONLY 2GB memory - it is inconceivable you are unable to harness this energy source. |
|
Send message Joined: 26 Dec 13 Posts: 86 Credit: 1,292,358,731 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
There are so many GPU's out there with ONLY 2GB memory Alas, but at the moment this is true. If you are interested in helping projects in the field of medicine, then you should pay attention to Folding@home. While this project is outside the BOINC ecosystem, it is undoubtedly worthy of attention. Its hardware requirements are quite modest and there are always tasks to crunch. |
|
Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,340,042,080 RAC: 33 Level ![]() Scientific publications
|
Just had a unit fail on my other machine, W11 with the following: 5950x 32gb memory 3090 I have 24 cores assigned to the units and the page file is automatically set to: Virtual Memory: Max Size: 61,366 MB Virtual Memory: Available: 41,259 MB Virtual Memory: In Use: 20,107 MB I have seen higher values, over 70gb set though. I think a second unit has crashed as well but the site has not updated yet. Any thoughts? https://www.gpugrid.net/result.php?resultid=33158308 |
|
Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,340,042,080 RAC: 33 Level ![]() Scientific publications
|
Outcome Computation error Client state Compute error Exit status 195 (0xc3) EXIT_CHILD_FAILED Any idea what this means? |
|
Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,773,211,122 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This simply means the task failed. Some GPUgrid tasks will fail. It is somewhat inherent in the type of computation they are doing. Errors will occur with some jobs for other reasons. What you need to look at are the details of the stderr file and see if there is anything from your system causing the problem. In this example https://www.gpugrid.net/result.php?resultid=33158308 I can see that the task restarted 5 times. While GPUgrid tasks have the ability to restart you should leave them alone and let them run. Sometimes they will fail after too many restarts. |
|
Send message Joined: 2 Jan 09 Posts: 303 Credit: 7,321,800,090 RAC: 330 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just had a unit fail on my other machine, W11 with the following: Are you running these on your cpu? |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
The Python on GPU tasks ALWAYS run on the cpu. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
Just had a unit fail on my other machine, W11 with the following: ____________ Not just one CPU but also on all your cores plus GPU. Need to set your swap file to at least 50GB. It is memory hungry. On message boards click on news. Only the latest two or three threads concern Python and everything is being discussed on those threads. The rest are ACMED threads. Mikey, enjoy. |
|
Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level ![]() Scientific publications ![]()
|
Keith noted: The Python on GPU tasks ALWAYS run on the cpu. I see that happening with the windows version too. My last task completed in 14:45:35 but it shows 304,952.2 seconds (84.7 hrs) as well as same CPU time. That is a bit confusing to me but the fun is in the challenges. As for the 37 errors that preceded my 3 successful runs, they were caused by a lack of page file size according to the STDERR output: : [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\1\lib\site-packages\torch\lib\shm.dll" or one of its dependencies That host only has a 256GB m.2 drive in it and I had to free up 30GB of space in order for it to expand the virtual memory enough for the unpacked files to reside completely. My combined commit charge is usually around 53GB while running these Python apps. I let windoze create a second swap file on a SATA HHD in this host and now I have ample (windows managed) 44GB swap files on both drives. Haven't noticed any drop in performance, but haven't benchmarked it to know for sure. Anyone running multiple GPUs will need tons of swap file space I would speculate. I'm going to try to run these on other hosts but I'm having problems joining those hosts to GPUgrid. "Together we crunch To check out a hunch And wish all our credit Could just buy us lunch" Piasa Tribe - Illini Nation |
|
Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level ![]() Scientific publications ![]()
|
Outcome Computation error That only tells you that the WU failed and exited. To find out what actually caused the error you need to scroll down to the stderr output section. Look for events immediately above the line which reads "called BOINC finish". They are the fatal errors usually. These WUs require a 4GB graphics card as a minimum from current experience, although I will try to run one on a GTX1060 3GB if I can. They use about 2.8 GB graphics mem from observation. Be sure to give BOINC access to a large percentage of virtual memory, too. Python apps appear to me to run almost completely in virtual memory as my 16GBs of RAM are only half used. The CPU appears to use the GPU as a slave and claim the co-processing as its own CPU time. It appears to run a worker scenario where the GPU is intermittently called on to provide the math required for the scenario laid out by the wrapper program. Define rollouts storage (From stderr output of successful WU.) Looks like machine learning research. Cool. Someone please tell me if I'm assuming something wrong. "Together we crunch To check out a hunch And wish all our credit Could just buy us lunch" Piasa Tribe - Illini Nation |
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 210,528,292 RAC: 0 Level ![]() Scientific publications
|
You can use tail -F from mingw64 to read wrapper_run.out file. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
https://www.gpugrid.net/forum_thread.php?id=5233 |
|
Send message Joined: 12 Apr 11 Posts: 4 Credit: 2,158,796,835 RAC: 88 Level ![]() Scientific publications
|
I've started to see ACEMD 3 tasks are failing for me while Python GPU tasks run properly. I run Ubuntu, RTX 3080 Driver Version: 525.60.11 (Cuda 12.0) Boinc reports Cuda1131 Any hint? Additional data from log (Computation error): Thu 22 Dec 2022 12:17:59 PM CET | GPUGRID | Output file 5efh-ADRIA_KDeepMD_100ns_5076-0-1-RND7448_5_9 for task 5efh-ADRIA_KDeepMD_100ns_5076-0-1-RND7448_5 absent |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
Your driver supports CUDA 12, but the application is CUDA 11.3.1
|
©2025 Universitat Pompeu Fabra