Advanced search

Message boards : Graphics cards (GPUs) : Units failing

Author Message
Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 33
Credit: 1,233,570,527
RAC: 5,504,644
Level
Met
Scientific publications
wat
Message 58929 - Posted: 15 Jun 2022 | 9:14:29 UTC

Can someone have a quick look and let me know the problem here, a few computed fine but most errored out.

https://www.gpugrid.net/results.php?userid=524374&offset=0&show_names=0&state=5&appid=

Also, I have just started the project again on another machine with an Nvidia card and most of the time the card is idle with some second or so long spikes every now and again, is that normal?

Thanks

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 58930 - Posted: 15 Jun 2022 | 12:08:50 UTC - in response to Message 58929.

Can someone have a quick look and let me know the problem here, a few computed fine but most errored out.

https://www.gpugrid.net/results.php?userid=524374&offset=0&show_names=0&state=5&appid=

Also, I have just started the project again on another machine with an Nvidia card and most of the time the card is idle with some second or so long spikes every now and again, is that normal?

Thanks


likely failing because your GT1030 doesn't have enough GPU memory. these python tasks use a lot of VRAM. GT1030 is probably too weak to run these kinds of tasks unfortunately.

and yes it's normal to see that behavior with your RTX 3090. the app has intermittent GPU use
____________

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 33
Credit: 1,233,570,527
RAC: 5,504,644
Level
Met
Scientific publications
wat
Message 58931 - Posted: 15 Jun 2022 | 13:32:26 UTC - in response to Message 58930.

Thanks, some other odd behaviour I see on the 3090 machine, it seems to start the WU at 2%, if I pause Boinc and restart later the units elapsed time resets to 0 and the percentage goes back to 2%?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,362,966,723
RAC: 18,909,192
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58932 - Posted: 15 Jun 2022 | 14:04:24 UTC - in response to Message 58931.

The program goes through several stages. The first and second 1% stages are unpacking files from an archive, and don't need to be repeated - progress will move to 2% instantly.

The rest of the run involves the serious work, and the app doesn't work out exactly how far its progressed immediately. If you wait a few seconds or minutes (depending on the speed of the rest of the machine), it should jump back up to where it was before the restart, and continue from there in 0.98% increments.

Jurgen
Send message
Joined: 7 Nov 14
Posts: 5
Credit: 85,591,905
RAC: 617,161
Level
Thr
Scientific publications
watwatwat
Message 59412 - Posted: 7 Oct 2022 | 14:57:43 UTC
Last modified: 7 Oct 2022 | 14:58:42 UTC

I have the same problem. I have 109 units errored out (zero completed) between 2% and 4% completed. What is going on?

GT 1030 Nvidia card
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,362,966,723
RAC: 18,909,192
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59413 - Posted: 7 Oct 2022 | 15:35:19 UTC - in response to Message 59412.

A GT 1030 with 2047 MB of video RAM will be below the minimum specification to run these tasks. Sorry about that.

Profile JohnMD
Avatar
Send message
Joined: 4 Dec 10
Posts: 5
Credit: 26,860,106
RAC: 0
Level
Val
Scientific publications
watwatwat
Message 59433 - Posted: 10 Oct 2022 | 23:29:33 UTC - in response to Message 59413.

There are so many GPU's out there with ONLY 2GB memory - it is inconceivable you are unable to harness this energy source.

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,280,951,766
RAC: 346,681
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59434 - Posted: 11 Oct 2022 | 0:11:24 UTC - in response to Message 59433.
Last modified: 11 Oct 2022 | 0:12:44 UTC

There are so many GPU's out there with ONLY 2GB memory

Alas, but at the moment this is true.
If you are interested in helping projects in the field of medicine, then you should pay attention to Folding@home.
While this project is outside the BOINC ecosystem, it is undoubtedly worthy of attention.
Its hardware requirements are quite modest and there are always tasks to crunch.

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 33
Credit: 1,233,570,527
RAC: 5,504,644
Level
Met
Scientific publications
wat
Message 59616 - Posted: 6 Dec 2022 | 20:36:10 UTC

Just had a unit fail on my other machine, W11 with the following:

5950x
32gb memory
3090

I have 24 cores assigned to the units and the page file is automatically set to:

Virtual Memory: Max Size: 61,366 MB
Virtual Memory: Available: 41,259 MB
Virtual Memory: In Use: 20,107 MB

I have seen higher values, over 70gb set though.

I think a second unit has crashed as well but the site has not updated yet.

Any thoughts?

https://www.gpugrid.net/result.php?resultid=33158308

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 33
Credit: 1,233,570,527
RAC: 5,504,644
Level
Met
Scientific publications
wat
Message 59617 - Posted: 8 Dec 2022 | 19:00:03 UTC

Outcome Computation error
Client state Compute error
Exit status 195 (0xc3) EXIT_CHILD_FAILED

Any idea what this means?

jjch
Send message
Joined: 10 Nov 13
Posts: 101
Credit: 15,694,370,517
RAC: 3,382,072
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59618 - Posted: 11 Dec 2022 | 5:46:05 UTC - in response to Message 59617.

This simply means the task failed. Some GPUgrid tasks will fail. It is somewhat inherent in the type of computation they are doing. Errors will occur with some jobs for other reasons.

What you need to look at are the details of the stderr file and see if there is anything from your system causing the problem. In this example https://www.gpugrid.net/result.php?resultid=33158308 I can see that the task restarted 5 times. While GPUgrid tasks have the ability to restart you should leave them alone and let them run. Sometimes they will fail after too many restarts.

mikey
Send message
Joined: 2 Jan 09
Posts: 298
Credit: 6,644,908,968
RAC: 15,138,204
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59619 - Posted: 13 Dec 2022 | 4:11:57 UTC - in response to Message 59616.

Just had a unit fail on my other machine, W11 with the following:

5950x
32gb memory
3090

I have 24 cores assigned to the units and the page file is automatically set to:

Virtual Memory: Max Size: 61,366 MB
Virtual Memory: Available: 41,259 MB
Virtual Memory: In Use: 20,107 MB

I have seen higher values, over 70gb set though.

I think a second unit has crashed as well but the site has not updated yet.

Any thoughts?

https://www.gpugrid.net/result.php?resultid=33158308


Are you running these on your cpu?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,894,103,302
RAC: 7,266,669
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59620 - Posted: 13 Dec 2022 | 17:30:15 UTC - in response to Message 59619.

The Python on GPU tasks ALWAYS run on the cpu.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 524,093,545
RAC: 15,739
Level
Lys
Scientific publications
watwat
Message 59621 - Posted: 14 Dec 2022 | 15:54:51 UTC - in response to Message 59619.

Just had a unit fail on my other machine, W11 with the following:

5950x
32gb memory
3090

I have 24 cores assigned to the units and the page file is automatically set to:

Virtual Memory: Max Size: 61,366 MB
Virtual Memory: Available: 41,259 MB
Virtual Memory: In Use: 20,107 MB

I have seen higher values, over 70gb set though.

I think a second unit has crashed as well but the site has not updated yet.

Any thoughts?

https://www.gpugrid.net/result.php?resultid=33158308


Are you running these on your cpu?

____________
Not just one CPU but also on all your cores plus GPU. Need to set your swap file to at least 50GB. It is memory hungry.
On message boards click on news. Only the latest two or three threads concern Python and everything is being discussed on those threads. The rest are ACMED threads. Mikey, enjoy.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59622 - Posted: 16 Dec 2022 | 18:17:42 UTC - in response to Message 59620.

Keith noted:

The Python on GPU tasks ALWAYS run on the cpu.



I see that happening with the windows version too. My last task completed in 14:45:35 but it shows 304,952.2 seconds (84.7 hrs) as well as same CPU time.

That is a bit confusing to me but the fun is in the challenges.

As for the 37 errors that preceded my 3 successful runs, they were caused by a lack of page file size according to the STDERR output:

: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\1\lib\site-packages\torch\lib\shm.dll" or one of its dependencies


That host only has a 256GB m.2 drive in it and I had to free up 30GB of space in order for it to expand the virtual memory enough for the unpacked files to reside completely. My combined commit charge is usually around 53GB while running these Python apps.

I let windoze create a second swap file on a SATA HHD in this host and now I have ample (windows managed) 44GB swap files on both drives. Haven't noticed any drop in performance, but haven't benchmarked it to know for sure. Anyone running multiple GPUs will need tons of swap file space I would speculate.

I'm going to try to run these on other hosts but I'm having problems joining those hosts to GPUgrid.


____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"


Piasa Tribe - Illini Nation

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59623 - Posted: 16 Dec 2022 | 22:28:51 UTC - in response to Message 59617.

Outcome Computation error
Client state Compute error
Exit status 195 (0xc3) EXIT_CHILD_FAILED

Any idea what this means?


That only tells you that the WU failed and exited.

To find out what actually caused the error you need to scroll down to the stderr output section. Look for events immediately above the line which reads "called BOINC finish". They are the fatal errors usually.

These WUs require a 4GB graphics card as a minimum from current experience, although I will try to run one on a GTX1060 3GB if I can. They use about 2.8 GB graphics mem from observation.

Be sure to give BOINC access to a large percentage of virtual memory, too.

Python apps appear to me to run almost completely in virtual memory as my 16GBs of RAM are only half used.

The CPU appears to use the GPU as a slave and claim the co-processing as its own CPU time. It appears to run a worker scenario where the GPU is intermittently called on to provide the math required for the scenario laid out by the wrapper program.

Define rollouts storage
Define scheme
Created CWorker with worker_index 0
Created GWorker with worker_index 0
Created UWorker with worker_index 0
Created training scheme.
Define learner
Created Learner.

(From stderr output of successful WU.)

Looks like machine learning research.
Cool.

Someone please tell me if I'm assuming something wrong.
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"


Piasa Tribe - Illini Nation

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 141,134,050
RAC: 546,082
Level
Cys
Scientific publications
wat
Message 59625 - Posted: 18 Dec 2022 | 17:08:26 UTC - in response to Message 59623.

You can use tail -F from mingw64 to read wrapper_run.out file.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 524,093,545
RAC: 15,739
Level
Lys
Scientific publications
watwat
Message 59626 - Posted: 18 Dec 2022 | 17:57:09 UTC

https://www.gpugrid.net/forum_thread.php?id=5233

Igor Misic
Send message
Joined: 12 Apr 11
Posts: 4
Credit: 1,769,264,335
RAC: 9,694,046
Level
His
Scientific publications
wat
Message 59641 - Posted: 22 Dec 2022 | 11:19:32 UTC

I've started to see ACEMD 3 tasks are failing for me while Python GPU tasks run properly.

I run Ubuntu, RTX 3080 Driver Version: 525.60.11 (Cuda 12.0) Boinc reports Cuda1131

Any hint?

Additional data from log (Computation error):
Thu 22 Dec 2022 12:17:59 PM CET | GPUGRID | Output file 5efh-ADRIA_KDeepMD_100ns_5076-0-1-RND7448_5_9 for task 5efh-ADRIA_KDeepMD_100ns_5076-0-1-RND7448_5 absent

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 59642 - Posted: 22 Dec 2022 | 13:18:04 UTC - in response to Message 59641.


I run Ubuntu, RTX 3080 Driver Version: 525.60.11 (Cuda 12.0) Boinc reports Cuda1131



Your driver supports CUDA 12, but the application is CUDA 11.3.1
____________

mikey
Send message
Joined: 2 Jan 09
Posts: 298
Credit: 6,644,908,968
RAC: 15,138,204
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59644 - Posted: 22 Dec 2022 | 14:36:51 UTC - in response to Message 59621.

Just had a unit fail on my other machine, W11 with the following:

5950x
32gb memory
3090

I have 24 cores assigned to the units and the page file is automatically set to:

Virtual Memory: Max Size: 61,366 MB
Virtual Memory: Available: 41,259 MB
Virtual Memory: In Use: 20,107 MB

I have seen higher values, over 70gb set though.

I think a second unit has crashed as well but the site has not updated yet.

Any thoughts?

https://www.gpugrid.net/result.php?resultid=33158308


Are you running these on your cpu?

____________
Not just one CPU but also on all your cores plus GPU. Need to set your swap file to at least 50GB. It is memory hungry.
On message boards click on news. Only the latest two or three threads concern Python and everything is being discussed on those threads. The rest are ACMED threads. Mikey, enjoy.


Thank you very much I think I will try these after the New Year

Magiceye04
Send message
Joined: 1 Apr 09
Posts: 24
Credit: 67,905,687
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwat
Message 59663 - Posted: 27 Dec 2022 | 12:10:50 UTC - in response to Message 59433.

There are so many GPU's out there with ONLY 2GB memory - it is inconceivable you are unable to harness this energy source.


There are so many projects our there which will run fine on the Geforce 1030. :)

Igor Misic
Send message
Joined: 12 Apr 11
Posts: 4
Credit: 1,769,264,335
RAC: 9,694,046
Level
His
Scientific publications
wat
Message 59664 - Posted: 27 Dec 2022 | 16:18:59 UTC - in response to Message 59642.


I run Ubuntu, RTX 3080 Driver Version: 525.60.11 (Cuda 12.0) Boinc reports Cuda1131



Your driver supports CUDA 12, but the application is CUDA 11.3.1


Thx for the reply. Does this mean I should downgrade my driver to one that supports CUDA 11 to make ACEMD 3 App compatible? Python App runs properly with the current driver so this is confusing me a bit.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 59665 - Posted: 27 Dec 2022 | 16:23:36 UTC - in response to Message 59664.


I run Ubuntu, RTX 3080 Driver Version: 525.60.11 (Cuda 12.0) Boinc reports Cuda1131



Your driver supports CUDA 12, but the application is CUDA 11.3.1


Thx for the reply. Does this mean I should downgrade my driver to one that supports CUDA 11 to make ACEMD 3 App compatible? Python App runs properly with the current driver so this is confusing me a bit.

no you don't need to do anything. CUDA is backwards compatible.
____________

Post to thread

Message boards : Graphics cards (GPUs) : Units failing

//