Message boards :
News :
PYSCFbeta: Quantum chemistry calculations on GPU
Message board moderation
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 · Next
| Author | Message |
|---|---|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
Ian, are you saying that even after you've set DCF to a low value in the client_state file that it is still escalating? my DCF was set to about 0.01, and my tasks were estimating that they would take 27hrs each to complete. i changed the DCF to 0.0001, and that changed the estimate to about 16mins each. then after a short time i noticed that the time to completion estimate was going up again, reaching back to 27hrs again. i checked DCF and it's back to 0.01.
|
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,447 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
First, it’s well known at this point that these tasks require a lot of VRAM. So some failures are to be expected from that. The VRAM utilization is not constant, but spikes up and down. From the tasks running on my systems, loading up to 5-6GB and staying around that amount is pretty normal, with intermittent spikes to the 9-12GB+ range occasionally. Just by looking at the failure rate of different GPUs, I’m estimating that most tasks need more than 8GB (>70%), a small amount of tasks need more than 12GB (~5%), and a very small number of them need even more than 16GB (<1%). A teammate of mine is running on a couple 2080Tis (11GB) and has had some failures but mostly success. As you suggested in some previous post, VRAM utilization seems to be bound to every particular model of graphics card / GPU. GPUs with fewer CUDA cores available, seem to span less amount of VRAM. My GTX 1650 GPUs have 896 CUDA cores and 4 GB VRAM. My GTX 1650 SUPER GPU has 1280 CUDA cores and 4 GB VRAM. My GTX 1660 Ti GPU has 1536 CUDA cores and 6 GB VRAM. These cards are achieving currently an overall success of 44% on processing PYSCFbeta (676 valid versus 856 errored tasks at the moment of writing this). Not all the errors were due to memory overflows, some of them were due to not viable WUs or other reasons, but deeping in this would take too much time... Processing ATMbeta tasks, success was pretty close to 100% |
|
Send message Joined: 13 Jul 09 Posts: 64 Credit: 2,922,790,120 RAC: 98 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days. I'm trying to suppress my grinning at this upside down world... having retired my last Windoze box some time ago. On a more germane note... Between this CUDA Error of GINTint2e_jk_kernel: out of memory https://www.gpugrid.net/result.php?resultid=33956113 and this... Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone? I am ASSUMING that this is referring to the memory on the 12G vid card? cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes). https://www.gpugrid.net/result.php?resultid=33955488 And for what it's worth best I can tell I'm getting a lower error % on my RTX3070 8GB cards once I backed off the sclk/mclk clocks. Skip - da shu @ HeliOS, "A child's exposure to technology should never be predicated on an ability to afford it." |
|
Send message Joined: 8 Oct 16 Posts: 27 Credit: 4,153,801,869 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days. Seems to me that your 3080 is the 10G version instead of 12G? |
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days. I'm clearly in the presence of passionate Linux believers here... :-) Between thisCUDA Error of GINTint2e_jk_kernel: out of memory It does refer to video memory, but the limit each WU sets possibly doesn't take into account other processes allocating video memory. That would especially be an issue I think if you run multiple WU's in parallel. Try executing nvidia-smi to see which processes allocate how much video memory:
svennemans@PCSLLINUX01:~$ nvidia-smi
Sun Feb 11 17:29:48 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1080 Ti Off | 00000000:01:00.0 On | N/A |
| 47% 71C P2 179W / 275W | 6449MiB / 11264MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1611 G /usr/lib/xorg/Xorg 534MiB |
| 0 N/A N/A 1801 G /usr/bin/gnome-shell 75MiB |
| 0 N/A N/A 9616 G boincmgr 2MiB |
| 0 N/A N/A 9665 G ...gnu/webkit2gtk-4.0/WebKitWebProcess 12MiB |
| 0 N/A N/A 27480 G ...38,262144 --variations-seed-version 125MiB |
| 0 N/A N/A 46332 G gnome-control-center 2MiB |
| 0 N/A N/A 47110 C python 5562MiB |
+---------------------------------------------------------------------------------------+
My one running WU has allocated 5.5G but with the other running processes, total allocated is 6.4G. It would depend on implementation if the limit is calculated from the total CUDA memory or the actual free CUDA memory and whether that limit is updated only once at the start or multiple times. |
|
Send message Joined: 13 Jul 09 Posts: 64 Credit: 2,922,790,120 RAC: 98 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Good point about the other stuff on the card... right this minute it's taking a break from GPUGRID to do a Meerkat Burp7... I usually have "watch -t -n 8 nvidia-smi" running on this box if I'm poking around. I'll capture a shot of it as soon as GPUGRID comes back up if any of the listed below changes significantly. I don't think it will. While the 'cuda_1222' is running I see a total ~286MB of 'other stuff' if my 'ciphering' is right:
Skip PS: GPUGRID WUs are all 1x here. PPS: Yes, it's the 10G version! PPPS: Also my adhoc perception of error rates was wrong... working on that. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I believe the lowest value that DCF can be in the client_state file is 0.01 Found that in the code someplace, sometime |
|
Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,550,803,412 RAC: 248 Level ![]() Scientific publications
|
bonjour apparemment maintenant ça fonctionne sur mes 2 gpu-gtx 1650 et rtx 4060. Je n'ai pas eu d'erreur de calcul. hello apparently now it works on my 2 gpu-gtx 1650 and rtx 4060. I did not have a miscalculation |
|
Send message Joined: 21 Dec 23 Posts: 51 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Hello, Yes I would not expect the app to work on WSL. There are many linux specific libraries in the packaged python environent that is the "app". Thank you for the feedback regarding the faliure rate. As I mentioned different WUs require different memory use that is hard to check before they start crunching. From my viewpoint the failiure rates are low enough that all WUs seem to suceed with a few retries. This is still a "Beta" app. We definitely want a Windows app and it is in pipeline. However, as I mentioned before the development of this is time consuming. Several of the underlying code-bases are linux only at the moment so a windows app requires a windows port of some code. |
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
Yes I would not expect the app to work on WSL. There are many linux specific libraries in the packaged python environent that is the "app". Actually, it *should* work, since WSL2 is being sold as a native Linux kernel running in a virtual environment with full system call compatibility. So one could reasonably expect any native linux libraries to work as expected. However there are obviously still a few issues to iron out. Not by gpugrid to be clear - by microsoft. |
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
I'm seeing a bunch of checksum errors during unzip, anyone else have this problem? https://www.gpugrid.net/results.php?hostid=617834&offset=0&show_names=0&state=5&appid= Stderr output <core_client_version>7.20.5</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> 11:26:18 (177385): wrapper (7.7.26016): starting lib/libcufft.so.10.9.0.58 bad CRC e458474a (should be 0a867ac2) boinc_unzip() error: 2 </stderr_txt> ]]> The workunits seem to all run fine on a subsequent host. |
|
Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,550,803,412 RAC: 248 Level ![]() Scientific publications
|
bonjour, quand les taches windows seront elles pretes pour essais? franchement,Linux ,c'est pourri. apres une mise a jour le lhc@home ne fonctionne plus.Je reste sous linux pour vous mais j'ai hate de repasser sous un bon vieux windows. Merci Good afternoon, when will windows tasks be ready for testing? Frankly, Linux is rotten. after an update the lhc@home no longer works. I stay under linux for you but I hate to go back under a good old windows. Thanks |
|
Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level ![]() Scientific publications
|
bonjour, Maybe try a different version. I have always used Windows (and still do on some systems) but use Linux Mint on others. Really user friendly and a very similar feel to Windows. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
Between thisCUDA Error of GINTint2e_jk_kernel: out of memory Sometimes I get the same error on my 3080 10 GB Card. E.g., https://www.gpugrid.net/result.php?resultid=33960422 Headless computer with a single 3080 running 1C + 1N. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
I believe the lowest value that DCF can be in the client_state file is 0.01 Zoltan posted long ago that BOINC does not understand zero and 0.01 is as close as it can get. I wonder if that was someones approach to fixing a division by zero problem in antiquity. |
|
Send message Joined: 13 Jul 09 Posts: 64 Credit: 2,922,790,120 RAC: 98 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
... After logging error rates for a few days across 5 boxes w/ Nvidia cards (all RTX30x0, all Linux Mint v2x.3) and trying to be aware of what I was doing on the main desktop while 'python' was running along with some sclk / mclk cut backs, shows the avg error rate is dropping. The last cut shows it at 23.44% across the 5 boxes averaged over 28 hours. No longer any segfault 0x8b errors, all 0x1. The last one was on the most troublesome of the 3070 cards. https://www.gpugrid.net/result.php?resultid=33950656 Anything I can do to help with this type of error? Skip |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
... its still an out of memory error. a little further up in the error log shows this: "CUDA Error of GINTint2e_jk_kernel: out of memory" so it's probably just running out of memory at a different stage of the task, producing a slightly different error, but still an issue with not enough memory.
|
|
Send message Joined: 13 Jul 09 Posts: 64 Credit: 2,922,790,120 RAC: 98 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
I'm seeing a bunch of checksum errors during unzip, anyone else have this problem? I didn't find any of these in the 10GB 3080 errors that occurred so far today. Will check the 3070 cards shortly. Skip |
|
Send message Joined: 13 Jul 09 Posts: 64 Credit: 2,922,790,120 RAC: 98 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanx... as I suspected and this is my most common error now. Along with these that I'm thinking are also memory related also from a different point in the process... same situation w/o having reached the cap limit shown. https://www.gpugrid.net/result.php?resultid=33962293 Skip |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
between your systems and mine, looking at the error rates; ~23% of tasks need more than 8GB ~17% of tasks need more than 10GB ~4% of tasks need more than 12GB <1% of tasks need more than 16GB me personally, i wouldn't run these (as they are now) with less than 12GB VRAM.
|
©2025 Universitat Pompeu Fabra