PYSCFbeta: Quantum chemistry calculations on GPU

Author	Message
Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 61237 - Posted: 11 Feb 2024, 3:30:21 UTC - in response to Message 61236. Ian, are you saying that even after you've set DCF to a low value in the client_state file that it is still escalating? I set mine to 0.02 a month ago and it is still hanging around there now that I looked at the hosts here. my DCF was set to about 0.01, and my tasks were estimating that they would take 27hrs each to complete. i changed the DCF to 0.0001, and that changed the estimate to about 16mins each. then after a short time i noticed that the time to completion estimate was going up again, reaching back to 27hrs again. i checked DCF and it's back to 0.01. ID: 61237 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 1,140,567 Level Scientific publications	Message 61238 - Posted: 11 Feb 2024, 15:35:28 UTC - in response to Message 61223. Last modified: 11 Feb 2024, 15:46:18 UTC First, it’s well known at this point that these tasks require a lot of VRAM. So some failures are to be expected from that. The VRAM utilization is not constant, but spikes up and down. From the tasks running on my systems, loading up to 5-6GB and staying around that amount is pretty normal, with intermittent spikes to the 9-12GB+ range occasionally. Just by looking at the failure rate of different GPUs, I’m estimating that most tasks need more than 8GB (>70%), a small amount of tasks need more than 12GB (~5%), and a very small number of them need even more than 16GB (<1%). A teammate of mine is running on a couple 2080Tis (11GB) and has had some failures but mostly success. As you suggested in some previous post, VRAM utilization seems to be bound to every particular model of graphics card / GPU. GPUs with fewer CUDA cores available, seem to span less amount of VRAM. My GTX 1650 GPUs have 896 CUDA cores and 4 GB VRAM. My GTX 1650 SUPER GPU has 1280 CUDA cores and 4 GB VRAM. My GTX 1660 Ti GPU has 1536 CUDA cores and 6 GB VRAM. These cards are achieving currently an overall success of 44% on processing PYSCFbeta (676 valid versus 856 errored tasks at the moment of writing this). Not all the errors were due to memory overflows, some of them were due to not viable WUs or other reasons, but deeping in this would take too much time... Processing ATMbeta tasks, success was pretty close to 100% ID: 61238 · Rating: 0 · rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 64 Credit: 2,928,790,120 RAC: 10,428 Level Scientific publications	Message 61239 - Posted: 11 Feb 2024, 15:58:36 UTC - in response to Message 61235. Last modified: 11 Feb 2024, 16:04:45 UTC well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days. yeah i don't know what's wrong with DCF. mine goes crazy shortly after i fix it also. says my tasks will take like 27 days even though most are done in 5-10 mins. I'm trying to suppress my grinning at this upside down world... having retired my last Windoze box some time ago. On a more germane note... Between this CUDA Error of GINTint2e_jk_kernel: out of memory https://www.gpugrid.net/result.php?resultid=33956113 and this... Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone? I am ASSUMING that this is referring to the memory on the 12G vid card? cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes). https://www.gpugrid.net/result.php?resultid=33955488 And for what it's worth best I can tell I'm getting a lower error % on my RTX3070 8GB cards once I backed off the sclk/mclk clocks. Skip - da shu @ HeliOS, "A child's exposure to technology should never be predicated on an ability to afford it." ID: 61239 · Rating: 0 · rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 27 Credit: 4,153,801,869 RAC: 0 Level Scientific publications	Message 61240 - Posted: 11 Feb 2024, 16:23:18 UTC - in response to Message 61239. well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days. yeah i don't know what's wrong with DCF. mine goes crazy shortly after i fix it also. says my tasks will take like 27 days even though most are done in 5-10 mins. I'm trying to suppress my grinning at this upside down world... having retired my last Windoze box some time ago. On a more germane note... Between this CUDA Error of GINTint2e_jk_kernel: out of memory https://www.gpugrid.net/result.php?resultid=33956113 and this... Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone? I am ASSUMING that this is referring to the memory on the 12G vid card? cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes). https://www.gpugrid.net/result.php?resultid=33955488 And for what it's worth best I can tell I'm getting a lower error % on my RTX3070 8GB cards once I backed off the sclk/mclk clocks. Skip Seems to me that your 3080 is the 10G version instead of 12G? ID: 61240 · Rating: 0 · rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level Scientific publications	Message 61241 - Posted: 11 Feb 2024, 16:52:25 UTC - in response to Message 61239. well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days. I'm trying to suppress my grinning at this upside down world... having retired my last Windoze box some time ago. I'm clearly in the presence of passionate Linux believers here... :-) Between this CUDA Error of GINTint2e_jk_kernel: out of memory https://www.gpugrid.net/result.php?resultid=33956113 and this... Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone? I am ASSUMING that this is referring to the memory on the 12G vid card? cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes). https://www.gpugrid.net/result.php?resultid=33955488 And for what it's worth best I can tell I'm getting a lower error % on my RTX3070 8GB cards once I backed off the sclk/mclk clocks. Skip It does refer to video memory, but the limit each WU sets possibly doesn't take into account other processes allocating video memory. That would especially be an issue I think if you run multiple WU's in parallel. Try executing nvidia-smi to see which processes allocate how much video memory: svennemans@PCSLLINUX01:~$ nvidia-smi Sun Feb 11 17:29:48 2024 +---------------------------------------------------------------------------------------+ \| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 \| \|-----------------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M \| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap \| Memory-Usage \| GPU-Util Compute M. \| \| \| \| MIG M. \| \|=========================================+======================+======================\| \| 0 NVIDIA GeForce GTX 1080 Ti Off \| 00000000:01:00.0 On \| N/A \| \| 47% 71C P2 179W / 275W \| 6449MiB / 11264MiB \| 100% Default \| \| \| \| N/A \| +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ \| Processes: \| \| GPU GI CI PID Type Process name GPU Memory \| \| ID ID Usage \| \|=======================================================================================\| \| 0 N/A N/A 1611 G /usr/lib/xorg/Xorg 534MiB \| \| 0 N/A N/A 1801 G /usr/bin/gnome-shell 75MiB \| \| 0 N/A N/A 9616 G boincmgr 2MiB \| \| 0 N/A N/A 9665 G ...gnu/webkit2gtk-4.0/WebKitWebProcess 12MiB \| \| 0 N/A N/A 27480 G ...38,262144 --variations-seed-version 125MiB \| \| 0 N/A N/A 46332 G gnome-control-center 2MiB \| \| 0 N/A N/A 47110 C python 5562MiB \| +---------------------------------------------------------------------------------------+ My one running WU has allocated 5.5G but with the other running processes, total allocated is 6.4G. It would depend on implementation if the limit is calculated from the total CUDA memory or the actual free CUDA memory and whether that limit is updated only once at the start or multiple times. ID: 61241 · Rating: 0 · rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 64 Credit: 2,928,790,120 RAC: 10,428 Level Scientific publications	Message 61242 - Posted: 11 Feb 2024, 17:38:59 UTC - in response to Message 61241. Last modified: 11 Feb 2024, 17:59:48 UTC Good point about the other stuff on the card... right this minute it's taking a break from GPUGRID to do a Meerkat Burp7... I usually have "watch -t -n 8 nvidia-smi" running on this box if I'm poking around. I'll capture a shot of it as soon as GPUGRID comes back up if any of the listed below changes significantly. I don't think it will. While the 'cuda_1222' is running I see a total ~286MB of 'other stuff' if my 'ciphering' is right: /usr/lib/xorg/Xorg 153MiB cinnamon 18MiB ...gnu/webkit2gtk-4.0/WebKitWebProcess 12MiB Boincmgr /usr/lib/firefox/firefox 103MiB because I'm reading/posting ...inary_x86_64-pc-linux-gnu__cuda1222 776MiB the only Compute task Skip PS: GPUGRID WUs are all 1x here. PPS: Yes, it's the 10G version! PPPS: Also my adhoc perception of error rates was wrong... working on that. ID: 61242 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 61243 - Posted: 11 Feb 2024, 19:14:30 UTC - in response to Message 61237. I believe the lowest value that DCF can be in the client_state file is 0.01 Found that in the code someplace, sometime ID: 61243 · Rating: 0 · rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,586,053,412 RAC: 7,959 Level Scientific publications	Message 61244 - Posted: 12 Feb 2024, 9:39:40 UTC bonjour apparemment maintenant ça fonctionne sur mes 2 gpu-gtx 1650 et rtx 4060. Je n'ai pas eu d'erreur de calcul. hello apparently now it works on my 2 gpu-gtx 1650 and rtx 4060. I did not have a miscalculation ID: 61244 · Rating: 0 · rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 51 Credit: 0 RAC: 0 Level Scientific publications	Message 61245 - Posted: 12 Feb 2024, 11:02:45 UTC - in response to Message 61244. Hello, Yes I would not expect the app to work on WSL. There are many linux specific libraries in the packaged python environent that is the "app". Thank you for the feedback regarding the faliure rate. As I mentioned different WUs require different memory use that is hard to check before they start crunching. From my viewpoint the failiure rates are low enough that all WUs seem to suceed with a few retries. This is still a "Beta" app. We definitely want a Windows app and it is in pipeline. However, as I mentioned before the development of this is time consuming. Several of the underlying code-bases are linux only at the moment so a windows app requires a windows port of some code. ID: 61245 · Rating: 0 · rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level Scientific publications	Message 61246 - Posted: 12 Feb 2024, 12:25:22 UTC - in response to Message 61245. Yes I would not expect the app to work on WSL. There are many linux specific libraries in the packaged python environent that is the "app". Actually, it should work, since WSL2 is being sold as a native Linux kernel running in a virtual environment with full system call compatibility. So one could reasonably expect any native linux libraries to work as expected. However there are obviously still a few issues to iron out. Not by gpugrid to be clear - by microsoft. ID: 61246 · Rating: 0 · rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level Scientific publications	Message 61252 - Posted: 13 Feb 2024, 11:23:35 UTC I'm seeing a bunch of checksum errors during unzip, anyone else have this problem? https://www.gpugrid.net/results.php?hostid=617834&offset=0&show_names=0&state=5&appid= Stderr output <core_client_version>7.20.5</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> 11:26:18 (177385): wrapper (7.7.26016): starting lib/libcufft.so.10.9.0.58 bad CRC e458474a (should be 0a867ac2) boinc_unzip() error: 2 </stderr_txt> ]]> The workunits seem to all run fine on a subsequent host. ID: 61252 · Rating: 0 · rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,586,053,412 RAC: 7,959 Level Scientific publications	Message 61260 - Posted: 13 Feb 2024, 14:13:19 UTC bonjour, quand les taches windows seront elles pretes pour essais? franchement,Linux ,c'est pourri. apres une mise a jour le lhc@home ne fonctionne plus.Je reste sous linux pour vous mais j'ai hate de repasser sous un bon vieux windows. Merci Good afternoon, when will windows tasks be ready for testing? Frankly, Linux is rotten. after an update the lhc@home no longer works. I stay under linux for you but I hate to go back under a good old windows. Thanks ID: 61260 · Rating: 0 · rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level Scientific publications	Message 61261 - Posted: 13 Feb 2024, 16:15:31 UTC - in response to Message 61260. bonjour, quand les taches windows seront elles pretes pour essais? franchement,Linux ,c'est pourri. apres une mise a jour le lhc@home ne fonctionne plus.Je reste sous linux pour vous mais j'ai hate de repasser sous un bon vieux windows. Merci Good afternoon, when will windows tasks be ready for testing? Frankly, Linux is rotten. after an update the lhc@home no longer works. I stay under linux for you but I hate to go back under a good old windows. Thanks Maybe try a different version. I have always used Windows (and still do on some systems) but use Linux Mint on others. Really user friendly and a very similar feel to Windows. ID: 61261 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 61269 - Posted: 14 Feb 2024, 15:59:06 UTC - in response to Message 61239. Last modified: 14 Feb 2024, 16:00:55 UTC Between this CUDA Error of GINTint2e_jk_kernel: out of memory https://www.gpugrid.net/result.php?resultid=33956113 and this... Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone? I am ASSUMING that this is referring to the memory on the 12G vid card? cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes). https://www.gpugrid.net/result.php?resultid=33955488 Sometimes I get the same error on my 3080 10 GB Card. E.g., https://www.gpugrid.net/result.php?resultid=33960422 Headless computer with a single 3080 running 1C + 1N. ID: 61269 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 61270 - Posted: 14 Feb 2024, 16:04:55 UTC - in response to Message 61243. I believe the lowest value that DCF can be in the client_state file is 0.01 Found that in the code someplace, sometime Zoltan posted long ago that BOINC does not understand zero and 0.01 is as close as it can get. I wonder if that was someones approach to fixing a division by zero problem in antiquity. ID: 61270 · Rating: 0 · rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 64 Credit: 2,928,790,120 RAC: 10,428 Level Scientific publications	Message 61272 - Posted: 14 Feb 2024, 16:20:02 UTC - in response to Message 61242. Last modified: 14 Feb 2024, 16:21:50 UTC ... Skip PS: GPUGRID WUs are all 1x here. PPS: Yes, it's the 10G version! PPPS: Also my adhoc perception of error rates was wrong... working on that. After logging error rates for a few days across 5 boxes w/ Nvidia cards (all RTX30x0, all Linux Mint v2x.3) and trying to be aware of what I was doing on the main desktop while 'python' was running along with some sclk / mclk cut backs, shows the avg error rate is dropping. The last cut shows it at 23.44% across the 5 boxes averaged over 28 hours. No longer any segfault 0x8b errors, all 0x1. The last one was on the most troublesome of the 3070 cards. https://www.gpugrid.net/result.php?resultid=33950656 Anything I can do to help with this type of error? Skip ID: 61272 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 61273 - Posted: 14 Feb 2024, 16:29:14 UTC - in response to Message 61272. ... Skip PS: GPUGRID WUs are all 1x here. PPS: Yes, it's the 10G version! PPPS: Also my adhoc perception of error rates was wrong... working on that. After logging error rates for a few days across 5 boxes w/ Nvidia cards (all RTX30x0, all Linux Mint v2x.3) and trying to be aware of what I was doing on the main desktop while 'python' was running along with some sclk / mclk cut backs, shows the avg error rate is dropping. The last cut shows it at 23.44% across the 5 boxes averaged over 28 hours. No longer any segfault 0x8b errors, all 0x1. The last one was on the most troublesome of the 3070 cards. https://www.gpugrid.net/result.php?resultid=33950656 Anything I can do to help with this type of error? Skip its still an out of memory error. a little further up in the error log shows this: "CUDA Error of GINTint2e_jk_kernel: out of memory" so it's probably just running out of memory at a different stage of the task, producing a slightly different error, but still an issue with not enough memory. ID: 61273 · Rating: 0 · rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 64 Credit: 2,928,790,120 RAC: 10,428 Level Scientific publications	Message 61274 - Posted: 14 Feb 2024, 16:44:55 UTC - in response to Message 61252. I'm seeing a bunch of checksum errors during unzip, anyone else have this problem? https://www.gpugrid.net/results.php?hostid=617834&offset=0&show_names=0&state=5&appid= Stderr output <core_client_version>7.20.5</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> 11:26:18 (177385): wrapper (7.7.26016): starting lib/libcufft.so.10.9.0.58 bad CRC e458474a (should be 0a867ac2) boinc_unzip() error: 2 </stderr_txt> ]]> The workunits seem to all run fine on a subsequent host. I didn't find any of these in the 10GB 3080 errors that occurred so far today. Will check the 3070 cards shortly. Skip ID: 61274 · Rating: 0 · rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 64 Credit: 2,928,790,120 RAC: 10,428 Level Scientific publications	Message 61275 - Posted: 14 Feb 2024, 16:53:42 UTC - in response to Message 61273. its still an out of memory error. a little further up in the error log shows this: "CUDA Error of GINTint2e_jk_kernel: out of memory" so it's probably just running out of memory at a different stage of the task, producing a slightly different error, but still an issue with not enough memory. Thanx... as I suspected and this is my most common error now. Along with these that I'm thinking are also memory related also from a different point in the process... same situation w/o having reached the cap limit shown. https://www.gpugrid.net/result.php?resultid=33962293 Skip ID: 61275 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 61276 - Posted: 14 Feb 2024, 16:58:25 UTC - in response to Message 61275. Last modified: 14 Feb 2024, 16:58:55 UTC between your systems and mine, looking at the error rates; ~23% of tasks need more than 8GB ~17% of tasks need more than 10GB ~4% of tasks need more than 12GB <1% of tasks need more than 16GB me personally, i wouldn't run these (as they are now) with less than 12GB VRAM. ID: 61276 · Rating: 0 · rate: / Reply Quote