PYSCFbeta: Quantum chemistry calculations on GPU

Message boards : News : PYSCFbeta: Quantum chemistry calculations on GPU
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 · Next

AuthorMessage
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61237 - Posted: 11 Feb 2024, 3:30:21 UTC - in response to Message 61236.  

Ian, are you saying that even after you've set DCF to a low value in the client_state file that it is still escalating?

I set mine to 0.02 a month ago and it is still hanging around there now that I looked at the hosts here.


my DCF was set to about 0.01, and my tasks were estimating that they would take 27hrs each to complete.

i changed the DCF to 0.0001, and that changed the estimate to about 16mins each.

then after a short time i noticed that the time to completion estimate was going up again, reaching back to 27hrs again. i checked DCF and it's back to 0.01.
ID: 61237 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61238 - Posted: 11 Feb 2024, 15:35:28 UTC - in response to Message 61223.  
Last modified: 11 Feb 2024, 15:46:18 UTC

First, it’s well known at this point that these tasks require a lot of VRAM. So some failures are to be expected from that. The VRAM utilization is not constant, but spikes up and down. From the tasks running on my systems, loading up to 5-6GB and staying around that amount is pretty normal, with intermittent spikes to the 9-12GB+ range occasionally. Just by looking at the failure rate of different GPUs, I’m estimating that most tasks need more than 8GB (>70%), a small amount of tasks need more than 12GB (~5%), and a very small number of them need even more than 16GB (<1%). A teammate of mine is running on a couple 2080Tis (11GB) and has had some failures but mostly success.

As you suggested in some previous post, VRAM utilization seems to be bound to every particular model of graphics card / GPU.
GPUs with fewer CUDA cores available, seem to span less amount of VRAM.
My GTX 1650 GPUs have 896 CUDA cores and 4 GB VRAM.
My GTX 1650 SUPER GPU has 1280 CUDA cores and 4 GB VRAM.
My GTX 1660 Ti GPU has 1536 CUDA cores and 6 GB VRAM.
These cards are achieving currently an overall success of 44% on processing PYSCFbeta (676 valid versus 856 errored tasks at the moment of writing this).
Not all the errors were due to memory overflows, some of them were due to not viable WUs or other reasons, but deeping in this would take too much time...
Processing ATMbeta tasks, success was pretty close to 100%
ID: 61238 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Skip Da Shu

Send message
Joined: 13 Jul 09
Posts: 64
Credit: 2,922,790,120
RAC: 98
Level
Phe
Scientific publications
watwatwatwatwatwatwat
Message 61239 - Posted: 11 Feb 2024, 15:58:36 UTC - in response to Message 61235.  
Last modified: 11 Feb 2024, 16:04:45 UTC

well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days.

yeah i don't know what's wrong with DCF. mine goes crazy shortly after i fix it also. says my tasks will take like 27 days even though most are done in 5-10 mins.


I'm trying to suppress my grinning at this upside down world... having retired my last Windoze box some time ago.

On a more germane note...

Between this
CUDA Error of GINTint2e_jk_kernel: out of memory

https://www.gpugrid.net/result.php?resultid=33956113

and this...

Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone?

I am ASSUMING that this is referring to the memory on the 12G vid card?

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes).


https://www.gpugrid.net/result.php?resultid=33955488


And for what it's worth best I can tell I'm getting a lower error % on my RTX3070 8GB cards once I backed off the sclk/mclk clocks.

Skip
- da shu @ HeliOS,
"A child's exposure to technology should never be predicated on an ability to afford it."
ID: 61239 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pututu

Send message
Joined: 8 Oct 16
Posts: 27
Credit: 4,153,801,869
RAC: 0
Level
Arg
Scientific publications
watwatwatwat
Message 61240 - Posted: 11 Feb 2024, 16:23:18 UTC - in response to Message 61239.  

well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days.

yeah i don't know what's wrong with DCF. mine goes crazy shortly after i fix it also. says my tasks will take like 27 days even though most are done in 5-10 mins.


I'm trying to suppress my grinning at this upside down world... having retired my last Windoze box some time ago.

On a more germane note...

Between this
CUDA Error of GINTint2e_jk_kernel: out of memory

https://www.gpugrid.net/result.php?resultid=33956113

and this...

Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone?

I am ASSUMING that this is referring to the memory on the 12G vid card?

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes).


https://www.gpugrid.net/result.php?resultid=33955488


And for what it's worth best I can tell I'm getting a lower error % on my RTX3070 8GB cards once I backed off the sclk/mclk clocks.

Skip

Seems to me that your 3080 is the 10G version instead of 12G?
ID: 61240 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[BAT] Svennemans

Send message
Joined: 27 May 21
Posts: 54
Credit: 1,004,151,720
RAC: 0
Level
Met
Scientific publications
wat
Message 61241 - Posted: 11 Feb 2024, 16:52:25 UTC - in response to Message 61239.  

well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days.


I'm trying to suppress my grinning at this upside down world... having retired my last Windoze box some time ago.


I'm clearly in the presence of passionate Linux believers here... :-)


Between this
CUDA Error of GINTint2e_jk_kernel: out of memory

https://www.gpugrid.net/result.php?resultid=33956113

and this...

Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone?

I am ASSUMING that this is referring to the memory on the 12G vid card?

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes).


https://www.gpugrid.net/result.php?resultid=33955488



And for what it's worth best I can tell I'm getting a lower error % on my RTX3070 8GB cards once I backed off the sclk/mclk clocks.

Skip


It does refer to video memory, but the limit each WU sets possibly doesn't take into account other processes allocating video memory. That would especially be an issue I think if you run multiple WU's in parallel.
Try executing nvidia-smi to see which processes allocate how much video memory:

svennemans@PCSLLINUX01:~$ nvidia-smi
Sun Feb 11 17:29:48 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off | 00000000:01:00.0  On |                  N/A |
| 47%   71C    P2             179W / 275W |   6449MiB / 11264MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1611      G   /usr/lib/xorg/Xorg                          534MiB |
|    0   N/A  N/A      1801      G   /usr/bin/gnome-shell                         75MiB |
|    0   N/A  N/A      9616      G   boincmgr                                      2MiB |
|    0   N/A  N/A      9665      G   ...gnu/webkit2gtk-4.0/WebKitWebProcess       12MiB |
|    0   N/A  N/A     27480      G   ...38,262144 --variations-seed-version      125MiB |
|    0   N/A  N/A     46332      G   gnome-control-center                          2MiB |
|    0   N/A  N/A     47110      C   python                                     5562MiB |
+---------------------------------------------------------------------------------------+


My one running WU has allocated 5.5G but with the other running processes, total allocated is 6.4G.
It would depend on implementation if the limit is calculated from the total CUDA memory or the actual free CUDA memory and whether that limit is updated only once at the start or multiple times.

ID: 61241 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Skip Da Shu

Send message
Joined: 13 Jul 09
Posts: 64
Credit: 2,922,790,120
RAC: 98
Level
Phe
Scientific publications
watwatwatwatwatwatwat
Message 61242 - Posted: 11 Feb 2024, 17:38:59 UTC - in response to Message 61241.  
Last modified: 11 Feb 2024, 17:59:48 UTC

Good point about the other stuff on the card... right this minute it's taking a break from GPUGRID to do a Meerkat Burp7...

I usually have "watch -t -n 8 nvidia-smi" running on this box if I'm poking around. I'll capture a shot of it as soon as GPUGRID comes back up if any of the listed below changes significantly. I don't think it will.

While the 'cuda_1222' is running I see a total ~286MB of 'other stuff' if my 'ciphering' is right:


/usr/lib/xorg/Xorg 153MiB
cinnamon 18MiB
...gnu/webkit2gtk-4.0/WebKitWebProcess 12MiB Boincmgr
/usr/lib/firefox/firefox 103MiB because I'm reading/posting
...inary_x86_64-pc-linux-gnu__cuda1222 776MiB the only Compute task


Skip

PS: GPUGRID WUs are all 1x here.
PPS: Yes, it's the 10G version!
PPPS: Also my adhoc perception of error rates was wrong... working on that.
ID: 61242 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61243 - Posted: 11 Feb 2024, 19:14:30 UTC - in response to Message 61237.  

I believe the lowest value that DCF can be in the client_state file is 0.01

Found that in the code someplace, sometime
ID: 61243 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pascal

Send message
Joined: 15 Jul 20
Posts: 95
Credit: 2,550,803,412
RAC: 248
Level
Phe
Scientific publications
wat
Message 61244 - Posted: 12 Feb 2024, 9:39:40 UTC

bonjour apparemment maintenant ça fonctionne sur mes 2 gpu-gtx 1650 et rtx 4060.
Je n'ai pas eu d'erreur de calcul.


hello apparently now it works on my 2 gpu-gtx 1650 and rtx 4060.
I did not have a miscalculation
ID: 61244 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 21 Dec 23
Posts: 51
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61245 - Posted: 12 Feb 2024, 11:02:45 UTC - in response to Message 61244.  

Hello,

Yes I would not expect the app to work on WSL. There are many linux specific libraries in the packaged python environent that is the "app".

Thank you for the feedback regarding the faliure rate. As I mentioned different WUs require different memory use that is hard to check before they start crunching. From my viewpoint the failiure rates are low enough that all WUs seem to suceed with a few retries. This is still a "Beta" app.

We definitely want a Windows app and it is in pipeline. However, as I mentioned before the development of this is time consuming. Several of the underlying code-bases are linux only at the moment so a windows app requires a windows port of some code.
ID: 61245 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[BAT] Svennemans

Send message
Joined: 27 May 21
Posts: 54
Credit: 1,004,151,720
RAC: 0
Level
Met
Scientific publications
wat
Message 61246 - Posted: 12 Feb 2024, 12:25:22 UTC - in response to Message 61245.  

Yes I would not expect the app to work on WSL. There are many linux specific libraries in the packaged python environent that is the "app".



Actually, it *should* work, since WSL2 is being sold as a native Linux kernel running in a virtual environment with full system call compatibility.
So one could reasonably expect any native linux libraries to work as expected.

However there are obviously still a few issues to iron out.

Not by gpugrid to be clear - by microsoft.
ID: 61246 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[BAT] Svennemans

Send message
Joined: 27 May 21
Posts: 54
Credit: 1,004,151,720
RAC: 0
Level
Met
Scientific publications
wat
Message 61252 - Posted: 13 Feb 2024, 11:23:35 UTC

I'm seeing a bunch of checksum errors during unzip, anyone else have this problem?

https://www.gpugrid.net/results.php?hostid=617834&offset=0&show_names=0&state=5&appid=

Stderr output
<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
11:26:18 (177385): wrapper (7.7.26016): starting
lib/libcufft.so.10.9.0.58  bad CRC e458474a  (should be 0a867ac2)
boinc_unzip() error: 2

</stderr_txt>
]]>


The workunits seem to all run fine on a subsequent host.
ID: 61252 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pascal

Send message
Joined: 15 Jul 20
Posts: 95
Credit: 2,550,803,412
RAC: 248
Level
Phe
Scientific publications
wat
Message 61260 - Posted: 13 Feb 2024, 14:13:19 UTC

bonjour,
quand les taches windows seront elles pretes pour essais?
franchement,Linux ,c'est pourri.
apres une mise a jour le lhc@home ne fonctionne plus.Je reste sous linux pour vous mais j'ai hate de repasser sous un bon vieux windows.
Merci


Good afternoon,
when will windows tasks be ready for testing?
Frankly, Linux is rotten.
after an update the lhc@home no longer works. I stay under linux for you but I hate to go back under a good old windows.
Thanks
ID: 61260 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 61261 - Posted: 13 Feb 2024, 16:15:31 UTC - in response to Message 61260.  

bonjour,
quand les taches windows seront elles pretes pour essais?
franchement,Linux ,c'est pourri.
apres une mise a jour le lhc@home ne fonctionne plus.Je reste sous linux pour vous mais j'ai hate de repasser sous un bon vieux windows.
Merci


Good afternoon,
when will windows tasks be ready for testing?
Frankly, Linux is rotten.
after an update the lhc@home no longer works. I stay under linux for you but I hate to go back under a good old windows.
Thanks


Maybe try a different version. I have always used Windows (and still do on some systems) but use Linux Mint on others. Really user friendly and a very similar feel to Windows.
ID: 61261 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 61269 - Posted: 14 Feb 2024, 15:59:06 UTC - in response to Message 61239.  
Last modified: 14 Feb 2024, 16:00:55 UTC

Between this
CUDA Error of GINTint2e_jk_kernel: out of memory

https://www.gpugrid.net/result.php?resultid=33956113

and this...

Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone?

I am ASSUMING that this is referring to the memory on the 12G vid card?

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes).


https://www.gpugrid.net/result.php?resultid=33955488

Sometimes I get the same error on my 3080 10 GB Card. E.g., https://www.gpugrid.net/result.php?resultid=33960422
Headless computer with a single 3080 running 1C + 1N.
ID: 61269 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 61270 - Posted: 14 Feb 2024, 16:04:55 UTC - in response to Message 61243.  

I believe the lowest value that DCF can be in the client_state file is 0.01

Found that in the code someplace, sometime

Zoltan posted long ago that BOINC does not understand zero and 0.01 is as close as it can get. I wonder if that was someones approach to fixing a division by zero problem in antiquity.
ID: 61270 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Skip Da Shu

Send message
Joined: 13 Jul 09
Posts: 64
Credit: 2,922,790,120
RAC: 98
Level
Phe
Scientific publications
watwatwatwatwatwatwat
Message 61272 - Posted: 14 Feb 2024, 16:20:02 UTC - in response to Message 61242.  
Last modified: 14 Feb 2024, 16:21:50 UTC

...
Skip

PS: GPUGRID WUs are all 1x here.
PPS: Yes, it's the 10G version!
PPPS: Also my adhoc perception of error rates was wrong... working on that.


After logging error rates for a few days across 5 boxes w/ Nvidia cards (all RTX30x0, all Linux Mint v2x.3) and trying to be aware of what I was doing on the main desktop while 'python' was running along with some sclk / mclk cut backs, shows the avg error rate is dropping. The last cut shows it at 23.44% across the 5 boxes averaged over 28 hours.

No longer any segfault 0x8b errors, all 0x1. The last one was on the most troublesome of the 3070 cards.

https://www.gpugrid.net/result.php?resultid=33950656

Anything I can do to help with this type of error?

Skip
ID: 61272 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61273 - Posted: 14 Feb 2024, 16:29:14 UTC - in response to Message 61272.  

...
Skip

PS: GPUGRID WUs are all 1x here.
PPS: Yes, it's the 10G version!
PPPS: Also my adhoc perception of error rates was wrong... working on that.


After logging error rates for a few days across 5 boxes w/ Nvidia cards (all RTX30x0, all Linux Mint v2x.3) and trying to be aware of what I was doing on the main desktop while 'python' was running along with some sclk / mclk cut backs, shows the avg error rate is dropping. The last cut shows it at 23.44% across the 5 boxes averaged over 28 hours.

No longer any segfault 0x8b errors, all 0x1. The last one was on the most troublesome of the 3070 cards.

https://www.gpugrid.net/result.php?resultid=33950656

Anything I can do to help with this type of error?

Skip


its still an out of memory error. a little further up in the error log shows this:
"CUDA Error of GINTint2e_jk_kernel: out of memory"

so it's probably just running out of memory at a different stage of the task, producing a slightly different error, but still an issue with not enough memory.

ID: 61273 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Skip Da Shu

Send message
Joined: 13 Jul 09
Posts: 64
Credit: 2,922,790,120
RAC: 98
Level
Phe
Scientific publications
watwatwatwatwatwatwat
Message 61274 - Posted: 14 Feb 2024, 16:44:55 UTC - in response to Message 61252.  

I'm seeing a bunch of checksum errors during unzip, anyone else have this problem?

https://www.gpugrid.net/results.php?hostid=617834&offset=0&show_names=0&state=5&appid=

Stderr output
<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
11:26:18 (177385): wrapper (7.7.26016): starting
lib/libcufft.so.10.9.0.58  bad CRC e458474a  (should be 0a867ac2)
boinc_unzip() error: 2

</stderr_txt>
]]>


The workunits seem to all run fine on a subsequent host.


I didn't find any of these in the 10GB 3080 errors that occurred so far today. Will check the 3070 cards shortly.

Skip
ID: 61274 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Skip Da Shu

Send message
Joined: 13 Jul 09
Posts: 64
Credit: 2,922,790,120
RAC: 98
Level
Phe
Scientific publications
watwatwatwatwatwatwat
Message 61275 - Posted: 14 Feb 2024, 16:53:42 UTC - in response to Message 61273.  


its still an out of memory error. a little further up in the error log shows this:
"CUDA Error of GINTint2e_jk_kernel: out of memory"

so it's probably just running out of memory at a different stage of the task, producing a slightly different error, but still an issue with not enough memory.


Thanx... as I suspected and this is my most common error now.

Along with these that I'm thinking are also memory related also from a different point in the process... same situation w/o having reached the cap limit shown.

https://www.gpugrid.net/result.php?resultid=33962293

Skip
ID: 61275 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61276 - Posted: 14 Feb 2024, 16:58:25 UTC - in response to Message 61275.  
Last modified: 14 Feb 2024, 16:58:55 UTC

between your systems and mine, looking at the error rates;

~23% of tasks need more than 8GB
~17% of tasks need more than 10GB
~4% of tasks need more than 12GB
<1% of tasks need more than 16GB

me personally, i wouldn't run these (as they are now) with less than 12GB VRAM.
ID: 61276 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 · Next

Message boards : News : PYSCFbeta: Quantum chemistry calculations on GPU

©2025 Universitat Pompeu Fabra