PYSCFbeta: Quantum chemistry calculations on GPU

Message boards : News : PYSCFbeta: Quantum chemistry calculations on GPU
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 14 · Next

AuthorMessage
pututu

Send message
Joined: 8 Oct 16
Posts: 27
Credit: 4,153,801,869
RAC: 0
Level
Arg
Scientific publications
watwatwatwat
Message 61195 - Posted: 5 Feb 2024, 15:09:41 UTC

There are some tasks that spike over 10G. Seems like nvidia-smi doesn't allow logging time shorter than 1s. Anyone has a workaround? Likely that the momentarily spike could be higher than 10G as recorded.

2024/02/05 07:06:39.675, 88 %, 1328 MHz, 5147 MiB, 115.28 W, 65
2024/02/05 07:06:40.678, 96 %, 1278 MHz, 5147 MiB, 117.58 W, 65
2024/02/05 07:06:41.688, 100 %, 1328 MHz, 5177 MiB, 111.94 W, 65
2024/02/05 07:06:42.691, 100 %, 1328 MHz, 6647 MiB, 70.23 W, 64
2024/02/05 07:06:43.694, 30 %, 1328 MHz, 8475 MiB, 69.65 W, 64
2024/02/05 07:06:44.697, 100 %, 1328 MHz, 9015 MiB, 81.81 W, 64
2024/02/05 07:06:45.700, 100 %, 1328 MHz, 9007 MiB, 46.32 W, 63
2024/02/05 07:06:46.705, 98 %, 1278 MHz, 9941 MiB, 46.08 W, 63
2024/02/05 07:06:47.708, 99 %, 1328 MHz, 10251 MiB, 57.06 W, 63
2024/02/05 07:06:48.711, 97 %, 1088 MHz, 4553 MiB, 133.72 W, 65
2024/02/05 07:06:49.714, 95 %, 1075 MHz, 4553 MiB, 132.99 W, 65
ID: 61195 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pututu

Send message
Joined: 8 Oct 16
Posts: 27
Credit: 4,153,801,869
RAC: 0
Level
Arg
Scientific publications
watwatwatwat
Message 61196 - Posted: 5 Feb 2024, 16:21:57 UTC - in response to Message 61195.  

Got a biggie. This one is 14.6G. I'm running 16G card. One task per gpu.

2024/02/05 08:20:03.043, 100 %, 1328 MHz, 9604 MiB, 107.19 W, 71
2024/02/05 08:20:04.046, 94 %, 1328 MHz, 11970 MiB, 97.69 W, 71
2024/02/05 08:20:05.049, 99 %, 1328 MHz, 12130 MiB, 123.24 W, 70
2024/02/05 08:20:06.052, 100 %, 1316 MHz, 12130 MiB, 122.21 W, 71
2024/02/05 08:20:07.055, 100 %, 1328 MHz, 12130 MiB, 121.26 W, 71
2024/02/05 08:20:08.058, 100 %, 1328 MHz, 12130 MiB, 118.64 W, 71
2024/02/05 08:20:09.061, 17 %, 1328 MHz, 12116 MiB, 56.48 W, 70
2024/02/05 08:20:10.064, 95 %, 1189 MHz, 14646 MiB, 73.99 W, 71
2024/02/05 08:20:11.071, 99 %, 1139 MHz, 14646 MiB, 194.84 W, 71
2024/02/05 08:20:12.078, 96 %, 1316 MHz, 14650 MiB, 65.82 W, 70
2024/02/05 08:20:13.081, 85 %, 1328 MHz, 8952 MiB, 84.32 W, 70
2024/02/05 08:20:14.084, 100 %, 1075 MHz, 8952 MiB, 130.53 W, 71
ID: 61196 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61197 - Posted: 5 Feb 2024, 16:35:36 UTC - in response to Message 61196.  
Last modified: 5 Feb 2024, 16:36:34 UTC

yeah i think you'll only ever see the spike if you actually have the VRAM for it. if you don't have enough, it will error out before hitting it and you'll never see it.

I'm just gonna deal with the errors. cost of doing business lol. I have my system set for 70% ATP through MPS.

QChem gpu_usage set to 0.55
ATMbeta gpu_usage set to 0.44

this way when both tasks are available, it will run either ATMbeta+ATMbeta, or ATMbeta+QChem on the same GPU, but will not allow 2x Qchem on the same GPU. i do this because ATMbeta uses a really small amount of the GPU VRAM and can utilize some of the spare compute cycles without hurting QChem VRAM use much. but when it's running only QChem and only running 1x tasks, it's not using absolutely the most compute that it could (only 70%), so maybe a little slower, but Titan Vs are fast enough anyway. most tasks finishing in about 6mins or so. some outliers running ~18mins.
ID: 61197 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 61198 - Posted: 5 Feb 2024, 16:37:51 UTC - in response to Message 61196.  

pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work.
ID: 61198 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 61199 - Posted: 5 Feb 2024, 16:53:51 UTC - in response to Message 61197.  



QChem gpu_usage set to 0.55
ATMbeta gpu_usage set to 0.44




We did this as well this morning for the 4090 GPUs since they have 24GB but with E@H work. To little VRAM to run QChem at 2x but too much compute power left on the table for running them at 1x.
ID: 61199 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pututu

Send message
Joined: 8 Oct 16
Posts: 27
Credit: 4,153,801,869
RAC: 0
Level
Arg
Scientific publications
watwatwatwat
Message 61200 - Posted: 5 Feb 2024, 17:02:55 UTC - in response to Message 61198.  
Last modified: 5 Feb 2024, 17:20:10 UTC

pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work.


0 failure after 19 completed tasks on one P100 with 16G.

So far 14.6G is the highest I've seen with 1 sec interval monitoring

More than half of the tasks processed momentarily hit 8G or more. Didn't record any actual data, just watching the nvidia-smi from time to time.

Edit: another task with more than 12G but with ominous 6666M, lol
2024/02/05 09:17:58.869, 99 %, 1328 MHz, 10712 MiB, 131.69 W, 70
2024/02/05 09:17:59.872, 100 %, 1328 MHz, 10712 MiB, 101.87 W, 70
2024/02/05 09:18:00.877, 100 %, 1328 MHz, 10700 MiB, 50.15 W, 69
2024/02/05 09:18:01.880, 92 %, 1240 MHz, 11790 MiB, 54.34 W, 69
2024/02/05 09:18:02.883, 95 %, 1240 MHz, 12364 MiB, 53.20 W, 69
2024/02/05 09:18:03.886, 83 %, 1126 MHz, 6666 MiB, 137.77 W, 70
2024/02/05 09:18:04.889, 100 %, 1075 MHz, 6666 MiB, 130.53 W, 71
2024/02/05 09:18:05.892, 92 %, 1164 MHz, 6666 MiB, 129.84 W, 71
2024/02/05 09:18:06.902, 100 %, 1063 MHz, 6666 MiB, 129.82 W, 71
ID: 61200 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61201 - Posted: 6 Feb 2024, 2:51:01 UTC - in response to Message 61198.  

pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work.


been running all day across my 18x Titan Vs. the effective error rate is right around 5%. so 5% of the tasks needed more than 12GB. running only 1 task per GPU.

i rented an A100 40GB for the day. running 3x on this GPU with MPS set to 40%, it's done about 300 tasks and only 1 task failed from out of memory. highest spike i saw was 39GB, but usually stays around 20GB utilized
ID: 61201 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 61202 - Posted: 6 Feb 2024, 5:02:10 UTC - in response to Message 61201.  

pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work.


been running all day across my 18x Titan Vs. the effective error rate is right around 5%. so 5% of the tasks needed more than 12GB. running only 1 task per GPU.

i rented an A100 40GB for the day. running 3x on this GPU with MPS set to 40%, it's done about 300 tasks and only 1 task failed from out of memory. highest spike i saw was 39GB, but usually stays around 20GB utilized



Wow, the A100 is powerful. I can't believe how fast it can chew through these (well, I can believe it, but it's still amazing). I am somewhat new to MPS and I understand the general concept, but what do you mean when you say it is set to 40%?
ID: 61202 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pascal

Send message
Joined: 15 Jul 20
Posts: 95
Credit: 2,550,803,412
RAC: 248
Level
Phe
Scientific publications
wat
Message 61203 - Posted: 6 Feb 2024, 8:44:37 UTC

eh bien moi j'ai abandonné trop d'erreurs.

well I gave up too many mistakes
ID: 61203 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61204 - Posted: 6 Feb 2024, 11:39:54 UTC - in response to Message 61202.  

I am somewhat new to MPS and I understand the general concept, but what do you mean when you say it is set to 40%?


CUDA MPS has a setting called active thread percentage. It basically limits how many SMs of the GPU get used for each process. Without MPS, each process will call for all available SMs all the time, in separate contexts (MPS also shares a single context). I set that to 40%, so each task is only using 40% of the available SMs. With 3x running that’s slightly over provisioning the GPU, but it usually works well and runs faster than 3x without MPS. It also has the benefit of reducing VRAM use most of the time, but it doesn’t seem to limit these tasks much. The only caveat is that when you run low on work, the remaining one or two tasks won’t use all the GPU, instead using only the 40% and none of the rest of the idle GPU.
ID: 61204 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 61205 - Posted: 6 Feb 2024, 13:11:48 UTC - in response to Message 61195.  

Seems like nvidia-smi doesn't allow logging time shorter than 1s. Anyone has a workaround?

Have you tried NVITOP?
https://github.com/XuehaiPan/nvitop
ID: 61205 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pututu

Send message
Joined: 8 Oct 16
Posts: 27
Credit: 4,153,801,869
RAC: 0
Level
Arg
Scientific publications
watwatwatwat
Message 61206 - Posted: 6 Feb 2024, 17:00:06 UTC - in response to Message 61205.  
Last modified: 6 Feb 2024, 17:00:33 UTC

Seems like nvidia-smi doesn't allow logging time shorter than 1s. Anyone has a workaround?

Have you tried NVITOP?
https://github.com/XuehaiPan/nvitop


No. A quick search seems to indicate that it uses nvidia-smi command, so likely to have similar limitation.

Anyway after a day or running (>100+ tasks) I didn't see any failures on the 16GB card, so I'm good, at least for now.
ID: 61206 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 61207 - Posted: 7 Feb 2024, 15:04:51 UTC - in response to Message 61204.  

I am somewhat new to MPS and I understand the general concept, but what do you mean when you say it is set to 40%?


CUDA MPS has a setting called active thread percentage. It basically limits how many SMs of the GPU get used for each process. Without MPS, each process will call for all available SMs all the time, in separate contexts (MPS also shares a single context). I set that to 40%, so each task is only using 40% of the available SMs. With 3x running that’s slightly over provisioning the GPU, but it usually works well and runs faster than 3x without MPS. It also has the benefit of reducing VRAM use most of the time, but it doesn’t seem to limit these tasks much. The only caveat is that when you run low on work, the remaining one or two tasks won’t use all the GPU, instead using only the 40% and none of the rest of the idle GPU.



Thank you for the explanation!
ID: 61207 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pascal

Send message
Joined: 15 Jul 20
Posts: 95
Credit: 2,550,803,412
RAC: 248
Level
Phe
Scientific publications
wat
Message 61208 - Posted: 7 Feb 2024, 20:38:28 UTC

bonsoir ,
y a t'il des unités de travail Windows a calculer ou faut il que je repasse sous linux?
Merci

Good evening,
Are there Windows work units to calculate or do I have to go back to linux?
Thanks
ID: 61208 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61210 - Posted: 7 Feb 2024, 21:10:58 UTC - in response to Message 61208.  

bonsoir ,
y a t'il des unités de travail Windows a calculer ou faut il que je repasse sous linux?
Merci

Good evening,
Are there Windows work units to calculate or do I have to go back to linux?
Thanks


Only Linux still.
ID: 61210 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61212 - Posted: 8 Feb 2024, 13:05:33 UTC - in response to Message 61210.  

Good evening,
Are there Windows work units to calculate or do I have to go back to linux?
Thanks


Only Linux still.

:-( :-( :-(

ID: 61212 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pascal

Send message
Joined: 15 Jul 20
Posts: 95
Credit: 2,550,803,412
RAC: 248
Level
Phe
Scientific publications
wat
Message 61213 - Posted: 8 Feb 2024, 16:12:49 UTC

je viens de repasser sous linux et c'est reparti.bye bye windows 10.


I just came back under linux and it’s gone again.bye bye windows 10
ID: 61213 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 61214 - Posted: 8 Feb 2024, 16:43:11 UTC

We have definitely noticed a sharp decrease in "errors" with these tasks. Steve (or anyone), can you offer some insight into the filenames? As example:


inputs_v3_ace_pch_ms_gc_filt_af05_index_263591_to_263591-SFARR_PYSCF_ace_pch_ms_gc_filt_af05_v4-0-1-RND5514_2

Are there two different references to version? I see a "_v3_" and then a "_v4-0-1".

Then, the app version: v1.04

I thought that "_v4-0-1" would equate to the app version, but it doesn't look like it does.

Thanks!
ID: 61214 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61215 - Posted: 8 Feb 2024, 17:35:30 UTC - in response to Message 61214.  
Last modified: 8 Feb 2024, 17:41:34 UTC

“0-1”notation with all GPUGRID tasks seems to indicate the segment you are on and how many total segments there are

So here, 0 = which segment you are on
1= how many segments there are in total
The segment you are on seems to always be in the 0-first kind of notation.

We see/saw the same behavior with ATM. Where you will have tasks like 0-5, 1-5, 2-5, etc. and they stop at 4-5, there was a batch that had ten segment for 0-10 through 9-10.

they likely have some kind of process on the server side which stiches the results together based on these (and other) numbers
ID: 61215 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61216 - Posted: 8 Feb 2024, 17:36:34 UTC - in response to Message 61214.  

Looks like they transitioned from v3-0-1 on Feb 2 to a test result on Feb 3 and then started the v4-0-1 run on Feb 5

That was looking back through 360 validated tasks.

I had two errors on the v4-0-1 tasks right at their beginning. Then they all validated since then.

All run on two 2080 Ti cards.
ID: 61216 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 14 · Next

Message boards : News : PYSCFbeta: Quantum chemistry calculations on GPU

©2025 Universitat Pompeu Fabra