Message boards :
News :
PYSCFbeta: Quantum chemistry calculations on GPU
Message board moderation
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 14 · Next
| Author | Message |
|---|---|
|
Send message Joined: 8 Oct 16 Posts: 27 Credit: 4,153,801,869 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
There are some tasks that spike over 10G. Seems like nvidia-smi doesn't allow logging time shorter than 1s. Anyone has a workaround? Likely that the momentarily spike could be higher than 10G as recorded. 2024/02/05 07:06:39.675, 88 %, 1328 MHz, 5147 MiB, 115.28 W, 65 2024/02/05 07:06:40.678, 96 %, 1278 MHz, 5147 MiB, 117.58 W, 65 2024/02/05 07:06:41.688, 100 %, 1328 MHz, 5177 MiB, 111.94 W, 65 2024/02/05 07:06:42.691, 100 %, 1328 MHz, 6647 MiB, 70.23 W, 64 2024/02/05 07:06:43.694, 30 %, 1328 MHz, 8475 MiB, 69.65 W, 64 2024/02/05 07:06:44.697, 100 %, 1328 MHz, 9015 MiB, 81.81 W, 64 2024/02/05 07:06:45.700, 100 %, 1328 MHz, 9007 MiB, 46.32 W, 63 2024/02/05 07:06:46.705, 98 %, 1278 MHz, 9941 MiB, 46.08 W, 63 2024/02/05 07:06:47.708, 99 %, 1328 MHz, 10251 MiB, 57.06 W, 63 2024/02/05 07:06:48.711, 97 %, 1088 MHz, 4553 MiB, 133.72 W, 65 2024/02/05 07:06:49.714, 95 %, 1075 MHz, 4553 MiB, 132.99 W, 65 |
|
Send message Joined: 8 Oct 16 Posts: 27 Credit: 4,153,801,869 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Got a biggie. This one is 14.6G. I'm running 16G card. One task per gpu. 2024/02/05 08:20:03.043, 100 %, 1328 MHz, 9604 MiB, 107.19 W, 71 2024/02/05 08:20:04.046, 94 %, 1328 MHz, 11970 MiB, 97.69 W, 71 2024/02/05 08:20:05.049, 99 %, 1328 MHz, 12130 MiB, 123.24 W, 70 2024/02/05 08:20:06.052, 100 %, 1316 MHz, 12130 MiB, 122.21 W, 71 2024/02/05 08:20:07.055, 100 %, 1328 MHz, 12130 MiB, 121.26 W, 71 2024/02/05 08:20:08.058, 100 %, 1328 MHz, 12130 MiB, 118.64 W, 71 2024/02/05 08:20:09.061, 17 %, 1328 MHz, 12116 MiB, 56.48 W, 70 2024/02/05 08:20:10.064, 95 %, 1189 MHz, 14646 MiB, 73.99 W, 71 2024/02/05 08:20:11.071, 99 %, 1139 MHz, 14646 MiB, 194.84 W, 71 2024/02/05 08:20:12.078, 96 %, 1316 MHz, 14650 MiB, 65.82 W, 70 2024/02/05 08:20:13.081, 85 %, 1328 MHz, 8952 MiB, 84.32 W, 70 2024/02/05 08:20:14.084, 100 %, 1075 MHz, 8952 MiB, 130.53 W, 71 |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
yeah i think you'll only ever see the spike if you actually have the VRAM for it. if you don't have enough, it will error out before hitting it and you'll never see it. I'm just gonna deal with the errors. cost of doing business lol. I have my system set for 70% ATP through MPS. QChem gpu_usage set to 0.55 ATMbeta gpu_usage set to 0.44 this way when both tasks are available, it will run either ATMbeta+ATMbeta, or ATMbeta+QChem on the same GPU, but will not allow 2x Qchem on the same GPU. i do this because ATMbeta uses a really small amount of the GPU VRAM and can utilize some of the spare compute cycles without hurting QChem VRAM use much. but when it's running only QChem and only running 1x tasks, it's not using absolutely the most compute that it could (only 70%), so maybe a little slower, but Titan Vs are fast enough anyway. most tasks finishing in about 6mins or so. some outliers running ~18mins.
|
|
Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level ![]() Scientific publications
|
pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work. |
|
Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level ![]() Scientific publications
|
We did this as well this morning for the 4090 GPUs since they have 24GB but with E@H work. To little VRAM to run QChem at 2x but too much compute power left on the table for running them at 1x. |
|
Send message Joined: 8 Oct 16 Posts: 27 Credit: 4,153,801,869 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work. 0 failure after 19 completed tasks on one P100 with 16G. So far 14.6G is the highest I've seen with 1 sec interval monitoring More than half of the tasks processed momentarily hit 8G or more. Didn't record any actual data, just watching the nvidia-smi from time to time. Edit: another task with more than 12G but with ominous 6666M, lol 2024/02/05 09:17:58.869, 99 %, 1328 MHz, 10712 MiB, 131.69 W, 70 2024/02/05 09:17:59.872, 100 %, 1328 MHz, 10712 MiB, 101.87 W, 70 2024/02/05 09:18:00.877, 100 %, 1328 MHz, 10700 MiB, 50.15 W, 69 2024/02/05 09:18:01.880, 92 %, 1240 MHz, 11790 MiB, 54.34 W, 69 2024/02/05 09:18:02.883, 95 %, 1240 MHz, 12364 MiB, 53.20 W, 69 2024/02/05 09:18:03.886, 83 %, 1126 MHz, 6666 MiB, 137.77 W, 70 2024/02/05 09:18:04.889, 100 %, 1075 MHz, 6666 MiB, 130.53 W, 71 2024/02/05 09:18:05.892, 92 %, 1164 MHz, 6666 MiB, 129.84 W, 71 2024/02/05 09:18:06.902, 100 %, 1063 MHz, 6666 MiB, 129.82 W, 71 |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work. been running all day across my 18x Titan Vs. the effective error rate is right around 5%. so 5% of the tasks needed more than 12GB. running only 1 task per GPU. i rented an A100 40GB for the day. running 3x on this GPU with MPS set to 40%, it's done about 300 tasks and only 1 task failed from out of memory. highest spike i saw was 39GB, but usually stays around 20GB utilized
|
|
Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level ![]() Scientific publications
|
pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work. Wow, the A100 is powerful. I can't believe how fast it can chew through these (well, I can believe it, but it's still amazing). I am somewhat new to MPS and I understand the general concept, but what do you mean when you say it is set to 40%? |
|
Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,550,803,412 RAC: 248 Level ![]() Scientific publications
|
eh bien moi j'ai abandonné trop d'erreurs. well I gave up too many mistakes |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
I am somewhat new to MPS and I understand the general concept, but what do you mean when you say it is set to 40%? CUDA MPS has a setting called active thread percentage. It basically limits how many SMs of the GPU get used for each process. Without MPS, each process will call for all available SMs all the time, in separate contexts (MPS also shares a single context). I set that to 40%, so each task is only using 40% of the available SMs. With 3x running that’s slightly over provisioning the GPU, but it usually works well and runs faster than 3x without MPS. It also has the benefit of reducing VRAM use most of the time, but it doesn’t seem to limit these tasks much. The only caveat is that when you run low on work, the remaining one or two tasks won’t use all the GPU, instead using only the 40% and none of the rest of the idle GPU.
|
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
Seems like nvidia-smi doesn't allow logging time shorter than 1s. Anyone has a workaround? Have you tried NVITOP? https://github.com/XuehaiPan/nvitop |
|
Send message Joined: 8 Oct 16 Posts: 27 Credit: 4,153,801,869 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Seems like nvidia-smi doesn't allow logging time shorter than 1s. Anyone has a workaround? No. A quick search seems to indicate that it uses nvidia-smi command, so likely to have similar limitation. Anyway after a day or running (>100+ tasks) I didn't see any failures on the 16GB card, so I'm good, at least for now. |
|
Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level ![]() Scientific publications
|
I am somewhat new to MPS and I understand the general concept, but what do you mean when you say it is set to 40%? Thank you for the explanation! |
|
Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,550,803,412 RAC: 248 Level ![]() Scientific publications
|
bonsoir , y a t'il des unités de travail Windows a calculer ou faut il que je repasse sous linux? Merci Good evening, Are there Windows work units to calculate or do I have to go back to linux? Thanks |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
bonsoir , Only Linux still.
|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Good evening, Only Linux still. :-( :-( :-( |
|
Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,550,803,412 RAC: 248 Level ![]() Scientific publications
|
je viens de repasser sous linux et c'est reparti.bye bye windows 10. I just came back under linux and it’s gone again.bye bye windows 10 |
|
Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level ![]() Scientific publications
|
We have definitely noticed a sharp decrease in "errors" with these tasks. Steve (or anyone), can you offer some insight into the filenames? As example: inputs_v3_ace_pch_ms_gc_filt_af05_index_263591_to_263591-SFARR_PYSCF_ace_pch_ms_gc_filt_af05_v4-0-1-RND5514_2 Are there two different references to version? I see a "_v3_" and then a "_v4-0-1". Then, the app version: v1.04 I thought that "_v4-0-1" would equate to the app version, but it doesn't look like it does. Thanks! |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
“0-1”notation with all GPUGRID tasks seems to indicate the segment you are on and how many total segments there are So here, 0 = which segment you are on 1= how many segments there are in total The segment you are on seems to always be in the 0-first kind of notation. We see/saw the same behavior with ATM. Where you will have tasks like 0-5, 1-5, 2-5, etc. and they stop at 4-5, there was a batch that had ten segment for 0-10 through 9-10. they likely have some kind of process on the server side which stiches the results together based on these (and other) numbers
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Looks like they transitioned from v3-0-1 on Feb 2 to a test result on Feb 3 and then started the v4-0-1 run on Feb 5 That was looking back through 360 validated tasks. I had two errors on the v4-0-1 tasks right at their beginning. Then they all validated since then. All run on two 2080 Ti cards. |
©2025 Universitat Pompeu Fabra