PYSCFbeta: Quantum chemistry calculations on GPU

Author	Message
Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 61154 - Posted: 2 Feb 2024, 10:54:03 UTC Last modified: 2 Feb 2024, 11:32:19 UTC I've disabled getting new GPUGrid tasks GPUGrid on my host with "small" amount (below 24GB) of GPU memory. This gigantic memory requirement is ridiculous in my opinion. This is not a user error, if the workunits can't be changed, then the project should not send these tasks to hosts that have less than ~20GB of GPU memory. There could be another solution, if the workunit would allocate memory in a less careless way. I've started a task on my RTX 4090 (it has 24GiB RAM), and I've monitored the memory usage: idle: 305 MiB task starting: 895 MiB GPU usage rises: 6115 MiB GPU usage drops: 7105 MiB GPU usage 100%: 7205 MiB GPU usage drops: 8495 MiB GPU usage rises: 9961 MiB GPU usage drops: 14327 MiB (it would have failed on my GTX 1080 Ti at this point) GPU usage rises: 6323 MiB GPU usage drops: 15945 MiB GPU usage 100%: 6205 MiB ...and so on So the memory usage doubles at some points of processing for a short while, and this cause the workunits to fail on GPUs that have "small" amount of memory. If this behaviour could be eliminated, much more hosts could process these workunits. ID: 61154 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 1,140,567 Level Scientific publications	Message 61155 - Posted: 2 Feb 2024, 11:59:29 UTC - in response to Message 61154. Nothing to do at this time for my currently working GPUs with PYSCFbeta tasks. 5 GTX 1650 4GB, 1 GTX 1650 SUPER 4GB, 1 GTX 1660 Ti 6GB. 100% errors with current PYSCFbeta tasks, now I can realize why... I've disabled Quantum chemistry on GPU (beta) at my project preferences in the wait for a correction, if any. Conversely, they are performing right with ATMbeta tasks. ID: 61155 · Rating: 0 · rate: / Reply Quote

Freewill Send message Joined: 18 Mar 10 Posts: 28 Credit: 42,166,087,419 RAC: 76,788 Level Scientific publications	Message 61156 - Posted: 2 Feb 2024, 12:09:55 UTC I agree it does seem these tasks have a spike in memory usage. I "rented" an RTX A5000 GPU which also has 24 GB memory, and running 1 task at a time, at least the first task completed: https://www.gpugrid.net/workunit.php?wuid=27678500 I will try a few more ID: 61156 · Rating: 0 · rate: / Reply Quote

roundup Send message Joined: 11 May 10 Posts: 68 Credit: 12,531,253,875 RAC: 67,608 Level Scientific publications	Message 61157 - Posted: 2 Feb 2024, 12:16:07 UTC - in response to Message 61155. Last modified: 2 Feb 2024, 12:17:30 UTC I've disabled Quantum chemistry on GPU (beta) at my project preferences in the wait for a correction, if any. Conversely, they are performing right with ATMbeta tasks. Exactly the same here. After 29 consecutive errors on a RTX4070Ti, I have disabled 'Quantum chemistry on GPU (beta)'. ID: 61157 · Rating: 0 · rate: / Reply Quote

gemini8 Send message Joined: 3 Jul 16 Posts: 31 Credit: 2,250,309,169 RAC: 41 Level Scientific publications	Message 61158 - Posted: 2 Feb 2024, 12:25:43 UTC I have one machine still taking on GPUGrid tasks. The others are using their GPUs for the Tour de Primes over at PrimeGrid only. If there really is a driver issue (see earlier post and answers) with this machine I'd like to know which, as its GPU is running fine on other BOINC projects apart from SRBase. Not being able to run SRBase is related to libc, not the GPU driver. - - - - - - - - - - Greetings, Jens ID: 61158 · Rating: 0 · rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,586,053,412 RAC: 7,959 Level Scientific publications	Message 61159 - Posted: 2 Feb 2024, 12:37:34 UTC Last modified: 2 Feb 2024, 12:38:16 UTC bonjour existe t'il un moyen de simuler de la vram pour gpu en utilisant la ram ou un ssd sous linux. cela éviterait les erreurs de calcul. J'ai augmenter le swap file a 50 gigas comme sous windows mais cela ne fonctionne pas. Merci hello Is there a way to simulate vram for GPU using RAM or SSD under linux. this would avoid miscalculation. I increased the swap file to 50 gigas as under windows but it does not work. Thanks ID: 61159 · Rating: 0 · rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level Scientific publications	Message 61160 - Posted: 2 Feb 2024, 13:36:47 UTC - in response to Message 61151. Last modified: 2 Feb 2024, 13:42:31 UTC Boca, How much VRAM do you see actually being used on some of these tasks? Mind watching a few? You’ll have to run a watch command to see continuous output of VRAM utilization since the usage isn’t constant. It spikes up and down. I’m just curious how much is actually needed. Most of the tasks I was running I would see spike up to about 8GB. But i assume the tasks that needed more just failed instead so I can’t know how much they are trying to use. Even though these Titan Vs are great DP performers they only have 12GB VRAM. Even most of the 16GB cards like V100 and P100 are seeing very high error rates. MPS helps. But not enough with this current batch. I was getting good throughput with running 3x tasks at once on the batches last week. This was wild... For a single work unit: Hovers around 3-4GB Rises to 8-9GB Spikes to ~11GB regularly. Highest Spike (seen): 12.5GB Highest Spike (estimated based on psensor): ~20GB. Additionally, Psensor caught a highest memory usage spike of 76% of the 48GB of the RTX A6000 for one work unit but I did not see when this happened or if it happened at all. I graphically captured the VRAM memory usage for one work unit. I have no idea how to imbed images here. So, here is a Google Doc: https://docs.google.com/document/d/1xpOpNJ93finciJQW7U07dMHOycSVlbYq9G6h0Xg7GtA/edit?usp=sharing EDIT: I think they just purged these work units from the server? ID: 61160 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 61161 - Posted: 2 Feb 2024, 14:02:10 UTC - in response to Message 61160. Last modified: 2 Feb 2024, 14:06:34 UTC thanks. that's kind of what I expected was happening. and yeah, they must have seen the problems and just abandoned the remainder of this run to reassess how to tweak them. it seemed like they sweaked the input files to give the assertion error instead of just hanging like the earlier (index numbers below ~1000). the early tasks would hang with the fallback to CPU issue, and after that it changed to the assertion error if it ran out of vram. that was better behavior for the user since a quick failure is better than hanging for hours on end doing nothing. but they were probably getting back a majority of errors as the VRAM requirements grew beyond what most people have for available hardware. ID: 61161 · Rating: 0 · rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level Scientific publications	Message 61162 - Posted: 2 Feb 2024, 15:30:46 UTC New batch just come through- seeing the same VRAM spikes and patterns. ID: 61162 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 61163 - Posted: 2 Feb 2024, 15:32:14 UTC - in response to Message 61162. Last modified: 2 Feb 2024, 15:39:40 UTC I'm seeing the same spikes, but so far so good. biggest spike i saw was ~9GB no errors ...yet. spoke too soon. did get one failure https://gpugrid.net/result.php?resultid=33801391 ID: 61163 · Rating: 0 · rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 51 Credit: 0 RAC: 0 Level Scientific publications	Message 61164 - Posted: 2 Feb 2024, 15:37:15 UTC - in response to Message 61163. Hi. I have been tweaking settings. All WUs I have tried now work on my 1080(8GB). Sending a new batch of smaller WUs out now. From our end we will need to see how to assign WU's based on GPU memory. (Previous apps have been compute bound rather than GPU memory bound and have only been assigned based on driver version) ID: 61164 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 61165 - Posted: 2 Feb 2024, 16:07:04 UTC - in response to Message 61164. seeing some errors on Titan V (12GB). not a huge amount. but certainly a noteworthy amount. maybe you can correlate these specific WUs and see why these kind (number of atoms or molecules?) might be requesting more VRAM than the ones you tried on your 1080. most of the ones i've observed running will hover around ~3-4GB constant VRAM use, with spikes to the 8-11GB range. https://gpugrid.net/result.php?resultid=33802055 https://gpugrid.net/result.php?resultid=33801492 https://gpugrid.net/result.php?resultid=33801447 https://gpugrid.net/result.php?resultid=33801391 https://gpugrid.net/result.php?resultid=33801238 ID: 61165 · Rating: 0 · rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 27 Credit: 4,153,801,869 RAC: 0 Level Scientific publications	Message 61166 - Posted: 2 Feb 2024, 16:08:36 UTC Still seeing a vram spike above 8GB 2024/02/02 08:07:08.774, 71, 100 %, 40 %, 8997 MiB 2024/02/02 08:07:09.774, 71, 100 %, 34 %, 8999 MiB 2024/02/02 08:07:10.775, 71, 22 %, 1 %, 8989 MiB 2024/02/02 08:07:11.775, 70, 96 %, 2 %, 10209 MiB 2024/02/02 08:07:12.775, 71, 98 %, 7 %, 10721 MiB 2024/02/02 08:07:13.775, 71, 93 %, 8 %, 5023 MiB 2024/02/02 08:07:14.775, 72, 96 %, 24 %, 5019 MiB 2024/02/02 08:07:15.776, 72, 100 %, 0 %, 5019 MiB 2024/02/02 08:07:16.776, 72, 100 %, 0 %, 5019 MiB Seems like credit has gone down from 150K to 15K. ID: 61166 · Rating: 0 · rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level Scientific publications	Message 61167 - Posted: 2 Feb 2024, 16:20:20 UTC - in response to Message 61166. Agreed- it seems that there are fewer spikes and most of them are in the 8-9GB range. A few higher but it seems less frequent? Difficult to quantify an actual difference since the work units can be so different. Is there a difference in VRAM usage or does the actual work unit just happen to need less VRAM? ID: 61167 · Rating: 0 · rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,586,053,412 RAC: 7,959 Level Scientific publications	Message 61168 - Posted: 2 Feb 2024, 16:40:21 UTC Seems like credit has gone down from 150K to 15K. ID: 61168 · Rating: 0 · rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 27 Credit: 4,153,801,869 RAC: 0 Level Scientific publications	Message 61169 - Posted: 2 Feb 2024, 17:33:47 UTC Last modified: 2 Feb 2024, 17:34:29 UTC Occasionally 8G of vram card is not sufficient. Still seeing error on these cards. Example: two of the hosts below have 8G vram while the other one returned successfully has 16G. http://gpugrid.net/workunit.php?wuid=27683202 ID: 61169 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 61171 - Posted: 2 Feb 2024, 17:55:00 UTC - in response to Message 61169. Even that 16GB GPU had one failure with the new v3 batch http://gpugrid.net/result.php?resultid=33802340 ID: 61171 · Rating: 0 · rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level Scientific publications	Message 61172 - Posted: 2 Feb 2024, 18:47:46 UTC - in response to Message 61171. Even that 16GB GPU had one failure with the new v3 batch http://gpugrid.net/result.php?resultid=33802340 Based on the times of tasks, it looks like those were running at 1x? ID: 61172 · Rating: 0 · rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,586,053,412 RAC: 7,959 Level Scientific publications	Message 61173 - Posted: 2 Feb 2024, 18:52:03 UTC Last modified: 2 Feb 2024, 18:55:03 UTC bonsoir chez moi ça marche bien maintenant. je viens de finir 5 unités de calcul sans probleme avec ma gtx 1650 et ma rtx 4060. espérons que cela continue. j'ai reformaté mon pc aujourd'hui et j'ai réinstallé linux mint 21,3,une fois de plus. Good evening at my place it works well now. I just finished 5 computing units without problems with my gtx 1650 and my rtx 4060. let’s hope this continues. I reformatted my pc today and reinstalled linux mint 21,3,once again. https://www.gpugrid.net/results.php?userid=563937 ID: 61173 · Rating: 0 · rate: / Reply Quote

roundup Send message Joined: 11 May 10 Posts: 68 Credit: 12,531,253,875 RAC: 67,608 Level Scientific publications	Message 61174 - Posted: 2 Feb 2024, 19:00:05 UTC - in response to Message 61168. 14 tasks of the latest batch completed successfully without any error. Great progress! Seems like credit has gone down from 150K to 15K. Perhaps 150k was a little too generous. But 15k is not on par with other GPU projects. I expect there will be fairer credits again soon - with the next batch? ID: 61174 · Rating: 0 · rate: / Reply Quote