Message boards :
News :
PYSCFbeta: Quantum chemistry calculations on GPU
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 14 · Next
| Author | Message |
|---|---|
|
Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level ![]() Scientific publications
|
14 tasks of the latest batch completed successfully without any error. Are you running them at 1x and with how much VRAM? Trying to get a feel for what the actual "cutoff" is for these tasks right now. I am still feeling 24GB VRAM is needed for the success running 1x and double that for 2x. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
sometimes more than 12GB as about 4% (16 out of 372) of my tasks failed all on GPUs with 12GB, all running at 1x only for the v3 batch. not sure how much VRAM is needed to be 100% successful. I did have one success that was a resend of one of your errors from a 4090 24GB. so i'm guessing you were running that one at 2x and got unlucky with two big tasks at the same time.
|
|
Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level ![]() Scientific publications
|
sometimes more than 12GB as about 4% (16 out of 372) of my tasks failed all on GPUs with 12GB, all running at 1x only for the v3 batch. not sure how much VRAM is needed to be 100% successful. I did have one success that was a resend of one of your errors from a 4090 24GB. so i'm guessing you were running that one at 2x and got unlucky with two big tasks at the same time. Correct- I was playing around with the two 4090 systems running these to make some comparisons. And you are also correct- it seems that even with 24GB, running 2x is still not really ideal. Those random, huge spikes seem to find each other when running 2x. |
|
Send message Joined: 11 May 10 Posts: 68 Credit: 12,293,491,875 RAC: 2,606 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Are you running them at 1x and with how much VRAM? Trying to get a feel for what the actual "cutoff" is for these tasks right now. I am still feeling 24GB VRAM is needed for the success running 1x and double that for 2x. The GPU is an MSI 4070 Ti GAMING X SLIM with 12GB GDDR6X, run at 1x. Obviously sufficient for the latest batch to run flawlessly. |
|
Send message Joined: 8 Oct 16 Posts: 27 Credit: 4,153,801,869 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
14 tasks of the latest batch completed successfully without any error. Assuming someone with a 3080Ti card, it will be better to run ATMbeta task first and then Quantum chemistry (if former has no available tasks) if credits granted is an important factor for some crunchers. For me, I've 3080Ti and P100, so I will likely run ATMbeta on 3080Ti and Quantum chemistry on P100, if both tasks are available. |
|
Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level ![]() Scientific publications
|
Are you running them at 1x and with how much VRAM? Trying to get a feel for what the actual "cutoff" is for these tasks right now. I am still feeling 24GB VRAM is needed for the success running 1x and double that for 2x. Thanks for the info. If you don't mind me asking- how many ran (in a row) without any errors? |
|
Send message Joined: 6 Jan 21 Posts: 2 Credit: 56,925,024 RAC: 0 Level ![]() Scientific publications
|
I have got a rig with 9 pieces of P106, which are slightly modified GTX1060 6GB used for Ethereum mining back in the day. I can run only two GPUgrid tasks at once (main CPU is only a dual core Celeron) but so far I have had one error and several tasks finish and validate. Hoping for good results for the rest! |
|
Send message Joined: 11 May 10 Posts: 68 Credit: 12,293,491,875 RAC: 2,606 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Are you running them at 1x and with how much VRAM? Trying to get a feel for what the actual "cutoff" is for these tasks right now. I am still feeling 24GB VRAM is needed for the success running 1x and double that for 2x. 14 consecutive tasks without any error. |
|
Send message Joined: 6 Jan 21 Posts: 2 Credit: 56,925,024 RAC: 0 Level ![]() Scientific publications
|
I have got a rig with 9 pieces of P106, which are slightly modified GTX1060 6GB used for Ethereum mining back in the day. I can run only two GPUgrid tasks at once (main CPU is only a dual core Celeron) but so far I have had one error and several tasks finish and validate. Hoping for good results for the rest! So managed to get 11 tasks, from which 9 passed and validated and 2 failed some time into the process. |
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,187 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
...From our end we will need to see how to assign WU's based on GPU memory. (Previous apps have been compute bound rather than GPU memory bound and have only been assigned based on driver version) Probably (I don't know if it is viable), a better solution would be to include some portion in the code to limit peak VRAM according to the true device assigned. The reason, based in an example: My Host #482132 is shown by BOINC as [2] NVIDIA NVIDIA GeForce GTX 1660 Ti (5928MB) driver: 550.40 This is true for Device 0, NVIDIA NVIDIA GeForce GTX 1660 Ti (5928MB) driver: 550.40 But Device 1 completing this host, should be shown as NVIDIA NVIDIA GeForce GTX 1650 SUPER (3895MB) driver: 550.40 Tasks sent according to Device 0 VRAM (6 GB), would likely run out of memory when striking Device 1 (4 GB VRAM) |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
...From our end we will need to see how to assign WU's based on GPU memory. (Previous apps have been compute bound rather than GPU memory bound and have only been assigned based on driver version) the only caveat with this is that the application or project doesnt have any ability to select which GPUs you have or which GPU will run the task. in your example, if a task was sent that required >4GB, the project has no idea that GPU1 only has 4GB. the project can only see the "first/best" GPU in the system, that is communicated via your boinc client, and the boinc client is the one that selects which tasks go to which GPU. the science application is called after the GPU selection has already been made. and similarly, BOINC has no mechanism to assign tasks based on GPU VRAM use. you will have to manage things yourself after observing behavior. if you notice one GPU consistently has too little VRAM, you can exclude that GPU from running the QChem project via setting the <exclude_gpu> statement in the cc_config.xml file. <options> <exclude_gpu> <url>https://www.gpugrid.net/</url> <app>PYSCFbeta</app> <device_num>1</device_num> </exclude_gpu> </options>
|
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,187 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
you will have to manage things yourself after observing behavior. Certainly. Your advice is always very appreciated. Would be fine an update of minimum requirements when PYSCF taks arrive to production stage, as a help for excluding non accomplishing hosts / GPUs. |
|
Send message Joined: 8 Oct 16 Posts: 27 Credit: 4,153,801,869 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
you will have to manage things yourself after observing behavior. I would imagine something like what WCG posted may be useful showing system requirements such as memory, disk space, one-time download file size, etc https://www.worldcommunitygrid.org/help/topic.s?shortName=minimumreq. Other than WCG not running smoothly since the IBM migration, I notice that the WCG system requirements are outdated. I guess it takes effort to maintain such information and keeping it up to date. So far, this is my limited knowledge about the quantum chemistry task as I'm still learning. Anyone is welcome to chime in for the system requirements. 1) One time download file is about 2GB. Be prepare to wait for hours if you have very slow internet speed. 2) The more gpu vram the better. Seems like 24GB cards or more perform the best. 3) GPUs with faster memory bandwidth and faster FP64 have advantage in shorter run time. Typically this is found in datacenter/server/workstation cards. |
|
Send message Joined: 3 Jul 16 Posts: 31 Credit: 2,248,809,169 RAC: 0 Level ![]() Scientific publications ![]()
|
Implementing a possibility to choose work with certain demands to the hardware through the preferences would be nice as well. After lots of problems with the ECM subproject claiming too much system memory yoyo@home divided the subproject into smaller and bigger tasks, which can both be ticked (or be left unticked) in the project preferences. So, my suggestion is to hand out work that comes in 4, 6, 8, 12, 16 and 24 GB flavours which the user can choose from. As the machine's system also claims GPU memory it should naturally be considered to leavy about half a gig untouched by the GPUGrid tasks. - - - - - - - - - - Greetings, Jens |
|
Send message Joined: 21 Dec 23 Posts: 51 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Ok so it seems like things are improved with the latest settings. I am keeping the WUs short (10 molecule configurations per WU) to minimize the effect of the errors. I am going to send out some batches of WUs to get through a large dataset we have. I think this After lots of problems with the ECM subproject claiming too much system memory yoyo@home divided the subproject into smaller and bigger tasks, which can both be ticked (or be left unticked) in the project preferences. Might be the most workable solution for the future once the current batch of work is done. The memory use is mainly determined by the size of molecule and number of heavy elememts. So before WUs are sent out we can make a rough estimate of the memory use. There is an elemnt of randomness that comes from high memory use for specific physical configurations that are harder to converge. We cannot estimate this before sending and it will only happen during the calculation. |
|
Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,550,803,412 RAC: 203 Level ![]() Scientific publications
|
Seems like credit has gone down from 150K to 15K? |
|
Send message Joined: 18 Mar 10 Posts: 28 Credit: 41,810,583,419 RAC: 10,891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Seems like credit has gone down from 150K to 15K? Yes, and the memory use this morning seems to require running 1 at a time on GPUs with less than 16 GB, which hurts performance even more. Steve, what determines point value for a task? |
|
Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,550,803,412 RAC: 203 Level ![]() Scientific publications
|
Pour le moment ça va pas trop mal avec les nouvelles unités de calcul. Une erreur sur 4. For the moment it is not too bad with the new units of calculation. One in four mistakes. Nom inputs_v3_ace_pch_ms_gc_filt_af05_index_64000_to_64000-SFARR_PYSCF_ace_pch_ms_gc_filt_af05_v4-0-1-RND0521_0 Unité de travail (WU) 27684102 Créé 5 Feb 2024 | 10:40:37 UTC Envoyé 5 Feb 2024 | 10:47:37 UTC Reçu 5 Feb 2024 | 10:49:50 UTC État du serveur Sur Résultats Erreur de calcul État du client Erreur de calcul État à la sortie 195 (0xc3) EXIT_CHILD_FAILED ID de l'ordinateur 617458 Date limite de rapport 10 Feb 2024 | 10:47:37 UTC Temps de fonctionnement 45.93 Temps de CPU 9.59 Valider l'état Invalide Crédit 0.00 Version de l'application Quantum chemistry calculations on GPU v1.04 (cuda1121) Stderr output <core_client_version>7.20.5</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 11:47:47 (5931): wrapper (7.7.26016): starting 11:48:16 (5931): wrapper (7.7.26016): starting 11:48:16 (5931): wrapper: running bin/python (bin/conda-unpack) 11:48:17 (5931): bin/python exited; CPU time 0.157053 11:48:17 (5931): wrapper: running bin/tar (xjvf input.tar.bz2) 11:48:18 (5931): bin/tar exited; CPU time 0.002953 11:48:18 (5931): wrapper: running bin/bash (run.sh) + echo 'Setup environment' + source bin/activate ++ _conda_pack_activate ++ local _CONDA_SHELL_FLAVOR ++ '[' -n x ']' ++ _CONDA_SHELL_FLAVOR=bash ++ local script_dir ++ case "$_CONDA_SHELL_FLAVOR" in +++ dirname bin/activate ++ script_dir=bin +++ cd bin +++ pwd ++ local full_path_script_dir=/home/pascal/slots/3/bin +++ dirname /home/pascal/slots/3/bin ++ local full_path_env=/home/pascal/slots/3 +++ basename /home/pascal/slots/3 ++ local env_name=3 ++ '[' -n '' ']' ++ export CONDA_PREFIX=/home/pascal/slots/3 ++ CONDA_PREFIX=/home/pascal/slots/3 ++ export _CONDA_PACK_OLD_PS1= ++ _CONDA_PACK_OLD_PS1= ++ PATH=/home/pascal/slots/3/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. ++ PS1='(3) ' ++ case "$_CONDA_SHELL_FLAVOR" in ++ hash -r ++ local _script_dir=/home/pascal/slots/3/etc/conda/activate.d ++ '[' -d /home/pascal/slots/3/etc/conda/activate.d ']' + export PATH=/home/pascal/slots/3:/home/pascal/slots/3/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. + PATH=/home/pascal/slots/3:/home/pascal/slots/3/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. + echo 'Create a temporary directory' + export TMP=/home/pascal/slots/3/tmp + TMP=/home/pascal/slots/3/tmp + mkdir -p /home/pascal/slots/3/tmp + export OMP_NUM_THREADS=1 + OMP_NUM_THREADS=1 + export CUDA_VISIBLE_DEVICES=1 + CUDA_VISIBLE_DEVICES=1 + export CUPY_CUDA_LIB_PATH=/home/pascal/slots/3/cupy + CUPY_CUDA_LIB_PATH=/home/pascal/slots/3/cupy + echo 'Running PySCF' + python compute_dft.py /home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine. warnings.warn(f'using {contract_engine} as the tensor contraction engine.') /home/pascal/slots/3/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, ' nao = 570 /home/pascal/slots/3/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable warnings.warn(msg) Traceback (most recent call last): File "/home/pascal/slots/3/lib/python3.11/site-packages/pyscf/lib/misc.py", line 1094, in __exit__ handler.result() File "/home/pascal/slots/3/lib/python3.11/concurrent/futures/_base.py", line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/home/pascal/slots/3/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/df/df_jk.py", line 52, in build_df rsh_df.build(omega=omega) File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/df/df.py", line 102, in build self._cderi = cholesky_eri_gpu(intopt, mol, auxmol, self.cd_low, omega=omega) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/df/df.py", line 256, in cholesky_eri_gpu if lj>1: ints_slices = cart2sph(ints_slices, axis=1, ang=lj) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/lib/cupy_helper.py", line 333, in cart2sph t_sph = contract('min,ip->mpn', t_cart, c2s, out=out) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py", line 177, in contract return cupy.asarray(einsum(pattern, a, b), order='C') ^^^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/site-packages/cupy/linalg/_einsum.py", line 676, in einsum arr_out, sub_out = reduced_binary_einsum( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/site-packages/cupy/linalg/_einsum.py", line 418, in reduced_binary_einsum tmp1, shapes1 = _flatten_transpose(arr1, [bs1, cs1, ts1]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/site-packages/cupy/linalg/_einsum.py", line 298, in _flatten_transpose a.transpose(transpose_axes).reshape( File "cupy/_core/core.pyx", line 752, in cupy._core.core._ndarray_base.reshape File "cupy/_core/_routines_manipulation.pyx", line 81, in cupy._core._routines_manipulation._ndarray_reshape File "cupy/_core/_routines_manipulation.pyx", line 357, in cupy._core._routines_manipulation._reshape File "cupy/_core/core.pyx", line 611, in cupy._core.core._ndarray_base.copy File "cupy/_core/core.pyx", line 570, in cupy._core.core._ndarray_base.astype File "cupy/_core/core.pyx", line 132, in cupy._core.core.ndarray.__new__ File "cupy/_core/core.pyx", line 220, in cupy._core.core._ndarray_base._init File "cupy/cuda/memory.pyx", line 740, in cupy.cuda.memory.alloc File "cupy/cuda/memory.pyx", line 1426, in cupy.cuda.memory.MemoryPool.malloc File "cupy/cuda/memory.pyx", line 1447, in cupy.cuda.memory.MemoryPool.malloc File "cupy/cuda/memory.pyx", line 1118, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc File "cupy/cuda/memory.pyx", line 1139, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc File "cupy/cuda/memory.pyx", line 1346, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc File "cupy/cuda/memory.pyx", line 1358, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 595,413,504 bytes (allocated so far: 3,207,694,336 bytes, limit set to: 3,684,158,668 bytes). During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/pascal/slots/3/compute_dft.py", line 121, in <module> e,f,dip,q = compute_gpu(mol) ^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/compute_dft.py", line 24, in compute_gpu e_dft = mf.kernel() # compute total energy ^^^^^^^^^^^ File "<string>", line 2, in kernel File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 586, in scf _kernel(self, self.conv_tol, self.conv_tol_grad, File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 393, in _kernel mf.init_workflow(dm0=dm) File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/df/df_jk.py", line 56, in init_workflow with lib.call_in_background(build_df) as build: File "/home/pascal/slots/3/lib/python3.11/site-packages/pyscf/lib/misc.py", line 1096, in __exit__ raise ThreadRuntimeError('Error on thread %s:\n%s' % (self, e)) pyscf.lib.misc.ThreadRuntimeError: Error on thread <pyscf.lib.misc.call_in_background object at 0x7fec06934850>: Out of memory allocating 595,413,504 bytes (allocated so far: 3,207,694,336 bytes, limit set to: 3,684,158,668 bytes). 11:48:31 (5931): bin/bash exited; CPU time 11.139443 11:48:31 (5931): app exit status: 0x1 11:48:31 (5931): called boinc_finish(195) </stderr_txt> ]]> |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
I'm seeing about 10% failure rate with 12GB cards.
|
|
Send message Joined: 21 Dec 23 Posts: 51 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Credits should now be at 75k for rest of the batch. They should be consistent based on comparisons of runtime on our test machines across the other Apps, but this is complicated with this new memory intensive app. I will investigate before sending the next batch. |
©2025 Universitat Pompeu Fabra