Message boards :
News :
PYSCFbeta: Quantum chemistry calculations on GPU
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 14 · Next
| Author | Message |
|---|---|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
Thanks for the info Steve. In general, I don't have much problem with using a large amount of VRAM, if that's what you require for your science goals. Personally I just wish to have expectations set so that I can setup my hosts accordingly. If VRAM use is low, I can set my host to run multiple tasks at a time for better efficiency. if VRAM use is high, I'll need to cut it back to only 2 or 1 tasks per GPU, which hurts overall efficiency on my end and requires me to reconfigure some things, but it's fine if that's how they will be. I just prefer to know which way it will be so that I don't leave it in a bad configuration and cause errors. the bigger problem for me (and maybe many others) was the batch yesterday with VERY high system memory use per task. when system ram filled up it would crash the system, which requires some more manual intervention to get it running again. anyone with multi-GPU would be at risk there. just something to consider. for overall VRAM use. again you can require whatever you need for your science goals. but you might consider making sure you can at least keep them under 8GB. I'd say many people on GPUGRID these days have a GPU with at least 8GB. all of mine have 12GB. and less with 16+. if you can keep them below 8GB I think you'll be able to maintain a large pool of users rather than dealing with the tasks running out of memory and having to be resent multiple times to land on a host with enough VRAM.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
I have Task 33765246 running on a RTX 3060 Ti under Linux Mint 21.3 I'm getting several of these also. this is a problem too. you can always tell when the task basically stalls with almost no progress.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My CPU fallback task has now completed and validated, in not much longer than is usual for tasks on that host. I assume it was a shortened test task, running on a slower device? I now have just completed what looks like a similar task, with similarly large jumps in progress %age, but much more quickly. Task 33765553 |
|
Send message Joined: 21 Dec 23 Posts: 51 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
This is still very much a beta app. We will continue to explore different WU sizes and application settings (with better local testing on our internal hardware before sending them out). This app is the first time it has been possible to run QM calculations on GPUs. The underlying software was primarliy designed for the latest generation professional cards, e.g. A100s that are used in HPC centres. It is proving challenging to us to port the code to GPUGRID consumer hardware. We also are looking into how a windows port can be done. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
No problem Steve. I definitely understand the beta aspect of this and the need to test things. I’m just giving honest feedback from my POV. Sometimes it’s hard to tell if a radical change in behavior is intended or a sign of some problem or misconfiguration. Maybe it’s not possible for all the various molecules you want to test, but the size of the previous large batch last week I feel was very appropriate. Moderate VRAM use and consistent size/runtimes. Those worked well with the consumer hardware. Oh if everyone had A100s with 40-80GB of VRAM life would be nice LOL.
|
|
Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level ![]() Scientific publications
|
I had an odd work unit come through (and just abandoned). I have not had any issues with these work units so thought I would mention this one specifically. https://www.gpugrid.net/result.php?resultid=33764946 I think there was a memory error with it but I am not very skilled at reading the results. It hung at ~75% but I let it work for 5 hours (honestly, I just didn't notice that it was hung up...). When looking at properties of the work unit: Virtual memory: 56GB Working Size Set: 3.59GB I thought this was an odd one so thought I would post. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
I had an odd work unit come through (and just abandoned). I have not had any issues with these work units so thought I would mention this one specifically. Yeah you can see several out of memory errors. Are you running more than one at a time? I’ve had many like this. And many that seem to just fall back to CPU without any reason and get stuck for a long time. I’ve been aborting them when I notice. But it is troublesome :(
|
|
Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level ![]() Scientific publications
|
I have been running 2x for these (I can't get them to run 3x or 4x via app config file but it doesn't look like there are any cued tasks waiting to start). Good to know that others have seen this too! I have seen a MASSIVE reduction in time these tasks take today. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
I’m now getting a 3rd type of error across all of my hosts. “AssertionError” https://www.gpugrid.net/result.php?resultid=33766654
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've had a few of those too, mainly of the form File "/hdd/boinc-client/slots/6/lib/python3.11/site-packages/gpu4pyscf/df/grad/rhf.py", line 163, in get_jk assert k1-k0 <= block_size |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
150.000 credits for a few 100 seconds? I'm in! ;) https://www.gpugrid.net/result.php?resultid=33771102 https://www.gpugrid.net/result.php?resultid=33771333 https://www.gpugrid.net/result.php?resultid=33771431 https://www.gpugrid.net/result.php?resultid=33771446 https://www.gpugrid.net/result.php?resultid=33771539 |
|
Send message Joined: 3 Jul 16 Posts: 31 Credit: 2,248,809,169 RAC: 0 Level ![]() Scientific publications ![]()
|
I have Task 33765246 running on a RTX 3060 Ti under Linux Mint 21.3 I'm getting several of these also. this is a problem too. you can always tell when the task basically stalls with almost no progress. I had only those on one of my machines. Apparently it had lost sight of the GPU for crunching. Rebooting brought back the Nvidia driver to the BOINC client. Apart from this I found out that I can't run these tasks aside Private GFN Server's tasks on a six Gig GPU. So I called the PYSCFbeta tasks off for this machine, as I often have to wait for tasks to download from GPUGrid, and I don't want my GPUs to run idle. - - - - - - - - - - Greetings, Jens |
|
Send message Joined: 3 Jul 16 Posts: 31 Credit: 2,248,809,169 RAC: 0 Level ![]() Scientific publications ![]()
|
Did we encounter this one already? <core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
14:21:03 (24335): wrapper (7.7.26016): starting
14:21:36 (24335): wrapper (7.7.26016): starting
14:21:36 (24335): wrapper: running bin/python (bin/conda-unpack)
14:21:38 (24335): bin/python exited; CPU time 0.223114
14:21:38 (24335): wrapper: running bin/tar (xjvf input.tar.bz2)
14:21:39 (24335): bin/tar exited; CPU time 0.005282
14:21:39 (24335): wrapper: running bin/bash (run.sh)
+ echo 'Setup environment'
+ source bin/activate
++ _conda_pack_activate
++ local _CONDA_SHELL_FLAVOR
++ '[' -n x ']'
++ _CONDA_SHELL_FLAVOR=bash
++ local script_dir
++ case "$_CONDA_SHELL_FLAVOR" in
+++ dirname bin/activate
++ script_dir=bin
+++ cd bin
+++ pwd
++ local full_path_script_dir=/var/lib/boinc-client/slots/4/bin
+++ dirname /var/lib/boinc-client/slots/4/bin
++ local full_path_env=/var/lib/boinc-client/slots/4
+++ basename /var/lib/boinc-client/slots/4
++ local env_name=4
++ '[' -n '' ']'
++ export CONDA_PREFIX=/var/lib/boinc-client/slots/4
++ CONDA_PREFIX=/var/lib/boinc-client/slots/4
++ export _CONDA_PACK_OLD_PS1=
++ _CONDA_PACK_OLD_PS1=
++ PATH=/var/lib/boinc-client/slots/4/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
++ PS1='(4) '
++ case "$_CONDA_SHELL_FLAVOR" in
++ hash -r
++ local _script_dir=/var/lib/boinc-client/slots/4/etc/conda/activate.d
++ '[' -d /var/lib/boinc-client/slots/4/etc/conda/activate.d ']'
+ export PATH=/var/lib/boinc-client/slots/4:/var/lib/boinc-client/slots/4/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ PATH=/var/lib/boinc-client/slots/4:/var/lib/boinc-client/slots/4/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ echo 'Create a temporary directory'
+ export TMP=/var/lib/boinc-client/slots/4/tmp
+ TMP=/var/lib/boinc-client/slots/4/tmp
+ mkdir -p /var/lib/boinc-client/slots/4/tmp
+ export OMP_NUM_THREADS=1
+ OMP_NUM_THREADS=1
+ export CUDA_VISIBLE_DEVICES=0
+ CUDA_VISIBLE_DEVICES=0
+ export CUPY_CUDA_LIB_PATH=/var/lib/boinc-client/slots/4/cupy
+ CUPY_CUDA_LIB_PATH=/var/lib/boinc-client/slots/4/cupy
+ echo 'Running PySCF'
+ python compute_dft.py
/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine.
warnings.warn(f'using {contract_engine} as the tensor contraction engine.')
/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py
warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, '
/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable
warnings.warn(msg)
Traceback (most recent call last):
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 253, in _jitify_prep
name, options, headers, include_names = jitify.jitify(source, options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "cupy/cuda/jitify.pyx", line 63, in cupy.cuda.jitify.jitify
File "cupy/cuda/jitify.pyx", line 88, in cupy.cuda.jitify.jitify
RuntimeError: Runtime compilation failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/boinc-client/slots/4/compute_dft.py", line 125, in <module>
e,f,dip,q = compute_gpu(mol)
^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/4/compute_dft.py", line 32, in compute_gpu
e_dft = mf.kernel() # compute total energy
^^^^^^^^^^^
File "<string>", line 2, in kernel
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 586, in scf
_kernel(self, self.conv_tol, self.conv_tol_grad,
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 393, in _kernel
mf.init_workflow(dm0=dm)
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/df/df_jk.py", line 63, in init_workflow
rks.initialize_grids(mf, mf.mol, dm0)
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/dft/rks.py", line 80, in initialize_grids
ks.grids = prune_small_rho_grids_(ks, ks.mol, dm, ks.grids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/dft/rks.py", line 49, in prune_small_rho_grids_
logger.debug(grids, 'Drop grids %d', grids.weights.size - cupy.count_nonzero(idx))
^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/_sorting/count.py", line 24, in count_nonzero
return _count_nonzero(a, axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "cupy/_core/_reduction.pyx", line 608, in cupy._core._reduction._SimpleReductionKernel.__call__
File "cupy/_core/_reduction.pyx", line 364, in cupy._core._reduction._AbstractReductionKernel._call
File "cupy/_core/_cub_reduction.pyx", line 701, in cupy._core._cub_reduction._try_to_call_cub_reduction
File "cupy/_core/_cub_reduction.pyx", line 538, in cupy._core._cub_reduction._launch_cub
File "cupy/_core/_cub_reduction.pyx", line 473, in cupy._core._cub_reduction._cub_two_pass_launch
File "cupy/_util.pyx", line 64, in cupy._util.memoize.decorator.ret
File "cupy/_core/_cub_reduction.pyx", line 246, in cupy._core._cub_reduction._SimpleCubReductionKernel_get_cached_function
File "cupy/_core/_cub_reduction.pyx", line 231, in cupy._core._cub_reduction._create_cub_reduction_function
File "cupy/_core/core.pyx", line 2251, in cupy._core.core.compile_with_cache
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 496, in _compile_module_with_cache
return _compile_with_cache_cuda(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 574, in _compile_with_cache_cuda
ptx, mapping = compile_using_nvrtc(
^^^^^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 322, in compile_using_nvrtc
return _compile(source, options, cu_path,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 287, in _compile
options, headers, include_names = _jitify_prep(
^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 260, in _jitify_prep
raise JitifyException(str(cex))
cupy.cuda.compiler.JitifyException: Runtime compilation failed
14:23:34 (24335): bin/bash exited; CPU time 14.043607
14:23:34 (24335): app exit status: 0x1
14:23:34 (24335): called boinc_finish(195)
</stderr_txt>
]]>- - - - - - - - - - Greetings, Jens |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
that looks like a driver issue. but something else I noticed is that these tasks for the most part are having a very high failure rate. 30-50% on most hosts. there are a few hosts that have few or no errors however, and all of them are hosts with 24-48GB of VRAM. so it seems something like 30-50% of the tasks require more than 12-16GB. I'm sure the project has a very large error percentage to sort through, as there arent enough 24-48GB GPUs to catch all the resends
|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The present batch has a far worse failure ratio than the previous one. |
|
Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level ![]() Scientific publications
|
that looks like a driver issue. This is 100% correct. Our system with 2x RTX a6000 (48GB of VRAM) has had 500 valid results and no errors. They are running tasks at 2x and they seem to run really well (https://www.gpugrid.net/results.php?hostid=616410). In one of our systems with 3x RTX a4500 GPUs (20GB), as soon as I changed running 2x of these tasks to 1x, the error rate greatly improved (https://www.gpugrid.net/results.php?hostid=616409). I made the change and have had 14 tasks in a row without errors. When I am back in the classroom I think I will be changing anything equal to, or less than, 24GB to only run one task in order to improve the valid rate. Has any tried running MPS with these tasks, and would it would make a difference in the allocation of resources to successfully run 2X? Just curious about thoughts. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 57 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Last week, I had a 100% success rate. This week, it's a different story. Maybe, it's time to step back and dial it down a bit. You have to work with the resources that you have, not the ones that you wish you had. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
Boca, How much VRAM do you see actually being used on some of these tasks? Mind watching a few? You’ll have to run a watch command to see continuous output of VRAM utilization since the usage isn’t constant. It spikes up and down. I’m just curious how much is actually needed. Most of the tasks I was running I would see spike up to about 8GB. But i assume the tasks that needed more just failed instead so I can’t know how much they are trying to use. Even though these Titan Vs are great DP performers they only have 12GB VRAM. Even most of the 16GB cards like V100 and P100 are seeing very high error rates. MPS helps. But not enough with this current batch. I was getting good throughput with running 3x tasks at once on the batches last week.
|
|
Send message Joined: 3 Jul 16 Posts: 31 Credit: 2,248,809,169 RAC: 0 Level ![]() Scientific publications ![]()
|
that looks like a driver issue. That's what Pascal (?) wrote in the Q&A as well. Had three tasks on that host, and two of them failed. - - - - - - - - - - Greetings, Jens |
|
Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,550,803,412 RAC: 203 Level ![]() Scientific publications
|
tout le monde ne dipose de carte graphique a 5000 euros avec 24 gigas de vram voir plus.vous devriez penser au plus modeste d'entre nous. mo j'ai une rtx 4060 et une gtx 1650 mais je n'ai que des erreurs par exemple. je pense que lq plupart des gens qui calcule pour gpugrid et qui attende avec impatience du travail pour leur gpu sont comme moi. everyone only dipose graphics card has 5000 euros with 24 gigas of vram see more.you should think of the most modest of us. mo I have a 4060 rtx and a 1650 gtx but I have only errors for example. I think most people who compute for gpugrid and look forward to work for their gpu are like me. je pense toujours que le probleme est mon installation systeme alors je reformate et refais une installation propre en espérant que cela fonctionnera correctement. en vain car le probleme vient de vos unités de calcul défaillantes I still think the problem is my system installation so I reformat and redo a clean installation hoping it will work properly. in vain because the problem comes from your faulty computing units |
©2025 Universitat Pompeu Fabra