PYSCFbeta: Quantum chemistry calculations on GPU

Message boards : News : PYSCFbeta: Quantum chemistry calculations on GPU
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 14 · Next

AuthorMessage
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61130 - Posted: 31 Jan 2024, 13:33:51 UTC - in response to Message 61129.  

Thanks for the info Steve.

In general, I don't have much problem with using a large amount of VRAM, if that's what you require for your science goals. Personally I just wish to have expectations set so that I can setup my hosts accordingly. If VRAM use is low, I can set my host to run multiple tasks at a time for better efficiency. if VRAM use is high, I'll need to cut it back to only 2 or 1 tasks per GPU, which hurts overall efficiency on my end and requires me to reconfigure some things, but it's fine if that's how they will be. I just prefer to know which way it will be so that I don't leave it in a bad configuration and cause errors.

the bigger problem for me (and maybe many others) was the batch yesterday with VERY high system memory use per task. when system ram filled up it would crash the system, which requires some more manual intervention to get it running again. anyone with multi-GPU would be at risk there. just something to consider.

for overall VRAM use. again you can require whatever you need for your science goals. but you might consider making sure you can at least keep them under 8GB. I'd say many people on GPUGRID these days have a GPU with at least 8GB. all of mine have 12GB. and less with 16+. if you can keep them below 8GB I think you'll be able to maintain a large pool of users rather than dealing with the tasks running out of memory and having to be resent multiple times to land on a host with enough VRAM.
ID: 61130 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61131 - Posted: 31 Jan 2024, 13:58:20 UTC - in response to Message 61126.  

I have Task 33765246 running on a RTX 3060 Ti under Linux Mint 21.3

It's running incredibly slowly, and with zero GPU usage. I've found this in stderr.txt:

+ python compute_dft.py
/hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py
warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, '
/hdd/boinc-client/slots/5/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine.
warnings.warn(f'using {contract_engine} as the tensor contraction engine.')
/hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable
warnings.warn(msg)
Exception:
Fallback to CPU
Exception:
Fallback to CPU


I'm getting several of these also. this is a problem too. you can always tell when the task basically stalls with almost no progress.


ID: 61131 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61132 - Posted: 31 Jan 2024, 14:15:17 UTC

My CPU fallback task has now completed and validated, in not much longer than is usual for tasks on that host. I assume it was a shortened test task, running on a slower device? I now have just completed what looks like a similar task, with similarly large jumps in progress %age, but much more quickly. Task 33765553
ID: 61132 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 21 Dec 23
Posts: 51
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61133 - Posted: 31 Jan 2024, 14:44:43 UTC - in response to Message 61132.  

This is still very much a beta app.

We will continue to explore different WU sizes and application settings (with better local testing on our internal hardware before sending them out).

This app is the first time it has been possible to run QM calculations on GPUs. The underlying software was primarliy designed for the latest generation professional cards, e.g. A100s that are used in HPC centres. It is proving challenging to us to port the code to GPUGRID consumer hardware. We also are looking into how a windows port can be done.

ID: 61133 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61134 - Posted: 31 Jan 2024, 14:51:32 UTC - in response to Message 61133.  

No problem Steve. I definitely understand the beta aspect of this and the need to test things. I’m just giving honest feedback from my POV. Sometimes it’s hard to tell if a radical change in behavior is intended or a sign of some problem or misconfiguration.

Maybe it’s not possible for all the various molecules you want to test, but the size of the previous large batch last week I feel was very appropriate. Moderate VRAM use and consistent size/runtimes. Those worked well with the consumer hardware.

Oh if everyone had A100s with 40-80GB of VRAM life would be nice LOL.
ID: 61134 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 61135 - Posted: 31 Jan 2024, 16:33:38 UTC

I had an odd work unit come through (and just abandoned). I have not had any issues with these work units so thought I would mention this one specifically.

https://www.gpugrid.net/result.php?resultid=33764946

I think there was a memory error with it but I am not very skilled at reading the results. It hung at ~75% but I let it work for 5 hours (honestly, I just didn't notice that it was hung up...).

When looking at properties of the work unit:
Virtual memory: 56GB
Working Size Set: 3.59GB

I thought this was an odd one so thought I would post.
ID: 61135 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61136 - Posted: 31 Jan 2024, 17:03:21 UTC - in response to Message 61135.  

I had an odd work unit come through (and just abandoned). I have not had any issues with these work units so thought I would mention this one specifically.

https://www.gpugrid.net/result.php?resultid=33764946

I think there was a memory error with it but I am not very skilled at reading the results. It hung at ~75% but I let it work for 5 hours (honestly, I just didn't notice that it was hung up...).

When looking at properties of the work unit:
Virtual memory: 56GB
Working Size Set: 3.59GB

I thought this was an odd one so thought I would post.


Yeah you can see several out of memory errors. Are you running more than one at a time?

I’ve had many like this. And many that seem to just fall back to CPU without any reason and get stuck for a long time. I’ve been aborting them when I notice. But it is troublesome :(
ID: 61136 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 61137 - Posted: 31 Jan 2024, 17:19:14 UTC - in response to Message 61136.  



Yeah you can see several out of memory errors. Are you running more than one at a time?

I’ve had many like this. And many that seem to just fall back to CPU without any reason and get stuck for a long time. I’ve been aborting them when I notice. But it is troublesome :(



I have been running 2x for these (I can't get them to run 3x or 4x via app config file but it doesn't look like there are any cued tasks waiting to start).

Good to know that others have seen this too! I have seen a MASSIVE reduction in time these tasks take today.
ID: 61137 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61138 - Posted: 31 Jan 2024, 18:15:03 UTC

I’m now getting a 3rd type of error across all of my hosts.

“AssertionError”

https://www.gpugrid.net/result.php?resultid=33766654
ID: 61138 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61139 - Posted: 31 Jan 2024, 18:25:02 UTC - in response to Message 61138.  

I've had a few of those too, mainly of the form

File "/hdd/boinc-client/slots/6/lib/python3.11/site-packages/gpu4pyscf/df/grad/rhf.py", line 163, in get_jk
assert k1-k0 <= block_size
ID: 61139 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61140 - Posted: 1 Feb 2024, 4:17:42 UTC

ID: 61140 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gemini8
Avatar

Send message
Joined: 3 Jul 16
Posts: 31
Credit: 2,248,809,169
RAC: 0
Level
Phe
Scientific publications
watwat
Message 61141 - Posted: 1 Feb 2024, 7:48:55 UTC - in response to Message 61131.  

I have Task 33765246 running on a RTX 3060 Ti under Linux Mint 21.3

It's running incredibly slowly, and with zero GPU usage. I've found this in stderr.txt:

+ python compute_dft.py
/hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py
warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, '
/hdd/boinc-client/slots/5/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine.
warnings.warn(f'using {contract_engine} as the tensor contraction engine.')
/hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable
warnings.warn(msg)
Exception:
Fallback to CPU
Exception:
Fallback to CPU


I'm getting several of these also. this is a problem too. you can always tell when the task basically stalls with almost no progress.

I had only those on one of my machines.
Apparently it had lost sight of the GPU for crunching.
Rebooting brought back the Nvidia driver to the BOINC client.

Apart from this I found out that I can't run these tasks aside Private GFN Server's tasks on a six Gig GPU. So I called the PYSCFbeta tasks off for this machine, as I often have to wait for tasks to download from GPUGrid, and I don't want my GPUs to run idle.
- - - - - - - - - -
Greetings, Jens
ID: 61141 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gemini8
Avatar

Send message
Joined: 3 Jul 16
Posts: 31
Credit: 2,248,809,169
RAC: 0
Level
Phe
Scientific publications
watwat
Message 61145 - Posted: 1 Feb 2024, 21:47:40 UTC

Did we encounter this one already?
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
14:21:03 (24335): wrapper (7.7.26016): starting
14:21:36 (24335): wrapper (7.7.26016): starting
14:21:36 (24335): wrapper: running bin/python (bin/conda-unpack)
14:21:38 (24335): bin/python exited; CPU time 0.223114
14:21:38 (24335): wrapper: running bin/tar (xjvf input.tar.bz2)
14:21:39 (24335): bin/tar exited; CPU time 0.005282
14:21:39 (24335): wrapper: running bin/bash (run.sh)
+ echo 'Setup environment'
+ source bin/activate
++ _conda_pack_activate
++ local _CONDA_SHELL_FLAVOR
++ '[' -n x ']'
++ _CONDA_SHELL_FLAVOR=bash
++ local script_dir
++ case "$_CONDA_SHELL_FLAVOR" in
+++ dirname bin/activate
++ script_dir=bin
+++ cd bin
+++ pwd
++ local full_path_script_dir=/var/lib/boinc-client/slots/4/bin
+++ dirname /var/lib/boinc-client/slots/4/bin
++ local full_path_env=/var/lib/boinc-client/slots/4
+++ basename /var/lib/boinc-client/slots/4
++ local env_name=4
++ '[' -n '' ']'
++ export CONDA_PREFIX=/var/lib/boinc-client/slots/4
++ CONDA_PREFIX=/var/lib/boinc-client/slots/4
++ export _CONDA_PACK_OLD_PS1=
++ _CONDA_PACK_OLD_PS1=
++ PATH=/var/lib/boinc-client/slots/4/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
++ PS1='(4) '
++ case "$_CONDA_SHELL_FLAVOR" in
++ hash -r
++ local _script_dir=/var/lib/boinc-client/slots/4/etc/conda/activate.d
++ '[' -d /var/lib/boinc-client/slots/4/etc/conda/activate.d ']'
+ export PATH=/var/lib/boinc-client/slots/4:/var/lib/boinc-client/slots/4/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ PATH=/var/lib/boinc-client/slots/4:/var/lib/boinc-client/slots/4/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ echo 'Create a temporary directory'
+ export TMP=/var/lib/boinc-client/slots/4/tmp
+ TMP=/var/lib/boinc-client/slots/4/tmp
+ mkdir -p /var/lib/boinc-client/slots/4/tmp
+ export OMP_NUM_THREADS=1
+ OMP_NUM_THREADS=1
+ export CUDA_VISIBLE_DEVICES=0
+ CUDA_VISIBLE_DEVICES=0
+ export CUPY_CUDA_LIB_PATH=/var/lib/boinc-client/slots/4/cupy
+ CUPY_CUDA_LIB_PATH=/var/lib/boinc-client/slots/4/cupy
+ echo 'Running PySCF'
+ python compute_dft.py
/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine.
  warnings.warn(f'using {contract_engine} as the tensor contraction engine.')
/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py
  warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, '
/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable
  warnings.warn(msg)
Traceback (most recent call last):
  File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 253, in _jitify_prep
    name, options, headers, include_names = jitify.jitify(source, options)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "cupy/cuda/jitify.pyx", line 63, in cupy.cuda.jitify.jitify
  File "cupy/cuda/jitify.pyx", line 88, in cupy.cuda.jitify.jitify
RuntimeError: Runtime compilation failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/boinc-client/slots/4/compute_dft.py", line 125, in <module>
    e,f,dip,q = compute_gpu(mol)
                ^^^^^^^^^^^^^^^^
  File "/var/lib/boinc-client/slots/4/compute_dft.py", line 32, in compute_gpu
    e_dft = mf.kernel()  # compute total energy
            ^^^^^^^^^^^
  File "<string>", line 2, in kernel
  File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 586, in scf
    _kernel(self, self.conv_tol, self.conv_tol_grad,
  File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 393, in _kernel
    mf.init_workflow(dm0=dm)
  File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/df/df_jk.py", line 63, in init_workflow
    rks.initialize_grids(mf, mf.mol, dm0)
  File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/dft/rks.py", line 80, in initialize_grids
    ks.grids = prune_small_rho_grids_(ks, ks.mol, dm, ks.grids)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/dft/rks.py", line 49, in prune_small_rho_grids_
    logger.debug(grids, 'Drop grids %d', grids.weights.size - cupy.count_nonzero(idx))
                                                              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/_sorting/count.py", line 24, in count_nonzero
    return _count_nonzero(a, axis=axis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "cupy/_core/_reduction.pyx", line 608, in cupy._core._reduction._SimpleReductionKernel.__call__
  File "cupy/_core/_reduction.pyx", line 364, in cupy._core._reduction._AbstractReductionKernel._call
  File "cupy/_core/_cub_reduction.pyx", line 701, in cupy._core._cub_reduction._try_to_call_cub_reduction
  File "cupy/_core/_cub_reduction.pyx", line 538, in cupy._core._cub_reduction._launch_cub
  File "cupy/_core/_cub_reduction.pyx", line 473, in cupy._core._cub_reduction._cub_two_pass_launch
  File "cupy/_util.pyx", line 64, in cupy._util.memoize.decorator.ret
  File "cupy/_core/_cub_reduction.pyx", line 246, in cupy._core._cub_reduction._SimpleCubReductionKernel_get_cached_function
  File "cupy/_core/_cub_reduction.pyx", line 231, in cupy._core._cub_reduction._create_cub_reduction_function
  File "cupy/_core/core.pyx", line 2251, in cupy._core.core.compile_with_cache
  File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 496, in _compile_module_with_cache
    return _compile_with_cache_cuda(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 574, in _compile_with_cache_cuda
    ptx, mapping = compile_using_nvrtc(
                   ^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 322, in compile_using_nvrtc
    return _compile(source, options, cu_path,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 287, in _compile
    options, headers, include_names = _jitify_prep(
                                      ^^^^^^^^^^^^^
  File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 260, in _jitify_prep
    raise JitifyException(str(cex))
cupy.cuda.compiler.JitifyException: Runtime compilation failed
14:23:34 (24335): bin/bash exited; CPU time 14.043607
14:23:34 (24335): app exit status: 0x1
14:23:34 (24335): called boinc_finish(195)

</stderr_txt>
]]>

- - - - - - - - - -
Greetings, Jens
ID: 61145 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61146 - Posted: 1 Feb 2024, 21:54:29 UTC - in response to Message 61145.  

that looks like a driver issue.

but something else I noticed is that these tasks for the most part are having a very high failure rate. 30-50% on most hosts.

there are a few hosts that have few or no errors however, and all of them are hosts with 24-48GB of VRAM. so it seems something like 30-50% of the tasks require more than 12-16GB.

I'm sure the project has a very large error percentage to sort through, as there arent enough 24-48GB GPUs to catch all the resends
ID: 61146 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61147 - Posted: 1 Feb 2024, 22:22:48 UTC
Last modified: 1 Feb 2024, 22:26:22 UTC

The present batch has a far worse failure ratio than the previous one.
ID: 61147 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 61149 - Posted: 2 Feb 2024, 2:21:12 UTC - in response to Message 61146.  
Last modified: 2 Feb 2024, 2:21:54 UTC

that looks like a driver issue.

but something else I noticed is that these tasks for the most part are having a very high failure rate. 30-50% on most hosts.

there are a few hosts that have few or no errors however, and all of them are hosts with 24-48GB of VRAM. so it seems something like 30-50% of the tasks require more than 12-16GB.

I'm sure the project has a very large error percentage to sort through, as there arent enough 24-48GB GPUs to catch all the resends


This is 100% correct.

Our system with 2x RTX a6000 (48GB of VRAM) has had 500 valid results and no errors. They are running tasks at 2x and they seem to run really well (https://www.gpugrid.net/results.php?hostid=616410).

In one of our systems with 3x RTX a4500 GPUs (20GB), as soon as I changed running 2x of these tasks to 1x, the error rate greatly improved (https://www.gpugrid.net/results.php?hostid=616409). I made the change and have had 14 tasks in a row without errors.

When I am back in the classroom I think I will be changing anything equal to, or less than, 24GB to only run one task in order to improve the valid rate.

Has any tried running MPS with these tasks, and would it would make a difference in the allocation of resources to successfully run 2X? Just curious about thoughts.
ID: 61149 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 69
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61150 - Posted: 2 Feb 2024, 2:57:47 UTC

Last week, I had a 100% success rate. This week, it's a different story. Maybe, it's time to step back and dial it down a bit. You have to work with the resources that you have, not the ones that you wish you had.


ID: 61150 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61151 - Posted: 2 Feb 2024, 4:07:19 UTC - in response to Message 61149.  

Boca,

How much VRAM do you see actually being used on some of these tasks? Mind watching a few? You’ll have to run a watch command to see continuous output of VRAM utilization since the usage isn’t constant. It spikes up and down. I’m just curious how much is actually needed. Most of the tasks I was running I would see spike up to about 8GB. But i assume the tasks that needed more just failed instead so I can’t know how much they are trying to use. Even though these Titan Vs are great DP performers they only have 12GB VRAM. Even most of the 16GB cards like V100 and P100 are seeing very high error rates.

MPS helps. But not enough with this current batch. I was getting good throughput with running 3x tasks at once on the batches last week.
ID: 61151 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gemini8
Avatar

Send message
Joined: 3 Jul 16
Posts: 31
Credit: 2,248,809,169
RAC: 0
Level
Phe
Scientific publications
watwat
Message 61152 - Posted: 2 Feb 2024, 6:43:20 UTC - in response to Message 61146.  

that looks like a driver issue.

That's what Pascal (?) wrote in the Q&A as well.

Had three tasks on that host, and two of them failed.
- - - - - - - - - -
Greetings, Jens
ID: 61152 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pascal

Send message
Joined: 15 Jul 20
Posts: 95
Credit: 2,550,803,412
RAC: 248
Level
Phe
Scientific publications
wat
Message 61153 - Posted: 2 Feb 2024, 8:33:53 UTC

tout le monde ne dipose de carte graphique a 5000 euros avec 24 gigas de vram voir plus.vous devriez penser au plus modeste d'entre nous.
mo j'ai une rtx 4060 et une gtx 1650 mais je n'ai que des erreurs par exemple.
je pense que lq plupart des gens qui calcule pour gpugrid et qui attende avec impatience du travail pour leur gpu sont comme moi.

everyone only dipose graphics card has 5000 euros with 24 gigas of vram see more.you should think of the most modest of us.
mo I have a 4060 rtx and a 1650 gtx but I have only errors for example.
I think most people who compute for gpugrid and look forward to work for their gpu are like me.

je pense toujours que le probleme est mon installation systeme alors je reformate et refais une installation propre en espérant que cela fonctionnera correctement. en vain car le probleme vient de vos unités de calcul défaillantes

I still think the problem is my system installation so I reformat and redo a clean installation hoping it will work properly. in vain because the problem comes from your faulty computing units
ID: 61153 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 14 · Next

Message boards : News : PYSCFbeta: Quantum chemistry calculations on GPU

©2025 Universitat Pompeu Fabra