Message boards :
News :
PYSCFbeta: Quantum chemistry calculations on GPU
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 14 · Next
| Author | Message |
|---|---|
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
Why do I allways get segmentation fault I'm getting the same issues running throug WSL2, immediate segmentation fault. https://www.gpugrid.net/result.php?resultid=33853832 https://www.gpugrid.net/result.php?resultid=33853734 Environment & drivers should be OK, since it is running other project's GPU tasks just fine! Unless gpugrid has some specific prerequisites? Working project tasks: https://moowrap.net/result.php?resultid=201144661 Installing a native Linux OS is simply not an option for most regular users that don't have dedicated compute farms... |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
then I guess you'll just have to wait for the native Windows app. it seems apparent that something doesnt work with these tasks under WSL. so indeed some kind of problem or incompatibility related to WSL. the fact that some other app works isnt really relevant. a key difference is probably in the difference in how these apps are distributed. Moo wrapper uses a compiled binary and the QChem work is supplied via an entire python environment designed to work with a native linux install (it does a lot of things for setting up things like environment variables which might not be correct for WSL as an example). these tasks also use CuPy, which might not be well supported for WSL, or the way cupy is being called isnt right for WSL. either way, don't think there's gonna be a solution for use with WSL. switch to Linux, or wait for the Windows version.
|
|
Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,550,803,412 RAC: 203 Level ![]() Scientific publications
|
hello I noticed that you are losing users. Not many but the number of gpugrid users is decreasing. Maybe you have too many requirements level harware and credits are no longer the same as before. |
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
hello That's hardly surprising given this stat: https://www.boincstats.com/stats/45/host/breakdown/os/ 2500+ Windows hosts 688 Linux hosts Yet Windows hosts are not getting any work, so are not given opportunity to contribute to research or to beta testing even if they're prepared to go the extra mile with getting experimental apps to work. So logical that people start leaving - certainly the set-it-and-forget-it crowd. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yet Windows hosts are not getting any work, so are not given opportunity to contribute to research or to beta testing even if they're prepared to go the extra mile with getting experimental apps to work. When I joined GPUGRID about 9 years ago, all subprojects were available for Linux and Windows as well. At that time and even several years later, my hosts were working for GPUGRID almost 365 days/year. Somehow, it makes me sad that I am less and less able to contribute to this valuable project. Recently, someone here explained the reason: scientific projects are primarily done by Linux, not by Windows. Why so, all of a sudden ??? |
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
then I guess you'll just have to wait for the native Windows app. it seems apparent that something doesnt work with these tasks under WSL. so indeed some kind of problem or incompatibility related to WSL. the fact that some other app works isnt really relevant. a key difference is probably in the difference in how these apps are distributed. Moo wrapper uses a compiled binary and the QChem work is supplied via an entire python environment designed to work with a native linux install (it does a lot of things for setting up things like environment variables which might not be correct for WSL as an example). these tasks also use CuPy, which might not be well supported for WSL, or the way cupy is being called isnt right for WSL. either way, don't think there's gonna be a solution for use with WSL. switch to Linux, or wait for the Windows version. It could be that, yes. But it could also be memory overflow. Running a gtx1080ti with 11GB vram Running from the commandline with nvidia-smi logging I see memory going up to 8GB allocated, then a segmentation fault - which could be caused by a block allocating over the 11GB limit? monitoring output: # gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk pviol tviol fb bar1 ccpm sbecc dbecc pci rxpci txpci
# Idx W C C % % % % % % MHz MHz % bool MB MB MB errs errs errs MB/s MB/s
0 15 30 - 2 8 0 0 - - 405 607 0 0 1915 2 - - - 0 0 0
0 17 30 - 2 8 0 0 - - 405 607 0 0 1915 2 - - - 0 0 0
0 74 33 - 2 1 0 0 - - 5005 1569 0 0 2179 2 - - - 0 0 0
0 133 39 - 77 5 0 0 - - 5005 1987 0 0 4797 2 - - - 0 0 0
0 167 49 - 63 16 0 0 - - 5005 1974 0 0 6393 2 - - - 0 0 0
0 119 54 - 74 4 0 0 - - 5005 1974 0 0 8329 2 - - - 0 0 0
0 87 47 - 0 0 0 0 - - 5508 1974 0 0 1915 2 - - - 0 0 0
commandline run output:
/var/lib/boinc/projects/www.gpugrid.net/bck/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine.
warnings.warn(f'using {contract_engine} as the tensor contraction engine.')
/var/lib/boinc/projects/www.gpugrid.net/bck/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py
warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, '
nao = 590
reading molecules in current dir
mol_130305284_conf_0.xyz
mol_130305284_conf_1.xyz
mol_130305284_conf_2.xyz
mol_130305284_conf_3.xyz
mol_130305284_conf_4.xyz
mol_130305284_conf_5.xyz
mol_130305284_conf_6.xyz
mol_130305284_conf_7.xyz
mol_130305284_conf_8.xyz
mol_130305284_conf_9.xyz
['mol_130305284_conf_0.xyz', 'mol_130305284_conf_1.xyz', 'mol_130305284_conf_2.xyz', 'mol_130305284_conf_3.xyz', 'mol_130305284_conf_4.xyz', 'mol_130305284_conf_5.xyz', 'mol_130305284_conf_6.xyz', 'mol_130305284_conf_7.xyz', 'mol_130305284_conf_8.xyz', 'mol_130305284_conf_9.xyz']
Computing energy and forces for molecule 1 of 10
charge = 0
Structure:
('I', [-9.750986802755719, 0.9391938839088357, 0.1768783652592898])
('C', [-5.895945508642993, 0.12453295160883758, 0.05083363275080016])
('C', [-4.856596140132209, -2.2109795657411224, -0.2513335745671532])
('C', [-2.2109795657411224, -2.0220069532846163, -0.24377467006889297])
('O', [-0.304245906054975, -3.7227604653931716, -0.46865207889213534])
('C', [1.8519316020737606, -2.3621576557063273, -0.3080253583041051])
('C', [4.440856392727896, -2.9668700155671472, -0.4006219384077931])
('C', [5.839253724906041, -0.8163616858121067, -0.1379500070932495])
('I', [9.769884064001369, -0.6368377039784259, -0.13889487015553204])
('S', [4.100705690306184, 1.9464179083020137, 0.22298768269867728])
('C', [1.3587130835622794, 0.22298768269867728, 0.02022006953284616])
('C', [-1.2925726692025024, 0.43463700864996424, 0.06254993472310354])
('S', [-3.7227604653931716, 2.5700275294084842, 0.3477096069199714])
('H', [-5.914842769888644, -3.9306303390953286, -0.46298290051844015])
('H', [5.19674684255392, -4.818801617640907, -0.640617156227556])
******** <class 'gpu4pyscf.df.df_jk.DFRKS'> ********
method = DFRKS
initial guess = minao
damping factor = 0
level_shift factor = 0
DIIS = <class 'gpu4pyscf.scf.diis.CDIIS'>
diis_start_cycle = 1
diis_space = 8
SCF conv_tol = 1e-09
SCF conv_tol_grad = None
SCF max_cycles = 50
direct_scf = False
chkfile to save SCF result = /var/lib/boinc/projects/www.gpugrid.net/bck/tmp/tmpd03fogee
max_memory 4000 MB (current use 345 MB)
XC library pyscf.dft.libxc version 6.2.2
unable to decode the reference due to https://github.com/NVIDIA/cuda-python/issues/29
XC functionals = wB97M-V
N. Mardirossian and M. Head-Gordon., J. Chem. Phys. 144, 214110 (2016)
radial grids:
Treutler-Ahlrichs [JCP 102, 346 (1995); DOI:10.1063/1.469408] (M4) radial grids
becke partition: Becke, JCP 88, 2547 (1988); DOI:10.1063/1.454033
pruning grids: <function nwchem_prune at 0x7f29529356c0>
grids dens level: 3
symmetrized grids: False
atomic radii adjust function: <function treutler_atomic_radii_adjust at 0x7f2952935580>
** Following is NLC and NLC Grids **
NLC functional = wB97M-V
radial grids:
Treutler-Ahlrichs [JCP 102, 346 (1995); DOI:10.1063/1.469408] (M4) radial grids
becke partition: Becke, JCP 88, 2547 (1988); DOI:10.1063/1.454033
pruning grids: <function nwchem_prune at 0x7f29529356c0>
grids dens level: 3
symmetrized grids: False
atomic radii adjust function: <function treutler_atomic_radii_adjust at 0x7f2952935580>
small_rho_cutoff = 1e-07
Set gradient conv threshold to 3.16228e-05
Initial guess from minao.
Default auxbasis def2-tzvpp-jkfit is used for H def2-tzvppd
Default auxbasis def2-tzvpp-jkfit is used for C def2-tzvppd
Default auxbasis def2-tzvpp-jkfit is used for S def2-tzvppd
Default auxbasis def2-tzvpp-jkfit is used for O def2-tzvppd
Default auxbasis def2-tzvpp-jkfit is used for I def2-tzvppd
/var/lib/boinc/projects/www.gpugrid.net/bck/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable
warnings.warn(msg)
tot grids = 225920
tot grids = 225920
segmentation fault
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
First, it’s well known at this point that these tasks require a lot of VRAM. So some failures are to be expected from that. The VRAM utilization is not constant, but spikes up and down. From the tasks running on my systems, loading up to 5-6GB and staying around that amount is pretty normal, with intermittent spikes to the 9-12GB+ range occasionally. Just by looking at the failure rate of different GPUs, I’m estimating that most tasks need more than 8GB (>70%), a small amount of tasks need more than 12GB (~5%), and a very small number of them need even more than 16GB (<1%). A teammate of mine is running on a couple 2080Tis (11GB) and has had some failures but mostly success. When you hit memory limits, they fail, but not in a segfault. You always get some kind of error printed in the stderr regarding a memory allocation issue of some kind. With an 11GB GPU you should be seeing a majority of successes. Since they all fail in the same way with a segfault, that tells me it’s not the memory allocation problem, but something else. And now with two people having the same problem both using WSL, it’s clear that WSL is the root of the problem. The tasks were not setup to run in that environment.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Yet Windows hosts are not getting any work, so are not given opportunity to contribute to research or to beta testing even if they're prepared to go the extra mile with getting experimental apps to work. I posed this question to Google and their AI engine came up with this response "how long has most scientific research projects used linux compared to windows" Linux is a popular choice for research companies because it offers flexibility, security, stability, and cost-effectiveness. Linux is also used in technical disciplines at universities and research centers because it's free and includes a large amount of free and open-source software. |
|
Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,550,803,412 RAC: 203 Level ![]() Scientific publications
|
une chose est sure il n'y aura pas assez d'utilisateurs pour tout calculer.il y à 50462 taches pour 106 ordinateurs au moment ou j'écris ces lignes.Elles arrivent plus vite que les taches qui sont calculées.je pense que gpugrid va droit dans le mur s'il ne font rien. one thing is sure there will not be enough users to calculate everything.there are 50462 tasks for 106 computers at the time I write these lines. They arrive faster than the spots that are calculated.I think gpugrid goes straight into the wall if they do nothing. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
we are processing about 12,000 tasks per day, so there's a little more than 4 days worth of work right now, but the available work is still climbing
|
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
The choice for Linux as a research OS in academic context is clear, but has really no relation with the choice for which platforms to support as a BOINC project. BOINC as a platform was always a 'supercomputer for peanuts' proposition - you invest a fraction of what a real supercomputer costs but can get similar processing power, which is exactly what many low-budget academic research groups were looking for. Part of that investment is the choice of which platforms to support, and it is primarily driven by the amount of processing power needed, with the correlation to your native development OS only a secondary consideration. As I said already in my previous post it all depends what type of project you want to be 1) You need all the power and/or turnaround you can get? Support all the platforms you can handle, with Windows your #1 priority because that's where the majority of the FLOPS are. 2) You don't really need that much power, your focus is more on developing/researching algorithms? Stay native OS 3) You need some of both? Prio on native OS for your beta apps, but keep driving that steady stream of stable work on #1 Windows and #2 Linux to keep the interest of your supercomputer 'providers' engaged. Because that's the last part of the 'small investment' needed for your FLOPS: keeping your users happy and engaged. So I see no issue at all with new beta's being on Linux first, but am also concerned or sad that there is only beta/Linux work lately as opposed to the earlier days of gpugrid. Unless of course the decision is made to go full-on as a type 2) project? |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
there has been a bunch of ATM work intermittently, which works on Windows. they had to fix both Windows and Linux versions of their application at different times so there were times when Linux worked and Windows didn't, and times where Windows worked and Linux didnt. the most recent batch i believe both applications were working. this is still classified as "beta" for both Linux and Windows. The project admins/researchers have already mentioned a few times that a Windows app is in the pipeline. but it takes time. they obviously don't have a lot of expertise with Windows and are more comfortable with the Linux environment, so it makes sense that it will take more time and effort for them to get up to speed and get the windows version working. they likely need to sort out other parts of their workflow on the backend also (work generation, task sizes, task configurations, batch sizes, etc) and Linux users are the guinea pigs for that. They had many weeks of "false starts" with this QChem project where they generated a bunch of work, but it was causing errors and they ended up having to cancel the whole batch and try again the following week. it's a lot easier for the researchers to iron out these problems with one version of the code rather than juggling two version with different code changes to each. then when most issues are sorted, port that to Windows. I think they are still figuring out what configurations work best for them on the backend and the hardware available on GPUGRID. Steve had previously mentioned that he originally based things on high end datacenter GPUs like the A100 with lots of VRAM, but changes are necessary to get the same results from our pool of users with much lower end GPUs. when the Windows app comes I imagine it will still be "beta" in the context of BOINC, but it'll be a more polished setup from what Linux started with.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
The researcher earlier stated there were NO Windows computers in the lab. Are you going to buy some for them or fund them? How many of you have actually donated monetarily to the project? |
|
Send message Joined: 10 Feb 24 Posts: 1 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
I'm using Docker for Windows, which is using WSL2 as backend, and I'm having the same problems. So another hint at WSL being the problem. Other Projects that use my NVidia card work fine though. For now I've disabled "Quantum chemistry on GPU (beta)" and "If no work for selected applications is available, accept work from other applications" in my project settings to avoid this. Currently there's no other work available for me but I'll keep an eye on the other tasks coming in. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 57 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
There is an obvious solution, that no one has mentioned, for Windows users who wish to contribute to this project, and at the risk of starting a proverbial firestorm, I will mentioned it. You could install Linux on your machine(s). I did it last year. It has work out fine for me. I did the installation on a separate SSD, leaving the Windows disc intact. The default boot up is set for Linux, with option to boot up into Windows, when the need might arise. The process of installing Linux itself was not difficult, but I did have an issue of attaching an existing project accounts to boinc, but some of the Linux users crunching here helped me solve it. Thank you again for the help. It is option you might want to consider. |
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
Sure that's an option, and no need to fear a firestorm. At least not from me, I've worked with Linux or other Unix flavors a lot over the years, both professionally and personally. And besides, I hate forum flame-wars or any kind of tech solution holy wars. ;-) The problem with that solution for me and for many Windows users like me is that's an either/or solution. You either boot Linux or Windows. I have a single computer that I need for both work and personal stuff and that requires Windows due to the software stack being Microsoft based. Not all of which have Linux alternatives that I have the time, patience or skills to explore. I also run Boinc on that machine using 50% CPU + 100% GPU, 24/7. When participating in Linux-only projects, I just spin up a 25% CPU VMWare and let that run in parallel. Or since recently - WSL. I did just install Linux bare-metal on a partition of my data drive just to confirm that WSL is the issue, not the system, but for the reasons mentioned above, this I cannot let run 24/7. FYI - Ian&Steve, you're right. PYSCFbeta on bare-metal Linux runs just fine. So it must indeed be some incompatibility with WSL. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
why not run Linux as prime, and then virtualize (or maybe even WINE) your windows only software?
|
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
why not run Linux as prime, and then virtualize (or maybe even WINE) your windows only software? Because I need Windows all the time, whereas in the last 15 years, this is the only time I couldn't get something to work through a virtual Linux. And BOINC is just a hobby after all... Would you switch prime OS in such a case? On another note - DCF is going crazy again. Average runtimes are consistent around 30 minutes, yet DCF is going up like crazy - estimated runtime of new WU's now at 76 days! On a positive note: not a single failure yet! |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days. yeah i don't know what's wrong with DCF. mine goes crazy shortly after i fix it also. says my tasks will take like 27 days even though most are done in 5-10 mins.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Ian, are you saying that even after you've set DCF to a low value in the client_state file that it is still escalating? I set mine to 0.02 a month ago and it is still hanging around there now that I looked at the hosts here. |
©2025 Universitat Pompeu Fabra