Message boards :
Server and website :
python tasks get to 2.00% and hang
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 23 Nov 09 Posts: 5 Credit: 382,298,193 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Remaining estimate is over 45 days on an 8th-gen Intel system with a nVidia 1060 GPU. Also see some crash popups mentioning Python, but the task in BOINC still shows Active, but never seems to progress. Have suspended and even rebooted, but task still stuck at 2%. Perhaps my mix of software (Android development/emulators, etc that use VT-d modes) is causing problems? I guess it would be nice if you could roll everything up into an .EXE without needing to run a VirtualBox VM. |
|
Send message Joined: 9 May 13 Posts: 171 Credit: 4,739,796,466 RAC: 334,273 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
AFAIK, GPUGRID does not use a VirtualBox VM. At lease it doesn't on either of my machines. Remaining estimate always starts out really high until several tasks are completed successfully, then it will start being more accurate. A Python task will use about 65% of available CPU threads. For example, on my 6 core 12 thread CPU, Python tasks will use 7-8 threads. On all of your tasks that I checked, it shows an error message of "OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\13\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies. Traceback (most recent call last):" Suggest that you increase the size of your paging file, run a python task by itself and watch system usage, run the python task through to completion without interruption. Let us know if that helps. |
|
Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 8 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
The problem is that Windows handles the request for reserving memory without question. And the Python tasks request a ton of memory, more than what the automatically sized paging file can handle in Windows. Solution is to deselect automatic sizing and either set system managed size or Custom size and set a very large value on the order of tens of gigabytes. From a good Github reply on the problem that concisely sums up the issue.
The number of workers for the Python on GPU tasks is 32 spawned workers. So the equation is going to be 32 * MemoryPerProess < RAM +PageFileSize And many of the PyTorch DLL's request a couple of GB's of memory allocation each. This is why it is difficult to run the tasks on gpus because the system memory + pagefile size is most often inadequate. The pagefile size needs to be greatly increased. And to do that means you need a large piece of storage real estate for the pagefile. |
|
Send message Joined: 14 Oct 11 Posts: 31 Credit: 81,420,504 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hi & thanks for the clear info. I was just rejoining to crunch for the winter season, and first thought (after frantically increasing swap-size) was indeed: "is something assuming Linux-style overcommit?" Out of curiosity I also ran up VMMap, which led me to this SO article which did the same: https://stackoverflow.com/a/69489193/932359 The frustrating bit is it seems to just be an incorrect flag set by nVidea for their embedded "fat binaries": Setting copy-on-write means Window's rigorous memory accounting needs to commit space for each instance. If it's not space for _data_ then it should be read-only to be shared (memory-mapped) between all processes. It reads to me that this probably includes the binary code for all the various GPUs we don't own! :-D Notable quote: edit 2022-01-20: Per NVIDIA: "We have gone ahead and marked the nv_fatb section as read-only, this change will be targeting next major CUDA release 11.7 . We are not changing the ASLR, as that is considered a safety feature ." I take it, just from the filenames, that GPUGrid is using CUDA 10.x? (Their DLLs don't seem to embed version info.) (Problem for me is I'm trying to use an old 64GB SSD as a scratch disk for all swap & temp files. But not to worry, I'll see if I can scrape by with 48 GB & will add another swapfile on another disk if need be.) |
|
Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 8 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
The Nvidia drivers are already up to CUDA 11.7 in the 515 series. So maybe you can ping the developer abouh and see whether he can drop the lower compatibility CUDA 10.2 and 11.3 versions he is compiling the Windows apps with and move to the CUDA 11.7 SDK so that the nv_fatb sections in the DLL's will be marked read-only. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
I take it, just from the filenames, that GPUGrid is using CUDA 10.x? (Their DLLs don't seem to embed version info.) the app is CUDA 11.3.1 (11.3 Update 1) or CUDA 10.2. which version you get probably depends on which drivers you have. since your 512 drivers support 11.3+ you got the cuda1131 app. can see it here in your list of tasks: http://www.gpugrid.net/results.php?hostid=470942
|
|
Send message Joined: 14 Oct 11 Posts: 31 Credit: 81,420,504 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The Nvidia drivers are already up to CUDA 11.7 in the 515 series. (...) Ah, right. I didn't realise this would also put a minimum version constraint on the local gfx drivers. My one was at 512.15 from Mar-2022 via Windows automatic updates. (I'm currently downloading the latest to manually install.) I guess this catches users between: - "a rock": runs with larger than expected memory consumption, and - "a hard place": doesn't run at all due to driver incompatibility
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
The Nvidia drivers are already up to CUDA 11.7 in the 515 series. (...) since your 1060 supports old drivers, you could backdate the drivers to 10.2+ and get the 10.2 app, and maybe this will use less memory without having any cuda 11+ code and no bins for cuda 11 cards.
|
©2026 Universitat Pompeu Fabra