Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 50 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58269 - Posted: 10 Jan 2022, 21:31:54 UTC - in response to Message 58268.  

You need to look at the creation time of the master WU, not of the individual tasks (which will vary, even within a WU, let alone a batch of WUs).
ID: 58269 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58270 - Posted: 11 Jan 2022, 8:11:13 UTC - in response to Message 58265.  
Last modified: 11 Jan 2022, 8:11:37 UTC

I have seen this error a few times.

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.


Do you think it could be due to a lack of resources? I think Linux starts killing processes if you are over capacity.
ID: 58270 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58271 - Posted: 12 Jan 2022, 1:15:57 UTC

Might be the OOM-Killer kicking in. You would need to
grep -i kill /var/log/messages*

to check if processes were killed by the OOM-Killer.

If that is the case you would have to configure /etc/sysctl.conf to let the system be less sensitive to brief out of memory conditions.
ID: 58271 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58272 - Posted: 12 Jan 2022, 8:56:21 UTC

I Googled the error message, and came up with this stackoverflow thread.

The problem seems to be specific to Python, and arises when running concurrent modules. There's a quote from the Python manual:

"The main module must be importable by worker subprocesses. This means that ProcessPoolExecutor will not work in the interactive interpreter. Calling Executor or Future methods from a callable submitted to a ProcessPoolExecutor will result in deadlock."

Other search results may provide further clues.
ID: 58272 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58273 - Posted: 12 Jan 2022, 15:11:50 UTC - in response to Message 58272.  
Last modified: 12 Jan 2022, 15:24:12 UTC

Thanks! out of the possible explanations that could cause the error listed in the thread, I suspect it could be OS killing the threads do to a lack of resources. Could be not enough RAM, or maybe python raises this error if the ratio cores / processes is high? (I have seen some machines with 4 CPUs, and the tasks spawns 32 reinforcement learning environments).

All tasks run the same code and in the majority of GPUGrid machines this error does no occur. Also, I have reviewed the failed jobs and this errors always occurs in the same hosts. So it is something specific to those machines. I will check if I find a common patterns in all hosts that get this error.
ID: 58273 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58274 - Posted: 12 Jan 2022, 16:46:57 UTC
Last modified: 12 Jan 2022, 16:55:04 UTC

What version of Python are the hosts that have the errors running?

Mine for example is:

python3 --version
Python 3.8.10

What kernel and OS?

Linux 5.11.0-46-generic x86_64
Ubuntu 20.04.3 LTS

I've had the errors on hosts with 32GB and 128GB. I would assume the hosts with 128GB to be in the clear with no memory pressures.
ID: 58274 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 998,578
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58275 - Posted: 12 Jan 2022, 20:47:57 UTC

What version of Python are the hosts that have the errors running?

Mine for example is:

python3 --version
Python 3.8.10

Same Python version as current mine.

In case of doubt about conflicting Python versions, I published the solution that I applied to my hosts at Message #57833
It worked for my Ubuntu 20.04.3 LTS Linux distribution, but user mmonnin replied that this didn't work for him.
mmonnin kindly published an alternative way at his Message #57840
ID: 58275 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 178,897
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58276 - Posted: 13 Jan 2022, 2:31:57 UTC

I saw the prior post and was about to mention the same thing. Not sure which one works as the PC has been able to run tasks.

The recent tasks are taking a really long time
2d13h 62,2% 1070 and 1080 GPU system
2d15h 60.4% 1070 and 1080 GPU system

2x concurrently on 3080Ti
2d12h 61.3%
2d14h 60.4%
ID: 58276 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58277 - Posted: 13 Jan 2022, 10:45:46 UTC - in response to Message 58274.  

All jobs should use the same python version (3.8.10), I define it in the requirements.txt file of the conda environment.

Here are the specs from 3 hosts that failed with the BrokenProcessPool error:

OS:
Linux Debian Debian GNU/Linux 11 (bullseye) [5.10.0-10-amd64|libc 2.31 (Debian GLIBC 2.31-13+deb11u2)]
Linux Ubuntu Ubuntu 20.04.3 LTS [5.4.0-94-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.3)]
Linux Linuxmint Linux Mint 20.2 [5.4.0-91-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.2)]

Memory:
32081.92 MB
32092.04 MB
9954.41 MB

ID: 58277 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58278 - Posted: 13 Jan 2022, 19:55:11 UTC

I have a failed task today involving pickle.

magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input

When I was investigating the brokenprocesspool error I saw posts that involved the word pickle and the fixes for that error.

https://www.gpugrid.net/result.php?resultid=32733573
ID: 58278 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
SuperNanoCat

Send message
Joined: 3 Sep 21
Posts: 3
Credit: 146,609,125
RAC: 0
Level
Cys
Scientific publications
wat
Message 58279 - Posted: 13 Jan 2022, 21:18:41 UTC

The tasks run on my Tesla K20 for a while, but then fail when they need to use PyTorch, which requires higher CUDA Capability. Oh well. Guess I'll stick to the ACEMED tasks. The error output doesn't list the requirements properly, but from a little Googling, it was updated to require 3.7 within the past couple years. The only Kepler card that has 3.7 is the Tesla K80.

From this task:

[W NNPACK.cpp:79] Could not initialize NNPACK! Reason: Unsupported hardware.
/var/lib/boinc-client/slots/2/gpugridpy/lib/python3.8/site-packages/torch/cuda/__init__.py:120: UserWarning: 
    Found GPU%d %s which is of cuda capability %d.%d.
    PyTorch no longer supports this GPU because it is too old.
    The minimum cuda capability supported by this library is %d.%d.


While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla.
ID: 58279 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,909,595
RAC: 4,232,576
Level
Trp
Scientific publications
wat
Message 58280 - Posted: 13 Jan 2022, 21:51:08 UTC - in response to Message 58279.  

While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla.


this is a problem (feature?) of BOINC, not the project. the project only knows what hardware you have based on what BOINC communicates to the project.

with cards from the same vendor (nvidia/AMD/Intel) BOINC only lists the "best" card and then appends a number that's associated with how many total devices you have from that vendor. it will only list different models if they are from different vendors.

within the nvidia vendor group, BOINC figures out the "best" device by checking the compute capability first, then memory capacity, then some third metric that i cant remember right now. BOINC deems the K620 to be "best" because it has a higher compute capability (5.0) than the Tesla K20 (3.5) even though the K20 is arguably the better card with more/faster memory and more cores.

all in all, this has nothing to do with the project, and everything to do with BOINC's GPU ranking code.
ID: 58280 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 178,897
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58281 - Posted: 13 Jan 2022, 22:58:05 UTC - in response to Message 58280.  

While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla.


this is a problem (feature?) of BOINC, not the project. the project only knows what hardware you have based on what BOINC communicates to the project.

with cards from the same vendor (nvidia/AMD/Intel) BOINC only lists the "best" card and then appends a number that's associated with how many total devices you have from that vendor. it will only list different models if they are from different vendors.

within the nvidia vendor group, BOINC figures out the "best" device by checking the compute capability first, then memory capacity, then some third metric that i cant remember right now. BOINC deems the K620 to be "best" because it has a higher compute capability (5.0) than the Tesla K20 (3.5) even though the K20 is arguably the better card with more/faster memory and more cores.

all in all, this has nothing to do with the project, and everything to do with BOINC's GPU ranking code.


Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308

This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.
ID: 58281 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,909,595
RAC: 4,232,576
Level
Trp
Scientific publications
wat
Message 58282 - Posted: 13 Jan 2022, 23:23:11 UTC - in response to Message 58281.  
Last modified: 13 Jan 2022, 23:23:48 UTC



Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308

This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.


In your case, the metrics that BOINC is looking at are identical between the two cards (actually all three of the 1070, 1070Ti, and 1080 have identical specs as far as BOINC ranking is concerned). All have the same amount of VRAM and have the same compute capability. So the tie goes to device number I guess. If you were to swap the 1080 for even a weaker card with a better CC (like a GTX 1650) then that would get picked up instead, even when not in the first slot.
ID: 58282 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
SuperNanoCat

Send message
Joined: 3 Sep 21
Posts: 3
Credit: 146,609,125
RAC: 0
Level
Cys
Scientific publications
wat
Message 58283 - Posted: 14 Jan 2022, 2:21:35 UTC - in response to Message 58280.  

Ah, I get it. I thought it was just stuck, because it did have two K620s before. I didn't realize BOINC was just incapable of acknowledging different cards from the same vendor. Does this affect project statistics? The Milkyway@home folks are gonna have real inflated opinions of the K620 next time they check the numbers haha
ID: 58283 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58284 - Posted: 14 Jan 2022, 9:41:19 UTC - in response to Message 58278.  

Interesting I had seen this error once before locally, and I assumed it was due to a corrupted input file.

I have reviewed the task and it was solved by another hosts, but only after multiple failed attempts with this pickle error.

Thank you for bringing it up! I will review the code to see if I can find any bug related to that.
ID: 58284 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58285 - Posted: 14 Jan 2022, 20:12:28 UTC - in response to Message 58284.  

This is the document I had found about fixing the BrokenProcessPool error.

https://stackoverflow.com/questions/57031253/how-to-fix-brokenprocesspool-error-for-concurrent-futures-processpoolexecutor

I was reading it and stumbled upon the word "pickle" and verb "picklable" and thought it funny and I never had heard that word associated with computing before.

When the latest failed task mentioned pickle in the output, it tied it right back to all the previous BrokenProcessPool errors.
ID: 58285 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
klepel

Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,798,881,008
RAC: 311
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58286 - Posted: 14 Jan 2022, 20:25:49 UTC

@abouh: Thank you for PM me twice!
The Experimental Python tasks (beta) succeed miraculously on my two Linux computers (which produced only errors) after several restarts of GPUGRID.net project and the latest distro update this week.
ID: 58286 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 998,578
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58288 - Posted: 15 Jan 2022, 22:24:17 UTC - in response to Message 58225.  

Also I happened to catch two simultaneous Python tasks at my triple GTX 1650 GPU host.
I then urgently suspended requesting for Gpugrid tasks at BOINC Manager... Why?
This host system RAM size is 32 GB.
When the second Python task started, free system RAM decreased to 1% (!).

After upgrading system RAM from 32 GB to 64 GB at above mentioned host, it has successfully processed three concurrent ABOU Python GPU tasks:
e2a43-ABOU_rnd_ppod_baseline_rnn-0-1-RND6933_3 - Link: https://www.gpugrid.net/result.php?resultid=32733458
e2a21-ABOU_rnd_ppod_baseline_rnn-0-1-RND3351_3 - Link: https://www.gpugrid.net/result.php?resultid=32733477
e2a27-ABOU_rnd_ppod_baseline_rnn-0-1-RND5112_1 - Link: https://www.gpugrid.net/result.php?resultid=32733441

More details at regarding Message #58287
ID: 58288 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58289 - Posted: 17 Jan 2022, 8:36:42 UTC

Hello everyone,

I have seen a new error in some jobs:


Traceback (most recent call last):
File "run.py", line 444, in <module>
main()
File "run.py", line 62, in main
wandb.login(key=str(args.wandb_key))
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 65, in login
configured = _login(**kwargs)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 268, in _login
wlogin.configure_api_key(key)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 154, in configure_api_key
apikey.write_key(self._settings, key)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/lib/apikey.py", line 223, in write_key
api.clear_setting("anonymous", globally=True, persist=True)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/apis/internal.py", line 75, in clear_setting
return self.api.clear_setting(*args, **kwargs)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/apis/internal.py", line 19, in api
self._api = InternalApi(*self._api_args, **self._api_kwargs)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 78, in __init__
self._settings = Settings(
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/old/settings.py", line 23, in __init__
self._global_settings.read([Settings._global_path()])
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/old/settings.py", line 110, in _global_path
util.mkdir_exists_ok(config_dir)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/util.py", line 793, in mkdir_exists_ok
os.makedirs(path)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 223, in makedirs
mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/var/lib/boinc-client'
18:56:50 (54609): ./gpugridpy/bin/python exited; CPU time 42.541031
18:56:50 (54609): app exit status: 0x1
18:56:50 (54609): called boinc_finish(195)

</stderr_txt>


It seems like the task is not allowed to create a new dirs inside its working directory. Just wondering if it could be some kind of configuration problem, just like the "INTERNAL ERROR: cannot create temporary directory!" for which a solution was already shared.
ID: 58289 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2025 Universitat Pompeu Fabra