Experimental Python tasks (beta)

Author	Message
Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level Scientific publications	Message 58269 - Posted: 10 Jan 2022, 21:31:54 UTC - in response to Message 58268. You need to look at the creation time of the master WU, not of the individual tasks (which will vary, even within a WU, let alone a batch of WUs). ID: 58269 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58270 - Posted: 11 Jan 2022, 8:11:13 UTC - in response to Message 58265. Last modified: 11 Jan 2022, 8:11:37 UTC I have seen this error a few times. concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. Do you think it could be due to a lack of resources? I think Linux starts killing processes if you are over capacity. ID: 58270 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 58271 - Posted: 12 Jan 2022, 1:15:57 UTC Might be the OOM-Killer kicking in. You would need to grep -i kill /var/log/messages* to check if processes were killed by the OOM-Killer. If that is the case you would have to configure /etc/sysctl.conf to let the system be less sensitive to brief out of memory conditions. ID: 58271 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level Scientific publications	Message 58272 - Posted: 12 Jan 2022, 8:56:21 UTC I Googled the error message, and came up with this stackoverflow thread. The problem seems to be specific to Python, and arises when running concurrent modules. There's a quote from the Python manual: "The main module must be importable by worker subprocesses. This means that ProcessPoolExecutor will not work in the interactive interpreter. Calling Executor or Future methods from a callable submitted to a ProcessPoolExecutor will result in deadlock." Other search results may provide further clues. ID: 58272 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58273 - Posted: 12 Jan 2022, 15:11:50 UTC - in response to Message 58272. Last modified: 12 Jan 2022, 15:24:12 UTC Thanks! out of the possible explanations that could cause the error listed in the thread, I suspect it could be OS killing the threads do to a lack of resources. Could be not enough RAM, or maybe python raises this error if the ratio cores / processes is high? (I have seen some machines with 4 CPUs, and the tasks spawns 32 reinforcement learning environments). All tasks run the same code and in the majority of GPUGrid machines this error does no occur. Also, I have reviewed the failed jobs and this errors always occurs in the same hosts. So it is something specific to those machines. I will check if I find a common patterns in all hosts that get this error. ID: 58273 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 58274 - Posted: 12 Jan 2022, 16:46:57 UTC Last modified: 12 Jan 2022, 16:55:04 UTC What version of Python are the hosts that have the errors running? Mine for example is: python3 --version Python 3.8.10 What kernel and OS? Linux 5.11.0-46-generic x86_64 Ubuntu 20.04.3 LTS I've had the errors on hosts with 32GB and 128GB. I would assume the hosts with 128GB to be in the clear with no memory pressures. ID: 58274 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,447 Level Scientific publications	Message 58275 - Posted: 12 Jan 2022, 20:47:57 UTC What version of Python are the hosts that have the errors running? Mine for example is: python3 --version Python 3.8.10 Same Python version as current mine. In case of doubt about conflicting Python versions, I published the solution that I applied to my hosts at Message #57833 It worked for my Ubuntu 20.04.3 LTS Linux distribution, but user mmonnin replied that this didn't work for him. mmonnin kindly published an alternative way at his Message #57840 ID: 58275 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 259 Level Scientific publications	Message 58276 - Posted: 13 Jan 2022, 2:31:57 UTC I saw the prior post and was about to mention the same thing. Not sure which one works as the PC has been able to run tasks. The recent tasks are taking a really long time 2d13h 62,2% 1070 and 1080 GPU system 2d15h 60.4% 1070 and 1080 GPU system 2x concurrently on 3080Ti 2d12h 61.3% 2d14h 60.4% ID: 58276 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58277 - Posted: 13 Jan 2022, 10:45:46 UTC - in response to Message 58274. All jobs should use the same python version (3.8.10), I define it in the requirements.txt file of the conda environment. Here are the specs from 3 hosts that failed with the BrokenProcessPool error: OS: Linux Debian Debian GNU/Linux 11 (bullseye) [5.10.0-10-amd64\|libc 2.31 (Debian GLIBC 2.31-13+deb11u2)] Linux Ubuntu Ubuntu 20.04.3 LTS [5.4.0-94-generic\|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.3)] Linux Linuxmint Linux Mint 20.2 [5.4.0-91-generic\|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.2)] Memory: 32081.92 MB 32092.04 MB 9954.41 MB ID: 58277 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 58278 - Posted: 13 Jan 2022, 19:55:11 UTC I have a failed task today involving pickle. magic_number = pickle_module.load(f, **pickle_load_args) EOFError: Ran out of input When I was investigating the brokenprocesspool error I saw posts that involved the word pickle and the fixes for that error. https://www.gpugrid.net/result.php?resultid=32733573 ID: 58278 · Rating: 0 · rate: / Reply Quote

SuperNanoCat Send message Joined: 3 Sep 21 Posts: 3 Credit: 146,609,125 RAC: 0 Level Scientific publications	Message 58279 - Posted: 13 Jan 2022, 21:18:41 UTC The tasks run on my Tesla K20 for a while, but then fail when they need to use PyTorch, which requires higher CUDA Capability. Oh well. Guess I'll stick to the ACEMED tasks. The error output doesn't list the requirements properly, but from a little Googling, it was updated to require 3.7 within the past couple years. The only Kepler card that has 3.7 is the Tesla K80. From this task: [W NNPACK.cpp:79] Could not initialize NNPACK! Reason: Unsupported hardware. /var/lib/boinc-client/slots/2/gpugridpy/lib/python3.8/site-packages/torch/cuda/__init__.py:120: UserWarning: Found GPU%d %s which is of cuda capability %d.%d. PyTorch no longer supports this GPU because it is too old. The minimum cuda capability supported by this library is %d.%d. While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla. ID: 58279 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 58280 - Posted: 13 Jan 2022, 21:51:08 UTC - in response to Message 58279. While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla. this is a problem (feature?) of BOINC, not the project. the project only knows what hardware you have based on what BOINC communicates to the project. with cards from the same vendor (nvidia/AMD/Intel) BOINC only lists the "best" card and then appends a number that's associated with how many total devices you have from that vendor. it will only list different models if they are from different vendors. within the nvidia vendor group, BOINC figures out the "best" device by checking the compute capability first, then memory capacity, then some third metric that i cant remember right now. BOINC deems the K620 to be "best" because it has a higher compute capability (5.0) than the Tesla K20 (3.5) even though the K20 is arguably the better card with more/faster memory and more cores. all in all, this has nothing to do with the project, and everything to do with BOINC's GPU ranking code. ID: 58280 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 259 Level Scientific publications	Message 58281 - Posted: 13 Jan 2022, 22:58:05 UTC - in response to Message 58280. While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla. this is a problem (feature?) of BOINC, not the project. the project only knows what hardware you have based on what BOINC communicates to the project. with cards from the same vendor (nvidia/AMD/Intel) BOINC only lists the "best" card and then appends a number that's associated with how many total devices you have from that vendor. it will only list different models if they are from different vendors. within the nvidia vendor group, BOINC figures out the "best" device by checking the compute capability first, then memory capacity, then some third metric that i cant remember right now. BOINC deems the K620 to be "best" because it has a higher compute capability (5.0) than the Tesla K20 (3.5) even though the K20 is arguably the better card with more/faster memory and more cores. all in all, this has nothing to do with the project, and everything to do with BOINC's GPU ranking code. Its often said as the "Best" card but its just the 1st https://www.gpugrid.net/show_host_detail.php?hostid=475308 This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070. ID: 58281 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 58282 - Posted: 13 Jan 2022, 23:23:11 UTC - in response to Message 58281. Last modified: 13 Jan 2022, 23:23:48 UTC Its often said as the "Best" card but its just the 1st https://www.gpugrid.net/show_host_detail.php?hostid=475308 This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070. In your case, the metrics that BOINC is looking at are identical between the two cards (actually all three of the 1070, 1070Ti, and 1080 have identical specs as far as BOINC ranking is concerned). All have the same amount of VRAM and have the same compute capability. So the tie goes to device number I guess. If you were to swap the 1080 for even a weaker card with a better CC (like a GTX 1650) then that would get picked up instead, even when not in the first slot. ID: 58282 · Rating: 0 · rate: / Reply Quote

SuperNanoCat Send message Joined: 3 Sep 21 Posts: 3 Credit: 146,609,125 RAC: 0 Level Scientific publications	Message 58283 - Posted: 14 Jan 2022, 2:21:35 UTC - in response to Message 58280. Ah, I get it. I thought it was just stuck, because it did have two K620s before. I didn't realize BOINC was just incapable of acknowledging different cards from the same vendor. Does this affect project statistics? The Milkyway@home folks are gonna have real inflated opinions of the K620 next time they check the numbers haha ID: 58283 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58284 - Posted: 14 Jan 2022, 9:41:19 UTC - in response to Message 58278. Interesting I had seen this error once before locally, and I assumed it was due to a corrupted input file. I have reviewed the task and it was solved by another hosts, but only after multiple failed attempts with this pickle error. Thank you for bringing it up! I will review the code to see if I can find any bug related to that. ID: 58284 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 58285 - Posted: 14 Jan 2022, 20:12:28 UTC - in response to Message 58284. This is the document I had found about fixing the BrokenProcessPool error. https://stackoverflow.com/questions/57031253/how-to-fix-brokenprocesspool-error-for-concurrent-futures-processpoolexecutor I was reading it and stumbled upon the word "pickle" and verb "picklable" and thought it funny and I never had heard that word associated with computing before. When the latest failed task mentioned pickle in the output, it tied it right back to all the previous BrokenProcessPool errors. ID: 58285 · Rating: 0 · rate: / Reply Quote

klepel Send message Joined: 23 Dec 09 Posts: 189 Credit: 4,798,881,008 RAC: 0 Level Scientific publications	Message 58286 - Posted: 14 Jan 2022, 20:25:49 UTC @abouh: Thank you for PM me twice! The Experimental Python tasks (beta) succeed miraculously on my two Linux computers (which produced only errors) after several restarts of GPUGRID.net project and the latest distro update this week. ID: 58286 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,447 Level Scientific publications	Message 58288 - Posted: 15 Jan 2022, 22:24:17 UTC - in response to Message 58225. Also I happened to catch two simultaneous Python tasks at my triple GTX 1650 GPU host. I then urgently suspended requesting for Gpugrid tasks at BOINC Manager... Why? This host system RAM size is 32 GB. When the second Python task started, free system RAM decreased to 1% (!). After upgrading system RAM from 32 GB to 64 GB at above mentioned host, it has successfully processed three concurrent ABOU Python GPU tasks: e2a43-ABOU_rnd_ppod_baseline_rnn-0-1-RND6933_3 - Link: https://www.gpugrid.net/result.php?resultid=32733458 e2a21-ABOU_rnd_ppod_baseline_rnn-0-1-RND3351_3 - Link: https://www.gpugrid.net/result.php?resultid=32733477 e2a27-ABOU_rnd_ppod_baseline_rnn-0-1-RND5112_1 - Link: https://www.gpugrid.net/result.php?resultid=32733441 More details at regarding Message #58287 ID: 58288 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58289 - Posted: 17 Jan 2022, 8:36:42 UTC Hello everyone, I have seen a new error in some jobs: Traceback (most recent call last): File "run.py", line 444, in <module> main() File "run.py", line 62, in main wandb.login(key=str(args.wandb_key)) File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 65, in login configured = _login(*kwargs) File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 268, in _login wlogin.configure_api_key(key) File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 154, in configure_api_key apikey.write_key(self._settings, key) File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/lib/apikey.py", line 223, in write_key api.clear_setting("anonymous", globally=True, persist=True) File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/apis/internal.py", line 75, in clear_setting return self.api.clear_setting(args, *kwargs) File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/apis/internal.py", line 19, in api self._api = InternalApi(self._api_args, **self._api_kwargs) File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 78, in __init__ self._settings = Settings( File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/old/settings.py", line 23, in __init__ self._global_settings.read([Settings._global_path()]) File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/old/settings.py", line 110, in _global_path util.mkdir_exists_ok(config_dir) File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/util.py", line 793, in mkdir_exists_ok os.makedirs(path) File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 213, in makedirs makedirs(head, exist_ok=exist_ok) File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 213, in makedirs makedirs(head, exist_ok=exist_ok) File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 223, in makedirs mkdir(name, mode) OSError: [Errno 30] Read-only file system: '/var/lib/boinc-client' 18:56:50 (54609): ./gpugridpy/bin/python exited; CPU time 42.541031 18:56:50 (54609): app exit status: 0x1 18:56:50 (54609): called boinc_finish(195) </stderr_txt> It seems like the task is not allowed to create a new dirs inside its working directory. Just wondering if it could be some kind of configuration problem, just like the "INTERNAL ERROR: cannot create temporary directory!" for which a solution was already shared. ID: 58289 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description