Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 50 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58221 - Posted: 29 Dec 2021, 10:38:21 UTC

Some new (to me) errors in https://www.gpugrid.net/result.php?resultid=32732017

"During handling of the above exception, another exception occurred:"

"ValueError: probabilities are not non-negative"
ID: 58221 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 58222 - Posted: 29 Dec 2021, 16:57:53 UTC

it seems checkpointing still isnt working correctly.

despite BOINC "claiming" that it's checkpointing X number of seconds ago, stopping BOINC and re-starting shows that it's not restarting from the checkpoint.

The task I currently have in progress was ~20% completed. stopped BOINC, and restarted and it retained the time (elapsed and CPU time) but progress reset to 10%.
ID: 58222 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58223 - Posted: 29 Dec 2021, 17:40:37 UTC - in response to Message 58222.  

I saw the same issue on my last task which was checkpointed past 20% yet reset to 10% upon restart.
ID: 58223 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,102,898
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58225 - Posted: 29 Dec 2021, 23:05:12 UTC

- GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?)

Two of my hosts with 4 GB dedicated RAM GPUs have succeeded their latest Python GPU tasks so far.
If it is planned to be kept GPU RAM requirements this way, it widens the app to a quite greater number of hosts.

Also I happened to catch two simultaneous Python tasks at my triple GTX 1650 GPU host.
I then urgently suspended requesting for Gpugrid tasks at BOINC Manager... Why?
This host system RAM size is 32 GB.
When the second Python task started, free system RAM decreased to 1% (!).
I grossly estimate that environment for each Python task takes about 16 GB system RAM.
I guess that an eventual third concurrent task might have crashed itself, or even crashed the whole three Python tasks due to lack of system RAM.
I was watching to Psensor readings when the first of the two Python tasks finished, and then the free system memory drastically increased again from 1% to 38%.

I also took a nvidia-smi screenshot, where can be seen that each Python task was respectively running at GPU 0 and GPU 1, while GPU 2 was processing a PrimeGrid CUDA GPU task.

ID: 58225 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 58226 - Posted: 29 Dec 2021, 23:24:23 UTC - in response to Message 58225.  

now that I've upgraded my single 3080Ti host from a 5950X w/16GB ram to a 7402P/128GB ram, I want to see if I can even run 2x GPUGRID tasks on the same GPU. I see about 5GB VRAM use on the tasks I've processed so far. so with so much extra system ram and 12GB VRAM, it might work lol.
ID: 58226 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58227 - Posted: 30 Dec 2021, 14:40:09 UTC - in response to Message 58222.  

Regarding the checkpointing problem, the approach I follow is to check the progress file (if exists) at the beginning of the python script and then continue the job from there.


I have tested locally to stop the task and execute again the python script and it continues from the same point where it stopped. So the script seems correct.


However, I think that right after setting up the conda environment, the progress is set automatically to 10% before running my script, so I am guessing this is what is causing the problem. I have modified my code not to rely only on the progress file, since it might be overwritten after every conda setup to be at 10%.
ID: 58227 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 197,587
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58228 - Posted: 30 Dec 2021, 22:35:23 UTC - in response to Message 58226.  

now that I've upgraded my single 3080Ti host from a 5950X w/16GB ram to a 7402P/128GB ram, I want to see if I can even run 2x GPUGRID tasks on the same GPU. I see about 5GB VRAM use on the tasks I've processed so far. so with so much extra system ram and 12GB VRAM, it might work lol.


The last two tasks on my system with a 3080Ti ran concurrently and completed successfully.
https://www.gpugrid.net/results.php?hostid=477247
ID: 58228 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58248 - Posted: 6 Jan 2022, 9:01:57 UTC

Errors in e6a12-ABOU_rnd_ppod_15-0-1-RND6167_2 (created today):

"wandb: Waiting for W&B process to finish, PID 334655... (failed 1). Press ctrl-c to abort syncing."

"ValueError: demo dir contains more than ´total_buffer_demo_capacity´"
ID: 58248 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58249 - Posted: 6 Jan 2022, 10:01:11 UTC
Last modified: 6 Jan 2022, 10:20:07 UTC

One user mentioned that could not solve the error

INTERNAL ERROR: cannot create temporary directory!


This is the configuration he is using:

### Editing /etc/systemd/system/boinc-client.service.d/override.conf
### Anything between here and the comment below will become the new
contents of the file

PrivateTmp=true

### Lines below this comment will be discarded

### /lib/systemd/system/boinc-client.service
# [Unit]
# Description=Berkeley Open Infrastructure Network Computing Client
# Documentation=man:boinc(1)
# After=network-online.target
#
# [Service]
# Type=simple
# ProtectHome=true
# ProtectSystem=strict
# ProtectControlGroups=true
# ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
# Nice=10
# User=boinc
# WorkingDirectory=/var/lib/boinc
# ExecStart=/usr/bin/boinc
# ExecStop=/usr/bin/boinccmd --quit
# ExecReload=/usr/bin/boinccmd --read_cc_config
# ExecStopPost=/bin/rm -f lockfile
# IOSchedulingClass=idle
# # The following options prevent setuid root as they imply
NoNewPrivileges=true
# # Since Atlas requires setuid root, they break Atlas
# # In order to improve security, if you're not using Atlas,
# # Add these options to the [Service] section of an override file using
# # sudo systemctl edit boinc-client.service
# #NoNewPrivileges=true
# #ProtectKernelModules=true
# #ProtectKernelTunables=true
# #RestrictRealtime=true
# #RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
# #RestrictNamespaces=true
# #PrivateUsers=true
# #CapabilityBoundingSet=
# #MemoryDenyWriteExecute=true
# #PrivateTmp=true #Block X11 idle detection
#
# [Install]
# WantedBy=multi-user.target


I was just wondering if there is any possible reason why it should not work
ID: 58249 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58250 - Posted: 6 Jan 2022, 12:01:13 UTC - in response to Message 58249.  

I am using a systemd file generated from a PPA maintained by Gianfranco Costamagna. It's automatically generated from Debian sources, and kept up-to-date with new releases automatically. It's currently supplying a BOINC suite labelled v7.16.17

The full, unmodified, contents of the file are

[Unit]
Description=Berkeley Open Infrastructure Network Computing Client
Documentation=man:boinc(1)
After=network-online.target

[Service]
Type=simple
ProtectHome=true
PrivateTmp=true
ProtectSystem=strict
ProtectControlGroups=true
ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
Nice=10
User=boinc
WorkingDirectory=/var/lib/boinc
ExecStart=/usr/bin/boinc
ExecStop=/usr/bin/boinccmd --quit
ExecReload=/usr/bin/boinccmd --read_cc_config
ExecStopPost=/bin/rm -f lockfile
IOSchedulingClass=idle
# The following options prevent setuid root as they imply NoNewPrivileges=true
# Since Atlas requires setuid root, they break Atlas
# In order to improve security, if you're not using Atlas,
# Add these options to the [Service] section of an override file using
# sudo systemctl edit boinc-client.service
#NoNewPrivileges=true
#ProtectKernelModules=true
#ProtectKernelTunables=true
#RestrictRealtime=true
#RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
#RestrictNamespaces=true
#PrivateUsers=true
#CapabilityBoundingSet=
#MemoryDenyWriteExecute=true

[Install]
WantedBy=multi-user.target

That has the 'PrivateTmp=true' line in the [Service] section of the file, rather than isolated at the top as in your example. I don't know Linux well enough to know how critical the positioning is.

We had long discussions in the BOINC development community a couple of years ago, when it was discovered that the 'PrivateTmp=true' setting blocked access to BOINC's X-server based idle detection. The default setting was reversed for a while, until it was discovered that the reverse 'PrivateTmp=false' setting caused the problem creating temporary directories that we observe here. I think that the default setting was reverted to true, but the discussion moved into the darker reaches of the Linux package maintenance managers, and the BOINC development cycle became somewhat disjointed. I'm no longer fully up-to-date with the state of play.
ID: 58250 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58251 - Posted: 6 Jan 2022, 12:08:17 UTC - in response to Message 58249.  

A simpler answer might be

### Lines below this comment will be discarded

so the file as posted won't do anything at all - in particular, it won't run BOINC!
ID: 58251 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58253 - Posted: 7 Jan 2022, 10:27:24 UTC - in response to Message 58248.  

Thank you! I reviewed the code and detected the source of the error. I am currently working to solve it.

I will do local tests and then send a small batch of short tasks to GPUGrid to test the fixed version of the scripts before sending the next big batch.
ID: 58253 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58254 - Posted: 7 Jan 2022, 18:13:15 UTC

Everybody seems to be getting the same error in today's tasks:

"AttributeError: 'PPODBuffer' object has no attribute 'num_loaded_agent_demos'"
ID: 58254 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58255 - Posted: 7 Jan 2022, 19:48:11 UTC

I believe I got one of the test, fixed tasks this morning based on the short crunch time and valid report.

No sign of the previous error.

https://www.gpugrid.net/result.php?resultid=32732671
ID: 58255 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58256 - Posted: 7 Jan 2022, 19:56:15 UTC - in response to Message 58255.  

Yes, your workunit was "created 7 Jan 2022 | 17:50:07 UTC" - that's a couple of hours after the ones I saw.
ID: 58256 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58263 - Posted: 10 Jan 2022, 10:26:02 UTC
Last modified: 10 Jan 2022, 10:28:12 UTC

I just sent a batch that seems to fail with

File "/var/lib/boinc-client/slots/30/python_dependencies/ppod_buffer_v2.py", line 325, in before_gradients
if self.iter % self.save_demos_every == 0:
TypeError: unsupported operand type(s) for %: 'int' and 'NoneType'


For some reason it did not crash locally. "Fortunately" it will crash after only a few minutes, and it is easy to solve. I am very sorry for the inconvenience...

I will send also a corrected batch with tasks of normal duration. I have tried to reduce the GPU memory requirements a bit in the new tasks.
ID: 58263 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58264 - Posted: 10 Jan 2022, 10:38:35 UTC - in response to Message 58263.  
Last modified: 10 Jan 2022, 10:58:56 UTC

Got one of those - failed as you describe.

Also has the error message "AttributeError: 'GWorker' object has no attribute 'batches'".

Edit - had a couple more of the broken ones, but one created at 10:40:34 UTC seems to be running OK. We'll know later!
ID: 58264 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
FritzB

Send message
Joined: 7 Apr 15
Posts: 17
Credit: 2,978,057,945
RAC: 55,974
Level
Phe
Scientific publications
wat
Message 58265 - Posted: 10 Jan 2022, 14:09:55 UTC - in response to Message 58264.  

I got 20 bad WU's today on this host: https://www.gpugrid.net/results.php?hostid=520456

Stderr Ausgabe

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
13:25:53 (6392): wrapper (7.7.26016): starting
13:25:53 (6392): wrapper (7.7.26016): starting
13:25:53 (6392): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
                      /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")

  0%|          | 0/45 [00:00<?, ?it/s]
                                      
concurrent.futures.process._RemoteTraceback: 
'''
Traceback (most recent call last):
  File "concurrent/futures/process.py", line 368, in _queue_management_worker
  File "multiprocessing/connection.py", line 251, in recv
TypeError: __init__() missing 1 required positional argument: 'msg'
'''

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "entry_point.py", line 69, in <module>
  File "concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists
  File "concurrent/futures/_base.py", line 611, in result_iterator
  File "concurrent/futures/_base.py", line 439, in result
  File "concurrent/futures/_base.py", line 388, in __get_result
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
[6689] Failed to execute script entry_point
13:25:58 (6392): /usr/bin/flock exited; CPU time 3.906269
13:25:58 (6392): app exit status: 0x1
13:25:58 (6392): called boinc_finish(195)

</stderr_txt>
]]>
ID: 58265 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58266 - Posted: 10 Jan 2022, 16:33:22 UTC - in response to Message 58264.  

I errored out 12 tasks created from 10:09:55 to 10:40:06.

Those all have the batch error.

But have 3 tasks created from 10:41:01 to 11:01:56 still running normally
ID: 58266 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58268 - Posted: 10 Jan 2022, 19:39:01 UTC

And two of those were the batch error resends that now have failed.

Only 1 still processing that I assume is of the fixed variety. 8 hours elapsed currently.

https://www.gpugrid.net/result.php?resultid=32732855
ID: 58268 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2025 Universitat Pompeu Fabra