Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 50 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58541 - Posted: 21 Mar 2022, 12:39:20 UTC
Last modified: 21 Mar 2022, 13:03:38 UTC

Got a new one - the other Linux machine, but very similar. Looks like you've put some debug text into stderr.txt:

12:28:16 (482274): wrapper (7.7.26016): starting
12:28:17 (482274): wrapper (7.7.26016): starting
12:28:17 (482274): wrapper: running /bin/tar (xf pythongpu_x86_64-pc-linux-gnu__cuda1131.tar.bz2)
12:31:39 (482274): /bin/tar exited; CPU time 192.149659
12:31:39 (482274): wrapper: running bin/python (run.py)
Starting!!
Finished imports!!
Sanity check, make sure that logging matches execution
Check if this is a restarted job
Define Train Vector of Envs
Define RL training algorithm
Look for available model checkpoint in log_dir - node failure case
Define RL Policy
Define rollouts storage
Define scheme

but nothing new has been added in the last five minutes. Showing 50% progress, no GPU activity. I'll give it another ten minutes or so, then try stop-start and abort if nothing new.

Edit - no, no progress. Same result on two further tasks. All the quoted lines are written within about 5 seconds, then nothing. I'll let the machine do something else while I go shopping...

Tasks for host 132158
ID: 58541 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58549 - Posted: 21 Mar 2022, 14:51:37 UTC - in response to Message 58541.  
Last modified: 21 Mar 2022, 15:45:30 UTC

Ok so I have seen 3 main errors in the last batches:


1. The one reported by Bedrich Hajek ("Disk usage limit exceeded"). We have now increased the amount of disk space allotted by BOINC to each task and I believe, based on the last batch I sent, that this error is gone now.


2. The "older" Windows machines do not have the tar.exe application and therefore can not unpack the conda environment. I know Richard did some research into that, but had to download 7-Zip. Ideally I would like the app to be self-contained. Maybe we can send the 7-Zip program with the app, I will have to research if that is possible.

3. The job getting stuck at 50%. I did add some debug messages in the last batches and I believe I know more or less when in the code the script gets stuck. I am still looking into it. Will also check recent results to see if there is any pattern when this error happens. Note there there is no checkpoint because it is a short task that gets stuck, so since the training is not progressing new checkpoints are not getting saved.
ID: 58549 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58550 - Posted: 24 Mar 2022, 10:05:39 UTC - in response to Message 58549.  
Last modified: 24 Mar 2022, 10:09:32 UTC

We have updated to a new app version for windows that solves the following error:

application C:\Windows\System32\tar.exe missing


Now we send the 7z.exe (576 KB) file with the app, which allows to unpack the other files without relying on the host machine having tar.exe (which is only in windows 11 and latest builds of windows 10).

I just sent a small batch of short tasks this morning to test and so far it seems to work.
ID: 58550 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58551 - Posted: 24 Mar 2022, 10:14:38 UTC

Task 32868822 (Linux Mint GPU beta)

Still seems to be stalling at 50%, after "Define scheme". bin/python run.py is using 100% CPU, plus over 30 threads from multiprocessing.spawn with too little CPU usage to monitor (shows as 0.0%). No GPU app listed by nvidia-smi.
ID: 58551 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58552 - Posted: 24 Mar 2022, 10:24:01 UTC - in response to Message 58551.  
Last modified: 24 Mar 2022, 10:26:18 UTC

Do you know by chance if this same machine works fine with PythonGPU tasks even if it fails in the PythonGPUBeta ones?
ID: 58552 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58553 - Posted: 24 Mar 2022, 11:01:02 UTC - in response to Message 58552.  
Last modified: 24 Mar 2022, 11:26:25 UTC

Yes, it does. Most recent was:

e1a5-ABOU_rnd_ppod_avoid_cnn13-0-1-RND6436_3

Three failed before me, but mine was OK.

Edit: In relation to that successful task, BOINC only returns the last 64 KB of stderr.txt - so that result starts in the middle of the file (that's the bit that's most likely to contain debug information after a crash). I'll try to capture the initial part of the file next time I run one of those tasks, for reference.
ID: 58553 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58561 - Posted: 25 Mar 2022, 8:33:38 UTC
Last modified: 25 Mar 2022, 8:34:20 UTC

I have also changed a bit the approach.

I have just sent a batch of short tasks much more similar to those in PythonGPU. If these work fine, I will slowly introduce changes to see what was the problem.
ID: 58561 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58562 - Posted: 25 Mar 2022, 9:03:09 UTC - in response to Message 58561.  

I've grabbed one. Will run within the hour.
ID: 58562 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58563 - Posted: 25 Mar 2022, 9:20:46 UTC - in response to Message 58561.  
Last modified: 25 Mar 2022, 9:27:47 UTC

I sent 2 batches,

ABOU_rnd_ppod_avoid_cnn_testing

and

ABOU_rnd_ppod_avoid_cnn_testing2

Unfortunately the first batch will crash. I detected one bug already which I have fixed in the second one. Seems like you got at least one in the second batch ( e1a18-ABOU_rnd_ppod_avoid_cnn_testing2). Running it will give us the info we need.

On the bright side, the fix with 7z.exe seems to work in all machines so far.
ID: 58563 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58564 - Posted: 25 Mar 2022, 9:52:52 UTC - in response to Message 58563.  

Yes, I got the testing2. It's been running for about 23 minutes now, but I'm seeing the same as yesterday - nothing written to stderr.txt since:

09:29:18 (51456): wrapper (7.7.26016): starting
09:29:18 (51456): wrapper (7.7.26016): starting
09:29:18 (51456): wrapper: running /bin/tar (xf pythongpu_x86_64-pc-linux-gnu__cuda1131.tar.bz2)
09:32:39 (51456): /bin/tar exited; CPU time 192.380796
09:32:39 (51456): wrapper: running bin/python (run.py)
Starting!!
Finished imports!!
Define rollouts storage
Define scheme

and machine usage shows



(full-screen version of that at https://i.imgur.com/Ly9Aabd.png)

I've preserved the control information for that task, and I'll try to re-run it interactively in terminal later today - you can sometimes catch additional error messages that way.
ID: 58564 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58565 - Posted: 25 Mar 2022, 10:06:50 UTC - in response to Message 58564.  

Ok thanks a lot. Maybe then it is not the python script but some of the dependencies.
ID: 58565 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58566 - Posted: 25 Mar 2022, 10:27:08 UTC - in response to Message 58565.  

OK, I've aborted that task to get my GPU back. I'll see what I can pick out of the preserved entrails, and let you know.
ID: 58566 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58568 - Posted: 25 Mar 2022, 18:13:58 UTC

Sorry, ebcak. I copied all the files, but when I came to work on them, several turned out to be BOINC softlinks back to the project directory, where the original file had been deleted. So the fine detail had gone.

Memo to self - don't try to operate dangerous machinery too early in the morning.
ID: 58568 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 213
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58569 - Posted: 27 Mar 2022, 15:49:31 UTC

The past several tasks have gotten stuck at 50% for me as well. Today one has made it past to 57.7% now in 8hours. 1-2% GPU util on 3070Ti. 2.5 CPU threads per BOINCTasks. 3063mb memory per nvidia-smi and 4.4GB per BOINCTasks.
ID: 58569 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58571 - Posted: 28 Mar 2022, 16:09:03 UTC
Last modified: 28 Mar 2022, 17:15:17 UTC

I updated the app. Tested it locally and works fine on Linux.

I sent a batch of test jobs (ABOU_rnd_ppod_avoid_cnn_testing3), which I have seen executed successfully in at least 1 Linux machine so far.

One way check if the job is actually progressing, is to look for a directory called "monitor_logs/train" in the BOINC directory where the job is being executed. If logs are being written to the files inside this folder, means the task is progressing.
ID: 58571 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58572 - Posted: 28 Mar 2022, 17:20:54 UTC - in response to Message 58571.  

Got a couple on one of my Windows 7 machines. The first - task 32875836 - completed successfully, the second is running now.
ID: 58572 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58573 - Posted: 28 Mar 2022, 18:01:06 UTC - in response to Message 58572.  

nice to hear it! lets see what happens on linux.. so weird if it only works in some machines and gets stuck in others...
ID: 58573 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58574 - Posted: 28 Mar 2022, 18:55:31 UTC - in response to Message 58573.  

nice to hear it! lets see what happens on linux.. so weird if it only works in some machines and gets stuck in others...

Worse is to follow, I'm afraid. task 32875988 started immediately after the first one (same machine, but a different slot directory), but seems to have got stuck.

I now seem to have two separate slot directories:

Slot 0, where the original task ran. It has 31 items (3 folders, 28 files) at the top level, but the folder properties says the total (presumably expanding the site-packages) is 49 folders, 257 files, 3.62 GB

Slot 5, allocated to the new task. It has 93 items at the top level (12 folders, including monitor_logs, and the rest files). This one looks the same as the first one did, while it was actively running the first task. This one has 14 files in the train directory - I think the first only had 4. This slot also has a stderr file, which ends with multiple repetitions of

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "D:\BOINCdata\slots\5\lib\site-packages\pytorchrl\agent\env\__init__.py", line 1, in <module>
    from pytorchrl.agent.env.vec_env import VecEnv
  File "D:\BOINCdata\slots\5\lib\site-packages\pytorchrl\agent\env\vec_env.py", line 1, in <module>
    import torch
  File "D:\BOINCdata\slots\5\lib\site-packages\torch\__init__.py", line 126, in <module>
    raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\BOINCdata\slots\5\lib\site-packages\torch\lib\shm.dll" or one of its dependencies.

I'm going to try variations on a theme of
- clear the old slot manually
- pause and restart the task
- stop and restart BOINC
- stop and retsart Windows

I'll report back what works and what doesn't.
ID: 58574 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58575 - Posted: 28 Mar 2022, 19:36:32 UTC

Well, that was interesting. The files in slot 0 couldn't be deleted - they were locked by a running app 'python' - which is presumably why BOINC hadn't cleaned the folder when the first task finished.

So I stopped the second task, and used Windows Task Manager to see what was running. Sure enough, there was still a Python image, and I still couldn't delete the old files. So I force-stopped that python image, and then I could - and did - delete them.

I restarted the second task, but nothing much happened. The wrapper app posted in stderr that it was restarting python, but nothing else.

So then I restarted BOINC, and all hell broke loose. In quick succession, I got



Then windows crashed a browser tab and two Einstein@Home tasks on the other GPU.

When I'd closed the Python app from the Windows error box, the BOINC task closed cleanly, uploaded some files, and reported a successful finish. It even validated!

Things all seem to be running quietly now, so I think I'll leave this machine alone for a while and think. At the moment, the take-home theory is that the whole sequence was triggered by the failure of the python app to close at the end of the first task's run. That might be the next thing to look at.
ID: 58575 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STARBASEn
Avatar

Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58576 - Posted: 28 Mar 2022, 20:38:27 UTC

Well this beta WU was a weird one:

https://www.gpugrid.net/workunit.php?wuid=27211744

It ran to 50% completion and hung there for 3.5 days so I aborted it. Boinc properties showed it running slot 10 except slot 10 was empty. Top (Fedora35) showed no activity with any GPUGrid WU. Some wrapper or something must have been kept alive and running in the background when the WU quit because the ET counter was incrementing time normally.
ID: 58576 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2025 Universitat Pompeu Fabra