Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 50 · Next

AuthorMessage
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58384 - Posted: 27 Feb 2022, 15:28:01 UTC - in response to Message 58383.  

I am running a _4 now. After 18 minutes it is OK, but the CPU usage is still trending down to a single core after starting out high.

It stopped making progress after running for a day and reaching 26% complete, so I aborted it. I will wait until they fix things before jumping in again. But my results were different than the others, so maybe it will do them some good.
ID: 58384 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58417 - Posted: 3 Mar 2022, 16:29:10 UTC - in response to Message 58382.  

Hello everyone! I am sorry for the late reply.

Now that most of my jobs seem to complete successfully, we decided to remove the "beta" flag from the app. I would like to thank you all for your help during the past months to reach this point. Obviously I will try to solve any further problem detected. In the future we will try to extend it for Windows, but we are not there yet.

Regarding the app requirements, from now on they will be similar to those in my last batches. In reinforcement learning, in general there is no way around the mixed CPU/GPU usage. Most reinforcement learning environments are powered by CPU, but the machine learning algorithms to teach agents to solve the environments use GPU.

RAMIS was experimenting with a different application. But the idea is that another beta app will be created for this purpose.

ID: 58417 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58464 - Posted: 8 Mar 2022, 18:06:03 UTC
Last modified: 8 Mar 2022, 18:53:42 UTC

Is this a record?



Initial runtime estimate for:

e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5
Python apps for GPU hosts beta v1.00 (cuda1131) for Windows

Task 32766826

Time to lie back and enjoy the popcorn for ... 11½ years ??!!

Edit - 36 minutes to download 2.52 GB, less than a minute to crash. Ah well, back to the drawing board.

08/03/2022 17:57:22 | GPUGRID | Started download of windows_x86_64__cuda1131.tar.bz2.e9a2e4346c92bfb71fae05c8321e9325
08/03/2022 18:35:03 | GPUGRID | Finished download of windows_x86_64__cuda1131.tar.bz2.e9a2e4346c92bfb71fae05c8321e9325
08/03/2022 18:35:26 | GPUGRID | Starting task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5
08/03/2022 18:36:21 | GPUGRID | [sched_op] Reason: Unrecoverable error for task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5
08/03/2022 18:36:21 | GPUGRID | Computation for task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5 finished


Edit 2 - "application C:\Windows\System32\tar.exe missing". I can deal with that.

Download from https://sourceforge.net/projects/gnuwin32/files/tar/

NO - that wasn't what it said it was. Looking again.
ID: 58464 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58465 - Posted: 8 Mar 2022, 19:37:16 UTC

No, this isn't working. Apparently, tar.exe is included in Windows 10 - but I'm still running Windows 7/64, and a copy from a W10 machine won't run ("Not a valid Win32 application"). Giving up for tonight - I've got too much waiting to juggle. I'll try again with a clearer head tomorrow.
ID: 58465 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 213
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58466 - Posted: 8 Mar 2022, 23:51:15 UTC
Last modified: 8 Mar 2022, 23:55:06 UTC

Yeah estimates must have astronomical as I am at over 2 months Time left at 3/4 completion on 2 tasks.

11:37 hr:min 79.3% 61d2h
10:04 hr:min 73.9% 77d2h

74.8% dropped on the 2nd task it down to 74d10h. Around 215d initial ETA?
ID: 58466 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58467 - Posted: 9 Mar 2022, 8:54:44 UTC
Last modified: 9 Mar 2022, 9:22:43 UTC

No need to go back to the drawing board, in principle. Here is what is happening:

1. The PythonGPU app should be stable now and only available for Linux (like until now). Jobs are being sent there and should work normally.

2. A new app, called PythonGPUbeta, has been deployed for both Linux and Windows. The idea is to test now the python jobs for Windows. The source of bugs to solve should be this one now... Ultimately the idea is to have a common PythonGPU for both OS.

3. While PythonGPUbeta accepts Linux and Windows, I expect most errors to come from the Windows part.

Please, let me know if any of the following is not correct.
ID: 58467 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58468 - Posted: 9 Mar 2022, 9:02:10 UTC - in response to Message 58464.  
Last modified: 9 Mar 2022, 9:28:51 UTC

In this new version of the app, we send the whole conda environment in a compressed file ONLY ONCE, and unpack it in the machine. The conda environment is what weights around 2.5 GB (depends on whether the machine has cuda10 or cuda11). However, while the environment remains the same there will be no need to re-download it in every job. This is how acemd app works.

We are testing which compression format is best for our purpose. We tested first with a tar.bz2 file. For Linux there was no problem to decompress it.

For windows, I tested locally in a Windows 10 laptop. I could decompress it successfully with tar.exe.

I am not sure what is happening with the estimates, but the estimation is obviously wrong. The test jobs should download the conda environment only in the first job, decompress it and finally run a short python program using CPU and GPU. Are the Linux estimates also so exagerated?
ID: 58468 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58469 - Posted: 9 Mar 2022, 9:07:09 UTC
Last modified: 9 Mar 2022, 9:32:49 UTC

Some problems we are facing are, as Richard mentioned, that before W10 there is no tar.exe.

Also, I have seen some jobs with the following error:

tar.exe: Error opening archive: Can't initialize filter; unable to run program "bzip2 -d"


In theory tar.exe is able to handle bzip2 files. We suspect it could be a problem with PATH env variable (which we will test). Also, tar gz could be a more compatible format for Windows.
ID: 58469 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58470 - Posted: 9 Mar 2022, 9:35:49 UTC

Don't worry, it's only my own personal drawing board that I'm going back to!

Microsoft has form in this area. I remember buying a commercial copy of WinZip for use with Windows 3 - it arrived by post, on a single floppy disk. Later, they bought the company and incorporated it into Windows. Microsoft tend to do this very late in the day - hence my problems yesterday. I'll have a proper look round later today, and see if I can find a version which handles the bzip2 problem too.
ID: 58470 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58471 - Posted: 9 Mar 2022, 9:52:48 UTC - in response to Message 58470.  

Thank you very much! I will send a small batch of test jobs as soon as I can to check if for windows 10 the bzip2 error is caused by an erroneous PATH variable. And the next step will be trying with tar.gz as mentioned.

ID: 58471 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 213
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58472 - Posted: 9 Mar 2022, 11:45:54 UTC

How about some checkpoints. I have a python task that was nearly completed, a ACEMD4 task downloaded next with like 8 billion days ETA. It interrupted the python task. 14hours of work and it went back to 10%. I only have 0.05 days work queue on that client so the python app was at least 95% complete.
ID: 58472 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58473 - Posted: 9 Mar 2022, 14:20:01 UTC - in response to Message 58472.  
Last modified: 9 Mar 2022, 14:43:41 UTC

was it a PythonGPU task for Linux mmonnin? I have checked your recent jobs, seemed to be successful.


PythonGPU task checkpointing was working before. It was discussed previously in the forum. I tested in locally back then and worked fine. Did it happen to anyone else that checkpointing failed? please let me know in that case


I have sent a small batch of tasks for PythonGPUbeta, to test if some errors on Windows are now solved. Will keep iterating in small batches for the beta app.
ID: 58473 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58474 - Posted: 9 Mar 2022, 15:29:20 UTC - in response to Message 58473.  
Last modified: 9 Mar 2022, 15:42:54 UTC

I have a python task for Linux running, recently started.

It's reported that it's checkpointing properly:

CPU time 00:33:10
CPU time since checkpoint 00:01:33
Elapsed time 00:33:27

but that isn't the acid test: the question is whether it can read back the checkpoint data files when restarted.

I'll pause it after a checkpoint, let the machine finish the last 20 minutes of the task it booted aside, and see what happens on restart. Sometimes BOINC takes a little while to update progress after a pause - you have to watch it, not just take the first figure you see.

Results will be reported in task 32773760 overnight, but I'll post here before that.

Edit - looks good so far: restart.chk, progress, run.log all present with a good timestamp.
ID: 58474 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58475 - Posted: 9 Mar 2022, 15:37:59 UTC - in response to Message 58474.  
Last modified: 9 Mar 2022, 15:40:32 UTC

Perfect thanks! That it takes a little while to update progress after a pause, can happen.

The pythonGPU tasks progress is defined by a target number of interactions between the AI agent and the environment in which it is trained. Generally 25M interactions per job. I generate checkpoints regularly and create a progress file that tracks how many of these interactions have been already executed.

After resuming, the script looks for these progress and checkpoint files to continue counting from there.

However, Richard note that the result you linked is not PythonGPU but ACEMD 4. I am not sure how these do the checkpointing.
ID: 58475 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58476 - Posted: 9 Mar 2022, 16:10:04 UTC - in response to Message 58475.  

However, Richard note that the result you linked is not PythonGPU but ACEMD 4. I am not sure how these do the checkpointing.

Well, it was the only one I had in a suitable state for testing.
And it's a good thing we checked. It appears that ACEMD4 in its current state (v1.03) does NOT handle checkpointing correctly. I suspended it manually at just after 10% complete: on restart, it wound back to 1% and started counting again from there. It's reached 2.980% as I type - four increments of 0.495.

The run.log file (which we don't normally get a chance to see) has the ominous line

# WARNING: removed an old file: output.xtc

after a second set of startup details. Perhaps you could pass a message to the appropriate team?
ID: 58476 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58477 - Posted: 9 Mar 2022, 16:18:28 UTC - in response to Message 58476.  

I will. Thanks a lot for the feedback.
ID: 58477 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 213
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58478 - Posted: 9 Mar 2022, 23:18:59 UTC - in response to Message 58475.  

Perfect thanks! That it takes a little while to update progress after a pause, can happen.

The pythonGPU tasks progress is defined by a target number of interactions between the AI agent and the environment in which it is trained. Generally 25M interactions per job. I generate checkpoints regularly and create a progress file that tracks how many of these interactions have been already executed.

After resuming, the script looks for these progress and checkpoint files to continue counting from there.

However, Richard note that the result you linked is not PythonGPU but ACEMD 4. I am not sure how these do the checkpointing.


Yes it was linux.
The % complete I saw was 100%, then a bit later 10% per BOINCTasks.
Looking at the history on that PC it finished in 14:14 run time, just 11 minutes after the ACEMD4 tasks so it looks like it resumed properly. Thanks for checking.
ID: 58478 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58479 - Posted: 10 Mar 2022, 10:41:18 UTC

OK, back on topic. Another of my Windows 7 machines has been allocated a genuine ABOU_pythonGPU_beta2 task (task 32779476), and I was able to suspend it before it even tried to run. I've been able to copy all the downloaded files into a sandbox to play with.

The first task is:

    <task>
        <application>C:\Windows\System32\tar.exe</application>
        <command_line>-xvf windows_x86_64__cuda1131.tar.gz</command_line>
        <setenv>PATH=C:\Windows\system32;C:\Windows</setenv>
    </task>

You don't need both a path statement and a a hard-coded executable location. That may fail on a machine with non-standard drive assignments.

It will certainly fail on this machine, because I still haven't been able to locate a viable tar.exe for Windows 7 (the Windows 10 executable won't run under Windows 7 - at least, I haven't found a way to make it run yet).

I (and many other volunteers here) do have a freeware application called 7-Zip, and I've seen a suggestion that this may be able to handle the required decompression. I'll test that offline first, and if it works, I'll try to modify the job.xml file to use that instead. That's not a complete solution, of course, but it might give a pointer to the way forward.
ID: 58479 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58480 - Posted: 10 Mar 2022, 10:54:35 UTC

OK, that works in principle. The 2.48 GB gz download decompresses to a single 4.91 GB tar file, and that in turn unpacks to 13,449 files in 632 folders. 7-Zip can handle both operations.

ToDo: go find the command line I saw yesterday for doing that in a script.
Check the disk usage limits to ensure all that can happen in the slot directory.
ID: 58480 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58481 - Posted: 10 Mar 2022, 11:23:07 UTC

And it's worth a try. I'm going to split that task into two:

<task>
<application>"C:\Program Files\7-Zip\7z"</application>
<command_line>x windows_x86_64__cuda1131.tar.gz</command_line>
<setenv>PATH=C:\Windows\system32;C:\Windows</setenv>
</task>

<task>
<application>"C:\Program Files\7-Zip\7z"</application>
<command_line>x windows_x86_64__cuda1131.tar</command_line>
<setenv>PATH=C:\Windows\system32;C:\Windows</setenv>
</task>

I could have piped them, but - baby steps!

I'm going to need to increase the disk allowance: 10 (decimal) GB isn't enough.
ID: 58481 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2025 Universitat Pompeu Fabra