Message boards :
News :
Experimental Python tasks (beta) - task description
Message board moderation
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 50 · Next
| Author | Message |
|---|---|
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I am running a _4 now. After 18 minutes it is OK, but the CPU usage is still trending down to a single core after starting out high. It stopped making progress after running for a day and reaching 26% complete, so I aborted it. I will wait until they fix things before jumping in again. But my results were different than the others, so maybe it will do them some good. |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Hello everyone! I am sorry for the late reply. Now that most of my jobs seem to complete successfully, we decided to remove the "beta" flag from the app. I would like to thank you all for your help during the past months to reach this point. Obviously I will try to solve any further problem detected. In the future we will try to extend it for Windows, but we are not there yet. Regarding the app requirements, from now on they will be similar to those in my last batches. In reinforcement learning, in general there is no way around the mixed CPU/GPU usage. Most reinforcement learning environments are powered by CPU, but the machine learning algorithms to teach agents to solve the environments use GPU. RAMIS was experimenting with a different application. But the idea is that another beta app will be created for this purpose. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Is this a record? Initial runtime estimate for: e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5 Python apps for GPU hosts beta v1.00 (cuda1131) for Windows Task 32766826 Time to lie back and enjoy the popcorn for ... 11½ years ??!! Edit - 36 minutes to download 2.52 GB, less than a minute to crash. Ah well, back to the drawing board. 08/03/2022 17:57:22 | GPUGRID | Started download of windows_x86_64__cuda1131.tar.bz2.e9a2e4346c92bfb71fae05c8321e9325 08/03/2022 18:35:03 | GPUGRID | Finished download of windows_x86_64__cuda1131.tar.bz2.e9a2e4346c92bfb71fae05c8321e9325 08/03/2022 18:35:26 | GPUGRID | Starting task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5 08/03/2022 18:36:21 | GPUGRID | [sched_op] Reason: Unrecoverable error for task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5 08/03/2022 18:36:21 | GPUGRID | Computation for task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5 finished Edit 2 - "application C:\Windows\System32\tar.exe missing". I can deal with that. Download from https://sourceforge.net/projects/gnuwin32/files/tar/ NO - that wasn't what it said it was. Looking again. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
No, this isn't working. Apparently, tar.exe is included in Windows 10 - but I'm still running Windows 7/64, and a copy from a W10 machine won't run ("Not a valid Win32 application"). Giving up for tonight - I've got too much waiting to juggle. I'll try again with a clearer head tomorrow. |
|
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 213 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Yeah estimates must have astronomical as I am at over 2 months Time left at 3/4 completion on 2 tasks. 11:37 hr:min 79.3% 61d2h 10:04 hr:min 73.9% 77d2h 74.8% dropped on the 2nd task it down to 74d10h. Around 215d initial ETA? |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
No need to go back to the drawing board, in principle. Here is what is happening: 1. The PythonGPU app should be stable now and only available for Linux (like until now). Jobs are being sent there and should work normally. 2. A new app, called PythonGPUbeta, has been deployed for both Linux and Windows. The idea is to test now the python jobs for Windows. The source of bugs to solve should be this one now... Ultimately the idea is to have a common PythonGPU for both OS. 3. While PythonGPUbeta accepts Linux and Windows, I expect most errors to come from the Windows part. Please, let me know if any of the following is not correct. |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
In this new version of the app, we send the whole conda environment in a compressed file ONLY ONCE, and unpack it in the machine. The conda environment is what weights around 2.5 GB (depends on whether the machine has cuda10 or cuda11). However, while the environment remains the same there will be no need to re-download it in every job. This is how acemd app works. We are testing which compression format is best for our purpose. We tested first with a tar.bz2 file. For Linux there was no problem to decompress it. For windows, I tested locally in a Windows 10 laptop. I could decompress it successfully with tar.exe. I am not sure what is happening with the estimates, but the estimation is obviously wrong. The test jobs should download the conda environment only in the first job, decompress it and finally run a short python program using CPU and GPU. Are the Linux estimates also so exagerated? |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Some problems we are facing are, as Richard mentioned, that before W10 there is no tar.exe. Also, I have seen some jobs with the following error: tar.exe: Error opening archive: Can't initialize filter; unable to run program "bzip2 -d" In theory tar.exe is able to handle bzip2 files. We suspect it could be a problem with PATH env variable (which we will test). Also, tar gz could be a more compatible format for Windows. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Don't worry, it's only my own personal drawing board that I'm going back to! Microsoft has form in this area. I remember buying a commercial copy of WinZip for use with Windows 3 - it arrived by post, on a single floppy disk. Later, they bought the company and incorporated it into Windows. Microsoft tend to do this very late in the day - hence my problems yesterday. I'll have a proper look round later today, and see if I can find a version which handles the bzip2 problem too. |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Thank you very much! I will send a small batch of test jobs as soon as I can to check if for windows 10 the bzip2 error is caused by an erroneous PATH variable. And the next step will be trying with tar.gz as mentioned. |
|
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 213 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
How about some checkpoints. I have a python task that was nearly completed, a ACEMD4 task downloaded next with like 8 billion days ETA. It interrupted the python task. 14hours of work and it went back to 10%. I only have 0.05 days work queue on that client so the python app was at least 95% complete. |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
was it a PythonGPU task for Linux mmonnin? I have checked your recent jobs, seemed to be successful. PythonGPU task checkpointing was working before. It was discussed previously in the forum. I tested in locally back then and worked fine. Did it happen to anyone else that checkpointing failed? please let me know in that case I have sent a small batch of tasks for PythonGPUbeta, to test if some errors on Windows are now solved. Will keep iterating in small batches for the beta app. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have a python task for Linux running, recently started. It's reported that it's checkpointing properly: CPU time 00:33:10 CPU time since checkpoint 00:01:33 Elapsed time 00:33:27 but that isn't the acid test: the question is whether it can read back the checkpoint data files when restarted. I'll pause it after a checkpoint, let the machine finish the last 20 minutes of the task it booted aside, and see what happens on restart. Sometimes BOINC takes a little while to update progress after a pause - you have to watch it, not just take the first figure you see. Results will be reported in task 32773760 overnight, but I'll post here before that. Edit - looks good so far: restart.chk, progress, run.log all present with a good timestamp. |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Perfect thanks! That it takes a little while to update progress after a pause, can happen. The pythonGPU tasks progress is defined by a target number of interactions between the AI agent and the environment in which it is trained. Generally 25M interactions per job. I generate checkpoints regularly and create a progress file that tracks how many of these interactions have been already executed. After resuming, the script looks for these progress and checkpoint files to continue counting from there. However, Richard note that the result you linked is not PythonGPU but ACEMD 4. I am not sure how these do the checkpointing. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
However, Richard note that the result you linked is not PythonGPU but ACEMD 4. I am not sure how these do the checkpointing. Well, it was the only one I had in a suitable state for testing. And it's a good thing we checked. It appears that ACEMD4 in its current state (v1.03) does NOT handle checkpointing correctly. I suspended it manually at just after 10% complete: on restart, it wound back to 1% and started counting again from there. It's reached 2.980% as I type - four increments of 0.495. The run.log file (which we don't normally get a chance to see) has the ominous line # WARNING: removed an old file: output.xtc after a second set of startup details. Perhaps you could pass a message to the appropriate team? |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
I will. Thanks a lot for the feedback. |
|
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 213 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Perfect thanks! That it takes a little while to update progress after a pause, can happen. Yes it was linux. The % complete I saw was 100%, then a bit later 10% per BOINCTasks. Looking at the history on that PC it finished in 14:14 run time, just 11 minutes after the ACEMD4 tasks so it looks like it resumed properly. Thanks for checking. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
OK, back on topic. Another of my Windows 7 machines has been allocated a genuine ABOU_pythonGPU_beta2 task (task 32779476), and I was able to suspend it before it even tried to run. I've been able to copy all the downloaded files into a sandbox to play with. The first task is: <task>
<application>C:\Windows\System32\tar.exe</application>
<command_line>-xvf windows_x86_64__cuda1131.tar.gz</command_line>
<setenv>PATH=C:\Windows\system32;C:\Windows</setenv>
</task>You don't need both a path statement and a a hard-coded executable location. That may fail on a machine with non-standard drive assignments. It will certainly fail on this machine, because I still haven't been able to locate a viable tar.exe for Windows 7 (the Windows 10 executable won't run under Windows 7 - at least, I haven't found a way to make it run yet). I (and many other volunteers here) do have a freeware application called 7-Zip, and I've seen a suggestion that this may be able to handle the required decompression. I'll test that offline first, and if it works, I'll try to modify the job.xml file to use that instead. That's not a complete solution, of course, but it might give a pointer to the way forward. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
OK, that works in principle. The 2.48 GB gz download decompresses to a single 4.91 GB tar file, and that in turn unpacks to 13,449 files in 632 folders. 7-Zip can handle both operations. ToDo: go find the command line I saw yesterday for doing that in a script. Check the disk usage limits to ensure all that can happen in the slot directory. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
And it's worth a try. I'm going to split that task into two: <task> <application>"C:\Program Files\7-Zip\7z"</application> <command_line>x windows_x86_64__cuda1131.tar.gz</command_line> <setenv>PATH=C:\Windows\system32;C:\Windows</setenv> </task> <task> <application>"C:\Program Files\7-Zip\7z"</application> <command_line>x windows_x86_64__cuda1131.tar</command_line> <setenv>PATH=C:\Windows\system32;C:\Windows</setenv> </task> I could have piped them, but - baby steps! I'm going to need to increase the disk allowance: 10 (decimal) GB isn't enough. |
©2025 Universitat Pompeu Fabra