Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 50 · Next

AuthorMessage
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 58607 - Posted: 6 Apr 2022, 0:41:00 UTC - in response to Message 58606.  

I've had no problems with their CUDA ACEMD3 app. it's been very stable across many data sets. all of the issues raised in this thread are in regards to the Python app that's still in testing/beta. problems are to be expected.

CUDA outperforms OpenCL. even it identical code (as much as it can be), there is always the added overhead of needing to compile the opencl code at runtime. whereas CUDA runs natively on Nvidia. most projects run opencl because it lets them more easily port the code to different devices, expanding their user base at the expense of some performance overhead.

there have been many problems with the 500+ series drivers though. if you still have issues with the older drivers then it's something else wrong with your setup. if you didnt totally purge the old drivers with DDU from Safe Mode and re-install from a fresh nvidia package, that's a good first step. sometimes driver corruption can linger acropss many driver removals and upgrades and it needs to be more forcefully removed.
ID: 58607 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58608 - Posted: 6 Apr 2022, 5:23:39 UTC - in response to Message 58602.  

bcavnaugh wrote:
... For now I an waiting for new 3 & 4 on two of my hosts, it is a real bummer that our hosts have to sit for days on end without getting any tasks.

you say it, indeed :-(
Obviously, ACEMD has very low priority at GPUGRID these days :-(
ID: 58608 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58609 - Posted: 7 Apr 2022, 19:23:48 UTC

Beta is still having issues with establishing the correct Python environment.

Threw away around 27 tasks today with errors because of:

TypeError: object of type 'int' has no len()

ID: 58609 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58613 - Posted: 8 Apr 2022, 9:51:42 UTC - in response to Message 58609.  

thanks, this is solved now. A new batch is running without this issue.
ID: 58613 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58614 - Posted: 8 Apr 2022, 14:43:17 UTC

There are still a few old tasks around. I got the _9 (and hopefully final) issue of WU 27184379 from 19 March. It registered the 51% mark but hasn't moved on in over 3 hours: I'm afraid it's going the same way as all previous attempts.
ID: 58614 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58615 - Posted: 8 Apr 2022, 17:04:36 UTC

Yes, I am still getting the bad work unit resends.

Too bad they couldn't be purged before hitting the _9 timeout.
ID: 58615 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58616 - Posted: 11 Apr 2022, 10:26:59 UTC

New tasks today.

But: "ModuleNotFoundError: No module named 'yaml'"
ID: 58616 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58617 - Posted: 11 Apr 2022, 16:01:42 UTC

Same here today.
ID: 58617 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Azmodes

Send message
Joined: 7 Jan 17
Posts: 34
Credit: 1,371,429,518
RAC: 0
Level
Met
Scientific publications
watwatwat
Message 58618 - Posted: 11 Apr 2022, 19:04:45 UTC

Same.
ID: 58618 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58619 - Posted: 12 Apr 2022, 8:59:28 UTC
Last modified: 12 Apr 2022, 9:01:22 UTC

Thanks for the feedback. I will look into it today.

In which OS?
ID: 58619 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58621 - Posted: 12 Apr 2022, 9:08:46 UTC - in response to Message 58619.  

In which OS?

These were "Python apps for GPU hosts v4.01 (cuda1121)", which is Linux only.
ID: 58621 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58622 - Posted: 12 Apr 2022, 9:35:11 UTC - in response to Message 58621.  
Last modified: 12 Apr 2022, 9:36:30 UTC

Right I just saw it browsing thought the failed jobs. It seems that is in the PythonGPU app not in PythonGPUBeta.

This is what I think happened: since in PythonGPU the conda environment is created every time, it could be that some of the dependencies from one or more packages required have changed recently. Therefore, yaml package was not installed in the environments and was missing during execution.

This is one more reason to switch to the new approach (currently beta). The conda environment is created, packed and sent to the volunteer machine when executing the first job. There, the environment is simply unpacked and there is no need to send a new one unless some fix in required.

We will move the PythonGPUBeta app to PythonGPU. Now PythonGPUBeta is quite stable, and its approach avoids this kind of problems. I expect we can do it today, but I will post to confirm it.
ID: 58622 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58624 - Posted: 12 Apr 2022, 14:41:23 UTC
Last modified: 12 Apr 2022, 15:49:53 UTC

The current version of PythonGPUBeta has been copied to PythonGPU

Seems like the task DISK_LIMIT needs to be increased, I have seen some EXIT_DISK_LIMIT_EXCEEDED errors. We will adjust it.
ID: 58624 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58625 - Posted: 12 Apr 2022, 16:48:14 UTC

Well this is interesting to read.
Over at RAH they are using Python (cpu) and they are memory and disk space hogs.
I suggest once you get your GPU tasks working you make a FAQ on minimum memory and disk space needed to run these tasks.

One task in CPU uses 7.8 compressed to 8.4GB actual space on the drive.
Memory wise it uses 2861MB of physical ram and 55 to 58 MB of virtual.
If your tasks for GPU are anything like these...well we will need a bit of free space.

Looking forward to reading about your success getting python running on GPU.
ID: 58625 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58634 - Posted: 13 Apr 2022, 7:23:26 UTC - in response to Message 58625.  
Last modified: 13 Apr 2022, 7:42:40 UTC

The size for all the app files (including the compressed environment) are:

2.0G for windows with cuda102
2.7G for windows with cuda1131
1.8G for linux with cuda102
2.6G for linux with cuda1131

The additional task specific data goes from a few KB to a few MB. I did not expect 7.8G compressed (not even after unpacking the environment). Is that the case for all PythonGPU tasks now?

Regarding CPU/GPU usage, this app actually uses a combination of both due to the nature of the problem we are tackling (training AI agent to develop intelligent behaviour in a simulated environment with reinforcement learning techniques). Interactions with the agent environment happen in CPU, learning happens in GPU.
ID: 58634 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58635 - Posted: 13 Apr 2022, 9:10:08 UTC

Also, the PythonGPU app version used in the new jobs should be 402 (or 4.02).

If that is not the case, there is probably some problem. It should be automatically used, but if that is not the case resetting the app should help.
ID: 58635 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58636 - Posted: 13 Apr 2022, 9:46:58 UTC - in response to Message 58635.  
Last modified: 13 Apr 2022, 9:47:41 UTC

I have e1a46-ABOU_rnd_ppod_avoid_cnn_3-0-1-RND3588_4 running under Linux. I can confirm that my task (and its four predecessors) are running with the v4.02 app.

Small point: can you apply a "weight" to the sub-tasks in job.xml, please? At the moment, the 'decompress' stage is estimated to take 50% of the runtime under Linux, and 66% under Windows. That throws out the estimate for the rest of the run.

Under Linux, my slot directory is occupying 9.8 GB, against an allowed limit of 10,000,000,000 bytes: that's tight, especially when you consider the divergence of binary and decimal representations for bigger files.

All my predecessors for this workunit were running Windows. Three failed on disk limits, and one on memory limits. If every Windows version is using the 7-zip decompressor, there's the extra 'de-archived, but still compressed' step to allow for in the disk limit.

Still awaiting the final hurdle - the upload file size limit. In about 4 hours' time, I reckon - currently at 85% after 10 hours.
ID: 58636 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58637 - Posted: 13 Apr 2022, 10:09:20 UTC - in response to Message 58636.  
Last modified: 13 Apr 2022, 15:03:42 UTC

Thanks a lot for the info Richard!

You are right, I should adjust the weights of the subtasks in job.xml to 10% for 'decompress' and 90% to execute the python script. That maybe also explains why jobs were getting stuck at 50% when python was not closed properly between jobs. The new job could decompress the environment (50%), but the python script could be executed.

I have increased the allowed limit to 30,000,000,000 bytes. This should affect all new jobs (to be confirmed) and should solve the DISK LIMIT problems.

Finally, I was also thinking about sending the compressed environment as a tar.bz2 file instead of a tar.gz to make it smaller. But I have to test that 7-zip handles it correctly.

Probably will deploy these changes first in PythonGPUBeta, that is what it is for
ID: 58637 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58638 - Posted: 13 Apr 2022, 11:25:32 UTC - in response to Message 58637.  

I'd say 1%::99%, but thanks.
ID: 58638 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58639 - Posted: 13 Apr 2022, 13:59:28 UTC

Uploaded and reported with no problem at all.
ID: 58639 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2025 Universitat Pompeu Fabra