Experimental Python tasks (beta)

Author	Message
Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 58607 - Posted: 6 Apr 2022, 0:41:00 UTC - in response to Message 58606. I've had no problems with their CUDA ACEMD3 app. it's been very stable across many data sets. all of the issues raised in this thread are in regards to the Python app that's still in testing/beta. problems are to be expected. CUDA outperforms OpenCL. even it identical code (as much as it can be), there is always the added overhead of needing to compile the opencl code at runtime. whereas CUDA runs natively on Nvidia. most projects run opencl because it lets them more easily port the code to different devices, expanding their user base at the expense of some performance overhead. there have been many problems with the 500+ series drivers though. if you still have issues with the older drivers then it's something else wrong with your setup. if you didnt totally purge the old drivers with DDU from Safe Mode and re-install from a fresh nvidia package, that's a good first step. sometimes driver corruption can linger acropss many driver removals and upgrades and it needs to be more forcefully removed. ID: 58607 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 58608 - Posted: 6 Apr 2022, 5:23:39 UTC - in response to Message 58602. bcavnaugh wrote: ... For now I an waiting for new 3 & 4 on two of my hosts, it is a real bummer that our hosts have to sit for days on end without getting any tasks. you say it, indeed :-( Obviously, ACEMD has very low priority at GPUGRID these days :-( ID: 58608 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58609 - Posted: 7 Apr 2022, 19:23:48 UTC Beta is still having issues with establishing the correct Python environment. Threw away around 27 tasks today with errors because of: TypeError: object of type 'int' has no len() ID: 58609 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58613 - Posted: 8 Apr 2022, 9:51:42 UTC - in response to Message 58609. thanks, this is solved now. A new batch is running without this issue. ID: 58613 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58614 - Posted: 8 Apr 2022, 14:43:17 UTC There are still a few old tasks around. I got the _9 (and hopefully final) issue of WU 27184379 from 19 March. It registered the 51% mark but hasn't moved on in over 3 hours: I'm afraid it's going the same way as all previous attempts. ID: 58614 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58615 - Posted: 8 Apr 2022, 17:04:36 UTC Yes, I am still getting the bad work unit resends. Too bad they couldn't be purged before hitting the _9 timeout. ID: 58615 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58616 - Posted: 11 Apr 2022, 10:26:59 UTC New tasks today. But: "ModuleNotFoundError: No module named 'yaml'" ID: 58616 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58617 - Posted: 11 Apr 2022, 16:01:42 UTC Same here today. ID: 58617 · Rating: 0 · rate: / Reply Quote

Azmodes Send message Joined: 7 Jan 17 Posts: 34 Credit: 1,371,429,518 RAC: 0 Level Scientific publications	Message 58618 - Posted: 11 Apr 2022, 19:04:45 UTC Same. ID: 58618 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58619 - Posted: 12 Apr 2022, 8:59:28 UTC Last modified: 12 Apr 2022, 9:01:22 UTC Thanks for the feedback. I will look into it today. In which OS? ID: 58619 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58621 - Posted: 12 Apr 2022, 9:08:46 UTC - in response to Message 58619. In which OS? These were "Python apps for GPU hosts v4.01 (cuda1121)", which is Linux only. ID: 58621 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58622 - Posted: 12 Apr 2022, 9:35:11 UTC - in response to Message 58621. Last modified: 12 Apr 2022, 9:36:30 UTC Right I just saw it browsing thought the failed jobs. It seems that is in the PythonGPU app not in PythonGPUBeta. This is what I think happened: since in PythonGPU the conda environment is created every time, it could be that some of the dependencies from one or more packages required have changed recently. Therefore, yaml package was not installed in the environments and was missing during execution. This is one more reason to switch to the new approach (currently beta). The conda environment is created, packed and sent to the volunteer machine when executing the first job. There, the environment is simply unpacked and there is no need to send a new one unless some fix in required. We will move the PythonGPUBeta app to PythonGPU. Now PythonGPUBeta is quite stable, and its approach avoids this kind of problems. I expect we can do it today, but I will post to confirm it. ID: 58622 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58624 - Posted: 12 Apr 2022, 14:41:23 UTC Last modified: 12 Apr 2022, 15:49:53 UTC The current version of PythonGPUBeta has been copied to PythonGPU Seems like the task DISK_LIMIT needs to be increased, I have seen some EXIT_DISK_LIMIT_EXCEEDED errors. We will adjust it. ID: 58624 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 58625 - Posted: 12 Apr 2022, 16:48:14 UTC Well this is interesting to read. Over at RAH they are using Python (cpu) and they are memory and disk space hogs. I suggest once you get your GPU tasks working you make a FAQ on minimum memory and disk space needed to run these tasks. One task in CPU uses 7.8 compressed to 8.4GB actual space on the drive. Memory wise it uses 2861MB of physical ram and 55 to 58 MB of virtual. If your tasks for GPU are anything like these...well we will need a bit of free space. Looking forward to reading about your success getting python running on GPU. ID: 58625 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58634 - Posted: 13 Apr 2022, 7:23:26 UTC - in response to Message 58625. Last modified: 13 Apr 2022, 7:42:40 UTC The size for all the app files (including the compressed environment) are: 2.0G for windows with cuda102 2.7G for windows with cuda1131 1.8G for linux with cuda102 2.6G for linux with cuda1131 The additional task specific data goes from a few KB to a few MB. I did not expect 7.8G compressed (not even after unpacking the environment). Is that the case for all PythonGPU tasks now? Regarding CPU/GPU usage, this app actually uses a combination of both due to the nature of the problem we are tackling (training AI agent to develop intelligent behaviour in a simulated environment with reinforcement learning techniques). Interactions with the agent environment happen in CPU, learning happens in GPU. ID: 58634 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58635 - Posted: 13 Apr 2022, 9:10:08 UTC Also, the PythonGPU app version used in the new jobs should be 402 (or 4.02). If that is not the case, there is probably some problem. It should be automatically used, but if that is not the case resetting the app should help. ID: 58635 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58636 - Posted: 13 Apr 2022, 9:46:58 UTC - in response to Message 58635. Last modified: 13 Apr 2022, 9:47:41 UTC I have e1a46-ABOU_rnd_ppod_avoid_cnn_3-0-1-RND3588_4 running under Linux. I can confirm that my task (and its four predecessors) are running with the v4.02 app. Small point: can you apply a "weight" to the sub-tasks in job.xml, please? At the moment, the 'decompress' stage is estimated to take 50% of the runtime under Linux, and 66% under Windows. That throws out the estimate for the rest of the run. Under Linux, my slot directory is occupying 9.8 GB, against an allowed limit of 10,000,000,000 bytes: that's tight, especially when you consider the divergence of binary and decimal representations for bigger files. All my predecessors for this workunit were running Windows. Three failed on disk limits, and one on memory limits. If every Windows version is using the 7-zip decompressor, there's the extra 'de-archived, but still compressed' step to allow for in the disk limit. Still awaiting the final hurdle - the upload file size limit. In about 4 hours' time, I reckon - currently at 85% after 10 hours. ID: 58636 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58637 - Posted: 13 Apr 2022, 10:09:20 UTC - in response to Message 58636. Last modified: 13 Apr 2022, 15:03:42 UTC Thanks a lot for the info Richard! You are right, I should adjust the weights of the subtasks in job.xml to 10% for 'decompress' and 90% to execute the python script. That maybe also explains why jobs were getting stuck at 50% when python was not closed properly between jobs. The new job could decompress the environment (50%), but the python script could be executed. I have increased the allowed limit to 30,000,000,000 bytes. This should affect all new jobs (to be confirmed) and should solve the DISK LIMIT problems. Finally, I was also thinking about sending the compressed environment as a tar.bz2 file instead of a tar.gz to make it smaller. But I have to test that 7-zip handles it correctly. Probably will deploy these changes first in PythonGPUBeta, that is what it is for ID: 58637 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58638 - Posted: 13 Apr 2022, 11:25:32 UTC - in response to Message 58637. I'd say 1%::99%, but thanks. ID: 58638 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58639 - Posted: 13 Apr 2022, 13:59:28 UTC Uploaded and reported with no problem at all. ID: 58639 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description