Experimental Python tasks (beta)

Author	Message
Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 4 Level Scientific publications	Message 59695 - Posted: 8 Jan 2023, 16:55:28 UTC Thanks for keep digging into this high cpu usage bug Ian. I missed the last convos on your other thread at STH I guess. Hope that abouh can implement a proper fix. That should increase the return rate dramatically I think. ID: 59695 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59697 - Posted: 9 Jan 2023, 8:10:53 UTC - in response to Message 59683. Last modified: 9 Jan 2023, 10:56:29 UTC Hello Ian, learner.step() Is the line of code the task spends most time on. this function handles first the collection of data (CPU intensive) + takes one learning step (updating the weights of the agent neural networks, GPU intensive) Regarding your findings with respect to wandb, I could remove the wandb dependency. I can simply make a run.py script that does not use wandb. It is nice to have a way to log extra training information, but not at the cost of reducing task efficiency. And I get a part of that information anyway when the task comes back. I understand that simply getting rid of wandb would be the best solution right? Thanks a lot for your help! If that is the best solution, I will work on a run.py without wandb. I can start using it as soon as the current batch (~10,736 now) is processed ID: 59697 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 59698 - Posted: 9 Jan 2023, 13:39:07 UTC - in response to Message 59697. Hello Ian, learner.step() Is the line of code the task spends most time on. this function handles first the collection of data (CPU intensive) + takes one learning step (updating the weights of the agent neural networks, GPU intensive) Regarding your findings with respect to wandb, I could remove the wandb dependency. I can simply make a run.py script that does not use wandb. It is nice to have a way to log extra training information, but not at the cost of reducing task efficiency. And I get a part of that information anyway when the task comes back. I understand that simply getting rid of wandb would be the best solution right? Thanks a lot for your help! If that is the best solution, I will work on a run.py without wandb. I can start using it as soon as the current batch (~10,736 now) is processed removing wandb could be a start, but it's also possible that it's not the sole cause of the problem. are you able to see any soft errors in the logs from reported tasks? do you have any higher core count (32+ cores) systems in your lab or available to test on? ID: 59698 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59699 - Posted: 10 Jan 2023, 6:06:09 UTC - in response to Message 59693. Last modified: 10 Jan 2023, 6:07:30 UTC ok, in that case I will start by removing wandb in the next batch of tasks. Let’s see if that improves performance. I will make a post to inform about the submission once it is done, will probably still take a few days since the latest batch is still being processed. I have access to machines with up to 32 cores for testing. I will also try setting the same environment flags. To see what happens.   NUM_THREADS = "8" os.environ["OMP_NUM_THREADS"] = NUM_THREADS os.environ["OPENBLAS_NUM_THREADS"] = NUM_THREADS os.environ["MKL_NUM_THREADS"] = NUM_THREADS os.environ["CNN_NUM_THREADS"] = NUM_THREADS os.environ["VECLIB_MAXIMUM_THREADS"] = NUM_THREADS os.environ["NUMEXPR_NUM_THREADS"] = NUM_THREADS Unfortunately the error logs I get do not say much… at least I don’t see any soft errors. Is there any information which can be printed from the run.py script that would help? Regarding full access to the job, the python package we use to train the AI agents is public and mostly based in pytorch, in case anyone is interested (https://github.com/PyTorchRL/pytorchrl). ID: 59699 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59701 - Posted: 10 Jan 2023, 11:58:55 UTC I think I may have encountered the Linux version of the Windows virtual memory problem. I have been concentrating on another project, where a new application is generating vast amounts of uploadable result data. They deployed a new upload server to handle this data, but it crashed almost immediately - on Christmas Eve. Another new upload server may come online tonight, but in the meantime, my hard disk has been filling up something rotten. It's now down to below 30 GB free for BOINC, so I thought it was wise to stop that project, and do something else until the disk starts to empty. So I tried a couple of python tasks on host 132158: both failed with "OSError: [Errno 28] No space left on device", and BOINC crashed at the same time. I'm doing some less data-intensive work at the moment, and handling the machine with kid gloves. Timeshift is implicated in a third crash, so I've been able to move that to a different drive - let's see how that goes. I'll re-test GPUGrid when things have settled down a bit, to try and confirm that virtual memory theory. ID: 59701 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 241,278,292 RAC: 24,803 Level Scientific publications	Message 59702 - Posted: 10 Jan 2023, 12:26:44 UTC - in response to Message 59701. There are programs that can display what files use most space on disk. For example K4DirStat ID: 59702 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 59703 - Posted: 10 Jan 2023, 13:11:55 UTC - in response to Message 59699. ok, in that case I will start by removing wandb in the next batch of tasks. Let’s see if that improves performance. I will make a post to inform about the submission once it is done, will probably still take a few days since the latest batch is still being processed. I have access to machines with up to 32 cores for testing. I will also try setting the same environment flags. To see what happens.   NUM_THREADS = "8" os.environ["OMP_NUM_THREADS"] = NUM_THREADS os.environ["OPENBLAS_NUM_THREADS"] = NUM_THREADS os.environ["MKL_NUM_THREADS"] = NUM_THREADS os.environ["CNN_NUM_THREADS"] = NUM_THREADS os.environ["VECLIB_MAXIMUM_THREADS"] = NUM_THREADS os.environ["NUMEXPR_NUM_THREADS"] = NUM_THREADS Unfortunately the error logs I get do not say much… at least I don’t see any soft errors. Is there any information which can be printed from the run.py script that would help? Regarding full access to the job, the python package we use to train the AI agents is public and mostly based in pytorch, in case anyone is interested (https://github.com/PyTorchRL/pytorchrl). i'm sure if you set those same env flags, you'll get the same result I have. less CPU use and threads used for python per task based on the NUM_THREADS you set. I'm testing "4" now and it doesn't seem slower either. will need to run it a while longer to be sure. let me get back to you if you could print some errors from within the run.py script. and yeah, no worries about waiting for the batch to finish up. still over 9000 tasks to go. ID: 59703 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 59704 - Posted: 10 Jan 2023, 13:31:49 UTC - in response to Message 59701. I think I may have encountered the Linux version of the Windows virtual memory problem. I have been concentrating on another project, where a new application is generating vast amounts of uploadable result data. They deployed a new upload server to handle this data, but it crashed almost immediately - on Christmas Eve. Another new upload server may come online tonight, but in the meantime, my hard disk has been filling up something rotten. It's now down to below 30 GB free for BOINC, so I thought it was wise to stop that project, and do something else until the disk starts to empty. So I tried a couple of python tasks on host 132158: both failed with "OSError: [Errno 28] No space left on device", and BOINC crashed at the same time. I'm doing some less data-intensive work at the moment, and handling the machine with kid gloves. Timeshift is implicated in a third crash, so I've been able to move that to a different drive - let's see how that goes. I'll re-test GPUGrid when things have settled down a bit, to try and confirm that virtual memory theory. probably need some more context about the system. how much disk drive space does it have? how much of that space have you allowed BOINC to use? how many Python tasks are you running? Do you have any other projects running that cause high disk use? each expanded and running GPUGRID_Python slot looks to take up about 9GB. (the 2.7GB archive gets copied there, expanded to ~6.xGB, and and archive remains in place). so that's 9 GB per task running + ~5GB for the GPUGRID project folder depending on if you've cleaned up old apps/archives or not. if your project folder is carrying lots of the old apps, a project reset might be in order to clean it out. ID: 59704 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59706 - Posted: 10 Jan 2023, 14:35:35 UTC - in response to Message 59704. how much disk drive space does it have? how much of that space have you allowed BOINC to use? how many Python tasks are you running? Do you have any other projects running that cause high disk use? This is what BOINC sees: It's running on a single 512 GB M.2 SSD. Much of that 200 GB is used by the errant project, and is dormant until they get their new upload server fettled. One Python task - the other GPU is excluded by cc_config. Some Einstein GPU tasks are just finishing. Apart from that, just NumberFields (lightweight integer maths). Within the next half hour, the Einstein tasks will vacate the machine. I'll try one Python, solo, as an experiment, and report back. ID: 59706 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 59707 - Posted: 10 Jan 2023, 15:23:53 UTC - in response to Message 59706. So it looks like you’ve set BOINC to be allowed use to the whole drive or so? Or only 50%? The 234GB “used by other programs” seems odd. Are you using this system to store a large amount of personal files too? Do you know what is taking up nearly half of the drive that’s not BOINC related? If you’re not aware of what’s taking up that space. Check /var/log/, I’ve had it happen that large amounts of errors filling up the syslog and kern.log files and filling the disk. ID: 59707 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59709 - Posted: 10 Jan 2023, 15:57:46 UTC - in response to Message 59707. Last modified: 10 Jan 2023, 16:20:03 UTC The machine is primarily a BOINC cruncher, so yes - BOINC is allowed to use what it wants. I'm suspicious about those 'other programs', too - especially as my other Linux machine shows a much lower figure. The main difference between then is that I did an in-situ upgrade from Mint 20.3 to 21 not long ago, and the other machine is still at 20.3 - I suspect there may be a lot of rollback files kept 'just in case'. And yes, I'm suspicious of the logs too - especially BOINC writing to the systemd journal, and that upgrade. Next venue for an exploration. I've been watching the disk tab in my peripheral vision, as the test task started. 'Free space for BOINC' fell in steps through 26, 24, 22, 21, 20 as it started, and has stayed there. Now at around 10% progress / 1 hour elapsed. Should have mentioned - machine has 64 GB of physical RAM, in anticipation of some humongous multi-threaded tasks to come. Edit - new upload server won't be certified as 'fit for use' until tomorrow, so I've started Einstein again. ID: 59709 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 241,278,292 RAC: 24,803 Level Scientific publications	Message 59711 - Posted: 10 Jan 2023, 22:37:51 UTC - in response to Message 59709. What other project? ID: 59711 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59712 - Posted: 10 Jan 2023, 22:50:16 UTC - in response to Message 59711. What other project? Name redacted to save the blushes of the guilty! ID: 59712 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59713 - Posted: 11 Jan 2023, 8:04:22 UTC Looks like this was a false alarm - the probe task finished successfully, and I've started another. Must have been timeshift all along. The nameless project is still hors de combat. The new server is alive and ready, but can't be accessed by BOINC. ID: 59713 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 241,278,292 RAC: 24,803 Level Scientific publications	Message 59714 - Posted: 11 Jan 2023, 8:52:11 UTC - in response to Message 59713. You mean ithena? ID: 59714 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 59737 - Posted: 18 Jan 2023, 16:57:24 UTC - in response to Message 59703. ok, in that case I will start by removing wandb in the next batch of tasks. Let’s see if that improves performance. I will make a post to inform about the submission once it is done, will probably still take a few days since the latest batch is still being processed. I have access to machines with up to 32 cores for testing. I will also try setting the same environment flags. To see what happens.   NUM_THREADS = "8" os.environ["OMP_NUM_THREADS"] = NUM_THREADS os.environ["OPENBLAS_NUM_THREADS"] = NUM_THREADS os.environ["MKL_NUM_THREADS"] = NUM_THREADS os.environ["CNN_NUM_THREADS"] = NUM_THREADS os.environ["VECLIB_MAXIMUM_THREADS"] = NUM_THREADS os.environ["NUMEXPR_NUM_THREADS"] = NUM_THREADS Unfortunately the error logs I get do not say much… at least I don’t see any soft errors. Is there any information which can be printed from the run.py script that would help? Regarding full access to the job, the python package we use to train the AI agents is public and mostly based in pytorch, in case anyone is interested (https://github.com/PyTorchRL/pytorchrl). i'm sure if you set those same env flags, you'll get the same result I have. less CPU use and threads used for python per task based on the NUM_THREADS you set. I'm testing "4" now and it doesn't seem slower either. will need to run it a while longer to be sure. let me get back to you if you could print some errors from within the run.py script. and yeah, no worries about waiting for the batch to finish up. still over 9000 tasks to go. 4 seems to be working fine. abouh, if removing wandb doesn't fix the problem, then adding the env variarables listed above with num_threads = 4 will probably be a suitable workaround for everyone. probably not many hosts with less than 4 threads these days. ID: 59737 · Rating: 0 · rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,405,292,080 RAC: 62,283 Level Scientific publications	Message 59757 - Posted: 18 Jan 2023, 23:18:39 UTC - in response to Message 59737. Excuse the dumb question but would that then mean the app would only spin up 4 threads? On Windows, I have manually capped the app to 24 threads and it uses all of them, my Linux box capped at 6 threads has half the threads idling. Both seem to take about the same time though, what is the Windows app doing with all the threads that the Linux app does not need? ID: 59757 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59761 - Posted: 19 Jan 2023, 8:54:22 UTC - in response to Message 59737. Last modified: 19 Jan 2023, 9:13:41 UTC I have been testing the new script without wandb and the proposed environ configuration and works fine. In my machine performance is similar but looking forward to receiving feedback from other users. I also need to update the PyTorchRL library (our main dependency), so my idea is to follow these steps: 1. Wait for the current batch to finish (currently 3,726 tasks) 2. Then I will update PyTorchRL library. 3. Following I will send a small batch (20-50) to PythonGPUBeta with the new code to make sure everything works fine (I have tested locally, but it is always worth sending a test batch to Beta in my opinion) 4. Send again a big batch with the new code to PythonGPU. The app will be short of tasks for a brief period of time but even though the new version of PyTorchRL does not have huge changes I don't want to risk updating it now while 3000+ tasks are still on the queue. I will make a post once I submit the Beta tasks. ID: 59761 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 241,278,292 RAC: 24,803 Level Scientific publications	Message 59762 - Posted: 19 Jan 2023, 12:57:01 UTC - in response to Message 59761. Can http://www.gpugrid.net/apps.php link be put next to Server status link? ID: 59762 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 59764 - Posted: 19 Jan 2023, 13:48:56 UTC - in response to Message 59761. I have been testing the new script without wandb and the proposed environ configuration and works fine. In my machine performance is similar but looking forward to receiving feedback from other users. I also need to update the PyTorchRL library (our main dependency), so my idea is to follow these steps: 1. Wait for the current batch to finish (currently 3,726 tasks) 2. Then I will update PyTorchRL library. 3. Following I will send a small batch (20-50) to PythonGPUBeta with the new code to make sure everything works fine (I have tested locally, but it is always worth sending a test batch to Beta in my opinion) 4. Send again a big batch with the new code to PythonGPU. The app will be short of tasks for a brief period of time but even though the new version of PyTorchRL does not have huge changes I don't want to risk updating it now while 3000+ tasks are still on the queue. I will make a post once I submit the Beta tasks. thanks abouh! looking forward to testing out the new batch. ID: 59764 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description