Message boards :
News :
Experimental Python tasks (beta) - task description
Message board moderation
Previous · 1 . . . 38 · 39 · 40 · 41 · 42 · 43 · 44 . . . 50 · Next
| Author | Message |
|---|---|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Thanks for keep digging into this high cpu usage bug Ian. I missed the last convos on your other thread at STH I guess. Hope that abouh can implement a proper fix. That should increase the return rate dramatically I think. |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Hello Ian, learner.step() Is the line of code the task spends most time on. this function handles first the collection of data (CPU intensive) + takes one learning step (updating the weights of the agent neural networks, GPU intensive) Regarding your findings with respect to wandb, I could remove the wandb dependency. I can simply make a run.py script that does not use wandb. It is nice to have a way to log extra training information, but not at the cost of reducing task efficiency. And I get a part of that information anyway when the task comes back. I understand that simply getting rid of wandb would be the best solution right? Thanks a lot for your help! If that is the best solution, I will work on a run.py without wandb. I can start using it as soon as the current batch (~10,736 now) is processed |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
Hello Ian, removing wandb could be a start, but it's also possible that it's not the sole cause of the problem. are you able to see any soft errors in the logs from reported tasks? do you have any higher core count (32+ cores) systems in your lab or available to test on?
|
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
ok, in that case I will start by removing wandb in the next batch of tasks. Let’s see if that improves performance. I will make a post to inform about the submission once it is done, will probably still take a few days since the latest batch is still being processed. I have access to machines with up to 32 cores for testing. I will also try setting the same environment flags. To see what happens. NUM_THREADS = "8" Unfortunately the error logs I get do not say much… at least I don’t see any soft errors. Is there any information which can be printed from the run.py script that would help? Regarding full access to the job, the python package we use to train the AI agents is public and mostly based in pytorch, in case anyone is interested (https://github.com/PyTorchRL/pytorchrl). |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I think I may have encountered the Linux version of the Windows virtual memory problem. I have been concentrating on another project, where a new application is generating vast amounts of uploadable result data. They deployed a new upload server to handle this data, but it crashed almost immediately - on Christmas Eve. Another new upload server may come online tonight, but in the meantime, my hard disk has been filling up something rotten. It's now down to below 30 GB free for BOINC, so I thought it was wise to stop that project, and do something else until the disk starts to empty. So I tried a couple of python tasks on host 132158: both failed with "OSError: [Errno 28] No space left on device", and BOINC crashed at the same time. I'm doing some less data-intensive work at the moment, and handling the machine with kid gloves. Timeshift is implicated in a third crash, so I've been able to move that to a different drive - let's see how that goes. I'll re-test GPUGrid when things have settled down a bit, to try and confirm that virtual memory theory. |
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 210,528,292 RAC: 0 Level ![]() Scientific publications
|
There are programs that can display what files use most space on disk. For example K4DirStat |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
ok, in that case I will start by removing wandb in the next batch of tasks. Let’s see if that improves performance. I will make a post to inform about the submission once it is done, will probably still take a few days since the latest batch is still being processed. i'm sure if you set those same env flags, you'll get the same result I have. less CPU use and threads used for python per task based on the NUM_THREADS you set. I'm testing "4" now and it doesn't seem slower either. will need to run it a while longer to be sure. let me get back to you if you could print some errors from within the run.py script. and yeah, no worries about waiting for the batch to finish up. still over 9000 tasks to go.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
I think I may have encountered the Linux version of the Windows virtual memory problem. probably need some more context about the system. how much disk drive space does it have? how much of that space have you allowed BOINC to use? how many Python tasks are you running? Do you have any other projects running that cause high disk use? each expanded and running GPUGRID_Python slot looks to take up about 9GB. (the 2.7GB archive gets copied there, expanded to ~6.xGB, and and archive remains in place). so that's 9 GB per task running + ~5GB for the GPUGRID project folder depending on if you've cleaned up old apps/archives or not. if your project folder is carrying lots of the old apps, a project reset might be in order to clean it out.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
how much disk drive space does it have? This is what BOINC sees: It's running on a single 512 GB M.2 SSD. Much of that 200 GB is used by the errant project, and is dormant until they get their new upload server fettled. One Python task - the other GPU is excluded by cc_config. Some Einstein GPU tasks are just finishing. Apart from that, just NumberFields (lightweight integer maths). Within the next half hour, the Einstein tasks will vacate the machine. I'll try one Python, solo, as an experiment, and report back. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
So it looks like you’ve set BOINC to be allowed use to the whole drive or so? Or only 50%? The 234GB “used by other programs” seems odd. Are you using this system to store a large amount of personal files too? Do you know what is taking up nearly half of the drive that’s not BOINC related? If you’re not aware of what’s taking up that space. Check /var/log/, I’ve had it happen that large amounts of errors filling up the syslog and kern.log files and filling the disk.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The machine is primarily a BOINC cruncher, so yes - BOINC is allowed to use what it wants. I'm suspicious about those 'other programs', too - especially as my other Linux machine shows a much lower figure. The main difference between then is that I did an in-situ upgrade from Mint 20.3 to 21 not long ago, and the other machine is still at 20.3 - I suspect there may be a lot of rollback files kept 'just in case'. And yes, I'm suspicious of the logs too - especially BOINC writing to the systemd journal, and that upgrade. Next venue for an exploration. I've been watching the disk tab in my peripheral vision, as the test task started. 'Free space for BOINC' fell in steps through 26, 24, 22, 21, 20 as it started, and has stayed there. Now at around 10% progress / 1 hour elapsed. Should have mentioned - machine has 64 GB of physical RAM, in anticipation of some humongous multi-threaded tasks to come. Edit - new upload server won't be certified as 'fit for use' until tomorrow, so I've started Einstein again. |
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 210,528,292 RAC: 0 Level ![]() Scientific publications
|
What other project? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
What other project? Name redacted to save the blushes of the guilty! |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Looks like this was a false alarm - the probe task finished successfully, and I've started another. Must have been timeshift all along. The nameless project is still hors de combat. The new server is alive and ready, but can't be accessed by BOINC. |
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 210,528,292 RAC: 0 Level ![]() Scientific publications
|
You mean ithena? |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
ok, in that case I will start by removing wandb in the next batch of tasks. Let’s see if that improves performance. I will make a post to inform about the submission once it is done, will probably still take a few days since the latest batch is still being processed. 4 seems to be working fine. abouh, if removing wandb doesn't fix the problem, then adding the env variarables listed above with num_threads = 4 will probably be a suitable workaround for everyone. probably not many hosts with less than 4 threads these days.
|
|
Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,340,042,080 RAC: 27 Level ![]() Scientific publications
|
Excuse the dumb question but would that then mean the app would only spin up 4 threads? On Windows, I have manually capped the app to 24 threads and it uses all of them, my Linux box capped at 6 threads has half the threads idling. Both seem to take about the same time though, what is the Windows app doing with all the threads that the Linux app does not need? |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
I have been testing the new script without wandb and the proposed environ configuration and works fine. In my machine performance is similar but looking forward to receiving feedback from other users. I also need to update the PyTorchRL library (our main dependency), so my idea is to follow these steps: 1. Wait for the current batch to finish (currently 3,726 tasks) 2. Then I will update PyTorchRL library. 3. Following I will send a small batch (20-50) to PythonGPUBeta with the new code to make sure everything works fine (I have tested locally, but it is always worth sending a test batch to Beta in my opinion) 4. Send again a big batch with the new code to PythonGPU. The app will be short of tasks for a brief period of time but even though the new version of PyTorchRL does not have huge changes I don't want to risk updating it now while 3000+ tasks are still on the queue. I will make a post once I submit the Beta tasks. |
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 210,528,292 RAC: 0 Level ![]() Scientific publications
|
Can http://www.gpugrid.net/apps.php link be put next to Server status link? |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
I have been testing the new script without wandb and the proposed environ configuration and works fine. In my machine performance is similar but looking forward to receiving feedback from other users. thanks abouh! looking forward to testing out the new batch.
|
©2025 Universitat Pompeu Fabra