Experimental Python tasks (beta)

Author	Message
abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58577 - Posted: 29 Mar 2022, 7:52:03 UTC Last modified: 29 Mar 2022, 8:01:53 UTC Interesting that sometimes jobs work and sometimes get stuck in the same machine. It also seems to me, based on you info, that something remains running at the end of the job and causes the next job to get stuck. Presumably some python thread. I will see if I can add some code at the end of the task to make sure all python processes are killed and the main program exits correctly. And send another testing round. Another observation in that this problem does not seem to be OS-dependant, since it happened to STARBASEn in a Linux machine and to Richard in Windows. ID: 58577 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58578 - Posted: 29 Mar 2022, 8:45:27 UTC I've just had task 32876361 fail on a different, but identical, Windows machine. This time, it seems to be explicitly, and simply, a "not enough memory" error - these machines only have 8 GB, which was fine when I bought them. I've suspended the beta programme for the time being, and I'll try to upgrade them. ID: 58578 · Rating: 0 · rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,739,145,728 RAC: 2,991 Level Scientific publications	Message 58581 - Posted: 29 Mar 2022, 20:42:53 UTC Another "Disk usage limit exceeded" error: https://www.gpugrid.net/result.php?resultid=32876568 And a successful one yesterday: https://www.gpugrid.net/result.php?resultid=32876288 ID: 58581 · Rating: 0 · rate: / Reply Quote

roundup Send message Joined: 11 May 10 Posts: 68 Credit: 12,531,253,875 RAC: 82,415 Level Scientific publications	Message 58582 - Posted: 30 Mar 2022, 14:07:18 UTC Last modified: 30 Mar 2022, 14:11:23 UTC After having some errors with recent python app betas, task 32876819 ran without error on a RTX3070 Mobile under Win 11. A few observations: - GPU load only between 4% and 8% with a peak between 50% and 70% every 12 seconds. - The indicated time remaining in the BOINC Client was way off. It started with >7000 (seven thousand) days. - 15.000 BOINC credits for 102,296 sec runtime. I assume that will be corrected once the python app is going produtive. EDIT: This runtime indicated on the GPUGrid site is not correct, it was actually less. ID: 58582 · Rating: 0 · rate: / Reply Quote

captainjack Send message Joined: 9 May 13 Posts: 171 Credit: 4,610,796,466 RAC: 26,261 Level Scientific publications	Message 58588 - Posted: 31 Mar 2022, 17:33:27 UTC These tasks seem to run much better on my machines if I allocate 6 CPU's (threads) to each task. I managed to run one by itself and watched the performance monitor for CPU usage. During the initiation phase (about 5 minutes), the task used ~6 CPU's (threads). After the initiation phase, the CPU usage was in an oscillating pattern that was between ~2 and ~5 threads. Task ran very quickly and has been validated. Please let me know if you have questions. ID: 58588 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58590 - Posted: 1 Apr 2022, 8:59:15 UTC - in response to Message 58582. Thanks a lot for the feedback: - cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase. It is correct. - Incorrect time remaining prediction is an issue... it will only be fixed with time once the tasks become stable in duration.. maybe even will be required to create a new app and use this one only to debug. - Also credits will be corrected yes, for now we will have something similar to what we have in the PythonGPU app. Starting today I will start sending longer jobs, instead of the super short test jobs I was using just to test the code was working in all OS's and machines. ID: 58590 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58591 - Posted: 1 Apr 2022, 9:05:50 UTC - in response to Message 58588. Last batches seem to be working successfully both in Linux and Windows, and also for GPUs with cuda 10 and cuda 11. My main worry now is whether or not the problem of some jobs getting "stuck" and never being completed persists. It was reported that the reason was that the Python was not finishing correctly between jobs so I added a few changes in the code to try to solve this issue. Please let me know if you detect this problem in one of your tasks, that would be very helpful! Incidentally, once the PythonGPUBeta app is stable enough, will replace the current PythonGPU app, which only works for Linux. ID: 58591 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58592 - Posted: 1 Apr 2022, 9:18:50 UTC - in response to Message 58591. It was reported that the reason was that the Python was not finishing correctly between jobs so I added a few changes in the code to try to solve this issue. Well, that was one report of one task on one machine with limited memory. It seemed be a case that, if it happened, caused problems for the following task. It's certainly worth looking at, and if it prevents some tasks failing - great. But I'd be cautious about assuming that it was the problem in all cases. ID: 58592 · Rating: 0 · rate: / Reply Quote

STARBASEn Send message Joined: 17 Feb 09 Posts: 91 Credit: 1,603,303,394 RAC: 0 Level Scientific publications	Message 58593 - Posted: 1 Apr 2022, 23:39:38 UTC I will see if I can add some code at the end of the task to make sure all python processes are killed and the main program exits correctly. And send another testing round. Another observation in that this problem does not seem to be OS-dependant, since it happened to STARBASEn in a Linux machine and to Richard in Windows. I haven't gotten a new beta yet so I will shut off all GPU work with other projects to hopefully get some and help resolve this issue. ID: 58593 · Rating: 0 · rate: / Reply Quote

STARBASEn Send message Joined: 17 Feb 09 Posts: 91 Credit: 1,603,303,394 RAC: 0 Level Scientific publications	Message 58594 - Posted: 1 Apr 2022, 23:50:44 UTC One other after thought re that WU. I had checked my status page here prior to aborting the task. It indicated the task was still in progress so no disposition of the files that I am presuming were sent back sometime in the past (since the slot was empty) was assigned to it. Wonder where they went? ID: 58594 · Rating: 0 · rate: / Reply Quote

Short Final Send message Joined: 26 May 20 Posts: 4 Credit: 188,197,627 RAC: 58 Level Scientific publications	Message 58597 - Posted: 4 Apr 2022, 10:56:50 UTC Can anybody explain credits policy please. My CPU's running Python app relentlessly for up to 7 days for only 50,000 credits. Yet have received 360,000 credits for the ACEMD 3 after only 42,000 secs (11.6 hrs). Bit skewiff.. see below: https://www.gpugrid.net/results.php?userid=562496 Task click for details Show names Work unit click for details Computer Sent Time reported or deadline explain Status Run time (sec) CPU time (sec) Credit Application 32877811 27214361 590351 1 Apr 2022 \| 9:34:34 UTC 3 Apr 2022 \| 9:57:48 UTC Completed and validated 309,332.50 309,332.50 50,000.00 Python apps for GPU hosts beta v1.10 (cuda1131) 32877804 27214354 581235 1 Apr 2022 \| 9:38:33 UTC 3 Apr 2022 \| 19:38:13 UTC Completed and validated 628,304.20 628,304.20 50,000.00 Python apps for GPU hosts beta v1.10 (cuda1131) 32876508 27207895 581235 29 Mar 2022 \| 9:50:08 UTC 1 Apr 2022 \| 4:52:45 UTC Completed and validated 101,951.50 100,984.90 360,000.00 ACEMD 3: molecular dynamics simulations for GPUs v2.19 (cuda1121) 32876455 27213533 581235 29 Mar 2022 \| 9:17:17 UTC 29 Mar 2022 \| 9:49:31 UTC Completed and validated 12,109.13 12,109.13 3,000.00 Python apps for GPU hosts beta v1.09 (cuda1131) 32876341 27213457 590351 29 Mar 2022 \| 4:33:52 UTC 31 Mar 2022 \| 6:41:54 UTC Completed and validated 42,830.17 41,435.17 360,000.00 ACEMD 3: molecular dynamics simulations for GPUs v2.19 (cuda1121) 32875459 27212897 581235 27 Mar 2022 \| 2:32:46 UTC 29 Mar 2022 \| 9:06:58 UTC Completed and validated 96,228.49 95,544.64 360,000.00 ACEMD 3: molecular dynamics simulations for GPUs v2.19 (cuda1121) PS: How do I past neat image of above?? ID: 58597 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58598 - Posted: 4 Apr 2022, 13:12:34 UTC - in response to Message 58597. Please note that other users can't see your entire task list by userid - that's a privacy policy common to all BOINC projects. The ones you're worried about seem to be Results for host 581235 The one you're specifically asking about - the Python GPU beta v1.10 - was issued on Friday morning and returned on Sunday evening: it was only on your machine for about 58 hours. The run time of 628,304 seconds is misleading (a duplicate of the CPU time) and an error on this website. Runtime and credit are still being adjusted, and errors are a common feature of beta testing. Sometimes you win, others (like this one) you lose. I'm sure your comments will be noted before testing is complete. ID: 58598 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58599 - Posted: 4 Apr 2022, 18:32:01 UTC Last modified: 4 Apr 2022, 18:33:58 UTC For some reason I haven't been able to snag any of the Python beta tasks lately. Just the old stock Python tasks. Couple of them failed at 30 minutes with the no progress downloading the Python environment after 1800 seconds. One of the reasons I would like to get the new beta tasks that overcome that issue. Also found a task at 5 hours and counting at 100% completion and not reporting. Suspended the task and resumed in the hope that would nudge it to report but it just restarted at 10% progress. [Edit] Looks like the suspend/resume was the trick after all. Uploading now. ID: 58599 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58600 - Posted: 5 Apr 2022, 7:54:43 UTC - in response to Message 58597. The credits system is proportional to the amount of compute required to complete each task, like in acemd3. In acemd3, it is proportional to the complexity of the simulation. In python tasks, which train artificial intelligence reinforcement learning agents, is proportional to the amount of interactions between the agent and its simulated environment required for the agent to learn how to behave in it. At the moment, we give 2000 credits per 1M interactions, and most tasks require 25M training interactions (except test task which are shorter, normally just 1M). Therefore, completing a task gives 50000 credits and 75000 if completed specially fast. Note that we are in beta phase, and while the credit difference between acemd and pythonGPU jobs should not be huge, we might need to adjust the credits given per 1M interactions to make them equivalent. ID: 58600 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58601 - Posted: 5 Apr 2022, 8:04:42 UTC - in response to Message 58599. Last modified: 5 Apr 2022, 8:04:59 UTC Batches of both pythonGPU and pythonGPUBeta are being sent out this week. Hopefully pythonGPUBeta task will run without issues. We want to wait a bit more in case more bugs are detected, but we will soon update the pythonGPU app with the code from PythonGPUBeta, which seems to work well now. As mentioned, it does not have the problem of installing conda every time (instead downloads the packed environment only the first time). It also works for Linux and Windows. At that point we will keep PythonGPUBeta only for testing. ID: 58601 · Rating: 0 · rate: / Reply Quote

bcavnaugh Send message Joined: 8 Nov 13 Posts: 56 Credit: 1,002,640,163 RAC: 0 Level Scientific publications	Message 58602 - Posted: 5 Apr 2022, 16:52:21 UTC Last modified: 5 Apr 2022, 16:53:43 UTC So far some run well while other ran for 2 and 3 days. I did abort the ones that are still running after 3 days. I will pick back up in the Fall and I hope to see good running tasks on my GPU's. For now I an waiting for new 3 & 4 on two of my hosts, it is a real bummer that our hosts have to sit for days on end without getting any tasks. ID: 58602 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58603 - Posted: 5 Apr 2022, 17:38:10 UTC Looks like the standard BOINC mechanism of complain in a post on the forums on some topic and the BOINC genies grant your wish. Been getting nothing but solid Python beta tasks now for the past couple of days. ID: 58603 · Rating: 0 · rate: / Reply Quote

WR-HW95 Send message Joined: 16 Dec 08 Posts: 7 Credit: 1,549,469,403 RAC: 0 Level Scientific publications	Message 58604 - Posted: 5 Apr 2022, 18:04:48 UTC I have serious problems with my other machine running 1080Ti. So far from 20 tasks past 2 weeks best one has ran around 38secs before error. I tried to underpower + underclock core and mem, still same result around same time. This one is result of last one. "<core_client_version>7.16.20</core_client_version> <![CDATA[ <message> (unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 10:11:26 (15136): wrapper (7.9.26016): starting 10:11:26 (15136): wrapper: running bin/acemd3.exe (--boinc --device 0) 10:11:29 (15136): bin/acemd3.exe exited; CPU time 0.000000 10:11:29 (15136): app exit status: 0xc0000135 10:11:29 (15136): called boinc_finish(195)" Is there something wrong in newer drivers on nvidia? Only difference between machines that works and doesnt beside cpu (3900x and 5900x)is gfx driver version. Machine that runs tasks has driver 496.49. Machine that fails tasks has driver 511.79. ID: 58604 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 58605 - Posted: 5 Apr 2022, 19:06:38 UTC - in response to Message 58604. I have serious problems with my other machine running 1080Ti. So far from 20 tasks past 2 weeks best one has ran around 38secs before error. I tried to underpower + underclock core and mem, still same result around same time. This one is result of last one. "<core_client_version>7.16.20</core_client_version> <![CDATA[ <message> (unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 10:11:26 (15136): wrapper (7.9.26016): starting 10:11:26 (15136): wrapper: running bin/acemd3.exe (--boinc --device 0) 10:11:29 (15136): bin/acemd3.exe exited; CPU time 0.000000 10:11:29 (15136): app exit status: 0xc0000135 10:11:29 (15136): called boinc_finish(195)" Is there something wrong in newer drivers on nvidia? Only difference between machines that works and doesnt beside cpu (3900x and 5900x)is gfx driver version. Machine that runs tasks has driver 496.49. Machine that fails tasks has driver 511.79. you can try changing the driver back and see? easy troubleshooting step. It's definitely possible to be the driver. but you seem to be having an issue with the ACEMD3 tasks, this thread is about the Python tasks. ID: 58605 · Rating: 0 · rate: / Reply Quote

WR-HW95 Send message Joined: 16 Dec 08 Posts: 7 Credit: 1,549,469,403 RAC: 0 Level Scientific publications	Message 58606 - Posted: 5 Apr 2022, 21:38:04 UTC - in response to Message 58605. Sorry for posting wrong thread. Changed drivers to 496.49 on other machine too... now just have to wait to get some work to see does it work. Personally I was really hoping when new things were coming, that this project would ditch the cuda at last and moved to opencl. No project that I have crunched on opencl have had extended issues like this. And most of those projects run on AMD cards too. ID: 58606 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description