Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 50 · Next

AuthorMessage
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58577 - Posted: 29 Mar 2022, 7:52:03 UTC
Last modified: 29 Mar 2022, 8:01:53 UTC

Interesting that sometimes jobs work and sometimes get stuck in the same machine.

It also seems to me, based on you info, that something remains running at the end of the job and causes the next job to get stuck. Presumably some python thread.

I will see if I can add some code at the end of the task to make sure all python processes are killed and the main program exits correctly. And send another testing round.

Another observation in that this problem does not seem to be OS-dependant, since it happened to STARBASEn in a Linux machine and to Richard in Windows.
ID: 58577 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58578 - Posted: 29 Mar 2022, 8:45:27 UTC

I've just had task 32876361 fail on a different, but identical, Windows machine. This time, it seems to be explicitly, and simply, a "not enough memory" error - these machines only have 8 GB, which was fine when I bought them. I've suspended the beta programme for the time being, and I'll try to upgrade them.
ID: 58578 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 57
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58581 - Posted: 29 Mar 2022, 20:42:53 UTC

Another "Disk usage limit exceeded" error:

https://www.gpugrid.net/result.php?resultid=32876568

And a successful one yesterday:

https://www.gpugrid.net/result.php?resultid=32876288


ID: 58581 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
roundup

Send message
Joined: 11 May 10
Posts: 68
Credit: 12,293,491,875
RAC: 2,606
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58582 - Posted: 30 Mar 2022, 14:07:18 UTC
Last modified: 30 Mar 2022, 14:11:23 UTC

After having some errors with recent python app betas, task 32876819 ran without error on a RTX3070 Mobile under Win 11.
A few observations:
- GPU load only between 4% and 8% with a peak between 50% and 70% every 12 seconds.
- The indicated time remaining in the BOINC Client was way off. It started with >7000 (seven thousand) days.
- 15.000 BOINC credits for 102,296 sec runtime. I assume that will be corrected once the python app is going produtive. EDIT: This runtime indicated on the GPUGrid site is not correct, it was actually less.
ID: 58582 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
captainjack

Send message
Joined: 9 May 13
Posts: 171
Credit: 4,594,296,466
RAC: 140
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58588 - Posted: 31 Mar 2022, 17:33:27 UTC

These tasks seem to run much better on my machines if I allocate 6 CPU's (threads) to each task. I managed to run one by itself and watched the performance monitor for CPU usage. During the initiation phase (about 5 minutes), the task used ~6 CPU's (threads). After the initiation phase, the CPU usage was in an oscillating pattern that was between ~2 and ~5 threads. Task ran very quickly and has been validated. Please let me know if you have questions.
ID: 58588 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58590 - Posted: 1 Apr 2022, 8:59:15 UTC - in response to Message 58582.  

Thanks a lot for the feedback:

- cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase. It is correct.

- Incorrect time remaining prediction is an issue... it will only be fixed with time once the tasks become stable in duration.. maybe even will be required to create a new app and use this one only to debug.

- Also credits will be corrected yes, for now we will have something similar to what we have in the PythonGPU app.

Starting today I will start sending longer jobs, instead of the super short test jobs I was using just to test the code was working in all OS's and machines.
ID: 58590 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58591 - Posted: 1 Apr 2022, 9:05:50 UTC - in response to Message 58588.  

Last batches seem to be working successfully both in Linux and Windows, and also for GPUs with cuda 10 and cuda 11.

My main worry now is whether or not the problem of some jobs getting "stuck" and never being completed persists. It was reported that the reason was that the Python was not finishing correctly between jobs so I added a few changes in the code to try to solve this issue.

Please let me know if you detect this problem in one of your tasks, that would be very helpful!

Incidentally, once the PythonGPUBeta app is stable enough, will replace the current PythonGPU app, which only works for Linux.
ID: 58591 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58592 - Posted: 1 Apr 2022, 9:18:50 UTC - in response to Message 58591.  

It was reported that the reason was that the Python was not finishing correctly between jobs so I added a few changes in the code to try to solve this issue.

Well, that was one report of one task on one machine with limited memory. It seemed be a case that, if it happened, caused problems for the following task. It's certainly worth looking at, and if it prevents some tasks failing - great. But I'd be cautious about assuming that it was the problem in all cases.
ID: 58592 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STARBASEn
Avatar

Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58593 - Posted: 1 Apr 2022, 23:39:38 UTC

I will see if I can add some code at the end of the task to make sure all python processes are killed and the main program exits correctly. And send another testing round.

Another observation in that this problem does not seem to be OS-dependant, since it happened to STARBASEn in a Linux machine and to Richard in Windows.


I haven't gotten a new beta yet so I will shut off all GPU work with other projects to hopefully get some and help resolve this issue.
ID: 58593 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STARBASEn
Avatar

Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58594 - Posted: 1 Apr 2022, 23:50:44 UTC

One other after thought re that WU. I had checked my status page here prior to aborting the task. It indicated the task was still in progress so no disposition of the files that I am presuming were sent back sometime in the past (since the slot was empty) was assigned to it. Wonder where they went?
ID: 58594 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Short Final

Send message
Joined: 26 May 20
Posts: 4
Credit: 187,447,627
RAC: 4
Level
Ile
Scientific publications
wat
Message 58597 - Posted: 4 Apr 2022, 10:56:50 UTC

Can anybody explain credits policy please.
My CPU's running Python app relentlessly for up to 7 days for only 50,000 credits. Yet have received 360,000 credits for the ACEMD 3 after only 42,000 secs (11.6 hrs). Bit skewiff.. see below:

https://www.gpugrid.net/results.php?userid=562496


Task
click for details
Show names Work unit
click for details Computer Sent Time reported
or deadline
explain Status Run time
(sec) CPU time
(sec) Credit Application
32877811 27214361 590351 1 Apr 2022 | 9:34:34 UTC 3 Apr 2022 | 9:57:48 UTC Completed and validated 309,332.50 309,332.50 50,000.00 Python apps for GPU hosts beta v1.10 (cuda1131)
32877804 27214354 581235 1 Apr 2022 | 9:38:33 UTC 3 Apr 2022 | 19:38:13 UTC Completed and validated 628,304.20 628,304.20 50,000.00 Python apps for GPU hosts beta v1.10 (cuda1131)
32876508 27207895 581235 29 Mar 2022 | 9:50:08 UTC 1 Apr 2022 | 4:52:45 UTC Completed and validated 101,951.50 100,984.90 360,000.00 ACEMD 3: molecular dynamics simulations for GPUs v2.19 (cuda1121)
32876455 27213533 581235 29 Mar 2022 | 9:17:17 UTC 29 Mar 2022 | 9:49:31 UTC Completed and validated 12,109.13 12,109.13 3,000.00 Python apps for GPU hosts beta v1.09 (cuda1131)
32876341 27213457 590351 29 Mar 2022 | 4:33:52 UTC 31 Mar 2022 | 6:41:54 UTC Completed and validated 42,830.17 41,435.17 360,000.00 ACEMD 3: molecular dynamics simulations for GPUs v2.19 (cuda1121)
32875459 27212897 581235 27 Mar 2022 | 2:32:46 UTC 29 Mar 2022 | 9:06:58 UTC Completed and validated 96,228.49 95,544.64 360,000.00 ACEMD 3: molecular dynamics simulations for GPUs v2.19 (cuda1121)

PS: How do I past neat image of above??
ID: 58597 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58598 - Posted: 4 Apr 2022, 13:12:34 UTC - in response to Message 58597.  

Please note that other users can't see your entire task list by userid - that's a privacy policy common to all BOINC projects.

The ones you're worried about seem to be Results for host 581235

The one you're specifically asking about - the Python GPU beta v1.10 - was issued on Friday morning and returned on Sunday evening: it was only on your machine for about 58 hours. The run time of 628,304 seconds is misleading (a duplicate of the CPU time) and an error on this website.

Runtime and credit are still being adjusted, and errors are a common feature of beta testing. Sometimes you win, others (like this one) you lose. I'm sure your comments will be noted before testing is complete.
ID: 58598 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58599 - Posted: 4 Apr 2022, 18:32:01 UTC
Last modified: 4 Apr 2022, 18:33:58 UTC

For some reason I haven't been able to snag any of the Python beta tasks lately.

Just the old stock Python tasks.

Couple of them failed at 30 minutes with the no progress downloading the Python environment after 1800 seconds.

One of the reasons I would like to get the new beta tasks that overcome that issue.

Also found a task at 5 hours and counting at 100% completion and not reporting. Suspended the task and resumed in the hope that would nudge it to report but it just restarted at 10% progress.

[Edit] Looks like the suspend/resume was the trick after all. Uploading now.
ID: 58599 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58600 - Posted: 5 Apr 2022, 7:54:43 UTC - in response to Message 58597.  

The credits system is proportional to the amount of compute required to complete each task, like in acemd3.

In acemd3, it is proportional to the complexity of the simulation. In python tasks, which train artificial intelligence reinforcement learning agents, is proportional to the amount of interactions between the agent and its simulated environment required for the agent to learn how to behave in it.

At the moment, we give 2000 credits per 1M interactions, and most tasks require 25M training interactions (except test task which are shorter, normally just 1M). Therefore, completing a task gives 50000 credits and 75000 if completed specially fast.

Note that we are in beta phase, and while the credit difference between acemd and pythonGPU jobs should not be huge, we might need to adjust the credits given per 1M interactions to make them equivalent.
ID: 58600 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58601 - Posted: 5 Apr 2022, 8:04:42 UTC - in response to Message 58599.  
Last modified: 5 Apr 2022, 8:04:59 UTC

Batches of both pythonGPU and pythonGPUBeta are being sent out this week. Hopefully pythonGPUBeta task will run without issues.

We want to wait a bit more in case more bugs are detected, but we will soon update the pythonGPU app with the code from PythonGPUBeta, which seems to work well now. As mentioned, it does not have the problem of installing conda every time (instead downloads the packed environment only the first time). It also works for Linux and Windows.

At that point we will keep PythonGPUBeta only for testing.
ID: 58601 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile bcavnaugh

Send message
Joined: 8 Nov 13
Posts: 56
Credit: 1,002,640,163
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 58602 - Posted: 5 Apr 2022, 16:52:21 UTC
Last modified: 5 Apr 2022, 16:53:43 UTC

So far some run well while other ran for 2 and 3 days.
I did abort the ones that are still running after 3 days.
I will pick back up in the Fall and I hope to see good running tasks on my GPU's.
For now I an waiting for new 3 & 4 on two of my hosts, it is a real bummer that our hosts have to sit for days on end without getting any tasks.
ID: 58602 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58603 - Posted: 5 Apr 2022, 17:38:10 UTC

Looks like the standard BOINC mechanism of complain in a post on the forums on some topic and the BOINC genies grant your wish.

Been getting nothing but solid Python beta tasks now for the past couple of days.
ID: 58603 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
WR-HW95

Send message
Joined: 16 Dec 08
Posts: 7
Credit: 1,549,469,403
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58604 - Posted: 5 Apr 2022, 18:04:48 UTC

I have serious problems with my other machine running 1080Ti.
So far from 20 tasks past 2 weeks best one has ran around 38secs before error.
I tried to underpower + underclock core and mem, still same result around same time.
This one is result of last one.
"<core_client_version>7.16.20</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
10:11:26 (15136): wrapper (7.9.26016): starting
10:11:26 (15136): wrapper: running bin/acemd3.exe (--boinc --device 0)
10:11:29 (15136): bin/acemd3.exe exited; CPU time 0.000000
10:11:29 (15136): app exit status: 0xc0000135
10:11:29 (15136): called boinc_finish(195)"

Is there something wrong in newer drivers on nvidia?
Only difference between machines that works and doesnt beside cpu (3900x and 5900x)is gfx driver version.
Machine that runs tasks has driver 496.49.
Machine that fails tasks has driver 511.79.
ID: 58604 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 58605 - Posted: 5 Apr 2022, 19:06:38 UTC - in response to Message 58604.  

I have serious problems with my other machine running 1080Ti.
So far from 20 tasks past 2 weeks best one has ran around 38secs before error.
I tried to underpower + underclock core and mem, still same result around same time.
This one is result of last one.
"<core_client_version>7.16.20</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
10:11:26 (15136): wrapper (7.9.26016): starting
10:11:26 (15136): wrapper: running bin/acemd3.exe (--boinc --device 0)
10:11:29 (15136): bin/acemd3.exe exited; CPU time 0.000000
10:11:29 (15136): app exit status: 0xc0000135
10:11:29 (15136): called boinc_finish(195)"

Is there something wrong in newer drivers on nvidia?
Only difference between machines that works and doesnt beside cpu (3900x and 5900x)is gfx driver version.
Machine that runs tasks has driver 496.49.
Machine that fails tasks has driver 511.79.


you can try changing the driver back and see? easy troubleshooting step. It's definitely possible to be the driver.

but you seem to be having an issue with the ACEMD3 tasks, this thread is about the Python tasks.

ID: 58605 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
WR-HW95

Send message
Joined: 16 Dec 08
Posts: 7
Credit: 1,549,469,403
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58606 - Posted: 5 Apr 2022, 21:38:04 UTC - in response to Message 58605.  

Sorry for posting wrong thread.
Changed drivers to 496.49 on other machine too... now just have to wait to get some work to see does it work.

Personally I was really hoping when new things were coming, that this project would ditch the cuda at last and moved to opencl.

No project that I have crunched on opencl have had extended issues like this. And most of those projects run on AMD cards too.
ID: 58606 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2025 Universitat Pompeu Fabra