Experimental Python tasks (beta)

Message boards : News : Experimental Python tasks (beta)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55958 - Posted: 10 Dec 2020, 19:05:05 UTC - in response to Message 55955.  

There's an explanation for 20 credit tasks over at Rosetta.
Has to do with a task being interrupted in calculation and restarted if I remember correctly.
ID: 55958 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55959 - Posted: 10 Dec 2020, 19:07:58 UTC - in response to Message 55957.  
Last modified: 10 Dec 2020, 19:15:47 UTC

what kind of BOINC install do you have? does it run as a service? or a standalone install that runs from an executable?


That was one of the questions I wanted to ask Mr. Kevvy in the case he seems to be the first cruncher to successfully crunch a ton of them without errors.

I wondered if his BOINC was a service install or a standalone.

[Edit] OK, so Mr. Kevvy is still using the AIO. I wondered since a lot of our team seem to have dropped the AIO and gone back to the service install.

So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running.
ID: 55959 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 55961 - Posted: 10 Dec 2020, 19:17:41 UTC - in response to Message 55959.  

I'm almost positive he's running a standalone install.
ID: 55961 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
biodoc

Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55962 - Posted: 10 Dec 2020, 19:28:54 UTC - in response to Message 55957.  

I've got one running now on an RTX 2070S and the only real issue is low GPU utilization (60-70%). The current task is using ~2 GB of VRAM and ~3 GB of system RAM. I have one thread free on a ryzen 3900X to support the GPU and that thread is running at 100%. This computer has complete 3 of the new python tasks successfully.

Linux Mint 20; Driver Version: 440.95.01; CUDA Version: 10.2


what kind of BOINC install do you have? does it run as a service? or a standalone install that runs from an executable?

what is the clock speed of your 3900X and memory speed as well?

try letting there be 2 spare free threads (so you have one doing nothing) to avoid maxing out the CPU to 100% utilization on all threads. this is known to slow down GPU work. this might increase your GPU utilization a bit.


Boinc runs as a service and was installed from the Mint repository (version 17.16.6). The CPU clock speed is 3.9 GHz and the RAM is DDR4 3200 CL16. I did free up another thread but I didn't see an obvious difference in GPU utilization.
ID: 55962 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 55963 - Posted: 10 Dec 2020, 19:31:00 UTC - in response to Message 55959.  
Last modified: 10 Dec 2020, 19:39:47 UTC

So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running.


difference in what sense?

you and I both have glibc 2.31 and we both have a bunch of successful completions. looks like Kevvy's Ubuntu 20 systems also have 2.31. all of us with these Ubuntu 20.04 systems have successful completions.

but of all of his Linux Mint (based on Ubuntu 19) systems, none have completed a single Python task successfully. I'm not sure if it's a problem with Linux Mint or what. I'm not sure its necessarily anything to do with the GLIBC since his error messages are varied, and none mention GLIBC as being the cause. It could just be that the app has some bugs to work out when running in different environments. I also don't know if he's using service installs on his Mint systems, he's got a lot of different BOINC versions across all his systems.
ID: 55963 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 55964 - Posted: 10 Dec 2020, 19:36:51 UTC - in response to Message 55962.  

Boinc runs as a service and was installed from the Mint repository (version 17.16.6). The CPU clock speed is 3.9 GHz and the RAM is DDR4 3200 CL16. I did free up another thread but I didn't see an obvious difference in GPU utilization.


thanks for the clarification. it was worth a shot on the GPU utilization with the free thread, low hanging fruit.

I run my memory at 3600 CL14, but I've never seen memory matter that much even for CPU tasks on other projects, let alone GPU tasks. (I saw no difference when changing from 3200CL16 to 3600CL14), but anything's possible I guess.

ID: 55964 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
biodoc

Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55965 - Posted: 10 Dec 2020, 19:44:14 UTC - in response to Message 55963.  

So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running.


difference in what sense?

you and I both have glibc 2.31 and we both have a bunch of successful completions. looks like Kevvy's Ubuntu 20 systems also have 2.31. all of us with these Ubuntu 20.04 systems have successful completions.

but of all of his Linux Mint (based on Ubuntu 19) systems, none have completed a single Python task successfully. I'm not sure if it's a problem with Linux Mint or what. I'm not sure its necessarily anything to do with the GLIBC since his error messages are varied, and none mention GLIBC as being the cause. It could just be that the app has some bugs to work out when running in different environments.


Mint 20 is based on Ubuntu 20.04 and has glibc 2.31. The 2 computers I have running GPUGrid have Mint 20 installed and the RTX cards on those computers are completing the new python tasks successfully.
ID: 55965 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 55966 - Posted: 10 Dec 2020, 20:01:37 UTC - in response to Message 55965.  
Last modified: 10 Dec 2020, 20:02:13 UTC

Mint 20 is based on Ubuntu 20.04 and has glibc 2.31. The 2 computers I have running GPUGrid have Mint 20 installed and the RTX cards on those computers are completing the new python tasks successfully.


Yes, I know. But my point was that there are many differences between Mint 19 and 20, not just GLIBC version, and usually when GLIBC is an issue that shows up as the reason for the error in the task results, but that hasn't been the case.

and conversely we have several examples of tasks hitting Ubuntu 20.04 systems with GLIBC of 2.31 and they still fail.

I think it's just buggy.
ID: 55966 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55969 - Posted: 10 Dec 2020, 22:05:44 UTC - in response to Message 55966.  

Yes, I had over a half dozen failed tasks before the first successful task.
Why I was wondering if the failed tasks report the failed configuration upstream and change the future task configuration.

Pretty sure lots of prerequisite software is downloaded first from conda and configured on the system before finally actually starting real crunching.

And the configuration downloads happen for each task I think.

Not just some initial download and from then on all the file are static.

ID: 55969 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 55996 - Posted: 12 Dec 2020, 20:10:00 UTC

FYI, these tasks don't checkpoint properly.

if you need to stop BOINC or the system experiences a power outage, the tasks restart from the beginning (10%) but the task timer still tracks from where it left off even though the task restarted. if the tasks were short like MDAD (but MDAD checkpoints properly) it wouldn't be a huge problem. but when they run for 4-5hrs and need to start over for any interruption, it's a bit of a kick in the pants. even worse when these restarted tasks only get 20cred for up to 2x total run time. not worth finishing it at that point.

additionally as has been mentioned in the other thread, these tasks wreak havoc on the system's DCF since it seems to be set incorrectly for these tasks. you get these tasks that make boinc thing they will complete in 10 seconds, and they end up taking 4hrs, so BOINC counters with inflating the run time of normal tasks to 10+ days when they only take 20-40 min lol. and it swings wildly back and forth depending how many of each type you've completed.

and credit reward, other than being about 10x normal for tasks of this runtime, seems only tied to FLOPS and runtime without accounting for efficiency at all.

my 3900X/2080ti completes tasks on average much faster than my EPYC/2080ti system since the 3900X system is running higher GPU utilization allowing faster run times. but the 3900X system earns proportionally less credit. so both systems end up earning the same amount of credit per card. the 3900X/2080ti should be earning more credit since it's doing more tasks. reward is being overinflated for tasks that have longer run time due to inefficiency. it seems tied only to raw runtime and estimated flops. i understand that tasks can have varying run times, but if you wont account for efficiency you need to have a static reward not dependent on runtime at all. for reference, a static reward of about 175,000 would, on average, bring these tasks near the MDAD for cred/unit-time.
ID: 55996 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greger

Send message
Joined: 6 Jan 15
Posts: 76
Credit: 25,499,534,331
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 55997 - Posted: 12 Dec 2020, 22:03:58 UTC
Last modified: 12 Dec 2020, 22:06:56 UTC

My host switch to another project task then resume and after a while i had to update system and restart. So it indeed fail to resume from last state so it looks like checkpoint was far behind or no checkpoint at all. Time stay at around 2 hour which was hours behind and est percentage locked at 10%

I aborted it next day as it reached 14 hours.

https://www.gpugrid.net/result.php?resultid=31701824

I would expect it not to be fully working and checkpoint added later on. There is much testing of this but low info for us still so we need to take it for what it is and deal with it if the don't work.
ID: 55997 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 56007 - Posted: 15 Dec 2020, 15:52:05 UTC - in response to Message 55997.  

The Python app runs ACEMD, but uses additional libraries to compute additional force terms. These libraries are distributed as Conda (Python) packages.

For this to work, I had to make an App which installs a self-contained Conda install in the project dir. The installation is re-used from one run to the other.

This is rather finicky (for example, downloads are large, and I have to be careful with concurrent installs).

Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?).
ID: 56007 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]

Send message
Joined: 16 Jul 07
Posts: 209
Credit: 5,496,860,456
RAC: 8,582,660
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56008 - Posted: 15 Dec 2020, 17:04:43 UTC - in response to Message 56007.  

Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?).


Over-crediting? I am seeing the opposite problem.

https://www.gpugrid.net/result.php?resultid=31902208

20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar.
Reno, NV
Team: SETI.USA
ID: 56008 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56009 - Posted: 15 Dec 2020, 17:14:18 UTC - in response to Message 56007.  

Thanks for the details.

The flops estimate

Yes, the "size" of the tasks, as expressed by <rsc_fpops_est> in the workunit template. The current value is 3,000 GFLOPS: all other GPUGrid task types are are 5,000,000 GFLOPS.

An App which installs a self-contained Conda install

We are encountering an unfortunate clash with the security of BOINC running as a systemd service under Linux. Useful bits of BOINC (pausing computation when the computer's user is active on the mouse or keyboard) rely on having access to the public /tmp/ folder structure. The conda installer wants to make use of a temporary folder.

systemd allows us to have either public tmp folders (read only, for security), or private tmp folders (write access). But not both at the same time. We're exploring how to get the best of both worlds...

Discussions in
https://www.gpugrid.net/forum_thread.php?id=5204
https://github.com/BOINC/boinc/issues/4125

over-crediting

We're enjoying it while it lasts!
ID: 56009 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56010 - Posted: 15 Dec 2020, 17:19:51 UTC - in response to Message 56008.  

Over-crediting?

OK, make that 'inconsistent crediting'. Mine are all in the 600,000 - 900,000 range, for much the same runtime on a 1660 Ti.

Host 508381
ID: 56010 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 56011 - Posted: 15 Dec 2020, 17:50:02 UTC - in response to Message 56010.  
Last modified: 15 Dec 2020, 18:13:03 UTC

Over-crediting?

OK, make that 'inconsistent crediting'. Mine are all in the 600,000 - 900,000 range, for much the same runtime on a 1660 Ti.

Host 508381


the 20 credits thing seems to only happen with restarted tasks from what ive seen. not sure if anything else triggers it.

but I can say with certainty that the credit allocation is "questionable", and only appears to be related to the flops of device 0 in BOINC, as well as runtime. slow devices masked behind a fast device0 will earn credit at the rate of the faster device...
ID: 56011 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 56012 - Posted: 15 Dec 2020, 17:53:40 UTC - in response to Message 56008.  

Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?).


Over-crediting? I am seeing the opposite problem.

https://www.gpugrid.net/result.php?resultid=31902208

20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar.


this happens when the task is interrupted. started and resumed. you can't interrupt these tasks at all.
ID: 56012 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56013 - Posted: 15 Dec 2020, 18:09:15 UTC

We should perhaps mention the lack of effective checkpointing while we have Toni's attention. Even though the tasks claim to checkpoint every 0.9% (after the initial 10% allowed for the setup), the apps are unable to resume from the point previously reached.
ID: 56013 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]

Send message
Joined: 16 Jul 07
Posts: 209
Credit: 5,496,860,456
RAC: 8,582,660
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56014 - Posted: 15 Dec 2020, 18:22:52 UTC - in response to Message 56012.  

Over-crediting? I am seeing the opposite problem.

https://www.gpugrid.net/result.php?resultid=31902208

20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar.


this happens when the task is interrupted. started and resumed. you can't interrupt these tasks at all.


I'll check that out. But I have not suspended or otherwise interrupted any tasks. Unless BOINC is doing that without my knowledge. But I don't think so.
Reno, NV
Team: SETI.USA
ID: 56014 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 56015 - Posted: 15 Dec 2020, 18:28:46 UTC - in response to Message 56014.  

Over-crediting? I am seeing the opposite problem.

https://www.gpugrid.net/result.php?resultid=31902208

20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar.


this happens when the task is interrupted. started and resumed. you can't interrupt these tasks at all.


I'll check that out. But I have not suspended or otherwise interrupted any tasks. Unless BOINC is doing that without my knowledge. But I don't think so.


you also appear to have your hosts setup to ONLY crunch these beta tasks. is there a reason for that?

does your system process the normal tasks fine? maybe it's something going on with your system as a whole.

ID: 56015 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : News : Experimental Python tasks (beta)

©2025 Universitat Pompeu Fabra