Message boards :
News :
Experimental Python tasks (beta)
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next
Author | Message |
---|---|
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 678,713 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
There's an explanation for 20 credit tasks over at Rosetta. Has to do with a task being interrupted in calculation and restarted if I remember correctly. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 678,713 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
what kind of BOINC install do you have? does it run as a service? or a standalone install that runs from an executable? That was one of the questions I wanted to ask Mr. Kevvy in the case he seems to be the first cruncher to successfully crunch a ton of them without errors. I wondered if his BOINC was a service install or a standalone. [Edit] OK, so Mr. Kevvy is still using the AIO. I wondered since a lot of our team seem to have dropped the AIO and gone back to the service install. So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,348,595 RAC: 4,765,598 Level ![]() Scientific publications ![]() |
I'm almost positive he's running a standalone install. ![]() |
Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I've got one running now on an RTX 2070S and the only real issue is low GPU utilization (60-70%). The current task is using ~2 GB of VRAM and ~3 GB of system RAM. I have one thread free on a ryzen 3900X to support the GPU and that thread is running at 100%. This computer has complete 3 of the new python tasks successfully. Boinc runs as a service and was installed from the Mint repository (version 17.16.6). The CPU clock speed is 3.9 GHz and the RAM is DDR4 3200 CL16. I did free up another thread but I didn't see an obvious difference in GPU utilization. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,348,595 RAC: 4,765,598 Level ![]() Scientific publications ![]() |
So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running. difference in what sense? you and I both have glibc 2.31 and we both have a bunch of successful completions. looks like Kevvy's Ubuntu 20 systems also have 2.31. all of us with these Ubuntu 20.04 systems have successful completions. but of all of his Linux Mint (based on Ubuntu 19) systems, none have completed a single Python task successfully. I'm not sure if it's a problem with Linux Mint or what. I'm not sure its necessarily anything to do with the GLIBC since his error messages are varied, and none mention GLIBC as being the cause. It could just be that the app has some bugs to work out when running in different environments. I also don't know if he's using service installs on his Mint systems, he's got a lot of different BOINC versions across all his systems. ![]() |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,348,595 RAC: 4,765,598 Level ![]() Scientific publications ![]() |
Boinc runs as a service and was installed from the Mint repository (version 17.16.6). The CPU clock speed is 3.9 GHz and the RAM is DDR4 3200 CL16. I did free up another thread but I didn't see an obvious difference in GPU utilization. thanks for the clarification. it was worth a shot on the GPU utilization with the free thread, low hanging fruit. I run my memory at 3600 CL14, but I've never seen memory matter that much even for CPU tasks on other projects, let alone GPU tasks. (I saw no difference when changing from 3200CL16 to 3600CL14), but anything's possible I guess. ![]() |
Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running. Mint 20 is based on Ubuntu 20.04 and has glibc 2.31. The 2 computers I have running GPUGrid have Mint 20 installed and the RTX cards on those computers are completing the new python tasks successfully. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,348,595 RAC: 4,765,598 Level ![]() Scientific publications ![]() |
Mint 20 is based on Ubuntu 20.04 and has glibc 2.31. The 2 computers I have running GPUGrid have Mint 20 installed and the RTX cards on those computers are completing the new python tasks successfully. Yes, I know. But my point was that there are many differences between Mint 19 and 20, not just GLIBC version, and usually when GLIBC is an issue that shows up as the reason for the error in the task results, but that hasn't been the case. and conversely we have several examples of tasks hitting Ubuntu 20.04 systems with GLIBC of 2.31 and they still fail. I think it's just buggy. ![]() |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 678,713 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Yes, I had over a half dozen failed tasks before the first successful task. Why I was wondering if the failed tasks report the failed configuration upstream and change the future task configuration. Pretty sure lots of prerequisite software is downloaded first from conda and configured on the system before finally actually starting real crunching. And the configuration downloads happen for each task I think. Not just some initial download and from then on all the file are static. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,348,595 RAC: 4,765,598 Level ![]() Scientific publications ![]() |
FYI, these tasks don't checkpoint properly. if you need to stop BOINC or the system experiences a power outage, the tasks restart from the beginning (10%) but the task timer still tracks from where it left off even though the task restarted. if the tasks were short like MDAD (but MDAD checkpoints properly) it wouldn't be a huge problem. but when they run for 4-5hrs and need to start over for any interruption, it's a bit of a kick in the pants. even worse when these restarted tasks only get 20cred for up to 2x total run time. not worth finishing it at that point. additionally as has been mentioned in the other thread, these tasks wreak havoc on the system's DCF since it seems to be set incorrectly for these tasks. you get these tasks that make boinc thing they will complete in 10 seconds, and they end up taking 4hrs, so BOINC counters with inflating the run time of normal tasks to 10+ days when they only take 20-40 min lol. and it swings wildly back and forth depending how many of each type you've completed. and credit reward, other than being about 10x normal for tasks of this runtime, seems only tied to FLOPS and runtime without accounting for efficiency at all. my 3900X/2080ti completes tasks on average much faster than my EPYC/2080ti system since the 3900X system is running higher GPU utilization allowing faster run times. but the 3900X system earns proportionally less credit. so both systems end up earning the same amount of credit per card. the 3900X/2080ti should be earning more credit since it's doing more tasks. reward is being overinflated for tasks that have longer run time due to inefficiency. it seems tied only to raw runtime and estimated flops. i understand that tasks can have varying run times, but if you wont account for efficiency you need to have a static reward not dependent on runtime at all. for reference, a static reward of about 175,000 would, on average, bring these tasks near the MDAD for cred/unit-time. ![]() |
Send message Joined: 6 Jan 15 Posts: 76 Credit: 25,499,534,331 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My host switch to another project task then resume and after a while i had to update system and restart. So it indeed fail to resume from last state so it looks like checkpoint was far behind or no checkpoint at all. Time stay at around 2 hour which was hours behind and est percentage locked at 10% I aborted it next day as it reached 14 hours. https://www.gpugrid.net/result.php?resultid=31701824 I would expect it not to be fully working and checkpoint added later on. There is much testing of this but low info for us still so we need to take it for what it is and deal with it if the don't work. |
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
The Python app runs ACEMD, but uses additional libraries to compute additional force terms. These libraries are distributed as Conda (Python) packages. For this to work, I had to make an App which installs a self-contained Conda install in the project dir. The installation is re-used from one run to the other. This is rather finicky (for example, downloads are large, and I have to be careful with concurrent installs). Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?). |
Send message Joined: 16 Jul 07 Posts: 209 Credit: 5,496,860,456 RAC: 8,582,660 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?). Over-crediting? I am seeing the opposite problem. https://www.gpugrid.net/result.php?resultid=31902208 20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar. Reno, NV Team: SETI.USA |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks for the details. The flops estimate Yes, the "size" of the tasks, as expressed by <rsc_fpops_est> in the workunit template. The current value is 3,000 GFLOPS: all other GPUGrid task types are are 5,000,000 GFLOPS. An App which installs a self-contained Conda install We are encountering an unfortunate clash with the security of BOINC running as a systemd service under Linux. Useful bits of BOINC (pausing computation when the computer's user is active on the mouse or keyboard) rely on having access to the public /tmp/ folder structure. The conda installer wants to make use of a temporary folder. systemd allows us to have either public tmp folders (read only, for security), or private tmp folders (write access). But not both at the same time. We're exploring how to get the best of both worlds... Discussions in https://www.gpugrid.net/forum_thread.php?id=5204 https://github.com/BOINC/boinc/issues/4125 over-crediting We're enjoying it while it lasts! |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Over-crediting? OK, make that 'inconsistent crediting'. Mine are all in the 600,000 - 900,000 range, for much the same runtime on a 1660 Ti. Host 508381 |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,348,595 RAC: 4,765,598 Level ![]() Scientific publications ![]() |
Over-crediting? the 20 credits thing seems to only happen with restarted tasks from what ive seen. not sure if anything else triggers it. but I can say with certainty that the credit allocation is "questionable", and only appears to be related to the flops of device 0 in BOINC, as well as runtime. slow devices masked behind a fast device0 will earn credit at the rate of the faster device... ![]() |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,348,595 RAC: 4,765,598 Level ![]() Scientific publications ![]() |
Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?). this happens when the task is interrupted. started and resumed. you can't interrupt these tasks at all. ![]() |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
We should perhaps mention the lack of effective checkpointing while we have Toni's attention. Even though the tasks claim to checkpoint every 0.9% (after the initial 10% allowed for the setup), the apps are unable to resume from the point previously reached. |
Send message Joined: 16 Jul 07 Posts: 209 Credit: 5,496,860,456 RAC: 8,582,660 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Over-crediting? I am seeing the opposite problem. I'll check that out. But I have not suspended or otherwise interrupted any tasks. Unless BOINC is doing that without my knowledge. But I don't think so. Reno, NV Team: SETI.USA |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,348,595 RAC: 4,765,598 Level ![]() Scientific publications ![]() |
Over-crediting? I am seeing the opposite problem. you also appear to have your hosts setup to ONLY crunch these beta tasks. is there a reason for that? does your system process the normal tasks fine? maybe it's something going on with your system as a whole. ![]() |
©2025 Universitat Pompeu Fabra