Experimental Python tasks (beta)

Author	Message
Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level Scientific publications	Message 55958 - Posted: 10 Dec 2020, 19:05:05 UTC - in response to Message 55955. There's an explanation for 20 credit tasks over at Rosetta. Has to do with a task being interrupted in calculation and restarted if I remember correctly. ID: 55958 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level Scientific publications	Message 55959 - Posted: 10 Dec 2020, 19:07:58 UTC - in response to Message 55957. Last modified: 10 Dec 2020, 19:15:47 UTC what kind of BOINC install do you have? does it run as a service? or a standalone install that runs from an executable? That was one of the questions I wanted to ask Mr. Kevvy in the case he seems to be the first cruncher to successfully crunch a ton of them without errors. I wondered if his BOINC was a service install or a standalone. [Edit] OK, so Mr. Kevvy is still using the AIO. I wondered since a lot of our team seem to have dropped the AIO and gone back to the service install. So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running. ID: 55959 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 55961 - Posted: 10 Dec 2020, 19:17:41 UTC - in response to Message 55959. I'm almost positive he's running a standalone install. ID: 55961 · Rating: 0 · rate: / Reply Quote

biodoc Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 0 Level Scientific publications	Message 55962 - Posted: 10 Dec 2020, 19:28:54 UTC - in response to Message 55957. I've got one running now on an RTX 2070S and the only real issue is low GPU utilization (60-70%). The current task is using ~2 GB of VRAM and ~3 GB of system RAM. I have one thread free on a ryzen 3900X to support the GPU and that thread is running at 100%. This computer has complete 3 of the new python tasks successfully. Linux Mint 20; Driver Version: 440.95.01; CUDA Version: 10.2 what kind of BOINC install do you have? does it run as a service? or a standalone install that runs from an executable? what is the clock speed of your 3900X and memory speed as well? try letting there be 2 spare free threads (so you have one doing nothing) to avoid maxing out the CPU to 100% utilization on all threads. this is known to slow down GPU work. this might increase your GPU utilization a bit. Boinc runs as a service and was installed from the Mint repository (version 17.16.6). The CPU clock speed is 3.9 GHz and the RAM is DDR4 3200 CL16. I did free up another thread but I didn't see an obvious difference in GPU utilization. ID: 55962 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 55963 - Posted: 10 Dec 2020, 19:31:00 UTC - in response to Message 55959. Last modified: 10 Dec 2020, 19:39:47 UTC So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running. difference in what sense? you and I both have glibc 2.31 and we both have a bunch of successful completions. looks like Kevvy's Ubuntu 20 systems also have 2.31. all of us with these Ubuntu 20.04 systems have successful completions. but of all of his Linux Mint (based on Ubuntu 19) systems, none have completed a single Python task successfully. I'm not sure if it's a problem with Linux Mint or what. I'm not sure its necessarily anything to do with the GLIBC since his error messages are varied, and none mention GLIBC as being the cause. It could just be that the app has some bugs to work out when running in different environments. I also don't know if he's using service installs on his Mint systems, he's got a lot of different BOINC versions across all his systems. ID: 55963 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 55964 - Posted: 10 Dec 2020, 19:36:51 UTC - in response to Message 55962. Boinc runs as a service and was installed from the Mint repository (version 17.16.6). The CPU clock speed is 3.9 GHz and the RAM is DDR4 3200 CL16. I did free up another thread but I didn't see an obvious difference in GPU utilization. thanks for the clarification. it was worth a shot on the GPU utilization with the free thread, low hanging fruit. I run my memory at 3600 CL14, but I've never seen memory matter that much even for CPU tasks on other projects, let alone GPU tasks. (I saw no difference when changing from 3200CL16 to 3600CL14), but anything's possible I guess. ID: 55964 · Rating: 0 · rate: / Reply Quote

biodoc Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 0 Level Scientific publications	Message 55965 - Posted: 10 Dec 2020, 19:44:14 UTC - in response to Message 55963. So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running. difference in what sense? you and I both have glibc 2.31 and we both have a bunch of successful completions. looks like Kevvy's Ubuntu 20 systems also have 2.31. all of us with these Ubuntu 20.04 systems have successful completions. but of all of his Linux Mint (based on Ubuntu 19) systems, none have completed a single Python task successfully. I'm not sure if it's a problem with Linux Mint or what. I'm not sure its necessarily anything to do with the GLIBC since his error messages are varied, and none mention GLIBC as being the cause. It could just be that the app has some bugs to work out when running in different environments. Mint 20 is based on Ubuntu 20.04 and has glibc 2.31. The 2 computers I have running GPUGrid have Mint 20 installed and the RTX cards on those computers are completing the new python tasks successfully. ID: 55965 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 55966 - Posted: 10 Dec 2020, 20:01:37 UTC - in response to Message 55965. Last modified: 10 Dec 2020, 20:02:13 UTC Mint 20 is based on Ubuntu 20.04 and has glibc 2.31. The 2 computers I have running GPUGrid have Mint 20 installed and the RTX cards on those computers are completing the new python tasks successfully. Yes, I know. But my point was that there are many differences between Mint 19 and 20, not just GLIBC version, and usually when GLIBC is an issue that shows up as the reason for the error in the task results, but that hasn't been the case. and conversely we have several examples of tasks hitting Ubuntu 20.04 systems with GLIBC of 2.31 and they still fail. I think it's just buggy. ID: 55966 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level Scientific publications	Message 55969 - Posted: 10 Dec 2020, 22:05:44 UTC - in response to Message 55966. Yes, I had over a half dozen failed tasks before the first successful task. Why I was wondering if the failed tasks report the failed configuration upstream and change the future task configuration. Pretty sure lots of prerequisite software is downloaded first from conda and configured on the system before finally actually starting real crunching. And the configuration downloads happen for each task I think. Not just some initial download and from then on all the file are static. ID: 55969 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 55996 - Posted: 12 Dec 2020, 20:10:00 UTC FYI, these tasks don't checkpoint properly. if you need to stop BOINC or the system experiences a power outage, the tasks restart from the beginning (10%) but the task timer still tracks from where it left off even though the task restarted. if the tasks were short like MDAD (but MDAD checkpoints properly) it wouldn't be a huge problem. but when they run for 4-5hrs and need to start over for any interruption, it's a bit of a kick in the pants. even worse when these restarted tasks only get 20cred for up to 2x total run time. not worth finishing it at that point. additionally as has been mentioned in the other thread, these tasks wreak havoc on the system's DCF since it seems to be set incorrectly for these tasks. you get these tasks that make boinc thing they will complete in 10 seconds, and they end up taking 4hrs, so BOINC counters with inflating the run time of normal tasks to 10+ days when they only take 20-40 min lol. and it swings wildly back and forth depending how many of each type you've completed. and credit reward, other than being about 10x normal for tasks of this runtime, seems only tied to FLOPS and runtime without accounting for efficiency at all. my 3900X/2080ti completes tasks on average much faster than my EPYC/2080ti system since the 3900X system is running higher GPU utilization allowing faster run times. but the 3900X system earns proportionally less credit. so both systems end up earning the same amount of credit per card. the 3900X/2080ti should be earning more credit since it's doing more tasks. reward is being overinflated for tasks that have longer run time due to inefficiency. it seems tied only to raw runtime and estimated flops. i understand that tasks can have varying run times, but if you wont account for efficiency you need to have a static reward not dependent on runtime at all. for reference, a static reward of about 175,000 would, on average, bring these tasks near the MDAD for cred/unit-time. ID: 55996 · Rating: 0 · rate: / Reply Quote

Greger Send message Joined: 6 Jan 15 Posts: 76 Credit: 25,499,534,331 RAC: 0 Level Scientific publications	Message 55997 - Posted: 12 Dec 2020, 22:03:58 UTC Last modified: 12 Dec 2020, 22:06:56 UTC My host switch to another project task then resume and after a while i had to update system and restart. So it indeed fail to resume from last state so it looks like checkpoint was far behind or no checkpoint at all. Time stay at around 2 hour which was hours behind and est percentage locked at 10% I aborted it next day as it reached 14 hours. https://www.gpugrid.net/result.php?resultid=31701824 I would expect it not to be fully working and checkpoint added later on. There is much testing of this but low info for us still so we need to take it for what it is and deal with it if the don't work. ID: 55997 · Rating: 0 · rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 56007 - Posted: 15 Dec 2020, 15:52:05 UTC - in response to Message 55997. The Python app runs ACEMD, but uses additional libraries to compute additional force terms. These libraries are distributed as Conda (Python) packages. For this to work, I had to make an App which installs a self-contained Conda install in the project dir. The installation is re-used from one run to the other. This is rather finicky (for example, downloads are large, and I have to be careful with concurrent installs). Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?). ID: 56007 · Rating: 0 · rate: / Reply Quote

zombie67 [MM] Send message Joined: 16 Jul 07 Posts: 209 Credit: 5,496,860,456 RAC: 9,935 Level Scientific publications	Message 56008 - Posted: 15 Dec 2020, 17:04:43 UTC - in response to Message 56007. Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?). Over-crediting? I am seeing the opposite problem. https://www.gpugrid.net/result.php?resultid=31902208 20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar. Reno, NV Team: SETI.USA ID: 56008 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 56009 - Posted: 15 Dec 2020, 17:14:18 UTC - in response to Message 56007. Thanks for the details. The flops estimate Yes, the "size" of the tasks, as expressed by <rsc_fpops_est> in the workunit template. The current value is 3,000 GFLOPS: all other GPUGrid task types are are 5,000,000 GFLOPS. An App which installs a self-contained Conda install We are encountering an unfortunate clash with the security of BOINC running as a systemd service under Linux. Useful bits of BOINC (pausing computation when the computer's user is active on the mouse or keyboard) rely on having access to the public /tmp/ folder structure. The conda installer wants to make use of a temporary folder. systemd allows us to have either public tmp folders (read only, for security), or private tmp folders (write access). But not both at the same time. We're exploring how to get the best of both worlds... Discussions in https://www.gpugrid.net/forum_thread.php?id=5204 https://github.com/BOINC/boinc/issues/4125 over-crediting We're enjoying it while it lasts! ID: 56009 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 56010 - Posted: 15 Dec 2020, 17:19:51 UTC - in response to Message 56008. Over-crediting? OK, make that 'inconsistent crediting'. Mine are all in the 600,000 - 900,000 range, for much the same runtime on a 1660 Ti. Host 508381 ID: 56010 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56011 - Posted: 15 Dec 2020, 17:50:02 UTC - in response to Message 56010. Last modified: 15 Dec 2020, 18:13:03 UTC Over-crediting? OK, make that 'inconsistent crediting'. Mine are all in the 600,000 - 900,000 range, for much the same runtime on a 1660 Ti. Host 508381 the 20 credits thing seems to only happen with restarted tasks from what ive seen. not sure if anything else triggers it. but I can say with certainty that the credit allocation is "questionable", and only appears to be related to the flops of device 0 in BOINC, as well as runtime. slow devices masked behind a fast device0 will earn credit at the rate of the faster device... ID: 56011 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56012 - Posted: 15 Dec 2020, 17:53:40 UTC - in response to Message 56008. Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?). Over-crediting? I am seeing the opposite problem. https://www.gpugrid.net/result.php?resultid=31902208 20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar. this happens when the task is interrupted. started and resumed. you can't interrupt these tasks at all. ID: 56012 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 56013 - Posted: 15 Dec 2020, 18:09:15 UTC We should perhaps mention the lack of effective checkpointing while we have Toni's attention. Even though the tasks claim to checkpoint every 0.9% (after the initial 10% allowed for the setup), the apps are unable to resume from the point previously reached. ID: 56013 · Rating: 0 · rate: / Reply Quote

zombie67 [MM] Send message Joined: 16 Jul 07 Posts: 209 Credit: 5,496,860,456 RAC: 9,935 Level Scientific publications	Message 56014 - Posted: 15 Dec 2020, 18:22:52 UTC - in response to Message 56012. Over-crediting? I am seeing the opposite problem. https://www.gpugrid.net/result.php?resultid=31902208 20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar. this happens when the task is interrupted. started and resumed. you can't interrupt these tasks at all. I'll check that out. But I have not suspended or otherwise interrupted any tasks. Unless BOINC is doing that without my knowledge. But I don't think so. Reno, NV Team: SETI.USA ID: 56014 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56015 - Posted: 15 Dec 2020, 18:28:46 UTC - in response to Message 56014. Over-crediting? I am seeing the opposite problem. https://www.gpugrid.net/result.php?resultid=31902208 20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar. this happens when the task is interrupted. started and resumed. you can't interrupt these tasks at all. I'll check that out. But I have not suspended or otherwise interrupted any tasks. Unless BOINC is doing that without my knowledge. But I don't think so. you also appear to have your hosts setup to ONLY crunch these beta tasks. is there a reason for that? does your system process the normal tasks fine? maybe it's something going on with your system as a whole. ID: 56015 · Rating: 0 · rate: / Reply Quote