Experimental Python tasks (beta)

Author	Message
Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58146 - Posted: 17 Dec 2021, 18:11:26 UTC I must be crunching one of the fixed second batch currently on this daily driver. Seems to be progressing nicely. Using about 17GB of system memory and the gpu utilization spikes up to 97% every once in a while with periods mostly spent around 12-17% with some brief spikes around 42%. I got one of the first batch on another host that failed fast with similar along with all the wingmen. ID: 58146 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 58147 - Posted: 17 Dec 2021, 19:29:02 UTC these new ones must be pretty long. been running almost 2 hours now. and a lot higher VRAM use. over 6GB per task used on the VRAM. GPUs with less than 6GB have issues? but it also seems that some of the system memory used can be shared. running 1 task shows ~17GB system mem use, but running 5x tasks shows about 53GB system mem use. that's as far as I'll push it on my 64GB machines. ID: 58147 · Rating: 0 · rate: / Reply Quote

kksplace Send message Joined: 4 Mar 18 Posts: 53 Credit: 2,819,226,011 RAC: 23,105 Level Scientific publications	Message 58148 - Posted: 17 Dec 2021, 21:08:46 UTC Last modified: 17 Dec 2021, 21:09:41 UTC I got the first one of the Python WUs for me, and am a little concerned. After 3.25 hours it is only 10% complete. GPU usage seems to be about what you all are saying, and same with CPU. However, I also only have 8 cores/16 threads, with 6 other CPU work units running (TN Grid and Rosetta 4.2). Should I be limiting the other work to let these run? (16 GB RAM). ID: 58148 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58149 - Posted: 17 Dec 2021, 23:27:43 UTC - in response to Message 58148. I don't think BOINC knows how to handle interpreting the estimated run_times of these Python tasks. I wouldn't worry about it. I am over 6 1/2 hours now on this daily driver with 10% still showing. I bet they never show anything BUT 10% done until they finish. ID: 58149 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 58150 - Posted: 18 Dec 2021, 0:09:18 UTC - in response to Message 58149. I had the same feeling, Keith ID: 58150 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 58151 - Posted: 18 Dec 2021, 0:14:15 UTC Last modified: 18 Dec 2021, 0:15:02 UTC also those of us running these, should probably prepare for VERY low credit reward. This is something I have observed for a long time with beta tasks here. there seems to be some kind of anti-cheat mecahnism (or bug) built into BOINC when using the default credit reward scheme (based on flops), if the calculated credit reward is over some value, the credit reward gets defaulted to some very low value. since these are so long running, and beta, I fully expect to see this happen. I've reported about this behavior in the past. would be a nice surprise if not, but I have a strong feeling it'll happen. ID: 58151 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58152 - Posted: 18 Dec 2021, 1:14:41 UTC - in response to Message 58151. I got one task early on that rewarded more than reasonable credit. But the last one was way low but I thought I read a post from @abouh that he had made a mistake in the credit award algorithm and had corrected for that. https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#58124 ID: 58152 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 58153 - Posted: 18 Dec 2021, 2:36:47 UTC - in response to Message 58152. Last modified: 18 Dec 2021, 3:02:51 UTC That task was short though. The threshold is around 2million credit reward if I remember. I posted about it in the team forum almost exactly a year ago. Don’t want to post some details publicly because it could encourage cheating. But for a long time credit reward of the beta tasks has been inconsistent and not calculated fairly IMO. Because the credit reward was so high, I noticed a trend that when the credit reward was supposed to be high enough (extrapolating the runtime with expected reward) it triggered a very low value. This only happened on long running (and hence potential high reward) tasks. Since these tasks are so long, I just think there’s a possibility we’ll see that again. ID: 58153 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 58154 - Posted: 18 Dec 2021, 4:53:29 UTC - in response to Message 58151. Last modified: 18 Dec 2021, 5:23:09 UTC confirmed. Keith you just reported this one. http://www.gpugrid.net/result.php?resultid=32731284 that value of 34,722.22 is the exact same "penalty value" i noticed before a year ago. for 11hrs worth of work (clock time). and 28hrs of "cpu time". interesting that the multithreaded nature of these tasks inflates the run time so much. extrapolating from your successful run that did not hit a penalty, I'd guess that any task longer than about 2.5hrs is gonna hit the penalty value for these tasks. they really should just use the same credit scheme as acemd3. or assign static credit scaled for expected runtime, as long as all of the tasks are about the same size. BOINC documentation confirms my suspicions on what's happening. https://boinc.berkeley.edu/trac/wiki/CreditNew Peak FLOP Count This system uses the Peak-FLOPS-based approach, but addresses its problems in a new way. When a job J is issued to a host, the scheduler computes peak_flops(J) based on the resources used by the job and their peak speeds. When a client finishes a job and reports its elapsed time T, we define peak_flop_count(J), or PFC(J) as PFC(J) = T * peak_flops(J) The credit for a job J is typically proportional to PFC(J), but is limited and normalized in various ways. Notes: PFC(J) is not reliable; cheaters can falsify elapsed time or device attributes. We use elapsed time instead of actual device time (e.g., CPU time). If a job uses a resource inefficiently (e.g., a CPU job that does lots of disk I/O) PFC() won't reflect this. That's OK. The key thing is that BOINC allocated the device to the job, whether or not the job used it efficiently. peak_flops(J) may not be accurate; e.g., a GPU job may take more or less CPU than the scheduler thinks it will. Eventually we may switch to a scheme where the client dynamically determines the CPU usage. For now, though, we'll just use the scheduler's estimate. One-time cheats For example, claiming a PFC of 1e304. This is handled by the sanity check mechanism, which grants a default amount of credit and treats the host with suspicion for a while. ID: 58154 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58155 - Posted: 18 Dec 2021, 6:29:56 UTC Yep, I saw that. Same credit as before and now I remember this bit of code being brought up before back in the old Seti days. @Abouh needs to be made aware of this and assign fixed credit as what they do with acemd3. ID: 58155 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 58157 - Posted: 18 Dec 2021, 16:30:01 UTC Last modified: 18 Dec 2021, 16:45:56 UTC Awoke to find 4 PythonGPU WUs running on 3 computers. All had OPN & TN-Grid WUs running with CPU use flat-lined at 100%. Suspended all other CPU WUs to see what PG was using and got a band mostly contained in the range 20 to 40%. Then I tried a couple of scenarios. 1. Rig-44 has an i9-9980XE 18c36t 32 GB with 16 GB swap file, SSD, and 2 x 2080 Ti's. The GPU use is so low I switched GPU usage to 0.5 for both OPNG and PG and reread config files. OPNG WUs started running and have all been reported fine. PG WUs kept running. Then I started adding back in gene_pcim WUs. When I exceeded 4 gene_pcim WUs the CPU use bands changed shape in a similar way to Rig-24 with a tight band around 30% and a number of curves bouncing off 100%. 2. Rig-26 has an E5-2699 22c44t 32 GB with 16 GB swap (yet to be used), SSD, and a 2080 Ti. I've added back 24 gene_pcim WUs and the CPU use band has moved up to 40-80% with no peaks hitting 100%. Next I changed GPU usage to 0.5 for both OPNG and PG and reread config files. Both seem to be running fine. 3. Rig-24 has an i7-6980X 10c20t 32 GB with a 16 GB swap file, SSD, and a 2080 Ti. This one has been running for 17 hours so far with the last 2 hours having all other CPU work suspended. Its CPU usage graph looks different. There's a tight band oscillating about 20% with a single band oscillating from 60 to 90%. Since PG wants 32 CPUs and this CPU only has 20 there's a constant queue for hyperthreading to feed in. I'll let this one run by itself hoping it finishes soon. Note: TN-Grid usually runs great in Resource Zero Mode where it rarely ever sends more than one extra WU. With PG running and app_config reducing the max running WUs TN-Grid just keeps sending more WUs. Up to 280 now. ID: 58157 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 58158 - Posted: 18 Dec 2021, 17:03:32 UTC - in response to Message 58157. Last modified: 18 Dec 2021, 17:11:37 UTC I did something similar with my two 7xGPU systems. limited to 5 tasks concurrently. and set the app_config files up such that it would run either 3x Einstein per GPU, OR 1xEinstein + 1x GPUGRID since the resources used by both are complimentary. set GPUGRID to 0.6 for GPU use (prevents two from running on the same GPU, 0.6+0.6 >1.0) set Einstein to 0.33 for GPU use (allows three to run on a single GPU or one GPUGRID + one Einstein, 0.33+0.33+0.33<1.0, 0.6+0.33<1.0) but running 5 tasks on a system with 64GB system memory was too ambitious, ram use was initially OK, but grew to fill system ram and swap (default 2GB). if these tasks become more common and plentiful, I might consider upgrading these 7xGPU systems to 128GB RAM so that they can handle running on all GPUs at the same time, but not going to bother if the project decides to reduce the system requirements or these pop up very infrequently. the low credit reward per unit time due to the BOINC credit fail safe default value should be fixed though. not many people will have much incentive to test out the beta tasks with 10-20x less credit per unit time. oh and these don't checkpoint properly (they checkpoint once very early on). if you pause a task that's been running for 20hrs, it restarts from that first checkpoint 20hrs ago. ID: 58158 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58161 - Posted: 20 Dec 2021, 10:29:54 UTC Last modified: 20 Dec 2021, 13:55:24 UTC Hello everyone, The batch I sent on Friday was successfully completed, even if some jobs failed several times initially and got reassigned. I went through all failed jobs. Here I summarise some errors I have seen: 1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail. 2. Conda environment conflicts with package pinocchio. This one I talked about in a previous post. It requires resetting the app. 3. ´INTERNAL ERROR: cannot create temporary directory!´ - I understand this one could be due to a full disk. Also, based on the feedback I will work on fixing the following things before the next batch: 1. Checkpoints will be created more often during training. So jobs can be restarted and won’t go back to the beginning. 2. Credits assigned. The idea is to progressively increase the credits until the credit return becomes similar to that of the acemd jobs. However, devising a general formula to calculate them is more complex in this case. For now it is based in the total amount of data samples gathered from the environments and used to train the AI agent, but that does not take into account the size of the agent neural networks. For now we will keep them fixed, but to solve other problems might be necessary to adjust them. Finally, I think I was a bit too ambitious regarding the total amount of training per job. I will break jobs down in two, so they don't take that long to complete. ID: 58161 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 58162 - Posted: 20 Dec 2021, 14:55:18 UTC - in response to Message 58161. thanks! I did notice that all of mine failed with exceeded time limit. might be a good idea to increase the estimated flops size of these tasks so BOINC knows that they are large and will run for a long time. ID: 58162 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 1,390,367 Level Scientific publications	Message 58163 - Posted: 20 Dec 2021, 16:44:12 UTC - in response to Message 58161. 1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail. I've tried to set preferences at all my less than 6GB RAM GPU hosts for not receiving Python Runtime (GPU, beta) app: Run only the selected applications ACEMD3: yes Quantum Chemistry (CPU): yes Quantum Chemistry (CPU, beta): yes Python Runtime (CPU, beta): yes Python Runtime (GPU, beta): no If no work for selected applications is available, accept work from other applications?: no But I've still received one more Python GPU task at one of them. This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not... Task e1a1-ABOU_rnd_ppod_8-0-1-RND5560_0 RuntimeError: CUDA out of memory. ID: 58163 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 58164 - Posted: 20 Dec 2021, 17:12:00 UTC - in response to Message 58163. This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not... my question is a different one: as long as the GPUGRID team now concentrates on Python, no more ACEMD tasks will come? ID: 58164 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 7 Mar 14 Posts: 18 Credit: 6,703,375,525 RAC: 33,771 Level Scientific publications	Message 58166 - Posted: 20 Dec 2021, 18:21:34 UTC - in response to Message 58163. But I've still received one more Python GPU task at one of them. This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not... I had the same problem, you need to set the 'Run test applications' to No It looks like having that set to Yes will over ride any specific application setting you set. ID: 58166 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 1,390,367 Level Scientific publications	Message 58167 - Posted: 20 Dec 2021, 19:26:34 UTC - in response to Message 58166. Thanks, I'll try ID: 58167 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58168 - Posted: 20 Dec 2021, 19:53:57 UTC - in response to Message 58164. This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not... my question is a different one: as long as the GPUGRID team now concentrates on Python, no more ACEMD tasks will come? Hard to say. Toni and Gianni both stated the work would be very limited and infrequent until they can fill the new PhD positions. But there have been occasional "drive-by" drops of cryptic scout work I've noticed along with the occasional standard research acemd3 resend. Sounds like @abouh is getting ready to drop a larger debugged batch of Python on GPU tasks. ID: 58168 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 58169 - Posted: 21 Dec 2021, 5:52:18 UTC - in response to Message 58168. Sounds like @abouh is getting ready to drop a larger debugged batch of Python on GPU tasks. Would be great if they work on Windows, too :-) ID: 58169 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description