Experimental Python tasks (beta)

Author	Message
STARBASEn Send message Joined: 17 Feb 09 Posts: 91 Credit: 1,603,303,394 RAC: 0 Level Scientific publications	Message 58936 - Posted: 17 Jun 2022, 16:55:57 UTC BOINC would have to completely rewrite that part of the code. The fact that these tasks run on both the cpu and gpu makes them impossible to decipher by BOINC. The closest mechanism is the MT or multi-task category but that only knows about cpu tasks which run solely on the cpu. I think BOINC uses the CPU excluively for their Estimated Time to Completion algorithm all WU's including those using a GPU which makes sense since the job cannot complete until both processor's work are complete. Observing GPU work with E@H, it appears that the GPU finishes first and the CPU continues for a period of time to do what is necessary to wrap the job up for return and those BOINC ETC's are fairly accurate. It is the multi-thread WU's mentioned that appears to be throwing a monkey wrench at the ETC like these python jobs. From my observations, the python WU's use 32 processes regardless of actual system configuration. I have 2 Ryzen 16 core and my old FX-8350 8 core and they each run 32 processes each WU. It seems to me that the existing algorithm could be used in a modular fashion by assuming a single thread CPU job for the MT WU then calculating the estimated time and then knowing the number of processes the WU is requesting compared with those available from the system, it could perform a simple division and produce a more accurate result for MT WU's as well. Don't know for sure, just speculating but I do have the BOINC source code and might take a look and see if I can find the ETC stuff. Might be interseting. ID: 58936 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58937 - Posted: 17 Jun 2022, 17:57:47 UTC - in response to Message 58936. The server code for determining the ETC for MT tasks also has to account for task scheduling. If it was adjusted as you suggest, anytime a Python task would run on the host, the server would proclaim it severely overcommitted and prevent any other work from running or worse, would actually prevent the Python task from running as it prevents other work from running from other projects in accordance with resource share and round-robin scheduling algorithm in the server and client. It is a mess already with MT work, I believe it would be even worse accounting for these mixed platform cpu-gpu Python tasks. But go ahead and look at the code. Also you should raise an issue on BOINC's Github repository so that the problem is logged and can be tracked for progress. ID: 58937 · Rating: 0 · rate: / Reply Quote

STARBASEn Send message Joined: 17 Feb 09 Posts: 91 Credit: 1,603,303,394 RAC: 0 Level Scientific publications	Message 58943 - Posted: 18 Jun 2022, 18:52:40 UTC You make a good point regarding the server side issues. Perhaps the projects themselves, if not already, would submit desired resources to allow the server to compare with those available on clients similar to submitting in house cluster jobs. I also agree that it is probably best to go through BOINC git and get a request for a potential fix but I also want to see their ETC algorithms just out of curiousity, both server and client. Nice interesting discussion. ID: 58943 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58944 - Posted: 18 Jun 2022, 20:04:43 UTC You need to review the code in the /client/work_fetch.cpp module and any of the old closed issues pertaining to use of max_concurrent statements in app_config.xml. I've have posted many conversations on this issue and collaborated with David Anderson and Richard Haselgrove to understand the issue and have seen at least six attempts to fix the issue once and for all. A very complicated part of the code. You might also want to review many of the client emulator bug-fix runs done on this topic. https://boinc.berkeley.edu/sim_web.php The meat of the issue was in PR's #2918 #3001 #3065 #3076 #4117 and #4592 https://github.com/BOINC/boinc/pull/2918 Focus on the round-robin scheduling part of the code. ID: 58944 · Rating: 0 · rate: / Reply Quote

STARBASEn Send message Joined: 17 Feb 09 Posts: 91 Credit: 1,603,303,394 RAC: 0 Level Scientific publications	Message 58949 - Posted: 20 Jun 2022, 1:14:28 UTC Thank you Keith, much appreciated background and starting points. ID: 58949 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 58961 - Posted: 26 Jun 2022, 6:44:33 UTC need advice with regard to running Python on one of my Windows machines: One one of the Windows systems with a GTX980Ti, CPU Intel i7-4930K, 32GB RAM, Python runs well. GPU memory usage is almost constant at 2.679MB, system memory usage varies between ~1.300MB and ~5.000MB. Task runtime between ~510.000 and ~530.000 secs. Other Windows system with two RTX3070, CPU Intel i9-10900KF, 64GB RAM out of which 32GB are used for Ramdisk, leaving 32GB system RAM. When trying to download Python tasks, BOINC event log says that some 22GB more RAM are needed. How come? From what I see from the other machine, Python uses between 1.3GB and 5GB RAM. What can I do in order to get the machine with the two RTX3070 download and crunch Python tasks? ID: 58961 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58962 - Posted: 26 Jun 2022, 7:09:04 UTC - in response to Message 58961. BOINC event log says that some 22GB more RAM are needed. Could you post the exact text of the log message and a few lines either side for context? We might be able to decode it. ID: 58962 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 58963 - Posted: 26 Jun 2022, 7:30:14 UTC - in response to Message 58962. BOINC event log says that some 22GB more RAM are needed. Could you post the exact text of the log message and a few lines either side for context? We might be able to decode it. here is the text of the log message: 26.06.2022 09:20:35 \| GPUGRID \| Requesting new tasks for CPU and NVIDIA GPU 26.06.2022 09:20:37 \| GPUGRID \| Scheduler request completed: got 0 new tasks 26.06.2022 09:20:37 \| GPUGRID \| No tasks sent 26.06.2022 09:20:37 \| GPUGRID \| No tasks are available for ACEMD 3: molecular dynamics simulations for GPUs 26.06.2022 09:20:37 \| GPUGRID \| Nachricht vom Server: Python apps for GPU hosts needs 22396.05MB more disk space. You currently have 10982.55 MB available and it needs 33378.60 MB. 26.06.2022 09:20:37 \| GPUGRID \| Project requested delay of 31 seconds the reason why at this point it says I have 10.982MB available is because I currently have some LHC projects running which use some RAM. However, it also says: I need 33.378MB RAM; so my 32GB RAM are not enough anyway (as seen on the other machine, on which I also have 32GB RAM, and there is no problem with downloading and crunching Python). What I am surprised about is that the projects request so much free RAM, alhough while in operation, it uses only between 1.3 and 5GB. ID: 58963 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58964 - Posted: 26 Jun 2022, 8:06:41 UTC - in response to Message 58963. 26.06.2022 09:20:37 \| GPUGRID \| Nachricht vom Server: Python apps for GPU hosts needs 22396.05MB more disk space. You currently have 10982.55 MB available and it needs 33378.60 MB. Disk, not RAM. Probably one or other of your disk settings is blocking it. ID: 58964 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 58965 - Posted: 26 Jun 2022, 8:21:42 UTC - in response to Message 58964. 26.06.2022 09:20:37 \| GPUGRID \| Nachricht vom Server: Python apps for GPU hosts needs 22396.05MB more disk space. You currently have 10982.55 MB available and it needs 33378.60 MB. Disk, not RAM. Probably one or other of your disk settings is blocking it. Oh sorry, you are perfectly right. My mistake, how dumm :-( so, with my 32GB Ramdisk it does not work, when it says that it needs 33378MB. What I could do, theoretically, is to shift BOINC from the Ramdisk to the 1 GB SSD. However, the reason why I installed BOINC on the Ramdisk was that the LHC Atlas tasks which I am crunching permanently have an enormous disk usage, and I don't want ATLAS to kill the SSD too early. I guess that there might be ways to install a second instance of BOINC on the SSD - I tried this on another PC years ago, but somehow I did not get it done properly :-( ID: 58965 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58966 - Posted: 26 Jun 2022, 9:32:13 UTC - in response to Message 58965. You'll need to decide which copy of BOINC is going to be your 'primary' installation (default settings, autorun stuff in the registry, etc.), and which is going to be the 'secondary'. The primary one can be exactly what is set up by the installer, with one change. The easiest way is to add the line <allow_multiple_clients>1</allow_multiple_clients> to the options section of cc_config.xml (or set the value to 1 if the line is already present). That needs a client restart if BOINC's already running. Then, these two batch files work for me. Adapt program and data locations as needed. To run the client: D:\BOINC\rh_boinc_test --allow_multiple_clients --allow_remote_gui_rpc --redirectio --detach_console --gui_rpc_port 31418 --dir D:\BOINCdata2\ To run a Manager to control the second client: start D:\BOINC\boincmgr.exe /m /n 127.0.0.1 /g 31418 /p password Note that I've set this up to run test clients alongside my main working installation - you can probably ignore that bit. ID: 58966 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 58968 - Posted: 30 Jun 2022, 15:56:37 UTC - in response to Message 58844. We have a time estimation problem, discussed previously in the thread. As Keith mentioned, the real walltime calculation should be much less than reported. It would be very helpful if you could let us know if that is the case. In particular, if you are getting 75000 credits per jobs means the jobs are getting 25% extra credits for returning fast. Are you still in need of that? My first Python ran for 12 hours 55 minutes according to BoincTasks, but the website reported 156,269.60 seconds (over 43 hours). It got 75,000 credits. http://www.gpugrid.net/results.php?hostid=593715 ID: 58968 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58969 - Posted: 1 Jul 2022, 13:10:32 UTC - in response to Message 58968. Last modified: 1 Jul 2022, 13:11:38 UTC Thanks for the feedback Jim1348! It is useful for us to confirm that jobs run in a reasonable time despite the wrong estimation issue. Maybe that can be solved somehow in the future. Seems like at least did no estimate dozens of days like I have seen in other occasions. ID: 58969 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 58970 - Posted: 1 Jul 2022, 13:33:55 UTC - in response to Message 58969. it's because the app is using the CPU time instead of runtime. since it uses so many threads, it adds up the time spent on all the threads. 2 threads working for 1hr total would be 2hrs reported CPU time. you need to track wall clock time. the app seems to have this capability since it reports timestamps of start and stop in the stderr.txt file. also credit reward is static, and should be a more dynamic scheme like the acemd3 tasks. look at Jim's tasks you have tasks with a 2,000 - 150,000 seconds (reported) all with the same 75,000 credit reward. good reward for the 2,000s runs, but painfully low for the longer ones (the majority). ID: 58970 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58971 - Posted: 1 Jul 2022, 13:55:40 UTC - in response to Message 58970. There are two separate problems with timing. There's the display of CPU time instead of elapsed time on the website - that's purely cosmetic, as we report the correct elapsed time for the finished tasks. And there's the estimation of anticipated runtime when a task is first issued, before it's even started to run. I would have thought that would have started to correct itself by now: with the steady supply of work recently, we will have got well past all the trigger points for the server averaging algorithms. Next time I see a task waiting to run, I'll trap the numbers and try to make sense of them. ID: 58971 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 58972 - Posted: 1 Jul 2022, 14:56:15 UTC - in response to Message 58971. There's the display of CPU time instead of elapsed time on the website - that's purely cosmetic, as we report the correct elapsed time for the finished tasks. that may be true, NOW. however, if they move to a dynamic credit scheme (as they should) that awards credit based on flops and runtime (like ACEMD3 does), then the runtime will not be just cosmetic. ID: 58972 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58973 - Posted: 1 Jul 2022, 17:27:17 UTC - in response to Message 58971. OK, I got one on host 508381. Initial estimate is 752d 05:26:18, task is 32940037 Size: <rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est> Speed: <flops>707646935000.048218</flops> DCF: <duration_correction_factor>45.991658</duration_correction_factor> App_ver: <app_name>PythonGPU</app_name> <version_num>403</version_num> Host details: Number of tasks completed 80 Average processing rate 13025.358204684 Calculated time estimate (size / speed): 1413134.079355548 [seconds] 16.355718511 [days - raw] 752.226612105 [days - adjusted by DCF] So my client is doing the calculations right. The glaring difference is between flops and APR. Re-doing the {size / speed} calculation with APR gives 76773.320494203 [seconds] 21.32592236 [hours] which is a little high for this machine, but not bad. The last 'normal length' tasks ran in about 14 hours. So, the question is: why is the server tracking APR, but not using it in the <app_version> blocks sent to our machines? ID: 58973 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58974 - Posted: 2 Jul 2022, 9:04:06 UTC Yesterday's task is just in the final stages - it'll finish after about 13 hours - and the next is ready to start. So here are the figures for the next in the cycle. Initial estimate: 737d 06:19:25 <flops>707646935000.048218</flops> <duration_correction_factor>45.076802</duration_correction_factor> Average processing rate 13072.709605774 So, APR and DCF have both made a tiny movement in the right direction, but flops has remained stubbornly unchanged. And that's the one that controls the initial estimates. (actually, a little short one crept in between the two I'm watching, so it's two cycles - but that doesn't change the principle) ID: 58974 · Rating: 0 · rate: / Reply Quote

roundup Send message Joined: 11 May 10 Posts: 68 Credit: 12,531,253,875 RAC: 82,415 Level Scientific publications	Message 58975 - Posted: 3 Jul 2022, 10:17:13 UTC The credits per runtime for cuda1131 really looks strange sometimes: Task 27246643 2 Jul 2022 \| 8:13:32 UTC 3 Jul 2022 \| 8:20:56 UTC Runtimes 445,161.60 445,161.60 Credits 62,500.00 Compare to this one: Task 27246622 2 Jul 2022 \| 7:55:03 UTC 2 Jul 2022 \| 8:05:39 UTC Runtimes 2,770.92 2,770.92 Credits 75,000.00 ID: 58975 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58977 - Posted: 4 Jul 2022, 13:54:08 UTC - in response to Message 58970. Yes, you are right about that. There are 2 types of experiments I run now: a) Normal experiments have tasks with a fixed target number of agent-environment interaction to process. The tasks finish once this number of interactions is reached. All tasks require the same amount of compute, then makes sense (at least to me) to reward them with the same amount of credits. Even if some tasks are completed in less time due to faster hardware. b) I have recently introduced an "early stopping" mechanism to some experiments. The upper bound is the same as in the other type of experiments, a fixed amount of agent-environment interactions. However, if the agent discovers interesting results before that, it returns so this information can be shared with other agent in the population of AI agents. Which agents will finish earlier and how much earlier is random, so it would be interesting to adjust the credits dynamically, yes. I will ask acemd3 people how to do it. ID: 58977 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description