Experimental Python tasks (beta)

Author	Message
Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58505 - Posted: 12 Mar 2022, 8:56:44 UTC - in response to Message 58502. Maybe a temporary solution for the time estimation would be to set rsc_fpops_est for the PythonGPUbeta app to the same value we have in the PythonGPU app? This approach is wrong. The rsc_fpops_est should be set accprdingly for the actual batch of workunits, not for the app. As test batches are much shorter than production batches, they should have a much less rsc_fpops_est value, regardless that the same app processes them. Correct. Next time I see a really gross (multi-year) runtime estimate, I'll dig out the exact figures, show you the working-out, and try to analyse where they've come from. In the meantime, we're working through a glut of ACEMD3 tasks, and here's how they arrive: 12/03/2022 08:23:29 \| GPUGRID \| [sched_op] NVIDIA GPU work request: 11906.64 seconds; 0.00 devices 12/03/2022 08:23:30 \| GPUGRID \| Scheduler request completed: got 2 new tasks 12/03/2022 08:23:30 \| GPUGRID \| [sched_op] estimated total NVIDIA GPU task duration: 306007 seconds So, I'm asking for a few hours of work, and getting several days. Or so BOINC says. This is Windows host 45218, which is currently showing "Task duration correction factor 13.714405". (It was higher a few minutes ago, when that work was fetched - over 13.84) I forgot to mention yesterday that in the first phase of BOINC's life, both your server and our clients took account of DCF, so the 'request' and 'estimated' figures would have been much closer. But when the APR code was added in 2010, the DCF code was removed from the servers. So your server knows what my DCF is, but it doesn't use that information. So the server probably assessed that each task would last about 11,055 seconds. That's why it added the second task to the allocation: it thought the first one didn't quite fill my request for 11,906 seconds. In reality, this is a short-running batch - although not marked as such - and the last one finished in 4,289 seconds. That's why DCF is falling after every task, though slowly. ID: 58505 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 339 Credit: 7,990,341,558 RAC: 103 Level Scientific publications	Message 58506 - Posted: 12 Mar 2022, 21:01:37 UTC - in response to Message 58494. Yes, I have seen this error in some other machines that could unpack the file with tar.exe. In just a few of them. So it is an issue in the python script. Today I will be looking into it. It does not happen in linux with the same code. Having tar.exe wasn't enough. I later saw a popup in W10 saying archieveint.dll was missing. I had two python tasks in linux error out in ~30min with 15:33:14 (26820): task /usr/bin/flock reached time limit 1800 application ./gpugridpy/bin/python missing That PC has python 2.7.17 and 3.6.8 installed. ID: 58506 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58508 - Posted: 13 Mar 2022, 17:19:19 UTC - in response to Message 58505. Next time I see a really gross (multi-year) runtime estimate, I'll dig out the exact figures, show you the working-out, and try to analyse where they've come from. Caught one! Task e1a5-ABOU_pythonGPU_beta2_test16-0-1-RND7314_1 Host is 43404. Windows 7. It has two GPUs, and GPUGrid is set to run on the other one, not as shown. The important bits are CUDA: NVIDIA GPU 0: NVIDIA GeForce GTX 1660 Ti (driver version 472.12, CUDA version 11.4, compute capability 7.5, 4096MB, 3032MB available, 5622 GFLOPS peak) DCF is 8.882342, and the task shows up as: Why? This is what I got from the server, in the sched_reply file: <app_version> <app_name>PythonGPUbeta</app_name> <version_num>104</version_num> ... <flops>47361236228.648697</flops> ... <workunit> <rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est> ... 1,000,000,000,000,000,000 fpops, at 47 GFLOPS, would take 21,114,313 seconds, or 244 days. Multiply in the DCF, and you get the 2170 days shown. According to the application details page, this host has completed one 'Python apps for GPU hosts beta 1.04 windows_x86_64 (cuda1131)' task (new apps always go right down to the bottom of that page). It recorded an APR of 1279539, which is bonkers the other way - these are GFlops, remember. It must have been task 32782603, which completed in 781 seconds. So, lessons to be learned: 1) A shortened test task, described as running for the full-run number of fpops, will register an astronomical speed. If anyone completes 11 tasks like that, that speed will get locked into the system for that host, and will cause the 'runtime limit exceeded' error. 2) BOINC is extremely bad - stupidly bad - at generating a first guess for the speed of a 'new application, new host' combination. It's actually taken precisely one-tenth of the speed of the acemd3 application on this machine, which might be taken as a "safe working assumption" for the time being. I'll try to check that in the server code. Oooh - I've let it run, and BOINC has remembered how I set up 7-Zip decompression last week. That's nice. ID: 58508 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58509 - Posted: 13 Mar 2022, 17:23:05 UTC But it hasn't remembered the increased disk limit. Never mind - nor did I. ID: 58509 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58510 - Posted: 14 Mar 2022, 8:42:00 UTC - in response to Message 58506. Right now, the way PythonGPU app works is by dividing the job in 2 subtasks: 1- first, installing conda and creating the conda environment. 2- second, running the python script. The error 15:33:14 (26820): task /usr/bin/flock reached time limit 1800 application ./gpugridpy/bin/python missing means that after 1800 seconds, the conda environment was not yet created for some reason. This could be because the conda dependencies could not be downloaded in time or because the machine was running the installation process more slowly than expected. We set this time limit of 30 mins because in theory it is plenty of time to create the environment. However, in the new version (the current PythonGPUBeta), we send the whole conda environment compressed and simply unpack it in the machine. Therefor this error, which indeed happens every now and then now, should disappear. ID: 58510 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58511 - Posted: 14 Mar 2022, 8:55:03 UTC - in response to Message 58508. ok, so my plan was to run at least a few more batches of test jobs. Then start the real tasks. I understand now that if some machines have by then run several test tasks that will create an estimation problem. Does resetting the credit statistics help? Would it be better to create a new app for real jobs once the testing is finished? so statistics are consistent and, in the long term, BOINC estimates better the durations? ID: 58511 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58512 - Posted: 14 Mar 2022, 10:29:52 UTC - in response to Message 58511. My gut feeling is that it would be better to deploy the finished app (after all testing seems to be complete) as a new app_version. We would have to go through the training process for APR one last time, but then it should settle down. I've seen the reference to resetting the credit statistics before, but only some years ago in scanning the documentation. I've never actually seen the console screen you use to control a BOINC server, let alone operated one for real, so I don't know whether you can control the reset to a single app_version, or whether you have to nuke the entire project - best not to find out the hard way. You're right, of course - the whole Runtime Estimation (APR) structure is intimately bound up with the CreditNew tools, also introduced in 2010. So the credit reset is likely to include an APR reset - but I'd hold that back for now. I see you've started sending out v1.05 betas. One has arrived on one of my Linux machines, and again, the estimated speed is exactly one-tenth of the acemd3 speed - with extreme precision, to the last decimal place: <flops>707593666701.291382</flops> <flops>70759366670.129135</flops> That must be deliberate. ID: 58512 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 58513 - Posted: 14 Mar 2022, 11:21:42 UTC - in response to Message 58511. Last modified: 14 Mar 2022, 11:22:20 UTC Would it be better to create a new app for real jobs once the testing is finished? Based on the last few days' discussion here, I've understood the purpose of the former short and long queue from GPUGrid's perspective: By separating the tasks into two queues based on their length, the project's staff didn't have to bother setting the rsc_fpops_est value for each and every batch, (note that the same app was assigned to each queue). The two queues had used different (but constant through batches) rsc_fpops_est values, so the runtime estimation of BOINC could not get so much off in each queue that would tigger the "won't finish on time" or the "run time exceeded" situation. Perhaps this practise should be put in operation again, even on a finer level of granularity (S, M, L tasks, or even XS and XL tasks). ID: 58513 · Rating: 0 · rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,739,145,728 RAC: 2,991 Level Scientific publications	Message 58518 - Posted: 14 Mar 2022, 23:14:08 UTC I am getting "Disk usage limit exceeded" error. https://www.gpugrid.net/result.php?resultid=32808038 I do have 400 Gigs reserved for boincs. ID: 58518 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58519 - Posted: 15 Mar 2022, 16:40:36 UTC - in response to Message 58518. I believe the "Disk usage limit exceeded" error is not related to the machine resources, is defined by an adjustable parameter of the app. The conda environment + all the other files might be over this limit.I will review the current value, we might have to increase it. Thanks for pointing it out! ID: 58519 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58524 - Posted: 17 Mar 2022, 9:59:07 UTC After a day out running a long acemd3 task, there's good news and bad news. The good news: runtime estimates have reached sanity, The magic numbers are now <flops>336636264786015.625000</flops> <rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est> That ends up with an estimated runtime of about 9 hours - but at the cost of a speed estimate of 336,636 GFlops. That's way beyond a marketing department's dream. Either somebody has done open-heart surgery on the project's database (unlikely and unwise), or BOINC now has enough completed tasks for v1.05 to start taking notice of the reported values. The bad news: I'm getting errors again. ModuleNotFoundError: No module named 'gym' ID: 58524 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58527 - Posted: 18 Mar 2022, 13:05:44 UTC v1.06 is released and working (very short test tasks only). Watch out for: Another 2.46 GB download Estimates are back up to multiple years ID: 58527 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58528 - Posted: 18 Mar 2022, 13:43:25 UTC - in response to Message 58524. The latest version should fix this error. ModuleNotFoundError: No module named 'gym' ID: 58528 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58529 - Posted: 18 Mar 2022, 15:24:55 UTC - in response to Message 58528. Last modified: 18 Mar 2022, 16:19:46 UTC I have task 32836015 running - showing 50% after 30 minutes. That looks like it's giving the maths a good work-out. Edit - actually, it's not doing much at all. You should be on NVidia device 1 - but cool, low power, 0% usage. No checkpoint, nothing written to stderr.txt in an hour and a half. ID: 58529 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58534 - Posted: 18 Mar 2022, 16:53:01 UTC - in response to Message 58529. Last modified: 18 Mar 2022, 16:54:58 UTC For now I am just trying to see the jobs finish.. I am not even trying to make them run for a long time. Jobs should not even need checkpoints, should last less than 15 mins. So weird, some other jobs in Widows machines from the same batch managed to finish. For example those with result ids 32835825, 32836020 or 32835934. I don't understand why it works in some Windows machines and fails in others. Sometimes without complaining about anything. And works fine locally in my Windows laptop. Does windows have trouble with multiprocessing? I need to add many more checkpoints in the scripts I guess. Pretty much after every line of code.. ID: 58534 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58536 - Posted: 18 Mar 2022, 17:41:35 UTC - in response to Message 58534. Err, this particular task is running on Linux - specifically, Mint v20.3 It ran the first short task OK at lunchtime - see Python apps for GPU hosts beta on host 508381. I think I'd better abort it while we think. ID: 58536 · Rating: 0 · rate: / Reply Quote

kksplace Send message Joined: 4 Mar 18 Posts: 53 Credit: 2,819,226,011 RAC: 23,105 Level Scientific publications	Message 58537 - Posted: 20 Mar 2022, 12:08:10 UTC This task https://www.gpugrid.net/result.php?resultid=32841161 has been running for nearly 26 hours now. It is the first Python beta task I have received that appears to be working. Green-With-Envy shows intermittent low activity on my 1080 GPU and BoincTasks shows 100% CPU usage. It checkpointed only once several minutes after it started and has shown 50% complete ever since. Should I let this task continue or abort it? (Linux Mint, 1080 driver is 510.47.03) ID: 58537 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58538 - Posted: 20 Mar 2022, 12:35:55 UTC - in response to Message 58537. Sounds just like mine, including the 100% CPU usage - that'll be the wrapper app, rather than the main Python app. One thing I didn't try, but only thought about afterwards, is to suspend the task for a moment and then allow it to run again. That has re-vitalised some apps at other projects, but is not guaranteed to improve things: it might even cause it to fail. But if it goes back to 0% or 50%, and doesn't move further, it's probably not going anywhere. I'd abort it at that point. ID: 58538 · Rating: 0 · rate: / Reply Quote

kksplace Send message Joined: 4 Mar 18 Posts: 53 Credit: 2,819,226,011 RAC: 23,105 Level Scientific publications	Message 58539 - Posted: 20 Mar 2022, 13:12:02 UTC - in response to Message 58538. Well, after a suspend and allowing it to run, it went back to its checkpoint and has shown no progress since. I will abort it. Keep on learning.... ID: 58539 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58540 - Posted: 21 Mar 2022, 8:21:21 UTC - in response to Message 58538. ok so it gets stuck at 50%. I will be reviewing it today. Thanks for the feedback. I also seems to fail in most Windows cases without reporting any error. ID: 58540 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description