Experimental Python tasks (beta)

Author	Message
mmonnin Send message Joined: 2 Jul 16 Posts: 339 Credit: 7,990,341,558 RAC: 103 Level Scientific publications	Message 58483 - Posted: 10 Mar 2022, 11:47:57 UTC Last modified: 10 Mar 2022, 11:49:40 UTC I had a W10 PC without tar.exe. I noticed the error in a task and copied the exe to system32 folder. This morning I noticed a task running for 6.5 hours with no progress, no CPU usage. https://www.gpugrid.net/result.php?resultid=32778132 ID: 58483 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58484 - Posted: 10 Mar 2022, 11:50:21 UTC Last modified: 10 Mar 2022, 11:56:21 UTC Damn. Where did that go wrong? application C:\Windows\System32\tar.exe missing Anyone else who wants to try this experiment can try https://www.7-zip.org/ - looks as if the license would even allow the project to distribute it. Edit - I edited the job.xml file while the previous task was finishing, and then stopped BOINC to increase the disk limit. On restart, BOINC must have noticed that the file had changed, and it downloaded a fresh copy. Near miss. ID: 58484 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58485 - Posted: 10 Mar 2022, 13:42:43 UTC Last modified: 10 Mar 2022, 14:19:37 UTC application "C:\Program Files\7-Zip\7z" missing Make that "C:\Program Files\7-Zip\7z.exe" Or maybe not. application "C:\Program Files\7-Zip\7z.exe" missing Isn't the damn wrapper clever enough to remove the quotes I put in there to protect the space in "Program Files"? ID: 58485 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58486 - Posted: 10 Mar 2022, 15:02:08 UTC Using tar.exe in W10 and W11 seems to work now. However, it is true that: a) some machines do not have tar.exe. My initial idea was that older versions of Windows could donwload tar.exe, but it seems that is does not work. b) The C:\Windows\System32\tar.exe path is hardcoded. I understand that ideally we should add to PATH all possible paths where this executable could be found right? ID: 58486 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58487 - Posted: 10 Mar 2022, 15:40:34 UTC - in response to Message 58486. On this particular Windows 7 machine, I have: PATH= C:\Windows\system32; C:\Windows; C:\Windows\System32\Wbem; C:\Windows\System32\WindowsPowerShell\v1.0\;; C:\Program Files\Process Lasso\; - I've split that into separate lines for clarity. but it's one single environment variable that has been added to by various installers over the years. For a native Windows system component, I wouldn't have thought a path was necessary at all - Windows should handle all that. That's what path variables are for. But maybe the wrapper app is so dumb that it just throws the exact string it parses from job.xml at a file_open function? I'll have a look at the code. I've got two remaining thoughts left: try Program [space] Files without any quotes; or stick a copy of 7z.exe in Windows/system32 (although mine's a 64-bit version...), and call it explicitly from there. I don't think it'll have anywhere to hide from that... ID: 58487 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58488 - Posted: 10 Mar 2022, 17:53:57 UTC Yay! That's what I wanted to see: 17:49:09 (21360): wrapper: running C:\Program Files\7-Zip\7z.exe (x windows_x86_64__cuda1131.tar.gz) 7-Zip [64] 15.14 : Copyright (c) 1999-2015 Igor Pavlov : 2015-12-31 Scanning the drive for archives: 1 file, 2666937516 bytes (2544 MiB) Extracting archive: windows_x86_64__cuda1131.tar.gz And I've got v1.04 in my sandbox... ID: 58488 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58489 - Posted: 10 Mar 2022, 18:27:31 UTC But not much more than that. After half an hour, it's got as far as: Everything is Ok Files: 13722 Size: 5270733721 Compressed: 5281648640 18:02:00 (21360): C:\Program Files\7-Zip\7z.exe exited; CPU time 6.567642 18:02:00 (21360): wrapper: running python.exe (run.py) WARNING: The script shortuuid.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. WARNING: The script normalizer.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. WARNING: The scripts wandb.exe and wb.exe are installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts. We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default. pytest 0.0.0 requires atomicwrites>=1.0, which is not installed. pytest 0.0.0 requires attrs>=17.4.0, which is not installed. pytest 0.0.0 requires iniconfig, which is not installed. pytest 0.0.0 requires packaging, which is not installed. pytest 0.0.0 requires py>=1.8.2, which is not installed. pytest 0.0.0 requires toml, which is not installed. aiohttp 3.7.4.post0 requires attrs>=17.3.0, which is not installed. WARNING: The scripts pyrsa-decrypt.exe, pyrsa-encrypt.exe, pyrsa-keygen.exe, pyrsa-priv2pub.exe, pyrsa-sign.exe and pyrsa-verify.exe are installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. WARNING: The script jsonschema.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. WARNING: The script gpustat.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. WARNING: The scripts ray-operator.exe, ray.exe, rllib.exe, serve.exe and tune.exe are installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts. We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default. pytest 0.0.0 requires atomicwrites>=1.0, which is not installed. pytest 0.0.0 requires iniconfig, which is not installed. pytest 0.0.0 requires py>=1.8.2, which is not installed. pytest 0.0.0 requires toml, which is not installed. WARNING: The script f2py.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. wandb: W&B API key is configured (use `wandb login --relogin` to force relogin) wandb: Appending key for api.wandb.ai to your netrc file: D:\BOINCdata\slots\5/.netrc wandb: Currently logged in as: rl-team-upf (use `wandb login --relogin` to force relogin) wandb: Tracking run with wandb version 0.12.11 wandb: Run data is saved locally in D:\BOINCdata\slots\5\wandb\run-20220310_181709-mxbeog6d wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run MontezumaAgent_e1a12 wandb: View project at https://wandb.ai/rl-team-upf/MontezumaRevenge_rnd_ppo_cnn_nophi_baseline_beta wandb: View run at https://wandb.ai/rl-team-upf/MontezumaRevenge_rnd_ppo_cnn_nophi_baseline_beta/runs/mxbeog6d and doesn't seem to be getting any further. I'll see if it's moved on after dinner, might might abort it if it hasn't. Task is 32782603 ID: 58489 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58490 - Posted: 10 Mar 2022, 18:54:03 UTC Then, lots of iterations of: OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\BOINCdata\slots\5\lib\site-packages\torch\lib\cudnn_cnn_train64_8.dll" or one of its dependencies. Traceback (most recent call last): File "<string>", line 1, in <module> File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 105, in spawn_main exitcode = _main(fd) File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 114, in _main prepare(preparation_data) File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 225, in prepare _fixup_main_from_path(data['init_main_from_path']) File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path run_name="__mp_main__") File "D:\BOINCdata\slots\5\lib\runpy.py", line 263, in run_path pkg_name=pkg_name, script_name=fname) File "D:\BOINCdata\slots\5\lib\runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "D:\BOINCdata\slots\5\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\BOINCdata\slots\5\run.py", line 23, in <module> import torch File "D:\BOINCdata\slots\5\lib\site-packages\torch\__init__.py", line 126, in <module> raise err I've increased it ten-fold, but that requires a reboot - and the task didn't survive. Trying one last time, then it's 'No new Tasks' for tonight. ID: 58490 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58491 - Posted: 10 Mar 2022, 19:05:20 UTC BTW, yes - the wrapper really is that dumb. https://github.com/BOINC/boinc/blob/master/samples/wrapper/wrapper.cpp#L727 It just plods along, from beginning to end, copying it byte by byte. The only thing it considers is which way the slashes are pointing. ID: 58491 · Rating: 0 · rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,739,145,728 RAC: 2,991 Level Scientific publications	Message 58492 - Posted: 11 Mar 2022, 0:16:42 UTC I managed to complete 2 of these WUs successfully. They still need a lot of work done. You have low GPU usage, and they cause the boinc manager to be slow and sluggish and unresponsive. https://www.gpugrid.net/result.php?resultid=32784274 https://www.gpugrid.net/result.php?resultid=32783598 They were pain to finish!!!!! And what for, only 3000 points for 882 days worth of work per WU!!!!!! ID: 58492 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 339 Credit: 7,990,341,558 RAC: 103 Level Scientific publications	Message 58493 - Posted: 11 Mar 2022, 0:48:05 UTC - in response to Message 58483. I had a W10 PC without tar.exe. I noticed the error in a task and copied the exe to system32 folder. This morning I noticed a task running for 6.5 hours with no progress, no CPU usage. https://www.gpugrid.net/result.php?resultid=32778132 Disabling python beta on this W10 PC has another 11+ hours gone https://www.gpugrid.net/result.php?resultid=32780319 ID: 58493 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58494 - Posted: 11 Mar 2022, 8:49:55 UTC - in response to Message 58490. Last modified: 11 Mar 2022, 8:59:43 UTC Yes, I have seen this error in some other machines that could unpack the file with tar.exe. In just a few of them. So it is an issue in the python script. Today I will be looking into it. It does not happen in linux with the same code. ID: 58494 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58495 - Posted: 11 Mar 2022, 8:58:52 UTC - in response to Message 58492. Yes, regarding the workload, I have been testing the tasks with low GPU/CPU usage. I was interested in checking if the conda environment was successfully unpacked and the python script was able to complete a few iterations. It will be increased as soon as this part works, as well as the points. For the completely wrong duration estimation, I will look into what can be done. I am not sure how BOINC estimates it. Could please someone confirm if it is also wrong in Linux of if it is only a Windows issue? ID: 58495 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58496 - Posted: 11 Mar 2022, 9:15:30 UTC Could the astronomical time estimations be simply due to a wrong configuration of the rsc_fpops_est parameter? ID: 58496 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58497 - Posted: 11 Mar 2022, 9:28:05 UTC - in response to Message 58494. Last modified: 11 Mar 2022, 9:46:21 UTC Yes, I have seen this error in some other machines that could unpack the file with tar.exe. In just a few of them. So it is an issue in the python script. Today I will be looking into it. It does not happen in linux with the same code. I was a bit suspicious about the 'paging file too small' error - I didn't even think Windows applications could get information about what the current setting was. I'd suggest correlating the machines with this error, with their reported physical memory. Mine is 'only' 8 GB - small by modern standards. It looks like there may be some useful clues in https://discuss.pytorch.org/t/winerror-1455-the-paging-file-is-too-small-for-this-operation-to-complete/131233 ID: 58497 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58498 - Posted: 11 Mar 2022, 9:34:16 UTC - in response to Message 58496. Could the astronomical time estimations be simply due to a wrong configuration of the rsc_fpops_est parameter? That's certainly a part of it, but it's a very long, complicated, and historical story. It will affect any and all platforms, not just Windows, and other data as well as rsc_fpops_est. And it's also related to historical decisions by both BOINC and GPUGrid. I'll try and write up some bedtime reading for you, but don't waste time on it in the meantime - there won't be an easy 'magic bullet' to fix it. ID: 58498 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58499 - Posted: 11 Mar 2022, 10:21:10 UTC - in response to Message 58497. Yes I was looking at the same link. Seems related to limited memory. I might try to run the suggested script before running the job, which seems to mitigate the problem. ID: 58499 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58500 - Posted: 11 Mar 2022, 13:50:18 UTC - in response to Message 58498. Runtime estimation – and where it goes wrong The estimates we see on our home computers are made up of three elements. They are: The SIZE of a task – rsc_fpops_est The SPEED of the device that’s calculating the result One or more correction tweaks, designed to smooth off the rough edges. The original system In the early days, all BOINC projects ran on CPUs, and almost all the CPUs in use were single-core. The speed of that CPU was measured by a derivation of the Whetstone benchmark: this was originally designed to measure hardware speeds only, and deliberately excluded software optimisation. For scientific research, careful optimisation is a valid technique (provided it isn’t done at the expense of accuracy). There was a general (but unspoken) assumption that projects would be running a single type of research task, using a single application. So the rough edges were smoothed by something called DCF (duration correction factor). That kept track of that single application, running on that single CPU, and gently adjusted it until the estimates were pretty good. It worked. The adjustments were calculated by, and stored on, the local computer. The revised system Starting in 2008, BOINC was adapted to support applications that ran on GPUs – GPUGrid and SETI@home first, others followed. There never was any attempt to benchmark GPUs, so the theoretical baseline speed of a GPU application was taken to be a figure derived from the hardware architecture, notably the number of shaders and the clock speed. This was known as “peak FLOPS”, or – to some of us – “marketing FLOPS”. No way has any programmer ever been able to write a scientific program which uses every clock cycle of every shader, with no overhead for synchronisation or data transfer. Whatever. At the same time, projects kept their CPU apps running, and many developed multiple research streams using different apps. A single-valued DCF couldn’t smooth off all the different rough edges at the same time. There’s nothing in principle to stop the BOINC client keeping track of multiple application+device combinations, and such a system was in fact developed by a volunteer. But it was rejected by David Anderson in Berkeley, who devised his own system of Runtime Estimation, keeping track of the necessary tweaks on the project server. This was intended to replace client-based DCFs entirely, although the old system was retained for historical compatibility. The implications for GPUGrid As I think we all know, GPUGrid uses rsc_fpops_est, but I don’t think it’s realised quite how fundamental it is to the whole inverted pyramid. If tasks run much faster than their declared fpops, the only conclusion that BOINC can draw is the application speed has suddenly become much faster, and it tries to adapt accordingly. GPUGrid has kept both of the adjustment methods active. If you look at any of our computer details, you will see that it contains a link to show application details: the smoothed average of all our successful tasks with each application. The critical one here is APR, or ‘average processing rate’. That’s the device+application speed, in GFlops. But on the computer details page, you’ll also see the DCF listed. Nominally, this should be 1, replaced by APR – but here, usually it isn’t. The implications? APR works adequately for long term, steady, production work. But it fails during periods of rapid change and testing. 1) APR is disregarded entirely when a new application version is activated on the server. It starts again from scratch, and the initial estimates are – questionable. In fact, I don’t have a clue what speed is assumed for the first few tasks allocated. 2) It kicks in in two stages. First, when 100 tasks have been completed for the whole ensemble, and again when each individual computer reaches 11 completed tasks. Note that ‘completed’ here means a normal end-of-run plus a validated result. Some app versions never achieve that! Different GPUs run at very different speeds, and the first 100 tasks returned normally come back from the fastest cards. That skews the average speed. In the worst case, the first hundred back can set a standard which lesser cards can’t attain – so they are stopped by ‘run time exceeded’, can never achieve the necessary 11 validations to set their own, lower, bar, and are excluded for good. The same can happen if deliberately short test tasks are put through early on, without an adjusted rsc_fpops_est: again, an unfeasibly fast target is set, and no-one can complete full-length tasks. Sorry – I’ve been called out this afternoon, so I’ve dashed that off much quicker than I intended. I’ll leave it there for now, and we can all discuss the way forward later. ID: 58500 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58501 - Posted: 11 Mar 2022, 21:25:14 UTC - in response to Message 58500. Thank you very much for the explanation Richard, very helpful actually. I have been using short tests tasks to catch bugs in the early states of the job. That might have caused problems, although I guess we can adjust rsc_fpops_est and reset statistics later. The idea is to have long term, steady, production work after the tests. However, I don't fully understand how that could cause estimates of hundreds of days. In any case, the most reliable information for the host is then the progress percentage, which should be correct. I remember the ‘run time exceeded’ error was happening previously in the app and we had to adjust the rsc_fpops_est parameter. Maybe a temporary solution for the time estimation would be to set rsc_fpops_est for the PythonGPUbeta app to the same value we have in the PythonGPU app? The idea is that PythonGPUbeta eventually becomes the sole Python app, running the same Linux jobs PythonGPU is running now plus Windows jobs. ID: 58501 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 58502 - Posted: 11 Mar 2022, 21:58:48 UTC - in response to Message 58501. Last modified: 11 Mar 2022, 22:01:23 UTC Maybe a temporary solution for the time estimation would be to set rsc_fpops_est for the PythonGPUbeta app to the same value we have in the PythonGPU app? This approach is wrong. The rsc_fpops_est should be set accprdingly for the actual batch of workunits, not for the app. As test batches are much shorter than production batches, they should have a much less rsc_fpops_est value, regardless that the same app processes them. ID: 58502 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description