Message boards :
Number crunching :
Anaconda Python 3 Environment v4.01 failures
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
| Author | Message |
|---|---|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
YAY! I got one, and it's running. Installed conda, got through all the setup stages, and is now running ACEMD, and reporting proper progress. hoping for a successful run. but it'll probably take 5-6hrs on a 1660ti. did you make any changes to your system between the last failed task and this one?
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I applied the (very few) system updates that were waiting. I don't think anything would have been relevant to running tasks. [I remember a new version of curl, and of SSL, which might have affected external comms, but nothing caught my eye apart from those] I spotted one problem, though. Initial estimate was 1 minute 46 seconds. I checked the <rsc_fpops_bound>, and it was a mere x50 over <rsc_fpops_est>. At the speed it was running, it would probably have timed out - so I've given it a couple of extra noughts on the bound. The ACEMD phase started at 10% done, and is counting up in about 1% increments. Didn't seem to have checkpointed before I stopped it for the bound change, so I lost about 15 minutes of work. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
you probably didn't need to change the bounds to avoid a timeout. my early tasks all estimated sub 1min run times, but all ran to completion anyway, even my 1660Super which took 5+hrs to run.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Somebody mentioned a time out, so I thought it was better to be safe than sorry. After all the troubles, it would have been sad to lose a successful test to that (besides being a waste of time. At least the temp directory errors only wasted a couple of seconds!). |
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The last change to the BOINC service installation came as a result of https://github.com/BOINC/boinc/issues/3715#issuecomment-727609578. Previously, there was no access at all to /tmp/, which blocked idle detection. Did you find adding PrivateTmp=false to the boinc-client.service file changed the TMP access error in the original post? It is weird, as PrivateTmp defaults to false. I could set this to true on one of my working hosts to see what happens. YAY! I got one, and it's running. Installed conda, got through all the setup stages, and is now running ACEMD, and reporting proper progress. Progress! EDIT: I'm not entirely convinced by that. It's certainly the final destination for the installation, but is it the temp directory as well? Debian based distros seem to use /proc/locks directory for lock file location. If you use lslocks command, you will see the current locks, including any boinc/Gpugrid locks. (cat /proc/locks will also work, but you will need to know the PID of the processes listed to interpret the output) I cant think of anything that would prevent access to this directory structure, as it is critical in OS operations. I am leaning towards a bug in the Experimental Tasks scripts that causes this issue (and has since been cleaned up). |
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Ian&Steve C. wrote: Looks like I have 11 successful tasks, and 2 failures. gemini8 wrote: Several failures for me as well. I have seen the same errors on my hosts. It is a bug in those particular work units as they fail on all hosts. Some hosts report the Disk Limit Exceeded error, some hosts report an AssertionError - assert os.path.exists('output.coor') (AssertionError is from an assert statement inserted into the code by the programmer to trap and report errors) A host belonging to gemini8 experienced a flock timeout on Work Unit 26277866. All other hosts processing this task reported this error: os.remove('restart.chk') FileNotFoundError: [Errno 2] No such file or directory: 'restart.chk' gemini8 3 of your hosts with older CUDA driver report this: ACEMD failed: Error loading CUDA module: CUDA_ERROR_INVALID_PTX (218). Perhaps time for a driver update. It wont fix the errors being reported (reported errors are task bugs), but may prevent other issues developing in the future. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
EDIT: Remember that I had a very specific reason for reverting a single recent change. The BOINC systemd file had PrivateTmp=true until very recently, and I had evidence that at least one Anaconda had successfully passed the installation phase in the past. The systemd file as distributed now has PrivateTmp=false, and my machine had that setting when the error messages appeared. I changed it back, so I now have PrivateTmp=true again. And with that one change (plus an increased time limit for safety), the task has completed successfully. I have a second machine, where I haven't yet made that change. Compare tasks 31712246 (WU created 10 Dec 2020 | 20:54:14 UTC, PrivateTmp=true) and 31712588 (WU created 10 Dec 2020 | 21:05:06 UTC, PrivateTmp=false). The first succeeded, the second failed with 'cannot create temporary directory'. I don't think it was a design change by the project. I now have a DCF of 87, and I need to go and talk to the Debian/BOINC maintenance team again... |
|
Send message Joined: 3 Jul 16 Posts: 31 Credit: 2,248,809,169 RAC: 0 Level ![]() Scientific publications ![]()
|
rod4x4 wrote: gemini8 Thanks for your concern! The systems run with the 'standard' proprietary Nvidia Debian driver. Wasn't able to get a later one working, so I keep them this way. Could switch the systems to Ubuntu which features later drivers, but I like Debian better and thus don't plan to do so. - - - - - - - - - - Greetings, Jens |
|
Send message Joined: 4 Mar 18 Posts: 53 Credit: 2,815,476,011 RAC: 0 Level ![]() Scientific publications
|
Not a technical guru like you all here, but if it helps, my system has had one Anaconda failure back on 4 Dec, and now completed 4 in the last several days with one more in progress. Let me know if I can provide any information that helps. |
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,187 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Richard Haselgrove: Your suggested action at Message 55967 seems to have worked for me also. Thank you very much! I had many failed Anaconda Python tasks, reporting errors regarding: INTERNAL ERROR: cannot create temporary directory! I executed the stated command: sudo systemctl edit boinc-client.service And I added to the file the suggested lines: [Service] After saving changes and rebooting, the mentioned error vanished. Task 2p95010002r00-RAIMIS_NNPMM-0-1-RND3897_0 has succeeded on my Host #482132. This host had previously failed 28 Anaconda Python tasks one after another... I've applied the same remedy to several other hosts, and three more tasks seem to be progressing normally. You have hit the mark |
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The systemd file as distributed now has PrivateTmp=false, and my machine had that setting when the error messages appeared. I changed it back, so I now have PrivateTmp=true again. And with that one change (plus an increased time limit for safety), the task has completed successfully. On my hosts that are not reporting this issue, PrivateTmp is not set (so defaults to false) https://gpugrid.net/show_host_detail.php?hostid=483378 https://gpugrid.net/show_host_detail.php?hostid=544286 https://gpugrid.net/show_host_detail.php?hostid=483296 I do have ProtectHome=true set on all hosts. One thing I did note on 31712588 is that the tmp directory creation error is reported on a different PID to the wrapper. All transactions should be on the same PID. 21:48:18 (21729): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda && My concern is different outcomes are being seen with similar setting of PrivateTmp=false Would be interested in Debian/BOINC maintenance team feedback. |
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Richard Haselgrove: Good feedback! So Richard Haselgrove, does ProtectHome=true imply PrivateTmp=true (or vice versa)? |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Yes, I too would like to hear the feedback from the developers, in my case what can be done with the project DCF of 93 on all my hosts. That makes GPUGrid completely monopolize the gpus and prevent my other gpu projects from running. The standard new acemd application tasks are the ones that are causing the week long estimated completion times. The new beta anaconda/python tasks have reasonable estimated completion times. The problem is the disparate individual DCF values for each application with a 100:1 ratio for python to acemd3 and BOINC only being able to handle one DCF value from a project with multiple sub-apps. My current solution is manual intervention by stopping allowing new work and finishing off what is in the cache so that the other gpu projects can work on their caches. Then I let the other gpu projects run for a day or so so that the REC values of the projects are close to balancing out. But this is an undesirable solution. I can only think that stopping one or the other sub-apps and exclusively running only one app is the only permanent solution now once the DCF normalizes against a single application. |
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,187 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
On my hosts that are not reporting this issue, PrivateTmp is not set (so defaults to false) All your three successful hosts have 7.9.3 BOINC version in use. That version was default to PrivateTmp=true Newer versions 7.16.xx changed this criterium to PrivateTmp=false by default. |
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
On my hosts that are not reporting this issue, PrivateTmp is not set (so defaults to false) Good pickup. Thanks for the clarification. I missed that! Answers all my concerns. I noted that the security directories created by PrivateTmp existed on my system, but could not work out why they existed. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks guys. I've written up the problem for upstream at https://github.com/BOINC/boinc/issues/4125, and I'm hoping to get further ideas from there. At the moment, the situation seems to be that you can either use idle detection to keep BOINC out of the way when working, or run Python tasks, but not both together. That seems unsatisfactory. There are a couple of suggestions so far: either adding /var/tmp and /tmp to ReadWritePaths= Both sound insecure and not recommended for general use, but try them at your own risk. I'll be doing that. Responding to other comments: rod4x4 wrote: the tmp directory creation error is reported on a different PID to the wrapper. That's the nature of the wrapper. Its job is to spawn child processes. It will be the child processes which will be trying to create temporary folders, and their PIDs will be different from the parent's. rod4x4 wrote: does ProtectHome=true imply PrivateTmp=true (or vice versa)? I don't know. I've got a lot of experience with BOINC, but I'm a complete n00b with Linux - learning on the job, by using analogies from other areas of computing. We'll find out as we go along. Keith Myers wrote: what can be done with the project DCF of 93 on all my hosts. That's an easy problem, totally within the scope of the project admins here. We just have to ask TONI to have a word with RAIMIS, and ask him/her to increase the template <rsc_fpops_est> to a more realistic value. Then, the single DCF value controlling all runtime estimates throughout this project can settle back on a value close to 1. ServicEnginIC wrote: Newer versions 7.16.xx changed this criterium to PrivateTmp=false by default. The suggestion for making this change was first made on 15 November 2020. Only people who use a quickly updating Linux repo - like Gianfranco Costamagna's PPA - will have encountered it so far. But the mainstream stable repos will get to it eventually |
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks guys. I've written up the problem for upstream at https://github.com/BOINC/boinc/issues/4125, and I'm hoping to get further ideas from there. At the moment, the situation seems to be that you can either use idle detection to keep BOINC out of the way when working, or run Python tasks, but not both together. That seems unsatisfactory. Thanks Richard Haselgrove. Will follow that issue request with interest. The last suggestion by BryanQuigley has merit. Either way, the project has the BOINC writable directory - why not use that? You can make your own tmp directory there if I'm not mistaken (haven't tried it) His suggestion will mean any temp files will be protected by the Boinc security system already in place, bypasses the need for PrivateTmp=true and be accessible for Boinc process calls. One issue would be a a clean up process for the temp folder will need to be controlled by BOINC to prevent run away file storage. |
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
His suggestion will mean any temp files will be protected by the Boinc security system already in place, bypasses the need for PrivateTmp=true and be accessible for Boinc process calls. Which circles back to how flock uses /tmp directory. Perhaps BryanQuigley can clear that point up. |
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
His suggestion will mean any temp files will be protected by the Boinc security system already in place, bypasses the need for PrivateTmp=true and be accessible for Boinc process calls. I am starting to think the error is not related to flock temp folder handling nor is PrivateTmp the cause, rather they highlight an error in the Gpugrid work unit script. I have noticed that 9 files are written to /tmp folder for each experimental task. These files are readable and contain nvidia compiler data. The work unit script should be placing these files in the boinc folder structure. It is these file writes that are causing the error. 21:48:18 (21729): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda && Once this script error is corrected, PrivateTmp can be reverted to the preferred setting. It should also be noted that the task is not cleaning up these files on completion. This will lead to disk space exhaustion if not controlled. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I think it's more likely to be the /bin/bash which actually needs to write temporary files. I've now managed (with some difficulty) to separate the 569 lines of actual script from the 90 MB of payload. There's export TMP_BACKUP="$TMP" export TMP=$PREFIX/install_tmp but no sign of the TMPDIR mentioned in https://linux.die.net/man/1/bash. More than that is above my pay-grade, I'm afraid. You can run the script in your home directory if you want to know more about it. (Like the End User License Agreement, the Notice of Third Party Software Licenses, and the Export; Cryptography Notice!) |
©2025 Universitat Pompeu Fabra