Anaconda Python 3 Environment v4.01 failures

Author	Message
Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 347,555 Level Scientific publications	Message 55971 - Posted: 10 Dec 2020, 22:14:47 UTC - in response to Message 55968. Last modified: 10 Dec 2020, 22:15:40 UTC YAY! I got one, and it's running. Installed conda, got through all the setup stages, and is now running ACEMD, and reporting proper progress. Task 31712246, in case I go to bed before it finishes. hoping for a successful run. but it'll probably take 5-6hrs on a 1660ti. did you make any changes to your system between the last failed task and this one? ID: 55971 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 55972 - Posted: 10 Dec 2020, 22:25:40 UTC - in response to Message 55971. I applied the (very few) system updates that were waiting. I don't think anything would have been relevant to running tasks. [I remember a new version of curl, and of SSL, which might have affected external comms, but nothing caught my eye apart from those] I spotted one problem, though. Initial estimate was 1 minute 46 seconds. I checked the <rsc_fpops_bound>, and it was a mere x50 over <rsc_fpops_est>. At the speed it was running, it would probably have timed out - so I've given it a couple of extra noughts on the bound. The ACEMD phase started at 10% done, and is counting up in about 1% increments. Didn't seem to have checkpointed before I stopped it for the bound change, so I lost about 15 minutes of work. ID: 55972 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 347,555 Level Scientific publications	Message 55973 - Posted: 10 Dec 2020, 22:32:21 UTC - in response to Message 55972. you probably didn't need to change the bounds to avoid a timeout. my early tasks all estimated sub 1min run times, but all ran to completion anyway, even my 1660Super which took 5+hrs to run. ID: 55973 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 55974 - Posted: 10 Dec 2020, 22:43:53 UTC - in response to Message 55973. Somebody mentioned a time out, so I thought it was better to be safe than sorry. After all the troubles, it would have been sad to lose a successful test to that (besides being a waste of time. At least the temp directory errors only wasted a couple of seconds!). ID: 55974 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55977 - Posted: 11 Dec 2020, 1:27:54 UTC - in response to Message 55967. Last modified: 11 Dec 2020, 2:05:00 UTC The last change to the BOINC service installation came as a result of https://github.com/BOINC/boinc/issues/3715#issuecomment-727609578. Previously, there was no access at all to /tmp/, which blocked idle detection. Did you find adding PrivateTmp=false to the boinc-client.service file changed the TMP access error in the original post? It is weird, as PrivateTmp defaults to false. I could set this to true on one of my working hosts to see what happens. YAY! I got one, and it's running. Installed conda, got through all the setup stages, and is now running ACEMD, and reporting proper progress. Progress! EDIT: I'm not entirely convinced by that. It's certainly the final destination for the installation, but is it the temp directory as well? Call me naive, but wouldn't a generic temp folder be created in /tmp/, unless directed otherwise? Debian based distros seem to use /proc/locks directory for lock file location. If you use lslocks command, you will see the current locks, including any boinc/Gpugrid locks. (cat /proc/locks will also work, but you will need to know the PID of the processes listed to interpret the output) I cant think of anything that would prevent access to this directory structure, as it is critical in OS operations. I am leaning towards a bug in the Experimental Tasks scripts that causes this issue (and has since been cleaned up). ID: 55977 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55980 - Posted: 11 Dec 2020, 5:35:59 UTC - in response to Message 55944. Last modified: 11 Dec 2020, 6:30:13 UTC Ian&Steve C. wrote: Looks like I have 11 successful tasks, and 2 failures. the two failures both failed with "196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED" after a few mins and on different hosts. https://www.gpugrid.net/result.php?resultid=31680145 https://www.gpugrid.net/result.php?resultid=31678136 curious, since both systems have plenty of free space, and I've allowed BOINC to use 90% of it. gemini8 wrote: Several failures for me as well. Some because of time-limit 1800 which abort themselves after 1786 sec. Unhid my machines for the case someone is interested in the stderr output. There might be different output on different machines, so just help yourselves. I have seen the same errors on my hosts. It is a bug in those particular work units as they fail on all hosts. Some hosts report the Disk Limit Exceeded error, some hosts report an AssertionError - assert os.path.exists('output.coor') (AssertionError is from an assert statement inserted into the code by the programmer to trap and report errors) A host belonging to gemini8 experienced a flock timeout on Work Unit 26277866. All other hosts processing this task reported this error: os.remove('restart.chk') FileNotFoundError: [Errno 2] No such file or directory: 'restart.chk' gemini8 3 of your hosts with older CUDA driver report this: ACEMD failed: Error loading CUDA module: CUDA_ERROR_INVALID_PTX (218). Perhaps time for a driver update. It wont fix the errors being reported (reported errors are task bugs), but may prevent other issues developing in the future. ID: 55980 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 55981 - Posted: 11 Dec 2020, 9:03:36 UTC - in response to Message 55977. EDIT: [quote]I'm not entirely convinced by that. It's certainly the final destination for the installation, but is it the temp directory as well? Call me naive, but wouldn't a generic temp folder be created in /tmp/, unless directed otherwise? Debian based distros seem to use /proc/locks directory for lock file location. If you use lslocks command, you will see the current locks, including any boinc/Gpugrid locks. (cat /proc/locks will also work, but you will need to know the PID of the processes listed to interpret the output) I cant think of anything that would prevent access to this directory structure, as it is critical in OS operations. I am leaning towards a bug in the Experimental Tasks scripts that causes this issue (and has since been cleaned up). Remember that I had a very specific reason for reverting a single recent change. The BOINC systemd file had PrivateTmp=true until very recently, and I had evidence that at least one Anaconda had successfully passed the installation phase in the past. The systemd file as distributed now has PrivateTmp=false, and my machine had that setting when the error messages appeared. I changed it back, so I now have PrivateTmp=true again. And with that one change (plus an increased time limit for safety), the task has completed successfully. I have a second machine, where I haven't yet made that change. Compare tasks 31712246 (WU created 10 Dec 2020 \| 20:54:14 UTC, PrivateTmp=true) and 31712588 (WU created 10 Dec 2020 \| 21:05:06 UTC, PrivateTmp=false). The first succeeded, the second failed with 'cannot create temporary directory'. I don't think it was a design change by the project. I now have a DCF of 87, and I need to go and talk to the Debian/BOINC maintenance team again... ID: 55981 · Rating: 0 · rate: / Reply Quote

gemini8 Send message Joined: 3 Jul 16 Posts: 31 Credit: 2,250,309,169 RAC: 1,769 Level Scientific publications	Message 55982 - Posted: 11 Dec 2020, 10:44:35 UTC - in response to Message 55980. rod4x4 wrote: gemini8 3 of your hosts with older CUDA driver report this: ACEMD failed: Error loading CUDA module: CUDA_ERROR_INVALID_PTX (218). Perhaps time for a driver update. It wont fix the errors being reported (reported errors are task bugs), but may prevent other issues developing in the future Thanks for your concern! The systems run with the 'standard' proprietary Nvidia Debian driver. Wasn't able to get a later one working, so I keep them this way. Could switch the systems to Ubuntu which features later drivers, but I like Debian better and thus don't plan to do so. - - - - - - - - - - Greetings, Jens ID: 55982 · Rating: 0 · rate: / Reply Quote

kksplace Send message Joined: 4 Mar 18 Posts: 53 Credit: 2,816,976,011 RAC: 73,960 Level Scientific publications	Message 55983 - Posted: 11 Dec 2020, 11:33:58 UTC Not a technical guru like you all here, but if it helps, my system has had one Anaconda failure back on 4 Dec, and now completed 4 in the last several days with one more in progress. Let me know if I can provide any information that helps. ID: 55983 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 593 Credit: 12,146,936,510 RAC: 4,406,248 Level Scientific publications	Message 55986 - Posted: 11 Dec 2020, 23:43:34 UTC - in response to Message 55967. Richard Haselgrove: Your suggested action at Message 55967 seems to have worked for me also. Thank you very much! I had many failed Anaconda Python tasks, reporting errors regarding: INTERNAL ERROR: cannot create temporary directory! I executed the stated command: sudo systemctl edit boinc-client.service And I added to the file the suggested lines: [Service] PrivateTmp=true After saving changes and rebooting, the mentioned error vanished. Task 2p95010002r00-RAIMIS_NNPMM-0-1-RND3897_0 has succeeded on my Host #482132. This host had previously failed 28 Anaconda Python tasks one after another... I've applied the same remedy to several other hosts, and three more tasks seem to be progressing normally. You have hit the mark ID: 55986 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55987 - Posted: 11 Dec 2020, 23:44:59 UTC - in response to Message 55981. The systemd file as distributed now has PrivateTmp=false, and my machine had that setting when the error messages appeared. I changed it back, so I now have PrivateTmp=true again. And with that one change (plus an increased time limit for safety), the task has completed successfully. I have a second machine, where I haven't yet made that change. Compare tasks 31712246 (WU created 10 Dec 2020 \| 20:54:14 UTC, PrivateTmp=true) and 31712588 (WU created 10 Dec 2020 \| 21:05:06 UTC, PrivateTmp=false). The first succeeded, the second failed with 'cannot create temporary directory'. I don't think it was a design change by the project. On my hosts that are not reporting this issue, PrivateTmp is not set (so defaults to false) https://gpugrid.net/show_host_detail.php?hostid=483378 https://gpugrid.net/show_host_detail.php?hostid=544286 https://gpugrid.net/show_host_detail.php?hostid=483296 I do have ProtectHome=true set on all hosts. One thing I did note on 31712588 is that the tmp directory creation error is reported on a different PID to the wrapper. All transactions should be on the same PID. 21:48:18 (21729): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda && /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ") [21755] INTERNAL ERROR: cannot create temporary directory! [21759] INTERNAL ERROR: cannot create temporary directory! 21:48:19 (21729): /usr/bin/flock exited; CPU time 0.118700 My concern is different outcomes are being seen with similar setting of PrivateTmp=false Would be interested in Debian/BOINC maintenance team feedback. ID: 55987 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55988 - Posted: 12 Dec 2020, 0:03:26 UTC - in response to Message 55986. Richard Haselgrove: Your suggested action at Message 55967 seems to have worked for me also. Thank you very much! I had many failed Anaconda Python tasks, reporting errors regarding: INTERNAL ERROR: cannot create temporary directory! I executed the stated command: sudo systemctl edit boinc-client.service And I added to the file the suggested lines: [Service] PrivateTmp=true After saving changes and rebooting, the mentioned error vanished. Task 2p95010002r00-RAIMIS_NNPMM-0-1-RND3897_0 has succeeded on my Host #482132. This host had previously failed 28 Anaconda Python tasks one after another... I've applied the same remedy to several other hosts, and three more tasks seem to be progressing normally. You have hit the mark Good feedback! So Richard Haselgrove, does ProtectHome=true imply PrivateTmp=true (or vice versa)? ID: 55988 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,188,446,190 RAC: 1,336,521 Level Scientific publications	Message 55989 - Posted: 12 Dec 2020, 0:06:14 UTC Yes, I too would like to hear the feedback from the developers, in my case what can be done with the project DCF of 93 on all my hosts. That makes GPUGrid completely monopolize the gpus and prevent my other gpu projects from running. The standard new acemd application tasks are the ones that are causing the week long estimated completion times. The new beta anaconda/python tasks have reasonable estimated completion times. The problem is the disparate individual DCF values for each application with a 100:1 ratio for python to acemd3 and BOINC only being able to handle one DCF value from a project with multiple sub-apps. My current solution is manual intervention by stopping allowing new work and finishing off what is in the cache so that the other gpu projects can work on their caches. Then I let the other gpu projects run for a day or so so that the REC values of the projects are close to balancing out. But this is an undesirable solution. I can only think that stopping one or the other sub-apps and exclusively running only one app is the only permanent solution now once the DCF normalizes against a single application. ID: 55989 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 593 Credit: 12,146,936,510 RAC: 4,406,248 Level Scientific publications	Message 55990 - Posted: 12 Dec 2020, 0:12:06 UTC - in response to Message 55987. On my hosts that are not reporting this issue, PrivateTmp is not set (so defaults to false) All your three successful hosts have 7.9.3 BOINC version in use. That version was default to PrivateTmp=true Newer versions 7.16.xx changed this criterium to PrivateTmp=false by default. ID: 55990 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55991 - Posted: 12 Dec 2020, 0:30:06 UTC - in response to Message 55990. On my hosts that are not reporting this issue, PrivateTmp is not set (so defaults to false) All your three successful hosts have 7.9.3 BOINC version in use. That version was default to PrivateTmp=true Newer versions 7.16.xx changed this criterium to PrivateTmp=false by default. Good pickup. Thanks for the clarification. I missed that! Answers all my concerns. I noted that the security directories created by PrivateTmp existed on my system, but could not work out why they existed. ID: 55991 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 55993 - Posted: 12 Dec 2020, 14:41:52 UTC Thanks guys. I've written up the problem for upstream at https://github.com/BOINC/boinc/issues/4125, and I'm hoping to get further ideas from there. At the moment, the situation seems to be that you can either use idle detection to keep BOINC out of the way when working, or run Python tasks, but not both together. That seems unsatisfactory. There are a couple of suggestions so far: either adding /var/tmp and /tmp to ReadWritePaths= or commenting out: ProtectSystem=strict Both sound insecure and not recommended for general use, but try them at your own risk. I'll be doing that. Responding to other comments: rod4x4 wrote: the tmp directory creation error is reported on a different PID to the wrapper. That's the nature of the wrapper. Its job is to spawn child processes. It will be the child processes which will be trying to create temporary folders, and their PIDs will be different from the parent's. rod4x4 wrote: does ProtectHome=true imply PrivateTmp=true (or vice versa)? I don't know. I've got a lot of experience with BOINC, but I'm a complete n00b with Linux - learning on the job, by using analogies from other areas of computing. We'll find out as we go along. Keith Myers wrote: what can be done with the project DCF of 93 on all my hosts. That's an easy problem, totally within the scope of the project admins here. We just have to ask TONI to have a word with RAIMIS, and ask him/her to increase the template <rsc_fpops_est> to a more realistic value. Then, the single DCF value controlling all runtime estimates throughout this project can settle back on a value close to 1. ServicEnginIC wrote: Newer versions 7.16.xx changed this criterium to PrivateTmp=false by default. The suggestion for making this change was first made on 15 November 2020. Only people who use a quickly updating Linux repo - like Gianfranco Costamagna's PPA - will have encountered it so far. But the mainstream stable repos will get to it eventually ID: 55993 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55998 - Posted: 12 Dec 2020, 23:59:48 UTC - in response to Message 55993. Last modified: 13 Dec 2020, 0:16:11 UTC Thanks guys. I've written up the problem for upstream at https://github.com/BOINC/boinc/issues/4125, and I'm hoping to get further ideas from there. At the moment, the situation seems to be that you can either use idle detection to keep BOINC out of the way when working, or run Python tasks, but not both together. That seems unsatisfactory. Thanks Richard Haselgrove. Will follow that issue request with interest. The last suggestion by BryanQuigley has merit. Either way, the project has the BOINC writable directory - why not use that? You can make your own tmp directory there if I'm not mistaken (haven't tried it) His suggestion will mean any temp files will be protected by the Boinc security system already in place, bypasses the need for PrivateTmp=true and be accessible for Boinc process calls. One issue would be a a clean up process for the temp folder will need to be controlled by BOINC to prevent run away file storage. ID: 55998 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55999 - Posted: 13 Dec 2020, 1:20:55 UTC - in response to Message 55998. His suggestion will mean any temp files will be protected by the Boinc security system already in place, bypasses the need for PrivateTmp=true and be accessible for Boinc process calls. Which circles back to how flock uses /tmp directory. Perhaps BryanQuigley can clear that point up. ID: 55999 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 56000 - Posted: 13 Dec 2020, 5:19:28 UTC - in response to Message 55999. Last modified: 13 Dec 2020, 6:19:24 UTC His suggestion will mean any temp files will be protected by the Boinc security system already in place, bypasses the need for PrivateTmp=true and be accessible for Boinc process calls. Which circles back to how flock uses /tmp directory. Perhaps BryanQuigley can clear that point up. I am starting to think the error is not related to flock temp folder handling nor is PrivateTmp the cause, rather they highlight an error in the Gpugrid work unit script. I have noticed that 9 files are written to /tmp folder for each experimental task. These files are readable and contain nvidia compiler data. The work unit script should be placing these files in the boinc folder structure. It is these file writes that are causing the error. 21:48:18 (21729): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda && /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ") [21755] INTERNAL ERROR: cannot create temporary directory! [21759] INTERNAL ERROR: cannot create temporary directory! 21:48:19 (21729): /usr/bin/flock exited; CPU time 0.118700 Once this script error is corrected, PrivateTmp can be reverted to the preferred setting. It should also be noted that the task is not cleaning up these files on completion. This will lead to disk space exhaustion if not controlled. ID: 56000 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 56001 - Posted: 13 Dec 2020, 13:52:48 UTC - in response to Message 56000. I think it's more likely to be the /bin/bash which actually needs to write temporary files. I've now managed (with some difficulty) to separate the 569 lines of actual script from the 90 MB of payload. There's export TMP_BACKUP="$TMP" export TMP=$PREFIX/install_tmp but no sign of the TMPDIR mentioned in https://linux.die.net/man/1/bash. More than that is above my pay-grade, I'm afraid. You can run the script in your home directory if you want to know more about it. (Like the End User License Agreement, the Notice of Third Party Software Licenses, and the Export; Cryptography Notice!) ID: 56001 · Rating: 0 · rate: / Reply Quote