Anaconda Python 3 Environment v4.01 failures

Message boards : Number crunching : Anaconda Python 3 Environment v4.01 failures
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 55971 - Posted: 10 Dec 2020, 22:14:47 UTC - in response to Message 55968.  
Last modified: 10 Dec 2020, 22:15:40 UTC

YAY! I got one, and it's running. Installed conda, got through all the setup stages, and is now running ACEMD, and reporting proper progress.

Task 31712246, in case I go to bed before it finishes.


hoping for a successful run. but it'll probably take 5-6hrs on a 1660ti.

did you make any changes to your system between the last failed task and this one?
ID: 55971 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55972 - Posted: 10 Dec 2020, 22:25:40 UTC - in response to Message 55971.  

I applied the (very few) system updates that were waiting. I don't think anything would have been relevant to running tasks. [I remember a new version of curl, and of SSL, which might have affected external comms, but nothing caught my eye apart from those]

I spotted one problem, though. Initial estimate was 1 minute 46 seconds. I checked the <rsc_fpops_bound>, and it was a mere x50 over <rsc_fpops_est>. At the speed it was running, it would probably have timed out - so I've given it a couple of extra noughts on the bound.

The ACEMD phase started at 10% done, and is counting up in about 1% increments. Didn't seem to have checkpointed before I stopped it for the bound change, so I lost about 15 minutes of work.
ID: 55972 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 55973 - Posted: 10 Dec 2020, 22:32:21 UTC - in response to Message 55972.  

you probably didn't need to change the bounds to avoid a timeout. my early tasks all estimated sub 1min run times, but all ran to completion anyway, even my 1660Super which took 5+hrs to run.
ID: 55973 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55974 - Posted: 10 Dec 2020, 22:43:53 UTC - in response to Message 55973.  

Somebody mentioned a time out, so I thought it was better to be safe than sorry. After all the troubles, it would have been sad to lose a successful test to that (besides being a waste of time. At least the temp directory errors only wasted a couple of seconds!).
ID: 55974 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55977 - Posted: 11 Dec 2020, 1:27:54 UTC - in response to Message 55967.  
Last modified: 11 Dec 2020, 2:05:00 UTC

The last change to the BOINC service installation came as a result of https://github.com/BOINC/boinc/issues/3715#issuecomment-727609578. Previously, there was no access at all to /tmp/, which blocked idle detection.

Did you find adding PrivateTmp=false to the boinc-client.service file changed the TMP access error in the original post? It is weird, as PrivateTmp defaults to false.
I could set this to true on one of my working hosts to see what happens.

YAY! I got one, and it's running. Installed conda, got through all the setup stages, and is now running ACEMD, and reporting proper progress.

Progress!

EDIT:
I'm not entirely convinced by that. It's certainly the final destination for the installation, but is it the temp directory as well?

Call me naive, but wouldn't a generic temp folder be created in /tmp/, unless directed otherwise?

Debian based distros seem to use /proc/locks directory for lock file location.
If you use lslocks command, you will see the current locks, including any boinc/Gpugrid locks. (cat /proc/locks will also work, but you will need to know the PID of the processes listed to interpret the output)
I cant think of anything that would prevent access to this directory structure, as it is critical in OS operations.
I am leaning towards a bug in the Experimental Tasks scripts that causes this issue (and has since been cleaned up).
ID: 55977 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55980 - Posted: 11 Dec 2020, 5:35:59 UTC - in response to Message 55944.  
Last modified: 11 Dec 2020, 6:30:13 UTC

Ian&Steve C. wrote:
Looks like I have 11 successful tasks, and 2 failures.

the two failures both failed with "196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED" after a few mins and on different hosts.
https://www.gpugrid.net/result.php?resultid=31680145
https://www.gpugrid.net/result.php?resultid=31678136

curious, since both systems have plenty of free space, and I've allowed BOINC to use 90% of it.


gemini8 wrote:
Several failures for me as well.
Some because of time-limit 1800 which abort themselves after 1786 sec.
Unhid my machines for the case someone is interested in the stderr output.
There might be different output on different machines, so just help yourselves.


I have seen the same errors on my hosts.
It is a bug in those particular work units as they fail on all hosts.
Some hosts report the Disk Limit Exceeded error, some hosts report an AssertionError - assert os.path.exists('output.coor') (AssertionError is from an assert statement inserted into the code by the programmer to trap and report errors)

A host belonging to gemini8 experienced a flock timeout on Work Unit 26277866. All other hosts processing this task reported this error:
os.remove('restart.chk') FileNotFoundError: [Errno 2] No such file or directory: 'restart.chk'


gemini8
3 of your hosts with older CUDA driver report this: ACEMD failed: Error loading CUDA module: CUDA_ERROR_INVALID_PTX (218).
Perhaps time for a driver update. It wont fix the errors being reported (reported errors are task bugs), but may prevent other issues developing in the future.
ID: 55980 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55981 - Posted: 11 Dec 2020, 9:03:36 UTC - in response to Message 55977.  

EDIT:
[quote]I'm not entirely convinced by that. It's certainly the final destination for the installation, but is it the temp directory as well?

Call me naive, but wouldn't a generic temp folder be created in /tmp/, unless directed otherwise?

Debian based distros seem to use /proc/locks directory for lock file location.
If you use lslocks command, you will see the current locks, including any boinc/Gpugrid locks. (cat /proc/locks will also work, but you will need to know the PID of the processes listed to interpret the output)
I cant think of anything that would prevent access to this directory structure, as it is critical in OS operations.
I am leaning towards a bug in the Experimental Tasks scripts that causes this issue (and has since been cleaned up).

Remember that I had a very specific reason for reverting a single recent change. The BOINC systemd file had PrivateTmp=true until very recently, and I had evidence that at least one Anaconda had successfully passed the installation phase in the past.

The systemd file as distributed now has PrivateTmp=false, and my machine had that setting when the error messages appeared. I changed it back, so I now have PrivateTmp=true again. And with that one change (plus an increased time limit for safety), the task has completed successfully.

I have a second machine, where I haven't yet made that change. Compare tasks 31712246 (WU created 10 Dec 2020 | 20:54:14 UTC, PrivateTmp=true) and 31712588 (WU created 10 Dec 2020 | 21:05:06 UTC, PrivateTmp=false). The first succeeded, the second failed with 'cannot create temporary directory'. I don't think it was a design change by the project.

I now have a DCF of 87, and I need to go and talk to the Debian/BOINC maintenance team again...
ID: 55981 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gemini8
Avatar

Send message
Joined: 3 Jul 16
Posts: 31
Credit: 2,248,809,169
RAC: 0
Level
Phe
Scientific publications
watwat
Message 55982 - Posted: 11 Dec 2020, 10:44:35 UTC - in response to Message 55980.  

rod4x4 wrote:
gemini8
3 of your hosts with older CUDA driver report this: ACEMD failed: Error loading CUDA module: CUDA_ERROR_INVALID_PTX (218).
Perhaps time for a driver update. It wont fix the errors being reported (reported errors are task bugs), but may prevent other issues developing in the future

Thanks for your concern!
The systems run with the 'standard' proprietary Nvidia Debian driver. Wasn't able to get a later one working, so I keep them this way.
Could switch the systems to Ubuntu which features later drivers, but I like Debian better and thus don't plan to do so.
- - - - - - - - - -
Greetings, Jens
ID: 55982 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kksplace

Send message
Joined: 4 Mar 18
Posts: 53
Credit: 2,815,476,011
RAC: 0
Level
Phe
Scientific publications
wat
Message 55983 - Posted: 11 Dec 2020, 11:33:58 UTC

Not a technical guru like you all here, but if it helps, my system has had one Anaconda failure back on 4 Dec, and now completed 4 in the last several days with one more in progress. Let me know if I can provide any information that helps.
ID: 55983 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55986 - Posted: 11 Dec 2020, 23:43:34 UTC - in response to Message 55967.  

Richard Haselgrove:
Your suggested action at Message 55967 seems to have worked for me also.
Thank you very much!

I had many failed Anaconda Python tasks, reporting errors regarding:
INTERNAL ERROR: cannot create temporary directory!

I executed the stated command:
sudo systemctl edit boinc-client.service

And I added to the file the suggested lines:
[Service]
PrivateTmp=true

After saving changes and rebooting, the mentioned error vanished.
Task 2p95010002r00-RAIMIS_NNPMM-0-1-RND3897_0 has succeeded on my Host #482132.
This host had previously failed 28 Anaconda Python tasks one after another...
I've applied the same remedy to several other hosts, and three more tasks seem to be progressing normally.

You have hit the mark
ID: 55986 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55987 - Posted: 11 Dec 2020, 23:44:59 UTC - in response to Message 55981.  

The systemd file as distributed now has PrivateTmp=false, and my machine had that setting when the error messages appeared. I changed it back, so I now have PrivateTmp=true again. And with that one change (plus an increased time limit for safety), the task has completed successfully.

I have a second machine, where I haven't yet made that change. Compare tasks 31712246 (WU created 10 Dec 2020 | 20:54:14 UTC, PrivateTmp=true) and 31712588 (WU created 10 Dec 2020 | 21:05:06 UTC, PrivateTmp=false). The first succeeded, the second failed with 'cannot create temporary directory'. I don't think it was a design change by the project.

On my hosts that are not reporting this issue, PrivateTmp is not set (so defaults to false)
https://gpugrid.net/show_host_detail.php?hostid=483378
https://gpugrid.net/show_host_detail.php?hostid=544286
https://gpugrid.net/show_host_detail.php?hostid=483296

I do have ProtectHome=true set on all hosts.

One thing I did note on 31712588 is that the tmp directory creation error is reported on a different PID to the wrapper. All transactions should be on the same PID.

21:48:18 (21729): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")
[21755] INTERNAL ERROR: cannot create temporary directory!
[21759] INTERNAL ERROR: cannot create temporary directory!
21:48:19 (21729): /usr/bin/flock exited; CPU time 0.118700


My concern is different outcomes are being seen with similar setting of PrivateTmp=false

Would be interested in Debian/BOINC maintenance team feedback.
ID: 55987 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55988 - Posted: 12 Dec 2020, 0:03:26 UTC - in response to Message 55986.  

Richard Haselgrove:
Your suggested action at Message 55967 seems to have worked for me also.
Thank you very much!

I had many failed Anaconda Python tasks, reporting errors regarding:
INTERNAL ERROR: cannot create temporary directory!

I executed the stated command:
sudo systemctl edit boinc-client.service

And I added to the file the suggested lines:
[Service]
PrivateTmp=true

After saving changes and rebooting, the mentioned error vanished.
Task 2p95010002r00-RAIMIS_NNPMM-0-1-RND3897_0 has succeeded on my Host #482132.
This host had previously failed 28 Anaconda Python tasks one after another...
I've applied the same remedy to several other hosts, and three more tasks seem to be progressing normally.

You have hit the mark


Good feedback!

So Richard Haselgrove, does ProtectHome=true imply PrivateTmp=true (or vice versa)?
ID: 55988 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55989 - Posted: 12 Dec 2020, 0:06:14 UTC

Yes, I too would like to hear the feedback from the developers, in my case what can be done with the project DCF of 93 on all my hosts.

That makes GPUGrid completely monopolize the gpus and prevent my other gpu projects from running.

The standard new acemd application tasks are the ones that are causing the week long estimated completion times.

The new beta anaconda/python tasks have reasonable estimated completion times.

The problem is the disparate individual DCF values for each application with a 100:1 ratio for python to acemd3 and BOINC only being able to handle one DCF value from a project with multiple sub-apps.

My current solution is manual intervention by stopping allowing new work and finishing off what is in the cache so that the other gpu projects can work on their caches. Then I let the other gpu projects run for a day or so so that the REC values of the projects are close to balancing out.

But this is an undesirable solution. I can only think that stopping one or the other sub-apps and exclusively running only one app is the only permanent solution now once the DCF normalizes against a single application.
ID: 55989 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55990 - Posted: 12 Dec 2020, 0:12:06 UTC - in response to Message 55987.  

On my hosts that are not reporting this issue, PrivateTmp is not set (so defaults to false)

All your three successful hosts have 7.9.3 BOINC version in use.
That version was default to PrivateTmp=true
Newer versions 7.16.xx changed this criterium to PrivateTmp=false by default.
ID: 55990 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55991 - Posted: 12 Dec 2020, 0:30:06 UTC - in response to Message 55990.  

On my hosts that are not reporting this issue, PrivateTmp is not set (so defaults to false)

All your three successful hosts have 7.9.3 BOINC version in use.
That version was default to PrivateTmp=true
Newer versions 7.16.xx changed this criterium to PrivateTmp=false by default.

Good pickup. Thanks for the clarification. I missed that!
Answers all my concerns.

I noted that the security directories created by PrivateTmp existed on my system, but could not work out why they existed.
ID: 55991 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55993 - Posted: 12 Dec 2020, 14:41:52 UTC

Thanks guys. I've written up the problem for upstream at https://github.com/BOINC/boinc/issues/4125, and I'm hoping to get further ideas from there. At the moment, the situation seems to be that you can either use idle detection to keep BOINC out of the way when working, or run Python tasks, but not both together. That seems unsatisfactory.

There are a couple of suggestions so far:
either adding /var/tmp and /tmp to ReadWritePaths=
or commenting out:
ProtectSystem=strict

Both sound insecure and not recommended for general use, but try them at your own risk. I'll be doing that.

Responding to other comments:

rod4x4 wrote:
the tmp directory creation error is reported on a different PID to the wrapper.

That's the nature of the wrapper. Its job is to spawn child processes. It will be the child processes which will be trying to create temporary folders, and their PIDs will be different from the parent's.

rod4x4 wrote:
does ProtectHome=true imply PrivateTmp=true (or vice versa)?

I don't know. I've got a lot of experience with BOINC, but I'm a complete n00b with Linux - learning on the job, by using analogies from other areas of computing. We'll find out as we go along.

Keith Myers wrote:
what can be done with the project DCF of 93 on all my hosts.

That's an easy problem, totally within the scope of the project admins here. We just have to ask TONI to have a word with RAIMIS, and ask him/her to increase the template <rsc_fpops_est> to a more realistic value. Then, the single DCF value controlling all runtime estimates throughout this project can settle back on a value close to 1.

ServicEnginIC wrote:
Newer versions 7.16.xx changed this criterium to PrivateTmp=false by default.

The suggestion for making this change was first made on 15 November 2020. Only people who use a quickly updating Linux repo - like Gianfranco Costamagna's PPA - will have encountered it so far. But the mainstream stable repos will get to it eventually
ID: 55993 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55998 - Posted: 12 Dec 2020, 23:59:48 UTC - in response to Message 55993.  
Last modified: 13 Dec 2020, 0:16:11 UTC

Thanks guys. I've written up the problem for upstream at https://github.com/BOINC/boinc/issues/4125, and I'm hoping to get further ideas from there. At the moment, the situation seems to be that you can either use idle detection to keep BOINC out of the way when working, or run Python tasks, but not both together. That seems unsatisfactory.

Thanks Richard Haselgrove. Will follow that issue request with interest.

The last suggestion by BryanQuigley has merit.
Either way, the project has the BOINC writable directory - why not use that? You can make your own tmp directory there if I'm not mistaken (haven't tried it)


His suggestion will mean any temp files will be protected by the Boinc security system already in place, bypasses the need for PrivateTmp=true and be accessible for Boinc process calls.
One issue would be a a clean up process for the temp folder will need to be controlled by BOINC to prevent run away file storage.
ID: 55998 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55999 - Posted: 13 Dec 2020, 1:20:55 UTC - in response to Message 55998.  

His suggestion will mean any temp files will be protected by the Boinc security system already in place, bypasses the need for PrivateTmp=true and be accessible for Boinc process calls.

Which circles back to how flock uses /tmp directory. Perhaps BryanQuigley can clear that point up.
ID: 55999 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 56000 - Posted: 13 Dec 2020, 5:19:28 UTC - in response to Message 55999.  
Last modified: 13 Dec 2020, 6:19:24 UTC

His suggestion will mean any temp files will be protected by the Boinc security system already in place, bypasses the need for PrivateTmp=true and be accessible for Boinc process calls.

Which circles back to how flock uses /tmp directory. Perhaps BryanQuigley can clear that point up.

I am starting to think the error is not related to flock temp folder handling nor is PrivateTmp the cause, rather they highlight an error in the Gpugrid work unit script.
I have noticed that 9 files are written to /tmp folder for each experimental task. These files are readable and contain nvidia compiler data.

The work unit script should be placing these files in the boinc folder structure.

It is these file writes that are causing the error.
21:48:18 (21729): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")
[21755] INTERNAL ERROR: cannot create temporary directory!
[21759] INTERNAL ERROR: cannot create temporary directory!
21:48:19 (21729): /usr/bin/flock exited; CPU time 0.118700


Once this script error is corrected, PrivateTmp can be reverted to the preferred setting.

It should also be noted that the task is not cleaning up these files on completion. This will lead to disk space exhaustion if not controlled.
ID: 56000 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56001 - Posted: 13 Dec 2020, 13:52:48 UTC - in response to Message 56000.  

I think it's more likely to be the /bin/bash which actually needs to write temporary files. I've now managed (with some difficulty) to separate the 569 lines of actual script from the 90 MB of payload. There's

export TMP_BACKUP="$TMP"
export TMP=$PREFIX/install_tmp

but no sign of the TMPDIR mentioned in https://linux.die.net/man/1/bash. More than that is above my pay-grade, I'm afraid.

You can run the script in your home directory if you want to know more about it. (Like the End User License Agreement, the Notice of Third Party Software Licenses, and the Export; Cryptography Notice!)
ID: 56001 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Anaconda Python 3 Environment v4.01 failures

©2025 Universitat Pompeu Fabra