Anaconda Python 3 Environment v4.01 failures

Message boards : Number crunching : Anaconda Python 3 Environment v4.01 failures
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55917 - Posted: 9 Dec 2020, 17:58:27 UTC

On Linux Mint hosts 537311 and 508381

Sample error messages:
[26912] INTERNAL ERROR: cannot create temporary directory!
[26916] INTERNAL ERROR: cannot create temporary directory!
ID: 55917 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sergey Kovalchuk

Send message
Joined: 18 Feb 16
Posts: 6
Credit: 1,121,331
RAC: 0
Level
Ala
Scientific publications
wat
Message 55918 - Posted: 9 Dec 2020, 18:21:37 UTC

wrapper uses absolute path (from root)
in my case, the path contains spaces - so the path is transmitted truncated


other projects use relative path - from slot/N/
wrapper: running ../../projects/boinc.project.org/apps (arg)
ID: 55918 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 55919 - Posted: 9 Dec 2020, 19:38:38 UTC
Last modified: 9 Dec 2020, 19:38:54 UTC

it's interesting that even though all the wingmen experience some error, they are not receiving the same error, and some are running for quite a while before erroring out.
ID: 55919 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55922 - Posted: 9 Dec 2020, 20:21:47 UTC - in response to Message 55919.  

My Linux machines run under the systemd daemon install. We found a problem recently where systemd was set up to have no access at all to the Linux tmp area - couldn't even read it. My machines are now able to read there, but may still be prevented from writing - wherever flock is trying to unpack python for installation.
ID: 55922 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55924 - Posted: 9 Dec 2020, 23:26:30 UTC

Anybody successful yet in the new Python tasks?

I have just watched two tasks run and are currently at 100% progress and time remaining, dashed out. Runtime currently at 18 minutes for one and 12 minutes for the other, and counting.

Maybe making progress in the task development. At least ran for more than the couple of seconds of the previous tries.
ID: 55924 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 55925 - Posted: 9 Dec 2020, 23:30:34 UTC - in response to Message 55924.  
Last modified: 9 Dec 2020, 23:38:05 UTC

Looks like Kevvy's system completed a bunch of them.

And also was awarded over 1,500,000 credits for each one! I think the credit award calculation is off a bit LOL. its about 10x the credit given from MDAD
ID: 55925 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55927 - Posted: 9 Dec 2020, 23:49:54 UTC

Well, if mine really are going to run for 3 hours, then that is about 1.5X longer than ADRIA tasks on my hosts.

So at least 1.5X in credits.

But agree the credit awarded is way out there.

But maybe an exorbitant award for being a beta tester.
ID: 55927 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 55929 - Posted: 10 Dec 2020, 0:15:37 UTC - in response to Message 55927.  

looks like I picked up a couple resends (i only just enabled Beta tasks). I'll see if they complete on my system. both picked up on my RTX 2070 system.
ID: 55929 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55930 - Posted: 10 Dec 2020, 2:25:39 UTC

Everyone has failed again on my two hosts.

I wonder what Mr. Kevvy has done right for the crunching environment to get them to complete successfully?
ID: 55930 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55934 - Posted: 10 Dec 2020, 6:14:21 UTC

Finally had success on several new Python tasks. Wonder if the failures are logged to forward correct the task configuration so that future attempts are finished correctly.
ID: 55934 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55937 - Posted: 10 Dec 2020, 10:42:22 UTC

I thought I'd turned off Python tasks after the failures, but I forgot to turn off test tasks as well, so I've been sent another one.

The initial runtime estimate is 43 seconds...

I'll go and grab the full task specification before it runs, and see if I can spot anything.
ID: 55937 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55938 - Posted: 10 Dec 2020, 11:45:29 UTC - in response to Message 55937.  

The latest round of tasks look fairly good.
The current batch of errors I have seen are related to disk space and proxy pass through errors (download anaconda environment doesn't use proxy)

Can we post the STDerr output for errored tasks to see what other issues remain?
ID: 55938 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55939 - Posted: 10 Dec 2020, 11:47:33 UTC - in response to Message 55937.  
Last modified: 10 Dec 2020, 11:50:55 UTC

Well, it failed again, with the program 'flock' failing to create a temporary directory. That's the very first task in the job spec, and it seems to be trying to install miniconda.

The strange thing is that I seem to have a full miniconda installation already, created on 11 November, but that job has long since vanished from my task list.

Forgot to mention - a machine belonging to ServicEnginIC has also tried the same task, and failed in the same way.
ID: 55939 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55940 - Posted: 10 Dec 2020, 12:46:40 UTC - in response to Message 55939.  

Well, it failed again, with the program 'flock' failing to create a temporary directory. That's the very first task in the job spec, and it seems to be trying to install miniconda.

The strange thing is that I seem to have a full miniconda installation already, created on 11 November, but that job has long since vanished from my task list.

Forgot to mention - a machine belonging to ServicEnginIC has also tried the same task, and failed in the same way.


Apologies, should have read original post. Just to compare your failed tasks with successful tasks...https://gpugrid.net/result.php?resultid=31667292
It appears the tmp folder is a sub folder of the gpugrid.net project folder - miniconda and in the slot N folder gpugridpy.
I currently have a task running now, both populated with the app compile environment.
Is the user listed in /lib/systemd/system/boinc-client.service also in the /etc/group - boinc group? Does this user have correct permissions on /var/lib/boinc-client folder, sub folders?
ID: 55940 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 55944 - Posted: 10 Dec 2020, 14:34:56 UTC
Last modified: 10 Dec 2020, 14:58:52 UTC

Looks like I have 11 successful tasks, and 2 failures.

the two failures both failed with "196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED" after a few mins and on different hosts.
https://www.gpugrid.net/result.php?resultid=31680145
https://www.gpugrid.net/result.php?resultid=31678136

curious, since both systems have plenty of free space, and I've allowed BOINC to use 90% of it.

these tasks also have much different behavior compared to the default new version acemd tasks. and they don't seem well optimized yet.
-less reliance on PCIe bandwidth, seeing 2-8% PCIe 3.0 bus utilization
-more reliance on GPU VRAM, seeing 2-3GB memory used
-less GPU utilization, seeing 65-85% GPU utilization. (maybe more dependent on a fast CPU/mem subsystem. my 3900X system gets better GPU% than my slower EPYC systems)

contrast that with the default acemd3 tasks:
-25-50% PCIe 3.0 bus utilization
-about 500MB GPU VRAM used
-95+% GPU utilization

thinking about the GPU utilization being dependent on CPU speed. It could also have to do with the relative speed between the GPU:CPU. just something I observed on my systems. slower GPUs seem to tolerate slower CPUs better, which makes sense if the CPU speed is a limiting factor.

Ryzen 3900X @4.20GHz w/ 2080ti = 85% GPU Utilization
EPYC 7402P @3.30GHz w/ 2080ti = 65% GPU Utilization
EPYC 7402P @3.30GHz w/ 2070 = 76% GPU Utilization
EPYC 7642 @2.80GHz w/ 1660Super = 71% GPU Utilization

needs more optimization IMO. the default app sees much better performance keeping the GPU fully loaded.
ID: 55944 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gemini8
Avatar

Send message
Joined: 3 Jul 16
Posts: 31
Credit: 2,248,809,169
RAC: 0
Level
Phe
Scientific publications
watwat
Message 55950 - Posted: 10 Dec 2020, 17:07:24 UTC

Several failures for me as well.
Some because of time-limit 1800 which abort themselves after 1786 sec.
Unhid my machines for the case someone is interested in the stderr output.
There might be different output on different machines, so just help yourselves.
- - - - - - - - - -
Greetings, Jens
ID: 55950 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 55952 - Posted: 10 Dec 2020, 17:22:53 UTC - in response to Message 55950.  

Just spitballing, but it seems the people failing all their tasks are using a repository/service install of BOINC.

I see successful completions from myself, Keith, and Kevvy. all of us run an all-in-one install from an executable in our home directories.

all of the failures on repo/service installs might be down to a permissions issue with how the new app is trying to work.
ID: 55952 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55960 - Posted: 10 Dec 2020, 19:11:19 UTC

First thing I would ask of the BOINC service install users, is their user added to the boinc group.

Also is video added to the boinc group.

Permissions get tricky with the service BOINC install.
ID: 55960 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55967 - Posted: 10 Dec 2020, 20:21:58 UTC - in response to Message 55940.  

It appears the tmp folder is a sub folder of the gpugrid.net project folder - miniconda and in the slot N folder gpugridpy.

I'm not entirely convinced by that. It's certainly the final destination for the installation, but is it the temp directory as well?

Call me naive, but wouldn't a generic temp folder be created in /tmp/, unless directed otherwise?

The last change to the BOINC service installation came as a result of https://github.com/BOINC/boinc/issues/3715#issuecomment-727609578. Previously, there was no access at all to /tmp/, which blocked idle detection. I'm running the fixed version, so I have access - but maybe it's only read access? That may explain why I was able to install conda on 11 November - with no access to /tmp/, it would have been forced to try another location. But access with no write permission??? Consistent with the error message.

Anyway, I've made it private again. All we need now is a test task...
ID: 55967 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55968 - Posted: 10 Dec 2020, 22:05:15 UTC - in response to Message 55967.  

YAY! I got one, and it's running. Installed conda, got through all the setup stages, and is now running ACEMD, and reporting proper progress.

Task 31712246, in case I go to bed before it finishes.
ID: 55968 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Number crunching : Anaconda Python 3 Environment v4.01 failures

©2025 Universitat Pompeu Fabra