Message boards :
Number crunching :
Anaconda Python 3 Environment v4.01 failures
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
|
|
Send message Joined: 18 Feb 16 Posts: 6 Credit: 1,121,331 RAC: 0 Level ![]() Scientific publications
|
wrapper uses absolute path (from root) in my case, the path contains spaces - so the path is transmitted truncated other projects use relative path - from slot/N/ wrapper: running ../../projects/boinc.project.org/apps (arg) |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
it's interesting that even though all the wingmen experience some error, they are not receiving the same error, and some are running for quite a while before erroring out.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My Linux machines run under the systemd daemon install. We found a problem recently where systemd was set up to have no access at all to the Linux tmp area - couldn't even read it. My machines are now able to read there, but may still be prevented from writing - wherever flock is trying to unpack python for installation. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Anybody successful yet in the new Python tasks? I have just watched two tasks run and are currently at 100% progress and time remaining, dashed out. Runtime currently at 18 minutes for one and 12 minutes for the other, and counting. Maybe making progress in the task development. At least ran for more than the couple of seconds of the previous tries. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
Looks like Kevvy's system completed a bunch of them. And also was awarded over 1,500,000 credits for each one! I think the credit award calculation is off a bit LOL. its about 10x the credit given from MDAD
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Well, if mine really are going to run for 3 hours, then that is about 1.5X longer than ADRIA tasks on my hosts. So at least 1.5X in credits. But agree the credit awarded is way out there. But maybe an exorbitant award for being a beta tester. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
looks like I picked up a couple resends (i only just enabled Beta tasks). I'll see if they complete on my system. both picked up on my RTX 2070 system.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Everyone has failed again on my two hosts. I wonder what Mr. Kevvy has done right for the crunching environment to get them to complete successfully? |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Finally had success on several new Python tasks. Wonder if the failures are logged to forward correct the task configuration so that future attempts are finished correctly. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I thought I'd turned off Python tasks after the failures, but I forgot to turn off test tasks as well, so I've been sent another one. The initial runtime estimate is 43 seconds... I'll go and grab the full task specification before it runs, and see if I can spot anything. |
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The latest round of tasks look fairly good. The current batch of errors I have seen are related to disk space and proxy pass through errors (download anaconda environment doesn't use proxy) Can we post the STDerr output for errored tasks to see what other issues remain? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well, it failed again, with the program 'flock' failing to create a temporary directory. That's the very first task in the job spec, and it seems to be trying to install miniconda. The strange thing is that I seem to have a full miniconda installation already, created on 11 November, but that job has long since vanished from my task list. Forgot to mention - a machine belonging to ServicEnginIC has also tried the same task, and failed in the same way. |
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well, it failed again, with the program 'flock' failing to create a temporary directory. That's the very first task in the job spec, and it seems to be trying to install miniconda. Apologies, should have read original post. Just to compare your failed tasks with successful tasks...https://gpugrid.net/result.php?resultid=31667292 It appears the tmp folder is a sub folder of the gpugrid.net project folder - miniconda and in the slot N folder gpugridpy. I currently have a task running now, both populated with the app compile environment. Is the user listed in /lib/systemd/system/boinc-client.service also in the /etc/group - boinc group? Does this user have correct permissions on /var/lib/boinc-client folder, sub folders? |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
Looks like I have 11 successful tasks, and 2 failures. the two failures both failed with "196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED" after a few mins and on different hosts. https://www.gpugrid.net/result.php?resultid=31680145 https://www.gpugrid.net/result.php?resultid=31678136 curious, since both systems have plenty of free space, and I've allowed BOINC to use 90% of it. these tasks also have much different behavior compared to the default new version acemd tasks. and they don't seem well optimized yet. -less reliance on PCIe bandwidth, seeing 2-8% PCIe 3.0 bus utilization -more reliance on GPU VRAM, seeing 2-3GB memory used -less GPU utilization, seeing 65-85% GPU utilization. (maybe more dependent on a fast CPU/mem subsystem. my 3900X system gets better GPU% than my slower EPYC systems) contrast that with the default acemd3 tasks: -25-50% PCIe 3.0 bus utilization -about 500MB GPU VRAM used -95+% GPU utilization thinking about the GPU utilization being dependent on CPU speed. It could also have to do with the relative speed between the GPU:CPU. just something I observed on my systems. slower GPUs seem to tolerate slower CPUs better, which makes sense if the CPU speed is a limiting factor. Ryzen 3900X @4.20GHz w/ 2080ti = 85% GPU Utilization EPYC 7402P @3.30GHz w/ 2080ti = 65% GPU Utilization EPYC 7402P @3.30GHz w/ 2070 = 76% GPU Utilization EPYC 7642 @2.80GHz w/ 1660Super = 71% GPU Utilization needs more optimization IMO. the default app sees much better performance keeping the GPU fully loaded.
|
|
Send message Joined: 3 Jul 16 Posts: 31 Credit: 2,248,809,169 RAC: 0 Level ![]() Scientific publications ![]()
|
Several failures for me as well. Some because of time-limit 1800 which abort themselves after 1786 sec. Unhid my machines for the case someone is interested in the stderr output. There might be different output on different machines, so just help yourselves. - - - - - - - - - - Greetings, Jens |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
Just spitballing, but it seems the people failing all their tasks are using a repository/service install of BOINC. I see successful completions from myself, Keith, and Kevvy. all of us run an all-in-one install from an executable in our home directories. all of the failures on repo/service installs might be down to a permissions issue with how the new app is trying to work.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
First thing I would ask of the BOINC service install users, is their user added to the boinc group. Also is video added to the boinc group. Permissions get tricky with the service BOINC install. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It appears the tmp folder is a sub folder of the gpugrid.net project folder - miniconda and in the slot N folder gpugridpy. I'm not entirely convinced by that. It's certainly the final destination for the installation, but is it the temp directory as well? Call me naive, but wouldn't a generic temp folder be created in /tmp/, unless directed otherwise? The last change to the BOINC service installation came as a result of https://github.com/BOINC/boinc/issues/3715#issuecomment-727609578. Previously, there was no access at all to /tmp/, which blocked idle detection. I'm running the fixed version, so I have access - but maybe it's only read access? That may explain why I was able to install conda on 11 November - with no access to /tmp/, it would have been forced to try another location. But access with no write permission??? Consistent with the error message. Anyway, I've made it private again. All we need now is a test task... |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
YAY! I got one, and it's running. Installed conda, got through all the setup stages, and is now running ACEMD, and reporting proper progress. Task 31712246, in case I go to bed before it finishes. |
©2025 Universitat Pompeu Fabra