Advanced search

Message boards : News : Experimental Python tasks (beta)

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 988
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 55588 - Posted: 13 Oct 2020 | 6:07:19 UTC

I'm creating some experimental tasks for the Python app (made Beta). They are Linux and CUDA specific and serve in preparation for future batches.

They may use a relatively large amount of disk space (order of 1-10 GB) which persists between runs, and is cleared if you reset the project.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,200,441,910
RAC: 243,507
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55590 - Posted: 13 Oct 2020 | 7:44:18 UTC - in response to Message 55588.
Last modified: 13 Oct 2020 | 8:24:54 UTC

I'm creating some experimental tasks for the Python app (made Beta). They are Linux and CUDA specific and serve in preparation for future batches.

They may use a relatively large amount of disk space (order of 1-10 GB) which persists between runs, and is cleared if you reset the project.



Preference Ticked, ready and waiting...

EDIT: Received some already
https://www.gpugrid.net/result.php?resultid=29466771
https://www.gpugrid.net/result.php?resultid=29466770

Conda Warnings reported. Will you push out update with app (or safe to ignore)?

Also Warnings about path not found:
WARNING conda.core.envs_manager:register_env(50): Unable to register environment. Path not writable or missing. environment location: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda registry file: /root/.conda/environments.txt

Registry file location ( /root/ ) will not be accessible to boinc user unless conda is already installed on the host (by root user) and conda file is world readable

Otherwise the task status is Completed and Validated

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 988
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 55591 - Posted: 13 Oct 2020 | 9:25:38 UTC - in response to Message 55590.

Looks harmless, thanks for reporting. It's because the "boinc" user doesn't have a HOME directory I think.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,200,441,910
RAC: 243,507
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55592 - Posted: 13 Oct 2020 | 11:14:14 UTC - in response to Message 55591.
Last modified: 13 Oct 2020 | 11:17:49 UTC

Looks harmless, thanks for reporting. It's because the "boinc" user doesn't have a HOME directory I think.


Agreed

Perhaps adding "./envs" switch to the end of the command:

/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install


May help with setting up the environment.

This switch should add environment file to current directory from which command is executed.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 688
Credit: 904,899,505
RAC: 377,648
Level
Glu
Scientific publications
watwatwatwatwat
Message 55724 - Posted: 12 Nov 2020 | 1:59:01 UTC

I got one of these tasks which confused me as I have not set "accept beta applications" in my project preferences.

Failed after 1200 seconds.

Any idea why I got this task even when I have not accepted the app through beta settings?

https://www.gpugrid.net/result.php?resultid=30508976

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 55920 - Posted: 9 Dec 2020 | 19:42:43 UTC

What is the difference between these test Python apps and the standard one? Is it just that this application is coded in Python? what language are the default apps coded in?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 688
Credit: 904,899,505
RAC: 377,648
Level
Glu
Scientific publications
watwatwatwatwat
Message 55926 - Posted: 9 Dec 2020 | 23:40:13 UTC - in response to Message 55920.

Both apps are wrappered. One is the stock acemd3 and I assume is written in some form of C.

The new Anaconda Python task is a conda application. And Python.

I think Toni is going to have to explain what and how these new tasks and application work.

Very strange behavior. I think the conda and python parts run first and communicate with the project doing some intermediary calculation/configuration/formatting or something. Lots of upstream network activity and nothing going on in the client transfers screen.

I saw the tasks get to 100% progress and no time remaining and then stall out. No upload of the finished task.

Looked away from the machine and looked again and now both tasks have reset their progress and now have 3 hours to run.

I first saw conda show up in the process list and now that has disappeared to be replaced with a acemd3 and python process for each task.

Must be doing something other than insta-failing that the previous tries.

sph
Send message
Joined: 22 Oct 20
Posts: 4
Credit: 34,434,982
RAC: 79,115
Level
Val
Scientific publications
wat
Message 55933 - Posted: 10 Dec 2020 | 5:22:30 UTC

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.


I am receiving this error in STDerr Output for Experimental Python tasks on all my hosts.

This is probably due to the fact all my PCs are behind a proxy. Can you please set the Python tasks to use the Proxy defined in the Boinc Client?

Work Units here:
https://www.gpugrid.net/result.php?resultid=31672354
https://www.gpugrid.net/result.php?resultid=31668427
https://www.gpugrid.net/result.php?resultid=31665961

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 688
Credit: 904,899,505
RAC: 377,648
Level
Glu
Scientific publications
watwatwatwatwat
Message 55936 - Posted: 10 Dec 2020 | 8:30:18 UTC

Boy, mixing both regular acemd3 and the python anaconda tasks sure F*s up the APR for both tasks. The insanely low APR for the Python tasks is forcing all GPUGrid tasks into High Priority.

The regular acemd3 tasks are getting 3-6 day estimated completions.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 55945 - Posted: 10 Dec 2020 | 15:25:26 UTC - in response to Message 55936.
Last modified: 10 Dec 2020 | 15:41:38 UTC

Boy, mixing both regular acemd3 and the python anaconda tasks sure F*s up the APR for both tasks. The insanely low APR for the Python tasks is forcing all GPUGrid tasks into High Priority.

The regular acemd3 tasks are getting 3-6 day estimated completions.


I'm seeing that too lol. but it doesnt seem to be causing too much trouble for me since I don't run more than one GPU project concurrently. Only have Prime and backup.

copying my message from another thread with my observations about these tasks for Toni to see if he doesnt check the other threads:

Looks like I have 11 successful tasks, and 2 failures.

the two failures both failed with "196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED" after a few mins and on different hosts.
https://www.gpugrid.net/result.php?resultid=31680145
https://www.gpugrid.net/result.php?resultid=31678136

curious, since both systems have plenty of free space, and I've allowed BOINC to use 90% of it.

these tasks also have much different behavior compared to the default new version acemd tasks. and they don't seem well optimized yet.
-less reliance on PCIe bandwidth, seeing 2-8% PCIe 3.0 bus utilization
-more reliance on GPU VRAM, seeing 2-3GB memory used
-less GPU utilization, seeing 65-85% GPU utilization. (maybe more dependent on a fast CPU/mem subsystem. my 3900X system gets better GPU% than my slower EPYC systems)

contrast that with the default acemd3 tasks:
-25-50% PCIe 3.0 bus utilization
-about 500MB GPU VRAM used
-95+% GPU utilization

thinking about the GPU utilization being dependent on CPU speed. It could also have to do with the relative speed between the GPU:CPU. just something I observed on my systems. slower GPUs seem to tolerate slower CPUs better, which makes sense if the CPU speed is a limiting factor.

Ryzen 3900X @4.20GHz w/ 2080ti = 85% GPU Utilization
EPYC 7402P @3.30GHz w/ 2080ti = 65% GPU Utilization
EPYC 7402P @3.30GHz w/ 2070 = 76% GPU Utilization
EPYC 7642 @2.80GHz w/ 1660Super = 71% GPU Utilization

needs more optimization IMO. the default app sees much better performance keeping the GPU fully loaded.

____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1095
Credit: 3,258,384,910
RAC: 181,881
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55946 - Posted: 10 Dec 2020 | 16:03:34 UTC - in response to Message 55936.

Boy, mixing both regular acemd3 and the python anaconda tasks sure F*s up the APR for both tasks. The insanely low APR for the Python tasks is forcing all GPUGrid tasks into High Priority.

The regular acemd3 tasks are getting 3-6 day estimated completions.

Actually, that won't be the cause. The APRs are kept separately for each application, and once you have an 'active' APR (11 or more 'completions' - validated tasks for that app), they should keep out of each others way.

What will F* things up is that this project still allows DCF to run free - and that's a single value which is applied to both task types.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 688
Credit: 904,899,505
RAC: 377,648
Level
Glu
Scientific publications
watwatwatwatwat
Message 55947 - Posted: 10 Dec 2020 | 16:07:55 UTC - in response to Message 55946.

Yeah, after I wrote that I realized I meant the DCF is what is messing up the runtime estimations.

I wonder if the regular acemd3 tasks will ever get their normal DCF's back to normal.

I haven't run ANY of my other gpu project tasks since these anaconda python tasks have shown up. I will eventually when the other projects deadlines approach of course.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 55948 - Posted: 10 Dec 2020 | 16:09:51 UTC - in response to Message 55946.

what's DCF?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 688
Credit: 904,899,505
RAC: 377,648
Level
Glu
Scientific publications
watwatwatwatwat
Message 55949 - Posted: 10 Dec 2020 | 16:29:16 UTC - in response to Message 55948.

what's DCF?

Task Duration Correction Factor.
The older BOINC server versions use it like Einstein.
It messes up gpu tasks of different apps there too.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1095
Credit: 3,258,384,910
RAC: 181,881
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55951 - Posted: 10 Dec 2020 | 17:11:20 UTC - in response to Message 55947.

You can't talk about 'their DCFs' - there is only one (there could have been more than one, but that's the way David chose to play it)

You can see it in BOINC Manager, on the Projects|properties dialog. If it gets really, really high (above 90), it'll inch downwards at 1% per task. Below 90, it'll speed up to 10% par task. The standard advice used to be "two weeks to stabilise", but with modern machines (multi-core, multi-GPU, and faster), the tasks fly by, and it should be quicker.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 688
Credit: 904,899,505
RAC: 377,648
Level
Glu
Scientific publications
watwatwatwatwat
Message 55953 - Posted: 10 Dec 2020 | 17:28:15 UTC - in response to Message 55951.

What is also messed up is the size of the Anaconda Python task estimated computation size shown in the task properties.

The ones I crunched were only set for 3,000 GFLOPS.

The regular acemd3 tasks are set for 5,000,000 GFLOPS.

This also probably influenced the wildly inaccurate DCF's for the new python tasks.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 688
Credit: 904,899,505
RAC: 377,648
Level
Glu
Scientific publications
watwatwatwatwat
Message 55954 - Posted: 10 Dec 2020 | 17:33:17 UTC - in response to Message 55951.

You can't talk about 'their DCFs' - there is only one (there could have been more than one, but that's the way David chose to play it)

You can see it in BOINC Manager, on the Projects|properties dialog. If it gets really, really high (above 90), it'll inch downwards at 1% per task. Below 90, it'll speed up to 10% par task. The standard advice used to be "two weeks to stabilise", but with modern machines (multi-core, multi-GPU, and faster), the tasks fly by, and it should be quicker.

This daily driver has GPUGrid DCF Project properties currently at 85 and change.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 55955 - Posted: 10 Dec 2020 | 17:33:49 UTC - in response to Message 55953.
Last modified: 10 Dec 2020 | 17:37:11 UTC

What is also messed up is the size of the Anaconda Python task estimated computation size shown in the task properties.

The ones I crunched were only set for 3,000 GFLOPS.

The regular acemd3 tasks are set for 5,000,000 GFLOPS.

This also probably influenced the wildly inaccurate DCF's for the new python tasks.

can confirm.

could this be why the credit reward is so high too?

I wonder what the flop estimate was on this one from Kevvy:
https://www.gpugrid.net/result.php?resultid=31679003
he got wrecked on this one, over 5hrs on a 2080ti, and got a mere 20 credits lol.
____________

biodoc
Send message
Joined: 26 Aug 08
Posts: 174
Credit: 2,178,802,375
RAC: 535,448
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55956 - Posted: 10 Dec 2020 | 18:20:14 UTC

I've got one running now on an RTX 2070S and the only real issue is low GPU utilization (60-70%). The current task is using ~2 GB of VRAM and ~3 GB of system RAM. I have one thread free on a ryzen 3900X to support the GPU and that thread is running at 100%. This computer has complete 3 of the new python tasks successfully.

Linux Mint 20; Driver Version: 440.95.01; CUDA Version: 10.2

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 55957 - Posted: 10 Dec 2020 | 18:25:08 UTC - in response to Message 55956.

I've got one running now on an RTX 2070S and the only real issue is low GPU utilization (60-70%). The current task is using ~2 GB of VRAM and ~3 GB of system RAM. I have one thread free on a ryzen 3900X to support the GPU and that thread is running at 100%. This computer has complete 3 of the new python tasks successfully.

Linux Mint 20; Driver Version: 440.95.01; CUDA Version: 10.2


what kind of BOINC install do you have? does it run as a service? or a standalone install that runs from an executable?

what is the clock speed of your 3900X and memory speed as well?

try letting there be 2 spare free threads (so you have one doing nothing) to avoid maxing out the CPU to 100% utilization on all threads. this is known to slow down GPU work. this might increase your GPU utilization a bit.

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 688
Credit: 904,899,505
RAC: 377,648
Level
Glu
Scientific publications
watwatwatwatwat
Message 55958 - Posted: 10 Dec 2020 | 19:05:05 UTC - in response to Message 55955.

There's an explanation for 20 credit tasks over at Rosetta.
Has to do with a task being interrupted in calculation and restarted if I remember correctly.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 688
Credit: 904,899,505
RAC: 377,648
Level
Glu
Scientific publications
watwatwatwatwat
Message 55959 - Posted: 10 Dec 2020 | 19:07:58 UTC - in response to Message 55957.
Last modified: 10 Dec 2020 | 19:15:47 UTC

what kind of BOINC install do you have? does it run as a service? or a standalone install that runs from an executable?


That was one of the questions I wanted to ask Mr. Kevvy in the case he seems to be the first cruncher to successfully crunch a ton of them without errors.

I wondered if his BOINC was a service install or a standalone.

[Edit] OK, so Mr. Kevvy is still using the AIO. I wondered since a lot of our team seem to have dropped the AIO and gone back to the service install.

So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 55961 - Posted: 10 Dec 2020 | 19:17:41 UTC - in response to Message 55959.

I'm almost positive he's running a standalone install.
____________

biodoc
Send message
Joined: 26 Aug 08
Posts: 174
Credit: 2,178,802,375
RAC: 535,448
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55962 - Posted: 10 Dec 2020 | 19:28:54 UTC - in response to Message 55957.

I've got one running now on an RTX 2070S and the only real issue is low GPU utilization (60-70%). The current task is using ~2 GB of VRAM and ~3 GB of system RAM. I have one thread free on a ryzen 3900X to support the GPU and that thread is running at 100%. This computer has complete 3 of the new python tasks successfully.

Linux Mint 20; Driver Version: 440.95.01; CUDA Version: 10.2


what kind of BOINC install do you have? does it run as a service? or a standalone install that runs from an executable?

what is the clock speed of your 3900X and memory speed as well?

try letting there be 2 spare free threads (so you have one doing nothing) to avoid maxing out the CPU to 100% utilization on all threads. this is known to slow down GPU work. this might increase your GPU utilization a bit.


Boinc runs as a service and was installed from the Mint repository (version 17.16.6). The CPU clock speed is 3.9 GHz and the RAM is DDR4 3200 CL16. I did free up another thread but I didn't see an obvious difference in GPU utilization.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 55963 - Posted: 10 Dec 2020 | 19:31:00 UTC - in response to Message 55959.
Last modified: 10 Dec 2020 | 19:39:47 UTC

So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running.


difference in what sense?

you and I both have glibc 2.31 and we both have a bunch of successful completions. looks like Kevvy's Ubuntu 20 systems also have 2.31. all of us with these Ubuntu 20.04 systems have successful completions.

but of all of his Linux Mint (based on Ubuntu 19) systems, none have completed a single Python task successfully. I'm not sure if it's a problem with Linux Mint or what. I'm not sure its necessarily anything to do with the GLIBC since his error messages are varied, and none mention GLIBC as being the cause. It could just be that the app has some bugs to work out when running in different environments. I also don't know if he's using service installs on his Mint systems, he's got a lot of different BOINC versions across all his systems.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 55964 - Posted: 10 Dec 2020 | 19:36:51 UTC - in response to Message 55962.

Boinc runs as a service and was installed from the Mint repository (version 17.16.6). The CPU clock speed is 3.9 GHz and the RAM is DDR4 3200 CL16. I did free up another thread but I didn't see an obvious difference in GPU utilization.


thanks for the clarification. it was worth a shot on the GPU utilization with the free thread, low hanging fruit.

I run my memory at 3600 CL14, but I've never seen memory matter that much even for CPU tasks on other projects, let alone GPU tasks. (I saw no difference when changing from 3200CL16 to 3600CL14), but anything's possible I guess.

____________

biodoc
Send message
Joined: 26 Aug 08
Posts: 174
Credit: 2,178,802,375
RAC: 535,448
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55965 - Posted: 10 Dec 2020 | 19:44:14 UTC - in response to Message 55963.

So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running.


difference in what sense?

you and I both have glibc 2.31 and we both have a bunch of successful completions. looks like Kevvy's Ubuntu 20 systems also have 2.31. all of us with these Ubuntu 20.04 systems have successful completions.

but of all of his Linux Mint (based on Ubuntu 19) systems, none have completed a single Python task successfully. I'm not sure if it's a problem with Linux Mint or what. I'm not sure its necessarily anything to do with the GLIBC since his error messages are varied, and none mention GLIBC as being the cause. It could just be that the app has some bugs to work out when running in different environments.


Mint 20 is based on Ubuntu 20.04 and has glibc 2.31. The 2 computers I have running GPUGrid have Mint 20 installed and the RTX cards on those computers are completing the new python tasks successfully.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 55966 - Posted: 10 Dec 2020 | 20:01:37 UTC - in response to Message 55965.
Last modified: 10 Dec 2020 | 20:02:13 UTC

Mint 20 is based on Ubuntu 20.04 and has glibc 2.31. The 2 computers I have running GPUGrid have Mint 20 installed and the RTX cards on those computers are completing the new python tasks successfully.


Yes, I know. But my point was that there are many differences between Mint 19 and 20, not just GLIBC version, and usually when GLIBC is an issue that shows up as the reason for the error in the task results, but that hasn't been the case.

and conversely we have several examples of tasks hitting Ubuntu 20.04 systems with GLIBC of 2.31 and they still fail.

I think it's just buggy.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 688
Credit: 904,899,505
RAC: 377,648
Level
Glu
Scientific publications
watwatwatwatwat
Message 55969 - Posted: 10 Dec 2020 | 22:05:44 UTC - in response to Message 55966.

Yes, I had over a half dozen failed tasks before the first successful task.
Why I was wondering if the failed tasks report the failed configuration upstream and change the future task configuration.

Pretty sure lots of prerequisite software is downloaded first from conda and configured on the system before finally actually starting real crunching.

And the configuration downloads happen for each task I think.

Not just some initial download and from then on all the file are static.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 55996 - Posted: 12 Dec 2020 | 20:10:00 UTC

FYI, these tasks don't checkpoint properly.

if you need to stop BOINC or the system experiences a power outage, the tasks restart from the beginning (10%) but the task timer still tracks from where it left off even though the task restarted. if the tasks were short like MDAD (but MDAD checkpoints properly) it wouldn't be a huge problem. but when they run for 4-5hrs and need to start over for any interruption, it's a bit of a kick in the pants. even worse when these restarted tasks only get 20cred for up to 2x total run time. not worth finishing it at that point.

additionally as has been mentioned in the other thread, these tasks wreak havoc on the system's DCF since it seems to be set incorrectly for these tasks. you get these tasks that make boinc thing they will complete in 10 seconds, and they end up taking 4hrs, so BOINC counters with inflating the run time of normal tasks to 10+ days when they only take 20-40 min lol. and it swings wildly back and forth depending how many of each type you've completed.

and credit reward, other than being about 10x normal for tasks of this runtime, seems only tied to FLOPS and runtime without accounting for efficiency at all.

my 3900X/2080ti completes tasks on average much faster than my EPYC/2080ti system since the 3900X system is running higher GPU utilization allowing faster run times. but the 3900X system earns proportionally less credit. so both systems end up earning the same amount of credit per card. the 3900X/2080ti should be earning more credit since it's doing more tasks. reward is being overinflated for tasks that have longer run time due to inefficiency. it seems tied only to raw runtime and estimated flops. i understand that tasks can have varying run times, but if you wont account for efficiency you need to have a static reward not dependent on runtime at all. for reference, a static reward of about 175,000 would, on average, bring these tasks near the MDAD for cred/unit-time.
____________

Greger
Send message
Joined: 6 Jan 15
Posts: 48
Credit: 6,096,382,366
RAC: 59,156
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 55997 - Posted: 12 Dec 2020 | 22:03:58 UTC
Last modified: 12 Dec 2020 | 22:06:56 UTC

My host switch to another project task then resume and after a while i had to update system and restart. So it indeed fail to resume from last state so it looks like checkpoint was far behind or no checkpoint at all. Time stay at around 2 hour which was hours behind and est percentage locked at 10%

I aborted it next day as it reached 14 hours.

https://www.gpugrid.net/result.php?resultid=31701824

I would expect it not to be fully working and checkpoint added later on. There is much testing of this but low info for us still so we need to take it for what it is and deal with it if the don't work.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 988
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 56007 - Posted: 15 Dec 2020 | 15:52:05 UTC - in response to Message 55997.

The Python app runs ACEMD, but uses additional libraries to compute additional force terms. These libraries are distributed as Conda (Python) packages.

For this to work, I had to make an App which installs a self-contained Conda install in the project dir. The installation is re-used from one run to the other.

This is rather finicky (for example, downloads are large, and I have to be careful with concurrent installs).

Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?).

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 188
Credit: 453,296,905
RAC: 395,110
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56008 - Posted: 15 Dec 2020 | 17:04:43 UTC - in response to Message 56007.

Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?).


Over-crediting? I am seeing the opposite problem.

https://www.gpugrid.net/result.php?resultid=31902208

20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar.
____________
Reno, NV
Team: SETI.USA

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1095
Credit: 3,258,384,910
RAC: 181,881
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56009 - Posted: 15 Dec 2020 | 17:14:18 UTC - in response to Message 56007.

Thanks for the details.

The flops estimate

Yes, the "size" of the tasks, as expressed by <rsc_fpops_est> in the workunit template. The current value is 3,000 GFLOPS: all other GPUGrid task types are are 5,000,000 GFLOPS.

An App which installs a self-contained Conda install

We are encountering an unfortunate clash with the security of BOINC running as a systemd service under Linux. Useful bits of BOINC (pausing computation when the computer's user is active on the mouse or keyboard) rely on having access to the public /tmp/ folder structure. The conda installer wants to make use of a temporary folder.

systemd allows us to have either public tmp folders (read only, for security), or private tmp folders (write access). But not both at the same time. We're exploring how to get the best of both worlds...

Discussions in
https://www.gpugrid.net/forum_thread.php?id=5204
https://github.com/BOINC/boinc/issues/4125

over-crediting

We're enjoying it while it lasts!

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1095
Credit: 3,258,384,910
RAC: 181,881
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56010 - Posted: 15 Dec 2020 | 17:19:51 UTC - in response to Message 56008.

Over-crediting?

OK, make that 'inconsistent crediting'. Mine are all in the 600,000 - 900,000 range, for much the same runtime on a 1660 Ti.

Host 508381

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 56011 - Posted: 15 Dec 2020 | 17:50:02 UTC - in response to Message 56010.
Last modified: 15 Dec 2020 | 18:13:03 UTC

Over-crediting?

OK, make that 'inconsistent crediting'. Mine are all in the 600,000 - 900,000 range, for much the same runtime on a 1660 Ti.

Host 508381


the 20 credits thing seems to only happen with restarted tasks from what ive seen. not sure if anything else triggers it.

but I can say with certainty that the credit allocation is "questionable", and only appears to be related to the flops of device 0 in BOINC, as well as runtime. slow devices masked behind a fast device0 will earn credit at the rate of the faster device...
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 56012 - Posted: 15 Dec 2020 | 17:53:40 UTC - in response to Message 56008.

Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?).


Over-crediting? I am seeing the opposite problem.

https://www.gpugrid.net/result.php?resultid=31902208

20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar.


this happens when the task is interrupted. started and resumed. you can't interrupt these tasks at all.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1095
Credit: 3,258,384,910
RAC: 181,881
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56013 - Posted: 15 Dec 2020 | 18:09:15 UTC

We should perhaps mention the lack of effective checkpointing while we have Toni's attention. Even though the tasks claim to checkpoint every 0.9% (after the initial 10% allowed for the setup), the apps are unable to resume from the point previously reached.

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 188
Credit: 453,296,905
RAC: 395,110
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56014 - Posted: 15 Dec 2020 | 18:22:52 UTC - in response to Message 56012.

Over-crediting? I am seeing the opposite problem.

https://www.gpugrid.net/result.php?resultid=31902208

20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar.


this happens when the task is interrupted. started and resumed. you can't interrupt these tasks at all.


I'll check that out. But I have not suspended or otherwise interrupted any tasks. Unless BOINC is doing that without my knowledge. But I don't think so.
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 56015 - Posted: 15 Dec 2020 | 18:28:46 UTC - in response to Message 56014.

Over-crediting? I am seeing the opposite problem.

https://www.gpugrid.net/result.php?resultid=31902208

20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar.


this happens when the task is interrupted. started and resumed. you can't interrupt these tasks at all.


I'll check that out. But I have not suspended or otherwise interrupted any tasks. Unless BOINC is doing that without my knowledge. But I don't think so.


you also appear to have your hosts setup to ONLY crunch these beta tasks. is there a reason for that?

does your system process the normal tasks fine? maybe it's something going on with your system as a whole.

____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 188
Credit: 453,296,905
RAC: 395,110
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56016 - Posted: 15 Dec 2020 | 18:57:24 UTC - in response to Message 56015.

you also appear to have your hosts setup to ONLY crunch these beta tasks. is there a reason for that?

I have reached my wuprop goals for the other apps. So I am interested in only this particular app (for now).

does your system process the normal tasks fine? maybe it's something going on with your system as a whole.

Yep, all the other apps run fine, both here and on other projects.
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 56017 - Posted: 15 Dec 2020 | 20:40:18 UTC - in response to Message 56016.
Last modified: 15 Dec 2020 | 21:09:19 UTC

you also appear to have your hosts setup to ONLY crunch these beta tasks. is there a reason for that?

I have reached my wuprop goals for the other apps. So I am interested in only this particular app (for now).

does your system process the normal tasks fine? maybe it's something going on with your system as a whole.

Yep, all the other apps run fine, both here and on other projects.


I have a theory, but not sure if it's correct or not.

can you tell me the peak_flops value reported in your coproc_info.xml file for the 2080ti?

basically, since you are using such an old version of BOINC (7.9.3) which pre-dates the fixes implemented in 7.14.2 to properly calculate the peak flops of Turing cards. So I'm willing to bet that your version of BOINC is over-estimating your peak flops by a factor of 2. a 2080ti should read somewhere between 13.5 and 15 TFlops, and I'm guessing your old version of BOINC is thinking it's closer to double that (25-30 TFlops)

the second half of the theory is that there is some kind of hard limit (maybe an anti-cheat mechanism?) that prevents a credit reward somewhere around >2,000,000. maybe 1.8million, maybe 1.9million? but I haven't observed ANYONE getting a task earning that much, and all tasks that would reach that level based on runtime seem to get this 20-credit value.

thats my theory, i could be wrong. if you try a newer version of boinc that properly measures the flops on a turing card, and you start getting real credit, then it might hold water.
____________

sph
Send message
Joined: 22 Oct 20
Posts: 4
Credit: 34,434,982
RAC: 79,115
Level
Val
Scientific publications
wat
Message 56018 - Posted: 15 Dec 2020 | 23:13:08 UTC - in response to Message 56007.
Last modified: 15 Dec 2020 | 23:15:51 UTC

Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?).


Toni, One more issue to add to the list.

The download from Ananconda website does not allow for hosts behind a proxy. Can you please add a check for Proxy settings in the BOINC client so external software can be downloaded?
I have other hosts that are not behind a proxy and they download and run the Experimental tasks fine.

Issue here:
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.

This error repeats itself until it eventually gives up after 5 minutes and fails the task.

Happens on 2 hosts sitting behind a Web Proxy (Squid)

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 188
Credit: 453,296,905
RAC: 395,110
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56019 - Posted: 16 Dec 2020 | 1:19:31 UTC - in response to Message 56017.

A second, identical machine, except it has dual RTX 1660 Ti cards, finally got some work. The tasks reported and were awarded the large credits. So that rules out the question WRT BOINC version. FWIW, that version of BOINC is the latest available from the repository.

So maybe it is due to interruptions after all, and I am just unaware? I am running some more tasks now, and will check again in the morning.
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 56020 - Posted: 16 Dec 2020 | 2:57:24 UTC - in response to Message 56019.
Last modified: 16 Dec 2020 | 3:01:29 UTC

A second, identical machine, except it has dual RTX 1660 Ti cards, finally got some work. The tasks reported and were awarded the large credits. So that rules out the question WRT BOINC version. FWIW, that version of BOINC is the latest available from the repository.

So maybe it is due to interruptions after all, and I am just unaware? I am running some more tasks now, and will check again in the morning.


it doesnt rule it out because a 1660ti has a much lower flops value. like 5.5 TFlop. so with the old boinc version, it's estimating ~11TFlop and that's not high enough to trigger the issue. you're only seeing it on the 2080ti because it's a much higher performing card. ~14TFlop by default, and the old boinc version is scaling it all the way up to 28+ TFlop. this causes the calculated credit to be MUCH higher than that of the 1660ti, and hence triggering the 20-cred issue, according to my theory of course. but your 1660ti tasks are well below the 2,000,000 credit threshold that i'm estimating. highest i've seen is ~1.7million, so the line cant be much higher. I'm willing to bet that if one of your tasks on that 1660ti system runs for ~30,000-40,000 seconds, it gets hit with 20 credits. ¯\_(ツ)_/¯

you really should try to get your hands on a newer version of BOINC. I use a version of BOINC that was compiled custom, and have usually used custom compiled versions from newer versions of the source code. maybe one of the other guys here can point you to a different repository that has a newer version of BOINC that can properly manage the Turing cards.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 56021 - Posted: 16 Dec 2020 | 3:13:29 UTC - in response to Message 56020.

i also verified that restarting ALONE, wont necessarily trigger the 20-credit reward.

it depends WHEN you restart it. if you restart the task early, early enough that the combined runtime wont reach a point where you wont come close to the 2mil credit mark, you'll get the normal points

this task here: https://www.gpugrid.net/result.php?resultid=31934720

I restarted this task about 10-15mins into it. and it started over from the 10% mark, ran to completion, and still got normal crediting. and well below the threshold.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 56023 - Posted: 16 Dec 2020 | 14:36:25 UTC - in response to Message 56019.

A second, identical machine, except it has dual RTX 1660 Ti cards, finally got some work. The tasks reported and were awarded the large credits. So that rules out the question WRT BOINC version. FWIW, that version of BOINC is the latest available from the repository.

So maybe it is due to interruptions after all, and I am just unaware? I am running some more tasks now, and will check again in the morning.


i see you changed BOINC to 7.17.0.

another thing I noticed was that the change in tasks didnt take effect until new tasks were downloaded after the change, so tasks that were already there and tagged with the overinflated flops value will probably still get 20-cred. only the newly downloaded tasks after the change should work better.

____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 56027 - Posted: 16 Dec 2020 | 18:10:19 UTC - in response to Message 56023.

aaaand your 2080ti just completed a task and got credit with the new BOINC version. called it.

http://www.gpugrid.net/result.php?resultid=31951281
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 56028 - Posted: 16 Dec 2020 | 18:13:53 UTC - in response to Message 56020.

I'm willing to bet that if one of your tasks on that 1660ti system runs for ~30,000-40,000 seconds, it gets hit with 20 credits. ¯\_(ツ)_/¯


looks like just 25,000s was enough to trigger it.

http://www.gpugrid.net/result.php?resultid=31946707

it'll even out over time, since your other credits are earning 2x as much credit as you should be since the old version of BOINC is doubling your peak_flops value.

____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 188
Credit: 453,296,905
RAC: 395,110
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56030 - Posted: 17 Dec 2020 | 0:43:46 UTC

After upgrading all the BOINC clients, the tasks are erroring out. Ugh.
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 56031 - Posted: 17 Dec 2020 | 0:54:19 UTC - in response to Message 56030.

they were working fine on your 2080ti system when you had 7.17.0. why change it?

but the issue you're having now looks like the same issue that richard was dealing with here: https://www.gpugrid.net/forum_thread.php?id=5204

that thread has the steps they took to fix it. it's a permissions issue.
____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 188
Credit: 453,296,905
RAC: 395,110
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56033 - Posted: 17 Dec 2020 | 4:47:44 UTC - in response to Message 56031.

they were working fine on your 2080ti system when you had 7.17.0. why change it?

but the issue you're having now looks like the same issue that richard was dealing with here: https://www.gpugrid.net/forum_thread.php?id=5204

that thread has the steps they took to fix it. it's a permissions issue.


That was a kludge. There is no such thing as 7.17.0. =;^) Once I verified that the newer version worked, I updated all my machines with the latest repository version, so it would be clean and updated going forward.
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 56036 - Posted: 17 Dec 2020 | 5:05:48 UTC - in response to Message 56033.

There is such a thing. It’s the development branch. All of my systems use a version of BOINC based on 7.17.0 :)
____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 188
Credit: 453,296,905
RAC: 395,110
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56037 - Posted: 17 Dec 2020 | 5:23:58 UTC

Well sure. I meant a released version.
____________
Reno, NV
Team: SETI.USA

mmonnin
Send message
Joined: 2 Jul 16
Posts: 289
Credit: 1,152,511,238
RAC: 10,830
Level
Met
Scientific publications
watwatwatwatwat
Message 56046 - Posted: 18 Dec 2020 | 11:24:17 UTC
Last modified: 18 Dec 2020 | 11:24:46 UTC

So long start to end run times cause the 20 credit issue, not that they were restarted. But tasks that are interrupted cause them to restart at 0, thus having a longer start to end run time.

1070 or 1070Ti
27,656.18s received 1,316,998.40
42,652.74 received 20.83

1080Ti
21,508.23 received 1,694,500.25
25,133.86, 29,742.04, 38,297.41 tasks received 20.83

I doubt they were interrupted with the tasks being High Priority and nothing else but GPUGrid in the BOINC queue.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 56049 - Posted: 18 Dec 2020 | 14:57:21 UTC - in response to Message 56046.

yup I confirmed this. I manually restarted a task that didnt run very long and it didnt have the issue.

the issue only happens if your credit reward will be greater than about 1.9 million.

take some of your completed tasks, divide the total credit by the runtime seconds to figure how much credit you earn per second. then figure how many seconds you need to hit 1.9 million, and that's the runtime limit for your system, anything over that and you get the 20-credit bug
____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 188
Credit: 453,296,905
RAC: 395,110
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56148 - Posted: 24 Dec 2020 | 15:33:20 UTC

Why is the number of tasks in progress dwindling? Are no new tasks being issued?
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 56149 - Posted: 24 Dec 2020 | 15:48:21 UTC - in response to Message 56148.
Last modified: 24 Dec 2020 | 15:49:07 UTC

most of the Python tasks I've received in the last 3 days have been "_0", so that indicates brand new. and a few resends here and there.

the rate in which they are creating them is likely slowed, and the demand is high since points chasers have come to try to snatch them up. also possible that the recent new (_0) ones are only recreations of earlier failed tasks that had some bug that needed fixing. it does seem that this run is concluding.
____________

Profile trigggl
Send message
Joined: 6 Mar 09
Posts: 19
Credit: 90,319,257
RAC: 679
Level
Thr
Scientific publications
watwatwatwatwatwatwatwat
Message 56151 - Posted: 25 Dec 2020 | 16:41:49 UTC - in response to Message 55590.

...
Also Warnings about path not found:
WARNING conda.core.envs_manager:register_env(50): Unable to register environment. Path not writable or missing. environment location: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda registry file: /root/.conda/environments.txt

Registry file location ( /root/ ) will not be accessible to boinc user unless conda is already installed on the host (by root user) and conda file is world readable
...

I had the same error message except that mine was trying to go to
/opt/boinc/.conda/environments.txt

Profile trigggl
Send message
Joined: 6 Mar 09
Posts: 19
Credit: 90,319,257
RAC: 679
Level
Thr
Scientific publications
watwatwatwatwatwatwatwat
Message 56152 - Posted: 25 Dec 2020 | 16:43:36 UTC - in response to Message 55590.
Last modified: 25 Dec 2020 | 16:59:59 UTC

...
Also Warnings about path not found:
WARNING conda.core.envs_manager:register_env(50): Unable to register environment. Path not writable or missing. environment location: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda registry file: /root/.conda/environments.txt

Registry file location ( /root/ ) will not be accessible to boinc user unless conda is already installed on the host (by root user) and conda file is world readable
...

I had the same error message except that mine was trying to go to...
/opt/boinc/.conda/environments.txt
Looks harmless, thanks for reporting. It's because the "boinc" user doesn't have a HOME directory I think.

Gentoo put the home for boinc at /opt/boinc.
I updated the user file to change it to /var/lib/boinc.

ALAIN_13013
Avatar
Send message
Joined: 11 Sep 08
Posts: 18
Credit: 1,535,333,080
RAC: 28,843
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56177 - Posted: 29 Dec 2020 | 6:50:04 UTC - in response to Message 55588.

I'm creating some experimental tasks for the Python app (made Beta). They are Linux and CUDA specific and serve in preparation for future batches.

They may use a relatively large amount of disk space (order of 1-10 GB) which persists between runs, and is cleared if you reset the project.



What type of card minimum for this app. My 980Ti don't load WU.
____________

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,200,441,910
RAC: 243,507
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 56181 - Posted: 29 Dec 2020 | 13:08:38 UTC - in response to Message 56177.
Last modified: 29 Dec 2020 | 13:10:12 UTC

I'm creating some experimental tasks for the Python app (made Beta). They are Linux and CUDA specific and serve in preparation for future batches.

They may use a relatively large amount of disk space (order of 1-10 GB) which persists between runs, and is cleared if you reset the project.



What type of card minimum for this app. My 980Ti don't load WU.

In "GPUGRID Preferences", ensure you select "Python Runtime (beta)" and "Run test applications?"
Your GPU, driver and OS should run these tasks fine

ALAIN_13013
Avatar
Send message
Joined: 11 Sep 08
Posts: 18
Credit: 1,535,333,080
RAC: 28,843
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56182 - Posted: 29 Dec 2020 | 13:32:58 UTC - in response to Message 56181.
Last modified: 29 Dec 2020 | 13:33:30 UTC

I'm creating some experimental tasks for the Python app (made Beta). They are Linux and CUDA specific and serve in preparation for future batches.

They may use a relatively large amount of disk space (order of 1-10 GB) which persists between runs, and is cleared if you reset the project.



What type of card minimum for this app. My 980Ti don't load WU.

In "GPUGRID Preferences", ensure you select "Python Runtime (beta)" and "Run test applications?"
Your GPU, driver and OS should run these tasks fine


Merci, I just forgot Run test applications :)
____________

jiipee
Send message
Joined: 4 Jun 15
Posts: 7
Credit: 2,752,905,856
RAC: 209,440
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 56183 - Posted: 29 Dec 2020 | 13:35:30 UTC

All of these seem now to error out after computation has finished. On several computers:

<message>
upload failure: <file_xfer_error>
<file_name>2p95312000-RAIMIS_NNPMM-0-1-RND8920_1_0</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>

</message>


What causes this and how it can be fixed?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1095
Credit: 3,258,384,910
RAC: 181,881
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56185 - Posted: 29 Dec 2020 | 14:24:17 UTC - in response to Message 56183.

What causes this and how it can be fixed?

I've just posted instructions in the Anaconda Python 3 Environment v4.01 failures thread (Number Crunching).

Read through the whole post. If you don't understand anything, or you don't know how to do any of the steps I've described - back away. Don't even attempt it until you're sure. You have to edit a very important, protected, file - and that needs care and experience.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 324
Credit: 3,365,990,576
RAC: 4,059,647
Level
Arg
Scientific publications
wat
Message 56186 - Posted: 29 Dec 2020 | 14:33:52 UTC - in response to Message 56185.

What causes this and how it can be fixed?

I've just posted instructions in the Anaconda Python 3 Environment v4.01 failures thread (Number Crunching).

Read through the whole post. If you don't understand anything, or you don't know how to do any of the steps I've described - back away. Don't even attempt it until you're sure. You have to edit a very important, protected, file - and that needs care and experience.


really needs to be fixed server side (or would be nice if it were configurable via cc_config but that doesnt look to be the case either).

stopping and starting the client is a recipe for instant errors, and where successful, this process will need to be repeated for every time you download new tasks. not really a viable option unless you want to babysit the system all day.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1095
Credit: 3,258,384,910
RAC: 181,881
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56187 - Posted: 29 Dec 2020 | 14:45:32 UTC - in response to Message 56186.

Stopping and starting the client is a recipe for instant errors, and where successful, this process will need to be repeated for every time you download new tasks. not really a viable option unless you want to babysit the system all day.

By itself, it's fairly safe - provided you know and understand the software on your own system well enough. But you do need to have that experience and knowledge, which I why I put the caveats in.

I agree about having to re-do it for every new task, but I'd like to get my APR back up to something reasonable - and I'm happy to help nudge the admins one more step along the way to a fully-working, 'set and forget', application.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1095
Credit: 3,258,384,910
RAC: 181,881
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56189 - Posted: 29 Dec 2020 | 16:39:50 UTC - in response to Message 56187.

They're working on something...

WU 26917726

jiipee
Send message
Joined: 4 Jun 15
Posts: 7
Credit: 2,752,905,856
RAC: 209,440
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 56208 - Posted: 31 Dec 2020 | 8:59:22 UTC - in response to Message 56186.

What causes this and how it can be fixed?

I've just posted instructions in the Anaconda Python 3 Environment v4.01 failures thread (Number Crunching).

Read through the whole post. If you don't understand anything, or you don't know how to do any of the steps I've described - back away. Don't even attempt it until you're sure. You have to edit a very important, protected, file - and that needs care and experience.


really needs to be fixed server side (or would be nice if it were configurable via cc_config but that doesnt look to be the case either).

stopping and starting the client is a recipe for instant errors, and where successful, this process will need to be repeated for every time you download new tasks. not really a viable option unless you want to babysit the system all day.

Excaltly so. I don't know about others, but I have no time to sit and watch my hosts working. A host is working 10 hours to get the task done, and then everything turns out to be just a waste of time and energy because of this file size limitation. This is somewhat frustrating.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1095
Credit: 3,258,384,910
RAC: 181,881
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56209 - Posted: 31 Dec 2020 | 10:16:49 UTC - in response to Message 56208.

Opt out of the Beta test programme if you don't want to encounter those problems.

But as it happens, I haven't had a single over-run since they cancelled the one I highlighted in the post before yours.

jiipee
Send message
Joined: 4 Jun 15
Posts: 7
Credit: 2,752,905,856
RAC: 209,440
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 56210 - Posted: 31 Dec 2020 | 12:02:22 UTC - in response to Message 56209.

Opt out of the Beta test programme if you don't want to encounter those problems.

But as it happens, I haven't had a single over-run since they cancelled the one I highlighted in the post before yours.

Yes, I agree - something has changed.

It looks like the last full time (successful) computation on my hosts that produced too large output file was WU 26900019, ended 29 Dec 2020 | 15:00:52 UTC after 31,056 seconds of run time.

Post to thread

Message boards : News : Experimental Python tasks (beta)