Advanced search

Message boards : Number crunching : Pablo, When is enough enough???

Author Message
Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,812,520,943
RAC: 2,764,099
Level
Trp
Scientific publications
watwatwat
Message 51134 - Posted: 29 Dec 2018 | 13:29:41 UTC
Last modified: 29 Dec 2018 | 13:30:15 UTC

Two days ago I completed a WU that ran for 21 days, next one took 14 days and today there's two running at 5 and 8 days (below):
Computer: Rig-26
Project GPUGRID
Name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561_8
Application Long runs (8-12 hours on fastest card) 9.19 (cuda80)
Workunit name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561
State Running High P.
Received 12/20/2018 7:51:57 AM
Report deadline 12/25/2018 7:51:56 AM
Estimated app speed 193.79 GFLOPs/sec
Estimated task size 5,000,000 GFLOPs
Resources 1 CPU + 0.5 NVIDIA GPUs (device 1)
CPU time at last checkpoint 00:00:00
CPU time 08d,19:21:05
Elapsed time 08d,19:23:43
Estimated time remaining 00:00:00
Fraction done 100.000%
Virtual memory size 22,284.77 MB
Working set size 77.79 MB
Directory slots/31
Process ID 2411

Computer: Rig-12
Project GPUGRID
Name e2s18_e1s312p2f159-PABLO_V3_p27_sj403_IDP-3-4-RND6569_0
Application Long runs (8-12 hours on fastest card) 9.19 (cuda80)
Workunit name e2s18_e1s312p2f159-PABLO_V3_p27_sj403_IDP-3-4-RND6569
State Running High P.
Received 12/18/2018 12:36:04 PM
Report deadline 12/23/2018 12:36:02 PM
Estimated app speed 121.06 GFLOPs/sec
Estimated task size 5,000,000 GFLOPs
Resources 1 CPU + 0.5 NVIDIA GPUs (device 0)
CPU time at last checkpoint 00:00:00
CPU time 05d,12:06:03
Elapsed time 05d,12:38:21
Estimated time remaining 00:00:00
Fraction done 99.999%
Virtual memory size 30,588.04 MB
Working set size 81.13 MB
Directory slots/27
Process ID 1794

I would think there should be a builtin rule that says:

IF ElapsedTime > 2(PredictedTime) THEN Abort&Flag ELSE WTF

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51137 - Posted: 29 Dec 2018 | 14:31:45 UTC - in response to Message 51134.

Two days ago I completed a WU that ran for 21 days, next one took 14 days and today there's two running at 5 and 8 days (below):
Computer: Rig-26
Project GPUGRID
Name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561_8
Application Long runs (8-12 hours on fastest card) 9.19 (cuda80)
Workunit name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561
State Running High P.
Received 12/20/2018 7:51:57 AM
Report deadline 12/25/2018 7:51:56 AM
Estimated app speed 193.79 GFLOPs/sec
Estimated task size 5,000,000 GFLOPs
Resources 1 CPU + 0.5 NVIDIA GPUs (device 1)
CPU time at last checkpoint 00:00:00
CPU time 08d,19:21:05
Elapsed time 08d,19:23:43
Estimated time remaining 00:00:00
Fraction done 100.000%
Virtual memory size 22,284.77 MB
Working set size 77.79 MB
Directory slots/31
Process ID 2411

Computer: Rig-12
Project GPUGRID
Name e2s18_e1s312p2f159-PABLO_V3_p27_sj403_IDP-3-4-RND6569_0
Application Long runs (8-12 hours on fastest card) 9.19 (cuda80)
Workunit name e2s18_e1s312p2f159-PABLO_V3_p27_sj403_IDP-3-4-RND6569
State Running High P.
Received 12/18/2018 12:36:04 PM
Report deadline 12/23/2018 12:36:02 PM
Estimated app speed 121.06 GFLOPs/sec
Estimated task size 5,000,000 GFLOPs
Resources 1 CPU + 0.5 NVIDIA GPUs (device 0)
CPU time at last checkpoint 00:00:00
CPU time 05d,12:06:03
Elapsed time 05d,12:38:21
Estimated time remaining 00:00:00
Fraction done 99.999%
Virtual memory size 30,588.04 MB
Working set size 81.13 MB
Directory slots/27
Process ID 1794

I would think there should be a builtin rule that says:
IF ElapsedTime > 2(PredictedTime) THEN Abort&Flag ELSE WTF


When running GPUGrid, you really should only run 1 work unit per GPU. Without knowing what the other work unit is on that same GPU, you are probably starving the GPUGrid work unit with time on the GPU. What GPUs are in those machines?

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,812,520,943
RAC: 2,764,099
Level
Trp
Scientific publications
watwatwat
Message 51138 - Posted: 29 Dec 2018 | 15:05:47 UTC - in response to Message 51137.

So gpuGRID has been shown to be incapable of sharing a GPU???
Both of these computers have SWAN_SYNC enabled but their app_config.xml was changed after enabling SWAN_SYNC so they still show 0.5 GPU even though its mate completed days ago and it's been running alone.
So I bet you a GRC that is not the problem here. These WUs could not converge on a solution for any number of reasons that only Pablo can know.

Rig-12: http://www.gpugrid.net/show_host_detail.php?hostid=484061

Rig-26: http://www.gpugrid.net/show_host_detail.php?hostid=484035

3de64piB5uZAS6SUNt1GFDU9d...
Avatar
Send message
Joined: 20 Apr 15
Posts: 285
Credit: 1,102,216,607
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwat
Message 51139 - Posted: 29 Dec 2018 | 16:27:44 UTC

When running GPUGrid, you really should only run 1 work unit per GPU. Without knowing what the other work unit is on that same GPU, you are probably starving the GPUGrid work unit with time on the GPU. What GPUs are in those machines?


I dont have any problems with running two tasks per GPU. My Ryzen 1700 + 2x gtx1070 System runs 4 long jobs in parallel and needs 30.000 - 50.000 sec for completion. This config keeps the GPU working and at approximately constant temperature (-> longer lifetime) whereas a singe job config would put more thermal stress on the card especially when jobs are rare (many down/zero load times inbetween).
____________
I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51140 - Posted: 29 Dec 2018 | 18:23:50 UTC - in response to Message 51139.

This config keeps the GPU working and at approximately constant temperature (-> longer lifetime) whereas a singe job config would put more thermal stress on the card especially when jobs are rare (many down/zero load times inbetween).



My temps are never very high. All my cards are hybrids so they never go above 50C. Extended warranties for total 10 years (longer than most people keep their cards) so no worries about them failing and not being able to replace them.

It is troubling that his machines are taking anywhere from 5-21 days to do what most do in a few hours. The only way to know would be to run those exact work units on another machine and see if the results are the same.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2356
Credit: 16,400,393,858
RAC: 4,107,408
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51141 - Posted: 29 Dec 2018 | 19:26:53 UTC - in response to Message 51138.
Last modified: 29 Dec 2018 | 19:28:19 UTC

So gpuGRID has been shown to be incapable of sharing a GPU???
It's capable to share the GPU, but enabling SWAN_SYNC means that you dedicate your GPU to the GPUGrid app to make it as fast as possible (=maximize GPU utilization). In this case you should not share it with other project(s).

Both of these computers have SWAN_SYNC enabled but their app_config.xml was changed after enabling SWAN_SYNC so they still show 0.5 GPU even though its mate completed days ago and it's been running alone.
The app_config tells the BOINC manager what to expect from the given app, it does *not* instruct the app to what extent utilize resources (GPU, CPU). (There's no way to config the GPUGrid app to a given GPU utilization percentage.) So if you enable SWAN_SYNC, you should set 1.0 GPU and 1.0 CPU in app_config.xml like this:
<app_config> <app> <name>acemdlong</name> <fraction_done_exact/> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> <app> <name>acemdshort</name> <fraction_done_exact/> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> <app> <name>acemdbeta</name> <fraction_done_exact/> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> </app_config>


BTW the deadline is 5 days, so if a workunit takes much more than that you should abort it, and check for a solution, as it should take 5-6 hours on a GTX 1070.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,812,520,943
RAC: 2,764,099
Level
Trp
Scientific publications
watwatwat
Message 51142 - Posted: 30 Dec 2018 | 1:05:25 UTC

Pablo et alia, Please provide some feedback on these WUs that take days and weeks to complete.
Do you learn something from them if we let them run so long???
Do you notice them if we abort them before they complete???
If we should just abort them if they take too long can you automate the process as I suggested in the lead post???

IF ElapsedTime > 2*PredictedTime THEN Abort&Flag ELSE WTF

If I abort them will they just queue up and go out again???

I don't want you to miss the best binding pocket the universe has ever seen.

{BTW, this thread is not about SWAN_SYNCing.}

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 51147 - Posted: 30 Dec 2018 | 8:48:32 UTC
Last modified: 30 Dec 2018 | 8:49:42 UTC

On my SuSE Linux host, with a GTX 750 Ti graphic board, I find in the stderr.txt the following line:
CUDA Synchronization mode SPIN
What does it mean?
Tullio
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,316,956,834
RAC: 13,884,279
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51148 - Posted: 30 Dec 2018 | 11:48:53 UTC

On my SuSE Linux host, with a GTX 750 Ti graphic board, I find in the stderr.txt the following line:
CUDA Synchronization mode SPIN
What does it mean?


This means that SWAN_SYNC is successfully enabled in your system.
When it is not, it should read: CUDA Synchronization mode BLOCKING
{But this matter better fits in SWAN_SYNC in Linux client thread http://www.gpugrid.net/forum_thread.php?id=4813, as Aurum suggests.}[/list][/code]

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,812,520,943
RAC: 2,764,099
Level
Trp
Scientific publications
watwatwat
Message 51149 - Posted: 30 Dec 2018 | 12:07:24 UTC

Today's long running WU, to abort or not to abort:

Computer: Rig-26
Project GPUGRID
Name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561_8
Application Long runs (8-12 hours on fastest card) 9.19 (cuda80)
Workunit name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561
State Running High P.
Received 12/20/2018 7:51:57 AM
Report deadline 12/25/2018 7:51:56 AM
Estimated app speed 199.45 GFLOPs/sec
Estimated task size 5,000,000 GFLOPs
Resources 1 CPU + 0.5 NVIDIA GPUs (device 1)
CPU time at last checkpoint 00:00:00
CPU time 09d,18:08:21
Elapsed time 09d,18:11:16
Estimated time remaining 00:00:00
Fraction done 100.000%
Virtual memory size 22,284.77 MB
Working set size 77.79 MB
Directory slots/31
Process ID 2411

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,316,956,834
RAC: 13,884,279
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51150 - Posted: 30 Dec 2018 | 12:43:17 UTC
Last modified: 30 Dec 2018 | 12:50:01 UTC

If we should just abort them if they take too long can you automate the process as I suggested in the lead post???


There is a clue in your never-ending task 15602360 http://www.gpugrid.net/workunit.php?wuid=15602360
This task has failied or not finished in many different systems, so you can freely abort a task like this. It is clearly a defective one.
As you suggest, some in-task protection would be appreciated for eviting such problem.

If I abort them will they just queue up and go out again???


As seen in this particular defective task, it has been resent to many systems.
This task has been automatically retired by the project, as it has reached a total number of 10 resendings with no successful result.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,812,520,943
RAC: 2,764,099
Level
Trp
Scientific publications
watwatwat
Message 51151 - Posted: 30 Dec 2018 | 13:26:04 UTC - in response to Message 51150.

Wow! If the other nine spent 9 days each that's a quarter of a GPU-year that could've been better used.

Ok, I'm convinced to abort this one, but, I don't want to toss them out if there's something that Team GDF might learn from them.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51152 - Posted: 30 Dec 2018 | 17:24:31 UTC - in response to Message 51151.

Wow! If the other nine spent 9 days each that's a quarter of a GPU-year that could've been better used.

Ok, I'm convinced to abort this one, but, I don't want to toss them out if there's something that Team GDF might learn from them.

Even when you abort I'm pretty sure the results are still uploaded for their analysis

Nick Name
Send message
Joined: 3 Sep 13
Posts: 53
Credit: 1,533,531,731
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 51153 - Posted: 30 Dec 2018 | 20:29:39 UTC - in response to Message 51138.

BOINC will sometimes report run times that aren't correct although not usually off by days. I would first try to make sure these problem tasks are really running that long. If they are that's almost certainly a problem with your host and not the app, problems like that usually get reported pretty quickly by multiple users. You may have a problem similar to what's reported here.

https://www.gpugrid.net/forum_thread.php?id=4868

Rig-12: http://www.gpugrid.net/show_host_detail.php?hostid=484061 definitely has a card or cards not working well, you shouldn't have that many errors.
____________
Team USA forum | Team USA page
Join us and #crunchforcures. We are now also folding:join team ID 236370!

jiipee
Send message
Joined: 4 Jun 15
Posts: 19
Credit: 8,558,357,584
RAC: 2,430,983
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 51156 - Posted: 31 Dec 2018 | 7:34:49 UTC - in response to Message 51134.

I would think there should be a builtin rule that says:
IF ElapsedTime > 2(PredictedTime) THEN Abort&Flag ELSE WTF

I second that approach. Now some extremely long running (or endlessly looping?) tasks keep timing out and so waist many days of GPU time. For example:

e1s363_8_gen-PABLO_V3_p27_sj403_IDP-1-4-RND1764

timed out after running five days on a GTX 1080 Ti (single GPU task for that computer, id:274120). Looks like that task has been further assigned to others after being errored out or cancelled by operators.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,812,520,943
RAC: 2,764,099
Level
Trp
Scientific publications
watwatwat
Message 51157 - Posted: 31 Dec 2018 | 17:58:47 UTC - in response to Message 51153.

...almost certainly a problem with your host and not the app, problems like that usually get reported pretty quickly by multiple users.
Rig-12: http://www.gpugrid.net/show_host_detail.php?hostid=484061 definitely has a card or cards not working well, you shouldn't have that many errors.
Thanks for alerting me to that & I'm taking Rig-12 offline. From its stderr.txt file:
# CUDA Synchronisation mode: SPIN
# SWAN Device 2 :
# Name : GeForce GTX 1070
# ECC : Disabled
# Global mem : 8117MB
# Capability : 6.1
# PCI ID : 0000:03:00.0
# Device clock : 1784MHz
# Memory clock : 4004MHz
# Memory width : 256bit
# GPU [GeForce GTX 1070] Platform [Linux] Rev [3212] VERSION [80]
# SWAN Device 2 :
# Name : GeForce GTX 1070
# ECC : Disabled
# Global mem : 8117MB
# Capability : 6.1
# PCI ID : 0000:03:00.0
# Device clock : 1784MHz
# Memory clock : 4004MHz
# Memory width : 256bit
# Simulation unstable. Flag 5 value 11
# Simulation unstable. Flag 6 value 28
# Simulation unstable. Flag 7 value 11
# Simulation unstable. Flag 9 value 18654
# Simulation unstable. Flag 10 value 20896
# The simulation has become unstable. Terminating to avoid lock-up
# The simulation has become unstable. Terminating to avoid lock-up (2)
# Attempting restart (step 5000)

Those 3 Gigabyte 1070s are probably my oldest cards. May be time to retire.

Nick, How did you spot this???

{It's so hard for me to use this web site. It took me 15 minutes to get that link to appear and another few minutes to get its Tasks list to appear.}

Nick Name
Send message
Joined: 3 Sep 13
Posts: 53
Credit: 1,533,531,731
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 51168 - Posted: 1 Jan 2019 | 2:14:29 UTC - in response to Message 51157.

Nick, How did you spot this???

{It's so hard for me to use this web site. It took me 15 minutes to get that link to appear and another few minutes to get its Tasks list to appear.}

I also have problems getting the site to load, but once it does it usually works ok. I guess I had a little better luck than you. 200+ errors on a single machine stood out pretty quickly, since I hadn't experienced nor read reports here about a bad batch of tasks.
____________
Team USA forum | Team USA page
Join us and #crunchforcures. We are now also folding:join team ID 236370!

Post to thread

Message boards : Number crunching : Pablo, When is enough enough???

//