ADRIA extremely slow - not checkpointing

Message boards : Number crunching : ADRIA extremely slow - not checkpointing
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
GDB

Send message
Joined: 24 Oct 11
Posts: 4
Credit: 433,680,314
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56388 - Posted: 10 Feb 2021, 13:53:32 UTC

Something wrong with ADRIA. It's not checkpointing. After 7 hours - it's 99.5% done, but slowed to a crawl. I had to reboot, and it restarted from 0%.
ID: 56388 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 56389 - Posted: 10 Feb 2021, 14:34:59 UTC - in response to Message 56388.  

they run for a long time. my RTX 2070 took like 17hrs to complete. just let it go.
ID: 56389 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
lohphat

Send message
Joined: 21 Jan 10
Posts: 46
Credit: 1,388,234,528
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56396 - Posted: 10 Feb 2021, 22:59:36 UTC - in response to Message 56389.  
Last modified: 10 Feb 2021, 23:08:08 UTC

I just got my first WU and the clock is climbing. It's taken 1:40:00 for 6%, the estimated time is now at 26h and climbing.

GTX 980 Ti
ID: 56396 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
lohphat

Send message
Joined: 21 Jan 10
Posts: 46
Credit: 1,388,234,528
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56397 - Posted: 11 Feb 2021, 0:09:36 UTC - in response to Message 56396.  

Update: It reached 10% at 2h50m elapsed.
ID: 56397 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56409 - Posted: 11 Feb 2021, 11:33:54 UTC
Last modified: 11 Feb 2021, 11:55:17 UTC

I have one running on my power limited RTX 2080 Ti.
It's at 3.333% after 26m 30s, so it will take around 11h 45m.
Perhaps it will take a bit less than that as the progress indicator restarted after a few minutes.
TBH this is the "usual" lenght of a long workunit. (maybe it's 50% longer than the usual).
EDIT:
I have another one running on another RTX 2080 Ti (It's not power limited, also it's a Linux based host)
The progress indicator falls back to 0.666% after 4 minutes, then goes up by 0.666% increments raughly by every 4 minutes so it will take raughly 10 hours on that host.
ID: 56409 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56412 - Posted: 11 Feb 2021, 12:11:26 UTC - in response to Message 56409.  

The progress indicator falls back to 0.666% after 4 minutes, then goes up by 0.666% increments raughly by every 4 minutes so it will take raughly 10 hours on that host.

For the first minute, progress will be shown as zero. The next three minutes (variable according to the speed of your device) will be pseudo-progress invented by BOINC: no significance should be attached to the value shown.

The 0.666% value and onwards will be genuine values, and can be used for estimating the final run time - as you have done. They show the points where the app writes a checkpoint file - I'm seeing "restart.chk" files being written in the slot directories.

But the app doesn't appear to be able to detect or use these files to resume computation after a restart.
ID: 56412 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56418 - Posted: 11 Feb 2021, 14:06:36 UTC - in response to Message 56412.  

The progress indicator falls back to 0.666% after 4 minutes, then goes up by 0.666% increments raughly by every 4 minutes so it will take raughly 10 hours on that host.

For the first minute, progress will be shown as zero. The next three minutes (variable according to the speed of your device) will be pseudo-progress invented by BOINC: no significance should be attached to the value shown.
Oh, I thought that the GPUGrid app does something strange. This must be a new feature, the v7.9 showed 0.000%.

The 0.666% value and onwards will be genuine values, and can be used for estimating the final run time - as you have done.
New estimation based on 20% done in 1h 56m: 9h 40m

They show the points where the app writes a checkpoint file - I'm seeing "restart.chk" files being written in the slot directories.

But the app doesn't appear to be able to detect or use these files to resume computation after a restart.
This is simply unacceptable behavior for such long workunits.
ID: 56418 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56420 - Posted: 11 Feb 2021, 16:27:58 UTC - in response to Message 56412.  

The progress indicator falls back to 0.666% after 4 minutes, then goes up by 0.666% increments raughly by every 4 minutes so it will take raughly 10 hours on that host.

For the first minute, progress will be shown as zero. The next three minutes (variable according to the speed of your device) will be pseudo-progress invented by BOINC: no significance should be attached to the value shown.

The 0.666% value and onwards will be genuine values, and can be used for estimating the final run time - as you have done. They show the points where the app writes a checkpoint file - I'm seeing "restart.chk" files being written in the slot directories.

But the app doesn't appear to be able to detect or use these files to resume computation after a restart.

What about the wrapper_checkpoint.txt file in the slots. Is that the file used to checkpoint restart with wrapper apps?

Or is it always the restart.chk or restart.chk.bkp files that are used to restart a task?
ID: 56420 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56421 - Posted: 11 Feb 2021, 16:43:15 UTC - in response to Message 56420.  
Last modified: 11 Feb 2021, 16:44:10 UTC

What about the wrapper_checkpoint.txt file in the slots. Is that the file used to checkpoint restart with wrapper apps?
I don't think this file is sufficient for that, as it is 29 bytes long:
0 19978.890625 19938.000000


Or is it always the restart.chk or restart.chk.bkp files that are used to restart a task?
I think these ones contain the real information. (~9Mb)
ID: 56421 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56424 - Posted: 11 Feb 2021, 18:11:45 UTC

This task must have used the checkpoint files. It also was able to switch between devices without erroring out.
https://www.gpugrid.net/result.php?resultid=32507970
ID: 56424 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56427 - Posted: 11 Feb 2021, 19:24:19 UTC - in response to Message 56424.  

This task must have used the checkpoint files. It also was able to switch between devices without erroring out.
https://www.gpugrid.net/result.php?resultid=32507970

I confirm that. I restarted my Linux and Windows 10 hosts, and the GPUgrid app resumed just fine from the last checkpoint. These are single GPU hosts though.
ID: 56427 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56428 - Posted: 11 Feb 2021, 19:38:00 UTC

And here. One on a 1050 Ti was swapped out after 24 hours (that was enough in the good old days ...), but didn't lose the 35% done on restart.

I do wish the devs could notify us when they've fixed reported problems. Toni?
ID: 56428 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56429 - Posted: 11 Feb 2021, 19:41:46 UTC - in response to Message 56428.  

The one task I referenced that was able to switch and restart on a different device may have been a fluke or something.

I have another task that errored out for restarting on a different device already.
https://www.gpugrid.net/result.php?resultid=32509275
ID: 56429 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 56430 - Posted: 11 Feb 2021, 20:00:37 UTC - in response to Message 56429.  

The one task I referenced that was able to switch and restart on a different device may have been a fluke or something.

I have another task that errored out for restarting on a different device already.
https://www.gpugrid.net/result.php?resultid=32509275


from what I recall, I noticed this issue of "starting on a different device" to be quite intermittent and inconsistent. I always thought that if it restarted on the same device type (rtx 2070 -> rtx 2070, but different ids) that it would be fine and only caused an issue if the hardware was significantly different (rtx 2070 -> GTX 1060 for example) but that doesnt seem to be the case as i've seen successful pickups and failed pickups in most situations with no clear reason,

sometimes it restarted on a different device id and was fine
sometimes it restarted on a different device id and threw the error
sometimes it got lucky and restarted on the same device id and was fine
sometimes it got lucky and restarted on the same device id and still threw the error

also from what I recall, I very rarely ever got failures from the MDAD tasks, but a much higher failure rate from restarts with PABLO tasks. I just try to never interrupt them anymore and just let it run. I'll only ever interrupt running GPUGRID tasks now if it's due to something like a power outage.
ID: 56430 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56431 - Posted: 11 Feb 2021, 20:10:55 UTC
Last modified: 11 Feb 2021, 20:11:11 UTC

Up until this one task outlier, I have always thrown an error when restarting on a different device type. Two of my three hosts are mixed gpus types, different generations.

This daily driver has the same gpu type of all cards, RTX 2080 Hybrids.

I have never had an restart failure on any task on this host. Countless tasks have restarted on a different device ID and properly restarted and completed.
ID: 56431 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
lohphat

Send message
Joined: 21 Jan 10
Posts: 46
Credit: 1,388,234,528
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56434 - Posted: 11 Feb 2021, 20:32:19 UTC

Datapoint: My progress has been in 0.333% increments. (GTX 980 Ti)
ID: 56434 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
klepel

Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,798,881,008
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56446 - Posted: 12 Feb 2021, 0:46:06 UTC

No luck so far with this new ADRIA WUs so far!
This computer with a GTX2070 http://www.gpugrid.net/result.php?resultid=32508594
Produced an ERROR:
Detected memory leaks!

Anybody has an idea, what the cause might be? I moved this GPU from my Linux Computer –never had a problem with the card - to the game computer of my child - other projects just run fine, plays “Fortnite” on it once or twice a day and we switch it off at night.
On the Linux Computer http://www.gpugrid.net/results.php?hostid=523675 received the GTX1660ti from the gaming computer, and produced two ERRORs until now.
The cards are factory overclocked – just a side question how to undervolted a GPU?
ID: 56446 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56447 - Posted: 12 Feb 2021, 1:16:03 UTC - in response to Message 56446.  

AFAIK, there is no way to undervolt an Nvidia card.
ID: 56447 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 56448 - Posted: 12 Feb 2021, 1:18:59 UTC - in response to Message 56447.  

AFAIK, there is no way to undervolt an Nvidia card.


*in Linux

you can in Windows with programs liks MSI Afterburner.

ID: 56448 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
OCNfranz

Send message
Joined: 24 Nov 20
Posts: 1
Credit: 895,706,324
RAC: 0
Level
Glu
Scientific publications
wat
Message 56451 - Posted: 12 Feb 2021, 2:38:20 UTC
Last modified: 12 Feb 2021, 2:57:54 UTC

You cant undervolt in Linux directly, but you can lower the power limit if all you are trying to do is use less power.

check stock power limit with
nvidia-smi
~151W for my 1070
~170W for my 2060

set new power limit with
nvidia-smi --power-limit=***
*** is desired power limit in Watts

My 2060 is completing these tasks in around 69500-70000seconds or 19.4hours


EDIT: I was brave and restarted my Win10 rig with one task at 54.667% complete and it restarted the task at the same percentage. I will check it when its done to see if there are any errors, but there is definitely a checkpoint being used.
ID: 56451 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : ADRIA extremely slow - not checkpointing

©2025 Universitat Pompeu Fabra