Message boards :
Number crunching :
ADRIA extremely slow - not checkpointing
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 24 Oct 11 Posts: 4 Credit: 433,680,314 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Something wrong with ADRIA. It's not checkpointing. After 7 hours - it's 99.5% done, but slowed to a crawl. I had to reboot, and it restarted from 0%. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
they run for a long time. my RTX 2070 took like 17hrs to complete. just let it go.
|
|
Send message Joined: 21 Jan 10 Posts: 46 Credit: 1,388,234,528 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I just got my first WU and the clock is climbing. It's taken 1:40:00 for 6%, the estimated time is now at 26h and climbing. GTX 980 Ti |
|
Send message Joined: 21 Jan 10 Posts: 46 Credit: 1,388,234,528 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Update: It reached 10% at 2h50m elapsed. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have one running on my power limited RTX 2080 Ti. It's at 3.333% after 26m 30s, so it will take around 11h 45m. Perhaps it will take a bit less than that as the progress indicator restarted after a few minutes. TBH this is the "usual" lenght of a long workunit. (maybe it's 50% longer than the usual). EDIT: I have another one running on another RTX 2080 Ti (It's not power limited, also it's a Linux based host) The progress indicator falls back to 0.666% after 4 minutes, then goes up by 0.666% increments raughly by every 4 minutes so it will take raughly 10 hours on that host. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The progress indicator falls back to 0.666% after 4 minutes, then goes up by 0.666% increments raughly by every 4 minutes so it will take raughly 10 hours on that host. For the first minute, progress will be shown as zero. The next three minutes (variable according to the speed of your device) will be pseudo-progress invented by BOINC: no significance should be attached to the value shown. The 0.666% value and onwards will be genuine values, and can be used for estimating the final run time - as you have done. They show the points where the app writes a checkpoint file - I'm seeing "restart.chk" files being written in the slot directories. But the app doesn't appear to be able to detect or use these files to resume computation after a restart. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Oh, I thought that the GPUGrid app does something strange. This must be a new feature, the v7.9 showed 0.000%.The progress indicator falls back to 0.666% after 4 minutes, then goes up by 0.666% increments raughly by every 4 minutes so it will take raughly 10 hours on that host. The 0.666% value and onwards will be genuine values, and can be used for estimating the final run time - as you have done.New estimation based on 20% done in 1h 56m: 9h 40m They show the points where the app writes a checkpoint file - I'm seeing "restart.chk" files being written in the slot directories.This is simply unacceptable behavior for such long workunits. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
The progress indicator falls back to 0.666% after 4 minutes, then goes up by 0.666% increments raughly by every 4 minutes so it will take raughly 10 hours on that host. What about the wrapper_checkpoint.txt file in the slots. Is that the file used to checkpoint restart with wrapper apps? Or is it always the restart.chk or restart.chk.bkp files that are used to restart a task? |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
What about the wrapper_checkpoint.txt file in the slots. Is that the file used to checkpoint restart with wrapper apps?I don't think this file is sufficient for that, as it is 29 bytes long: 0 19978.890625 19938.000000 Or is it always the restart.chk or restart.chk.bkp files that are used to restart a task?I think these ones contain the real information. (~9Mb) |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
This task must have used the checkpoint files. It also was able to switch between devices without erroring out. https://www.gpugrid.net/result.php?resultid=32507970 |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This task must have used the checkpoint files. It also was able to switch between devices without erroring out. I confirm that. I restarted my Linux and Windows 10 hosts, and the GPUgrid app resumed just fine from the last checkpoint. These are single GPU hosts though. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
And here. One on a 1050 Ti was swapped out after 24 hours (that was enough in the good old days ...), but didn't lose the 35% done on restart. I do wish the devs could notify us when they've fixed reported problems. Toni? |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
The one task I referenced that was able to switch and restart on a different device may have been a fluke or something. I have another task that errored out for restarting on a different device already. https://www.gpugrid.net/result.php?resultid=32509275 |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
The one task I referenced that was able to switch and restart on a different device may have been a fluke or something. from what I recall, I noticed this issue of "starting on a different device" to be quite intermittent and inconsistent. I always thought that if it restarted on the same device type (rtx 2070 -> rtx 2070, but different ids) that it would be fine and only caused an issue if the hardware was significantly different (rtx 2070 -> GTX 1060 for example) but that doesnt seem to be the case as i've seen successful pickups and failed pickups in most situations with no clear reason, sometimes it restarted on a different device id and was fine sometimes it restarted on a different device id and threw the error sometimes it got lucky and restarted on the same device id and was fine sometimes it got lucky and restarted on the same device id and still threw the error also from what I recall, I very rarely ever got failures from the MDAD tasks, but a much higher failure rate from restarts with PABLO tasks. I just try to never interrupt them anymore and just let it run. I'll only ever interrupt running GPUGRID tasks now if it's due to something like a power outage.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Up until this one task outlier, I have always thrown an error when restarting on a different device type. Two of my three hosts are mixed gpus types, different generations. This daily driver has the same gpu type of all cards, RTX 2080 Hybrids. I have never had an restart failure on any task on this host. Countless tasks have restarted on a different device ID and properly restarted and completed. |
|
Send message Joined: 21 Jan 10 Posts: 46 Credit: 1,388,234,528 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Datapoint: My progress has been in 0.333% increments. (GTX 980 Ti) |
|
Send message Joined: 23 Dec 09 Posts: 189 Credit: 4,798,881,008 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
No luck so far with this new ADRIA WUs so far! This computer with a GTX2070 http://www.gpugrid.net/result.php?resultid=32508594 Produced an ERROR: Detected memory leaks! Anybody has an idea, what the cause might be? I moved this GPU from my Linux Computer –never had a problem with the card - to the game computer of my child - other projects just run fine, plays “Fortnite” on it once or twice a day and we switch it off at night. On the Linux Computer http://www.gpugrid.net/results.php?hostid=523675 received the GTX1660ti from the gaming computer, and produced two ERRORs until now. The cards are factory overclocked – just a side question how to undervolted a GPU? |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
AFAIK, there is no way to undervolt an Nvidia card. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
AFAIK, there is no way to undervolt an Nvidia card. *in Linux you can in Windows with programs liks MSI Afterburner.
|
|
Send message Joined: 24 Nov 20 Posts: 1 Credit: 895,706,324 RAC: 0 Level ![]() Scientific publications
|
You cant undervolt in Linux directly, but you can lower the power limit if all you are trying to do is use less power. check stock power limit with nvidia-smi ~151W for my 1070 ~170W for my 2060 set new power limit with nvidia-smi --power-limit=*** *** is desired power limit in Watts My 2060 is completing these tasks in around 69500-70000seconds or 19.4hours EDIT: I was brave and restarted my Win10 rig with one task at 54.667% complete and it restarted the task at the same percentage. I will check it when its done to see if there are any errors, but there is definitely a checkpoint being used. |
©2025 Universitat Pompeu Fabra