ADRIA extremely slow - not checkpointing

Message boards : Number crunching : ADRIA extremely slow - not checkpointing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
lohphat

Send message
Joined: 21 Jan 10
Posts: 46
Credit: 1,388,234,528
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56452 - Posted: 12 Feb 2021, 3:13:51 UTC
Last modified: 12 Feb 2021, 3:18:24 UTC

My first WU completed and validated. Finally.

https://www.gpugrid.net/workunit.php?wuid=27023218

Sent 10 Feb 2021 | 17:42:37 UTC
Received 12 Feb 2021 | 2:30:35 UTC
Credit 435,937.50 (wow!)

GTX 980 Ti

My next WU is a retry of a failed attempt by another system. It's running A LOT faster than my first WU. I'll report back when it completes.

https://www.gpugrid.net/workunit.php?wuid=27025153
ID: 56452 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 56453 - Posted: 12 Feb 2021, 3:58:33 UTC - in response to Message 56451.  

You cant undervolt in Linux directly, but you can lower the power limit if all you are trying to do is use less power.

check stock power limit with
nvidia-smi
~151W for my 1070
~170W for my 2060

set new power limit with
nvidia-smi --power-limit=***
*** is desired power limit in Watts

My 2060 is completing these tasks in around 69500-70000seconds or 19.4hours


EDIT: I was brave and restarted my Win10 rig with one task at 54.667% complete and it restarted the task at the same percentage. I will check it when its done to see if there are any errors, but there is definitely a checkpoint being used.


yeah I know about the power limiting. it's all you can do in Linux. but if you apply an overclock on top of the power limit you can claw back some lost performance.

you can probably do better on power for that 2060.

I have a system with 2070's that I power limit to 150W, and it completes tasks in about 60-61,000s

ID: 56453 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
lohphat

Send message
Joined: 21 Jan 10
Posts: 46
Credit: 1,388,234,528
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56454 - Posted: 12 Feb 2021, 6:28:49 UTC - in response to Message 56452.  
Last modified: 12 Feb 2021, 6:29:45 UTC


My next WU is a retry of a failed attempt by another system. It's running A LOT faster than my first WU. I'll report back when it completes.

https://www.gpugrid.net/workunit.php?wuid=27025153


10% complete 03:22:00

Oof.

At least it's running. The prior attempt by another user errored-out in a few seconds.
ID: 56454 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
lohphat

Send message
Joined: 21 Jan 10
Posts: 46
Credit: 1,388,234,528
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56463 - Posted: 12 Feb 2021, 19:51:34 UTC

I was able to exit/restart the client at 54% and it continued as expected. (Win64)
ID: 56463 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
lohphat

Send message
Joined: 21 Jan 10
Posts: 46
Credit: 1,388,234,528
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56478 - Posted: 13 Feb 2021, 17:17:32 UTC - in response to Message 56454.  

And...scene.

Run time 106,374.11
CPU time 106,292.60
Validate state Valid
Credit 348,750.00

https://www.gpugrid.net/workunit.php?wuid=27025153
ID: 56478 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 22 May 20
Posts: 110
Credit: 115,525,136
RAC: 0
Level
Cys
Scientific publications
wat
Message 56496 - Posted: 14 Feb 2021, 16:45:57 UTC
Last modified: 14 Feb 2021, 17:12:32 UTC

Be happy if they finish at all. Mine (3 at this time) ran between 5,000 and 90,000 sec on a GTX 1660 Super and all errored out. I aborted the last WU in progress now...
ID: 56496 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 56497 - Posted: 14 Feb 2021, 17:18:25 UTC - in response to Message 56496.  

Be happy if they finish at all. Mine (3 at this time) ran between 5,000 and 90,000 sec) on a GTX 1660 Super and all errored out.


My 1660 Super takes about 96,000 seconds on Linux. Maybe 100,000+ on Windows since the Windows app is slower.

Of your 4 tasks. On your 2-GPU Windows 10 host.

1 was aborted by the user.
1 failed due to a BOINC restart (or system restart) where it attempted to restart on a different device. It started on device 1 then tried to restart on device 0. This is a common and known situation that causes failures.

2 tasks failed with “particle coordinate is nan” (nan = not a number). This commonly happens from too much overclock or overheating. GPUGRID tasks are quite intense, and these tasks are no exception. An overclock that’s stable on another project or application can be unstable for GPUGRID.

Try to remove any overclock and make sure the GPU has sufficient airflow. Try to avoid restarting the system.
ID: 56497 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 22 May 20
Posts: 110
Credit: 115,525,136
RAC: 0
Level
Cys
Scientific publications
wat
Message 56498 - Posted: 14 Feb 2021, 17:50:33 UTC

Thanks Ian for the diagnostics! Just did revert back to the stock settings. Hope I'll catch a resend and can try again. Still weird as all other apps do work with the mild OC setting just fine. Never had a single error thrown so far, except on MLC. But that's mostly due to the inherent nature of these tasks where occasionally a WU results in a NaN error.

About that restarted WU on the other device. I noticed that too, and noticed that due to dry spell here I forgot to set up the <exclude gpu> poilicy in my cc_config file on this new host and the slow 750Ti just happend to pick it up. That's solved as well for now.

I'll see how the 1660 Super handles tasks in the future!
ID: 56498 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56771 - Posted: 15 Mar 2021, 19:39:07 UTC

A reduced amount of a new kind of ADRIA tasks seems to be in the field.
My host #557889 received one of them, named "e1s45_homeodomain_folded_100ns_44-ADRIA_HomeoFolded100ns-0-1-RND7761_0"
Progress for this task is about 66% after 5,5 hours on a GTX 1650 GPU.
Therefore, initial estimation is pointing that these tasks are much shorter than previous ones.
I'm not testing for the moment whether they checkpoint right or not...
ID: 56771 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 56772 - Posted: 15 Mar 2021, 22:42:45 UTC - in response to Message 56771.  

i also received a couple of these tasks, and concur that they complete in less time. my 2080ti completed them in about 9,000s vs ~36,000s on the longer running tasks.

payout is 76,500cred
ID: 56772 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56773 - Posted: 16 Mar 2021, 6:21:13 UTC - in response to Message 56772.  

In the case of my GTX 1650 and mentioned task, 30,471 seconds versus ~170,000 seconds for the previous ADRIA tasks in this device.
The same amount of 76,500 credits was awarded, since result was returned in time for full bonus.

e1s45_homeodomain_folded_100ns_44-ADRIA_HomeoFolded100ns-0-1-RND7761_0
ID: 56773 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : ADRIA extremely slow - not checkpointing

©2025 Universitat Pompeu Fabra