Advanced search

Message boards : Number crunching : GPUGrid problems, nothing has changed

Author Message
Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51606 - Posted: 7 Mar 2019 | 17:10:39 UTC

This is the 3rd time that I've gone in heavily on GPUGrid over the last 10-11 years. Twice I've gotten frustrated with the problems and cut way back. I was hoping that some of the issues would have been fixed. There's been an ongoing problem of stalling uploads (not to mention downloads) for many years. It's still not fixed. In addition WUs that get interrupted often fail even with write caching disabled on the drives.

Case in point. Last night we had a 3 hour power outage. When I brought the machines back up 18 out of 25 GPUGrid failed. There was not even one failure for any WU from any other project. These failures also cause another problem. Since 18 new WUs start at the same time it causes them to finish at about the same time. So many huge GPUGrid WUs uploading at once saturates my bandwidth for many hours. Yes, I live in the US so my DSL connection is not fast even though it was upgraded a few months ago (only 1 provider here, how do you spell monopoly). Unbridled capitalism is a bad idea for 99.9% of the people. Anyway, the combination of poor broadband infrastructure and these long standing GPUGrid problems sadly pushes me to cut back on this otherwise fine project. It seems to me that some of this should be not that difficult to fix, but apparently the necessary skills aren't present. BTW, these "upload storms" have been happening regularly. For someone with a faster connection and/or fewer GPUs it may not seem like a problem, but it's a problem here and I know of no way to solve it on my end. Thanks for listening to my frustration.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51607 - Posted: 7 Mar 2019 | 18:50:11 UTC - in response to Message 51606.

Zoltan had some great advice for me a bit ago. I don't think I can fully remember every step, but it completely fixed my corrupted WUs after a power outage issue. It had to do with the device manager as far as I can recall. Maybe Zoltan can remember?

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51608 - Posted: 7 Mar 2019 | 19:58:46 UTC - in response to Message 51607.

Zoltan had some great advice for me a bit ago. I don't think I can fully remember every step, but it completely fixed my corrupted WUs after a power outage issue. It had to do with the device manager as far as I can recall. Maybe Zoltan can remember?

That would be appreciated. Thanks. Another problem with the upload congestion problem is that some uploads can take upwards of 10 hours when a dozen or more are trying at once. Then they start missing the 24 hour cutoff, which is also irritating.

AuxRx
Send message
Joined: 3 Jul 18
Posts: 22
Credit: 2,758,801
RAC: 0
Level
Ala
Scientific publications
wat
Message 51609 - Posted: 7 Mar 2019 | 20:49:22 UTC - in response to Message 51606.

I know the frustration, but ironically GPUGRID is the better project to me by a small margin.

I wouldn't even know how the team at GPUGRID could fix the issues you're describing. Aren't those BOINC related issues? I know I have seen similar issues discussed in other projects. The solution was to run a start up script that booted the machines or restarted the clients with some delay. Alternatively you could consider limiting the number of connections for BOINC, which would be slower (unnecessarily slow at times) but more evenly distributed.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1352
Credit: 7,774,683,569
RAC: 10,103,437
Level
Tyr
Scientific publications
watwatwatwatwat
Message 51610 - Posted: 7 Mar 2019 | 21:03:14 UTC - in response to Message 51608.

Zoltan had some great advice for me a bit ago. I don't think I can fully remember every step, but it completely fixed my corrupted WUs after a power outage issue. It had to do with the device manager as far as I can recall. Maybe Zoltan can remember?

That would be appreciated. Thanks. Another problem with the upload congestion problem is that some uploads can take upwards of 10 hours when a dozen or more are trying at once. Then they start missing the 24 hour cutoff, which is also irritating.

If you don't have a big enough upload pipe for reporting multiple tasks, you can restrict the number of uploads in cc_config.xml

<max_file_xfers_per_project>1</max_file_xfers_per_project>

That way a single finished task will get all of the capacity of your upload pipe to itself and transfer faster.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51611 - Posted: 8 Mar 2019 | 1:45:15 UTC - in response to Message 51610.

If you don't have a big enough upload pipe for reporting multiple tasks, you can restrict the number of uploads in cc_config.xml

<max_file_xfers_per_project>1</max_file_xfers_per_project>

That way a single finished task will get all of the capacity of your upload pipe to itself and transfer faster.

Thanks, I've been meaning to try this. The problem then becomes that the CPU WUs create a huge backlog waiting while the huge GPUGrid upload stumbles along. The Ryzen 7 machines do a lot of CPU work pretty quickly.

No wait, that's a command that I didn't know (per project). I will definitely try it. Thanks again!

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,741,017,411
RAC: 10,101,593
Level
Tyr
Scientific publications
watwatwatwatwat
Message 51613 - Posted: 8 Mar 2019 | 2:00:51 UTC
Last modified: 8 Mar 2019 | 2:01:08 UTC

There is an option for the entire client and one per project.
https://boinc.berkeley.edu/wiki/Client_configuration

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51614 - Posted: 8 Mar 2019 | 3:56:05 UTC - in response to Message 51610.

<max_file_xfers_per_project>1</max_file_xfers_per_project>

That way a single finished task will get all of the capacity of your upload pipe to itself and transfer faster.

Seems to be helping, there's not as much stalling. Will continue to monitor.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51634 - Posted: 15 Mar 2019 | 17:14:59 UTC - in response to Message 51607.
Last modified: 15 Mar 2019 | 17:15:56 UTC

Zoltan had some great advice for me a bit ago. I don't think I can fully remember every step, but it completely fixed my corrupted WUs after a power outage issue. It had to do with the device manager as far as I can recall. Maybe Zoltan can remember?

I recall what Zoltan once told me. Go into Device Manager/ disk drives/ the drive BOINC is on/ policies/ uncheck "enable write caching on this device"/ reboot and you should be all set.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,521,847,676
RAC: 25,876,700
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 51635 - Posted: 15 Mar 2019 | 17:30:56 UTC - in response to Message 51634.

I recall what Zoltan once told me. Go into Device Manager/ disk drives/ the drive BOINC is on/ policies/ uncheck "enable write caching on this device"/ reboot and you should be all set.

Yes, this was/is exactly it.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51640 - Posted: 18 Mar 2019 | 15:35:24 UTC - in response to Message 51634.

I recall what Zoltan once told me. Go into Device Manager/ disk drives/ the drive BOINC is on/ policies/ uncheck "enable write caching on this device"/ reboot and you should be all set.

I've been unchecking that for years. Yes it helps but it didn't help with the power outage and 18 failed WUs that I described in the OP. All the drives on all my BOINC machines had write caching disabled.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51642 - Posted: 18 Mar 2019 | 21:22:07 UTC

Interesting, it seemed to eliminate the problem for me when I enabled it

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51644 - Posted: 19 Mar 2019 | 15:28:20 UTC - in response to Message 51642.

Interesting, it seemed to eliminate the problem for me when I enabled it

I also believed that before March 7th. Then I was educated x 18. However, it does help when write caching is disabled. One related thing I've found is that when Win10 reboots automatically to do updates, it must wait long enough for GPUGrid to close the WUs as they seem to survive that situation. Knock on wood... ;-)

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,521,847,676
RAC: 25,876,700
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 51645 - Posted: 19 Mar 2019 | 17:47:01 UTC - in response to Message 51644.

... when Win10 reboots automatically to do updates, it must wait long enough for GPUGrid to close the WUs ...

how do your educate Win10 to wait long enough until the GPUGRID tasks stops?

Even if a GPUGRID task is manually stopped in the BOINC manager, it takes up to a minute until it actually stops.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51646 - Posted: 20 Mar 2019 | 4:07:18 UTC - in response to Message 51645.
Last modified: 20 Mar 2019 | 4:08:40 UTC

... when Win10 reboots automatically to do updates, it must wait long enough for GPUGrid to close the WUs ...

how do your educate Win10 to wait long enough until the GPUGRID tasks stops?
Even if a GPUGRID task is manually stopped in the BOINC manager, it takes up to a minute until it actually stops.

I have no idea. My observation is that SO FAR with 5 Win10 machines running 3 GPUGrid WUs each, I haven't had any WUs fail when Win10 decides to reboot to process updates. This has happened quite a few times. Maybe I've just been lucky, maybe not.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51696 - Posted: 12 Apr 2019 | 19:39:08 UTC - in response to Message 51610.

Zoltan had some great advice for me a bit ago. I don't think I can fully remember every step, but it completely fixed my corrupted WUs after a power outage issue. It had to do with the device manager as far as I can recall. Maybe Zoltan can remember?

That would be appreciated. Thanks. Another problem with the upload congestion problem is that some uploads can take upwards of 10 hours when a dozen or more are trying at once. Then they start missing the 24 hour cutoff, which is also irritating.

If you don't have a big enough upload pipe for reporting multiple tasks, you can restrict the number of uploads in cc_config.xml

<max_file_xfers_per_project>1</max_file_xfers_per_project>

That way a single finished task will get all of the capacity of your upload pipe to itself and transfer faster.

Thanks again for this. It allowed me to keep more GPUs on the project, though I never could get them all shoehorned into my paltry UL bandwidth. Now with the rise of mostly KIX WUs and nearly double the UL size I have the problem again. Maybe someday my area will have better connectivity. For now I've had to transfer many of my GPUs to projects with lesser UL requirements. I very much like GPUGrid but have to lighten up on it for now. Keep up the great work! I'll keep running what I'm able to here.

Helix Von Smelix
Send message
Joined: 13 Aug 08
Posts: 7
Credit: 602,893,064
RAC: 2,551,996
Level
Lys
Scientific publications
watwatwatwatwat
Message 51701 - Posted: 16 Apr 2019 | 19:03:18 UTC - in response to Message 51606.

UPS.

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 51703 - Posted: 17 Apr 2019 | 16:07:57 UTC

Extremely slow uploads here (Menlo Park, Ca) at 9:00 AM Pacific time, I have 100 Mbps down and 40 Mbps up and my connection is working perfectly according to a speed test I just did. I've noticed this only happens about 25% of the time with me, it is a major pain uploading at 300 Kbps.

Billy Ewell 1931
Send message
Joined: 22 Oct 10
Posts: 40
Credit: 1,561,318,440
RAC: 2,086,488
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52521 - Posted: 24 Aug 2019 | 4:04:27 UTC
Last modified: 24 Aug 2019 | 4:21:31 UTC

In the past, my equipment was excluded from GPUGRD at times because of lesser quality and low performing cards. So I finally broke down a few days back and bought an EVGA RTX 2080 with the anticipation of crunching along with the "Big Boys."And of course, quite naturally, I was able over the last couple of days to download a dozen tasks that require 8-12 hours on the fastest cards. And if failure is success then I succeeded perfectly: Every task errored out with minimum time being 8.11 seconds and the longest time before failure was 14.71 seconds.

My driver for beginning crunching was the 436.02 and I changed to the 431.60 before failure of the last task. And I did a clean install on the second driver.
Equipment is visible.

I looked on the performance page and I do not see a performance record for the RTX 2080 card and my cursory look at the tasks results did not show a wingman having processed a task with the 2080. So what do I do? These tasks come few and far between.

BTW, my other machine with a GTX 1060 has processed all tasks available without a failure.

Profile PDW
Send message
Joined: 7 Mar 14
Posts: 15
Credit: 5,280,474,525
RAC: 29,626,109
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52522 - Posted: 24 Aug 2019 | 7:59:41 UTC - in response to Message 52521.

Your 2080 isn't supported yet, see here for more details...

http://gpugrid.org/forum_thread.php?id=4952

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,002,528,085
RAC: 18,845,979
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52523 - Posted: 24 Aug 2019 | 8:24:08 UTC

Your 2080 isn't supported yet, see here for more details...

http://gpugrid.org/forum_thread.php?id=4952


Here new Nvidia Turing series GPUs are listed:

- NVIDIA TITAN RTX
- RTX 2080 TI
- RTX 2080 SUPER
- RTX 2080
- RTX 2070 SUPER
- RTX 2070
- RTX 2060 SUPER
- RTX 2060
- GTX 1660 TI
- GTX 1660
- GTX 1650

They all will fail every current ACEMD version WU after few seconds from start.
The reason: Turing GPUs are not supported so far.
GPUGrid team is developing a new ACEMD3 version that is likely to support Turing GPUs.
I Have a GTX 1660 TI and a GTX 1650 (impatiently) waiting for this ;-)

Billy Ewell 1931
Send message
Joined: 22 Oct 10
Posts: 40
Credit: 1,561,318,440
RAC: 2,086,488
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52524 - Posted: 24 Aug 2019 | 14:19:39 UTC - in response to Message 52522.

PDW: Thank you! Bill

Post to thread

Message boards : Number crunching : GPUGrid problems, nothing has changed

//