Message boards :
Number crunching :
GPUGrid problems, nothing has changed
Message board moderation
| Author | Message |
|---|---|
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This is the 3rd time that I've gone in heavily on GPUGrid over the last 10-11 years. Twice I've gotten frustrated with the problems and cut way back. I was hoping that some of the issues would have been fixed. There's been an ongoing problem of stalling uploads (not to mention downloads) for many years. It's still not fixed. In addition WUs that get interrupted often fail even with write caching disabled on the drives. Case in point. Last night we had a 3 hour power outage. When I brought the machines back up 18 out of 25 GPUGrid failed. There was not even one failure for any WU from any other project. These failures also cause another problem. Since 18 new WUs start at the same time it causes them to finish at about the same time. So many huge GPUGrid WUs uploading at once saturates my bandwidth for many hours. Yes, I live in the US so my DSL connection is not fast even though it was upgraded a few months ago (only 1 provider here, how do you spell monopoly). Unbridled capitalism is a bad idea for 99.9% of the people. Anyway, the combination of poor broadband infrastructure and these long standing GPUGrid problems sadly pushes me to cut back on this otherwise fine project. It seems to me that some of this should be not that difficult to fix, but apparently the necessary skills aren't present. BTW, these "upload storms" have been happening regularly. For someone with a faster connection and/or fewer GPUs it may not seem like a problem, but it's a problem here and I know of no way to solve it on my end. Thanks for listening to my frustration. |
|
Send message Joined: 21 Mar 16 Posts: 513 Credit: 4,673,458,277 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Zoltan had some great advice for me a bit ago. I don't think I can fully remember every step, but it completely fixed my corrupted WUs after a power outage issue. It had to do with the device manager as far as I can recall. Maybe Zoltan can remember? |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Zoltan had some great advice for me a bit ago. I don't think I can fully remember every step, but it completely fixed my corrupted WUs after a power outage issue. It had to do with the device manager as far as I can recall. Maybe Zoltan can remember? That would be appreciated. Thanks. Another problem with the upload congestion problem is that some uploads can take upwards of 10 hours when a dozen or more are trying at once. Then they start missing the 24 hour cutoff, which is also irritating. |
|
Send message Joined: 3 Jul 18 Posts: 22 Credit: 2,758,801 RAC: 0 Level ![]() Scientific publications
|
I know the frustration, but ironically GPUGRID is the better project to me by a small margin. I wouldn't even know how the team at GPUGRID could fix the issues you're describing. Aren't those BOINC related issues? I know I have seen similar issues discussed in other projects. The solution was to run a start up script that booted the machines or restarted the clients with some delay. Alternatively you could consider limiting the number of connections for BOINC, which would be slower (unnecessarily slow at times) but more evenly distributed. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Zoltan had some great advice for me a bit ago. I don't think I can fully remember every step, but it completely fixed my corrupted WUs after a power outage issue. It had to do with the device manager as far as I can recall. Maybe Zoltan can remember? If you don't have a big enough upload pipe for reporting multiple tasks, you can restrict the number of uploads in cc_config.xml <max_file_xfers_per_project>1</max_file_xfers_per_project> That way a single finished task will get all of the capacity of your upload pipe to itself and transfer faster. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
If you don't have a big enough upload pipe for reporting multiple tasks, you can restrict the number of uploads in cc_config.xml Thanks, I've been meaning to try this. The problem then becomes that the CPU WUs create a huge backlog waiting while the huge GPUGrid upload stumbles along. The Ryzen 7 machines do a lot of CPU work pretty quickly. No wait, that's a command that I didn't know (per project). I will definitely try it. Thanks again! |
|
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 213 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
There is an option for the entire client and one per project. https://boinc.berkeley.edu/wiki/Client_configuration |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
<max_file_xfers_per_project>1</max_file_xfers_per_project> Seems to be helping, there's not as much stalling. Will continue to monitor. |
|
Send message Joined: 21 Mar 16 Posts: 513 Credit: 4,673,458,277 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Zoltan had some great advice for me a bit ago. I don't think I can fully remember every step, but it completely fixed my corrupted WUs after a power outage issue. It had to do with the device manager as far as I can recall. Maybe Zoltan can remember? I recall what Zoltan once told me. Go into Device Manager/ disk drives/ the drive BOINC is on/ policies/ uncheck "enable write caching on this device"/ reboot and you should be all set. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I recall what Zoltan once told me. Go into Device Manager/ disk drives/ the drive BOINC is on/ policies/ uncheck "enable write caching on this device"/ reboot and you should be all set. Yes, this was/is exactly it. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I recall what Zoltan once told me. Go into Device Manager/ disk drives/ the drive BOINC is on/ policies/ uncheck "enable write caching on this device"/ reboot and you should be all set. I've been unchecking that for years. Yes it helps but it didn't help with the power outage and 18 failed WUs that I described in the OP. All the drives on all my BOINC machines had write caching disabled. |
|
Send message Joined: 21 Mar 16 Posts: 513 Credit: 4,673,458,277 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Interesting, it seemed to eliminate the problem for me when I enabled it |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Interesting, it seemed to eliminate the problem for me when I enabled it I also believed that before March 7th. Then I was educated x 18. However, it does help when write caching is disabled. One related thing I've found is that when Win10 reboots automatically to do updates, it must wait long enough for GPUGrid to close the WUs as they seem to survive that situation. Knock on wood... ;-) |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
... when Win10 reboots automatically to do updates, it must wait long enough for GPUGrid to close the WUs ... how do your educate Win10 to wait long enough until the GPUGRID tasks stops? Even if a GPUGRID task is manually stopped in the BOINC manager, it takes up to a minute until it actually stops. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
... when Win10 reboots automatically to do updates, it must wait long enough for GPUGrid to close the WUs ... I have no idea. My observation is that SO FAR with 5 Win10 machines running 3 GPUGrid WUs each, I haven't had any WUs fail when Win10 decides to reboot to process updates. This has happened quite a few times. Maybe I've just been lucky, maybe not. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Zoltan had some great advice for me a bit ago. I don't think I can fully remember every step, but it completely fixed my corrupted WUs after a power outage issue. It had to do with the device manager as far as I can recall. Maybe Zoltan can remember? Thanks again for this. It allowed me to keep more GPUs on the project, though I never could get them all shoehorned into my paltry UL bandwidth. Now with the rise of mostly KIX WUs and nearly double the UL size I have the problem again. Maybe someday my area will have better connectivity. For now I've had to transfer many of my GPUs to projects with lesser UL requirements. I very much like GPUGrid but have to lighten up on it for now. Keep up the great work! I'll keep running what I'm able to here. |
|
Send message Joined: 13 Aug 08 Posts: 7 Credit: 772,857,675 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
UPS. |
|
Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Extremely slow uploads here (Menlo Park, Ca) at 9:00 AM Pacific time, I have 100 Mbps down and 40 Mbps up and my connection is working perfectly according to a speed test I just did. I've noticed this only happens about 25% of the time with me, it is a major pain uploading at 300 Kbps. |
|
Send message Joined: 22 Oct 10 Posts: 42 Credit: 1,752,050,315 RAC: 47 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
In the past, my equipment was excluded from GPUGRD at times because of lesser quality and low performing cards. So I finally broke down a few days back and bought an EVGA RTX 2080 with the anticipation of crunching along with the "Big Boys."And of course, quite naturally, I was able over the last couple of days to download a dozen tasks that require 8-12 hours on the fastest cards. And if failure is success then I succeeded perfectly: Every task errored out with minimum time being 8.11 seconds and the longest time before failure was 14.71 seconds. My driver for beginning crunching was the 436.02 and I changed to the 431.60 before failure of the last task. And I did a clean install on the second driver. Equipment is visible. I looked on the performance page and I do not see a performance record for the RTX 2080 card and my cursory look at the tasks results did not show a wingman having processed a task with the 2080. So what do I do? These tasks come few and far between. BTW, my other machine with a GTX 1060 has processed all tasks available without a failure. |
PDWSend message Joined: 7 Mar 14 Posts: 18 Credit: 6,575,125,525 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Your 2080 isn't supported yet, see here for more details... http://gpugrid.org/forum_thread.php?id=4952 |
©2025 Universitat Pompeu Fabra