Message boards :
Graphics cards (GPUs) :
Really low Run Times, but still Completed and Successful?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 8 · Next
| Author | Message |
|---|---|
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
But in the meantime why do something that's causing corrupted results? |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Not all results are corrupted, and the admins still have a way of finding the corrupted ones after they've been erroneously validated. So, to answer your question "why keep doing it", the answer is "the science" (Overall I do more science with a higher throughput, running 2-at-a-time, even if I get some that immediately error out). If the admins make a request for me to change, I will of course honor it. But, so far, I've been encouraged to test (find problems) and report results (and problems). I just hope they fix the problems I find. Regards, Jacob |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The apps at GPUGrid were not designed to run more than one task at a time on the same GPU. If running several tasks at once causes failures it might be massively detrimental to the project, so don't be surprised if the use of app_config gets banned. Getting lots of credits for producing failed tasks won't endear you to anyone and typically results in having your credit reduced or account suspended. I don't think projects have to facilitate crunchers or Boinc add ons. app_config was designed for all that can use it but is better used at other projects than here. At GPUGrid it seems the improvement is limited to but a few task types, on cards with more RAM and on Vista/W7/W8. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
JK is the only one affected by the problem, as far as I can tell, and it's a nasty one. My guess is that the client does not keep the two WUs separate. Disable as soon as possible. (Unless there is something more odd like read-only filesystems, file permissions, or the like). |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
JK is the only one affected by the problem, as far as I can tell, and it's a nasty one. My guess is that the client does not keep the two WUs separate. Disable as soon as possible. Toni: Per your request (as a project administrator), I will stop running 2-at-a-time on the same GPU. Toni / Nate / GDF: Could you please do more research into the problem (hopefully not just guessing and giving up)... and fix the apps so that they can run 2-at-a-time consistently, and fix the validator so it does proper additional validation checks before granting credits? Running the tasks 2-at-a-time does increase throughput, when they work properly, and thus supporting it would be beneficial to your science. I am available for any testing, and am eager to be allowed to run tasks 2-at-a-time, again. Thanks, Jacob Klein |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
Hi JK, thanks for your fix. Your results now seem to work properly. My guess is that it is a bug in the client (not so much the application) which makes the two tasks not isolated from each other. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Toni, There are times when GPUGrid-2-at-a-time works just fine for me. In fact, I'd say most of the time, it works correctly. If it's a bug in the client, as you suggest, then let's get it fixed! I'm a beta tester, testing 7.0.62, and if you can isolate the reason you think it's a client bug, we can contact David Anderson and he can fix it. What makes you think it's a client bug? I'm much more likely to believe that it's an application problem, though. I'm able to run x-at-a-time successfully on all of my other GPU projects. If you determine it's an application problem, could you please fix it, as well as the validator? I look forward to your reply, Jacob |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
JK's got a point in that running multiple concurrent WUs works for many projects. In fact GPu-Grid is the first one I am aware of where it doesn't work. Generally the others don't officially support and encourage it, but didn't have to adjust their code either. Running multiple WUs per system, on multiple GPUs, works. Since GPU-Grid is rather complex and surely uses quite some CUDA libraries I could imaginethe following: maybe in some of these libraries there's some code / variables / initializations which is global per GPU, so that multi WUs share it (unintentionally, in this case). Now an error could appear if 2 WUs conflict in the use of the ressource, otherwise it will run just fine. MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Toni, Toni / Nate / GDR: I contacted David Anderson. He said that he'd be willing to help with any client problem. I'd like to make progress on fixing the issues within this thread. Have you guys begun an investigation yet into the cause? It seems like the stderr output only had a single line of info, showing the BOINC version number. Maybe the code could be changed to indicate how far along in the execution it got, before crashing/completing? I'd like to resume processing GPUGrid tasks x-at-a-time on the same GPU, like I can with all of my other GPU projects. If there's anything I can do to help test a change or expedite the fix, I am at your command. Thank you, Jacob |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Toni / Nate / GDR: I disabled x-at-a-time for GPUGrid tasks, per Toni's request, on 4/12/2013. Yet, I had 2 more results that were marked "Completed and validated", with Run time less than 4 seconds, granted full+bonus credit of 70,800, on 4/14/2013! This leads me to believe that the problem is unrelated to running x-at-a-time. Again, I ask you, have you begun your investigation?? Task Work unit Computer Sent Time reported or deadline Status Run time (sec) CPU time (sec) Credit Application 6755331 4362872 126725 14 Apr 2013 | 7:01:18 UTC 14 Apr 2013 | 10:38:35 UTC Completed and validated 2.40 0.84 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6754567 4362305 126725 14 Apr 2013 | 2:29:48 UTC 14 Apr 2013 | 10:38:01 UTC Completed and validated 3.19 0.84 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) Specs: Windows 8 x64, BOINC v7.0.62 x64 Beta, nVidia GeForce 314.22 WHQL, eVGA GTX 660 TI 3GB FTW, eVGA GTX 460 1GB, GPUGrid app_config using the following settings for all 4 apps: <max_concurrent>9999</max_concurrent> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>0.001</cpu_usage> </gpu_versions> |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
<cpu_usage>.001</cpu_usage> Jacob, could you try running without the app_config and see if you still get these glitches? |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Maybe. Could you also try running WITH the app_config to see what results you get? I'm quite frustrated -- it really feels like I'm the only one trying to solve this problem. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Maybe. Could you also try running WITH the app_config to see what results you get? I don't see it as a problem. Many projects support running multiple instances (although not necessarily officially), this one does not. GPUGrid is extremely demanding compared to most projects. So far you've been seeing only the easiest of the long WUs compared to what we had not too long ago. I wouldn't want to try running multiple long NOELIA or GIANNI WUs. The NATHAN WUs previous to these would not even run properly 1x on my GTX 460/768 cards due to too large a memory footprint. I know you're trying to squeeze every last bit of performance out of your GTX 660 TI 3GB and that's commendable. Maybe the developers can find a solution for you, but I wouldn't be recommending either running 2x or limiting CPU reservation at this point. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I don't think you understand the cpu limitation portion. I removed the GPUGrid.net project, and re-added it, and started working on some long-run Nathan's (without any app_config override). By default, they use "0.73 CPUs + 1 NVIDIA GPU" on my machine. Having an app_config that sets it to 0.001, does not make any difference, if only 1 task is running. That number is used when to determine how many CPU jobs to add. For instance: Without app-config, my system runs: 1 GPUGrid.net task (0.73 CPUs + 1 NVIDIA GPU) 2 WCG HCC GPU tasks (each: 1 CPUs + 0.5 NVIDIA GPUs) 6 CPU jobs With app_config, my system runs: 1 GPUGrid.net task (0.001 CPUs + 1 NVIDIA GPU) 2 WCG HCC GPU tasks (each: 1 CPUs + 0.5 NVIDIA GPUs) 6 CPU jobs See? I don't see how the app_config could possibly make a difference, when running only 1 GPUGrid task, but I'm testing it anyways, since I'm losing hope that the admins care. You may not see it as a problem, but it affects their science results. Tasks are failing immediately, their validator is erroneously marking those invalid results as valid... and if I can prove (to myself) that running x-at-a-time is not the cause, then I will likely switch back to running x-at-a-time, for increased performance. The admins need to begin an investigation, if they care. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
See? I don't see how the app_config could possibly make a difference, when running only 1 GPUGrid task, but I'm testing it anyways That's all I asked, since it seems like these failures are unique to you and you're probably the only one running the above configuration. The only way to know for sure is to test it. You may not see it as a problem, but it affects their science results. Exactly, and if the config is the cause the admins should disable the use of app_info and app_config until it's sorted out. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
And how do you propose they disable those things? Seriously. I don't believe it's possible. <sigh> Man, I am SICK of all of these "ifs" and "let's do this and that." I'm the only one testing to find the root cause, so far as I know. It's frustrating, to say the least, it feels like I'm in a hole. But if I'm the only one having the issue, then sure, I can accept being the only one doing the testing. If anyone else truly cares, they should be trying to reproduce the problem, instead of guessing at causes/fixes. These proposals, some of which are ludicrous (making validation based on CPU time??), some of which are impossible (ban app-config??), and some of which are detrimental (limit to 1 gpu task per person??)... they're all the wrong approaches, in my opinion. I said it before, and I'll say it again: What they should do is 2 things: Priority 1) Fix the validator to stop marking these results valid, and thus stop issuing credits for invalid results Priority 2) Fix the workunits so they do not error under whatever conditions they are erroring If you want to help me test, then help me test. Otherwise, please don't guess at causes/solutions. Thanks, Jacob |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Jacob.. relax. It's only been 1 work day since Toni replied the last time. He could be on vacation or babysitting as far as we know.. or something else got in his way. I agree the validator should be fixed, and this doesn't sound too hard. If they can fix "your" problem, on the other hand, is up in the air as long as we have no further insight into why this is happening. One more possibility (I'm not calling it a guess ;) would be some wierd interaction with WCG@GPU, but then you probably wouldn't be the only one affected. MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'll try to relax. If I knew that an investigation was proceeding, or that a fix was being worked on, then I could relax a lot easier. It's been 2 full weeks since the problem was reported, and the admin responses thus far have been "oh wow that's odd and beyond me", "wait and see if it happens again", and "probably a client issue". |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
some of which are impossible (ban app-config??) It was my understanding that there is a server setting to allow/disallow the anonymous platform (app_info.xml). Not positive that's true and also don't know if there's a similar way to disallow the app_config. Maybe someone can enlighten us on that point. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Toni / Nate / GDF: I have been doing additional testing. Since removing/readding the project, I have not had the issue, even when using a custom app_config for 0.001 CPU and 1-at-a-time task processing. In order to test whether the remove/readd fixed the problem, I was wondering... Could I please turn on 2-at-a-time task processing, to continue my testing? Awaiting your reply, Jacob |
©2025 Universitat Pompeu Fabra