Message boards :
Graphics cards (GPUs) :
Really low Run Times, but still Completed and Successful?
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
GPU Developers: I've been testing on running multiple tasks at the same time on my GPU, and just noticed something odd in my results. If you look at these results, you'll see that they say "Completed and Validated", and I got the bonus credit, but... Look at those Run Times and CPU Times! How can they be valid, if they only ran for less than 10 seconds? What happened? http://www.gpugrid.net/result.php?resultid=6695537 http://www.gpugrid.net/result.php?resultid=6695975 Note 1: I didn't actually watch these units, and so I don't know how long they really took. Sorry. Note 2: I'm not trying to cheat any system. This just happened, and I'm looking for an explanation. Can you figure out what happened here? Regards, Jacob |
|
Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
As for the CPU time? No idea. Maybe that's the total amount of time the WU consumed getting data moved around on your system. Operator[/url] |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
One was a short unit, one was a long unit. I already understood that. My question is: How could they have only used that little time, and yet been completed and validated, and granted bonus credit? What went wrong? |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
Being valid with such short runtimes is puzzling: thanks for catching it. After a DB check, it appears that the problem is exclusive to your last two WUs. How it happens, is beyond me, but I think BOINC got stuck somewhere in between WUs. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well... I'll tell you my setup, to see if that helps you at all in determining the cause. Basically, I have 2 devices: GPU Device 0: eVGA GTX 660 Ti 3GB FTW GPU Device 1: eVGA GTX 460 1GB For each of the 4 GPUGrid applications, I had been running the following in app_config.xml, in an attempt to run 2 tasks on a single GPU device, without dedicating a CPU to each task (since, when they run on Device 1, they don't usually use CPU at all): <gpu_usage>0.5</gpu_usage> <cpu_usage>0.001</cpu_usage> I'm also attached to POEM@Home, and in order to run up-to-6 at a time while allocating a CPU for each task, for POEM's app_config.xml, I have: <gpu_usage>0.166</gpu_usage> <cpu_usage>1</cpu_usage> GPUGrid mainly runs on Device 0, and at one point, I believe I saw the following all running on GPU Device 0, all at the same time: - GPUGrid Task (0.5 GPU) - Poem Task (0.166 GPU) - Poem Task (0.166 GPU) - Poem Task (0.166 GPU) Now, that GPUGrid task may have been one that completed normally (ie: took ~35,000 secs), or it may have been one of these low-run-time tasks... I can't be sure, because I wasn't watching it closely enough. I'll try to keep a better eye out on this, but if you find anything more out on your end, please let me know. I'd like to figure out what's happening. |
nateSend message Joined: 6 Jun 11 Posts: 124 Credit: 2,928,865 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Running multiple jobs on the same GPU is somewhat unorthodox, and beyond what we can troubleshoot. Feel free to keep experimenting for now, but ultimately this raises two concerns for us: 1) Users could potentially cheat the credit system this way, if this in fact a bug (and the run times are real, and not a bug that is misleading us). 2) Equally concerning for us, it could affect the integrity of the returned data. Corrupted results might be returned without being flagged as such. For now, it just looks like a one-off peculiarity. But if either of these things become true, we would have to do something to fix it. Thanks for letting us know, and keep us posted if something like this happens again. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Nate, I understand that we want to prohibit cheating, and that we want to make sure the results are valid. I'll definitely keep watching for more of these, and if I get any more, I'll report them. But, since it already happened, and it seems that there's no way the result could be valid, then you might consider seeing if you can somehow account for it now (in the validator?), instead of waiting for it to happen again. I'll let you know when it happens again. Thanks, Jacob |
Carlesa25Send message Joined: 13 Nov 10 Posts: 328 Credit: 72,619,453 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Running multiple jobs on the same GPU is somewhat unorthodox, and beyond what we can troubleshoot. Feel free to keep experimenting for now, but ultimately this raises two concerns for us: Hello: I can only comment that I made several batches of short works with two GPU (GTX 590 = 4 tasks) tasks without problem in Linux Ubuntu 12.10 - 64bits is SO serious ... Greetings. |
nateSend message Joined: 6 Jun 11 Posts: 124 Credit: 2,928,865 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
But, since it already happened, and it seems that there's no way the result could be valid, then you might consider seeing if you can somehow account for it now (in the validator?), instead of waiting for it to happen again. Indeed. It appears that there is some problem. A file is missing, and the simulation may be affected. Let us know if it happens again. We'll have to figure out a way to search all the files/database and see if this is an issue elsewhere. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thought it might have switched cards mid-run or that the WU might have restarted following a driver crash/restart and Boinc reset the runtime, but that's not the case:
Received 1 Apr 2013 | 8:17:30 UTC
FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Maybe you could add a validation criteria so that the WU has to run for at least X amount of time to Validate. Probably best based on the app types (Short, Long). No, don't put a validation based on time... which may have to be rewritten someday to accommodate faster cards. Validation should be based on actual validation of the results, and it's clear you guys are currently missing a check somewhere for these units (as Nate said "A file is missing"). So, fix it by checking for that file. Note: I thought I'd copy/paste the workunit details of the 2 workunits, in case the details get removed: Name I1R446-NATHAN_RPS1_respawn3-8-32-RND4245_0 Workunit 4315248 Created 1 Apr 2013 | 1:43:35 UTC Sent 1 Apr 2013 | 6:18:32 UTC Received 1 Apr 2013 | 8:18:18 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 126725 Report deadline 6 Apr 2013 | 6:18:32 UTC Run time 11.36 CPU time 3.37 Validate state Valid Credit 16,200.00 Application version Short runs (2-3 hours on fastest card) v6.52 (cuda42) Name I13R89-NATHAN_dhfr36_3-23-32-RND7378_0 Workunit 4315612 Created 1 Apr 2013 | 4:33:06 UTC Sent 1 Apr 2013 | 6:01:20 UTC Received 1 Apr 2013 | 8:17:30 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 126725 Report deadline 6 Apr 2013 | 6:01:20 UTC Run time 8.37 CPU time 1.40 Validate state Valid Credit 70,800.00 Application version Long runs (8-12 hours on fastest card) v6.18 (cuda42) |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No, don't put a validation based on time... which may have to be rewritten someday to accommodate faster cards. Validation should be based on actual validation of the results, and it's clear you guys are currently missing a check somewhere for these units (as Nate said "A file is missing"). So, fix it by checking for that file. Completely agreed! A time-based check is like sloppy 20th century programming aka "640k is enough for everyone!" and will be forgotten to update. One could extend the usual server side checks to include such a test and bring unusually fast WUs to the devs attention automatically.. but don't automatically invalidate. MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
nate, I had 2 more tasks that appear to have completed way prematurely, yet were granted full credit plus bonus credit. I think I was suspending and resuming various GPU tasks, when these tasks decided they were completed. That's about all the information I have :-/ Can you please verify that the results are unusable? And then, could you also build additional checks into the validator, that will not validate tasks like these, whose results are unusable? Thanks, Jacob http://www.gpugrid.net/result.php?resultid=6738039 Name I3R50-NATHAN_dhfr36_3-31-32-RND3834_0 Workunit 4348765 Created 9 Apr 2013 | 22:06:35 UTC Sent 10 Apr 2013 | 3:51:18 UTC Received 10 Apr 2013 | 9:06:18 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 126725 Report deadline 15 Apr 2013 | 3:51:18 UTC Run time 3,280.11 CPU time 3,247.89 Validate state Valid Credit 70,800.00 Application version Long runs (8-12 hours on fastest card) v6.18 (cuda42) http://www.gpugrid.net/result.php?resultid=6738043 Name I10R38-NATHAN_dhfr36_3-31-32-RND3600_0 Workunit 4348769 Created 9 Apr 2013 | 22:09:36 UTC Sent 10 Apr 2013 | 3:51:18 UTC Received 10 Apr 2013 | 9:06:18 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 126725 Report deadline 15 Apr 2013 | 3:51:18 UTC Run time 3.12 CPU time 0.98 Validate state Valid Credit 70,800.00 Application version Long runs (8-12 hours on fastest card) v6.18 (cuda42) |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Do you think these are actual run times, or bogus? I've seen run times reset before. 5h15min would be within the rage I've seen for one NATHAN_dhfr36 task at a time. If you know what your cache was set to and if you were running one at a time or two, you might get some idea. Thanks,
FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
For that GPU, I usually have 4 GPUGrid tasks queued up, and as many POEM OpenCL tasks queued as possible (they rarely have them available). So, just because there was a 5-hour difference between download and reported, means nothing for this scenario. I'm pretty sure the task started, then immediately died, then immediately reported the result, and cashed in on bonus credits. I hope they can fix this. I don't want the credits, but even more importantly, I don't want error results mixed in with their other successful results. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Uhoh, Nate! I just had a whole rash of these, 18 of them, all Nathan long-run dhfr, just happen! And, from the best I can tell, this happened in the middle of the night while I was away from the PC. Each task took 3 seconds, was marked completed and validated, and then was granted the full 70,800.00 credit. It looks like I'm going to be getting the most credit today that I've ever gotten before. Have you guys begun your investigation into the validator?? I urge you to take action ASAP. If there's anything I can do to help test, please let me know. Thanks, Jacob Task click for details Show names Work unit click for details Computer Sent Time reported or deadline explain Status Run time (sec) CPU time (sec) Credit Application 6742602 4352606 126725 11 Apr 2013 | 11:31:23 UTC 11 Apr 2013 | 11:34:25 UTC Completed and validated 3.61 0.90 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742575 4352579 126725 11 Apr 2013 | 11:25:58 UTC 11 Apr 2013 | 11:28:41 UTC Completed and validated 3.23 0.92 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742574 4352578 126725 11 Apr 2013 | 11:25:58 UTC 11 Apr 2013 | 11:28:41 UTC Completed and validated 3.56 0.98 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742557 4352561 126725 11 Apr 2013 | 11:25:58 UTC 11 Apr 2013 | 11:28:41 UTC Completed and validated 3.60 0.86 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742544 4352548 126725 11 Apr 2013 | 11:28:41 UTC 11 Apr 2013 | 11:31:23 UTC Completed and validated 3.25 0.92 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742543 4352547 126725 11 Apr 2013 | 11:28:41 UTC 11 Apr 2013 | 11:31:23 UTC Completed and validated 3.71 0.95 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742528 4352532 126725 11 Apr 2013 | 11:23:07 UTC 11 Apr 2013 | 11:25:58 UTC Completed and validated 3.66 0.89 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742481 4352486 126725 11 Apr 2013 | 11:18:59 UTC 11 Apr 2013 | 11:20:56 UTC Completed and validated 3.24 0.86 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742441 4352446 126725 11 Apr 2013 | 11:31:23 UTC 11 Apr 2013 | 11:34:25 UTC Completed and validated 3.26 0.92 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742440 4352445 126725 11 Apr 2013 | 11:31:23 UTC 11 Apr 2013 | 11:34:25 UTC Completed and validated 3.64 0.94 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742431 4352436 126725 11 Apr 2013 | 11:20:56 UTC 11 Apr 2013 | 11:23:07 UTC Completed and validated 3.65 0.90 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742430 4352435 126725 11 Apr 2013 | 11:20:56 UTC 11 Apr 2013 | 11:23:07 UTC Completed and validated 3.21 0.98 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742409 4352415 126725 11 Apr 2013 | 11:16:58 UTC 11 Apr 2013 | 11:18:59 UTC Completed and validated 3.71 0.87 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742386 4352392 126725 11 Apr 2013 | 11:23:07 UTC 11 Apr 2013 | 11:25:58 UTC Completed and validated 3.19 1.01 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742385 4352391 126725 11 Apr 2013 | 11:23:07 UTC 11 Apr 2013 | 11:25:58 UTC Completed and validated 3.43 0.98 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742383 4352389 126725 11 Apr 2013 | 11:28:41 UTC 11 Apr 2013 | 11:31:23 UTC Completed and validated 3.74 0.86 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6742377 4352383 126725 11 Apr 2013 | 11:16:58 UTC 11 Apr 2013 | 11:20:56 UTC Completed and validated 3.22 0.86 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) 6741810 4351878 126725 11 Apr 2013 | 6:50:41 UTC 11 Apr 2013 | 11:16:58 UTC Completed and validated 3.27 0.90 70,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42) |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
They should probably disable running multiple WUs here, at least until this issue is resolved. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I disagree. They have a way to find the erroneously-validated results, and so they have a way to filter them out whenever they go to actually use the data. Disabling multiple WUs would hinder performance for a lot of people, those that run 2-at-a-time as well as those that run 1-at-a-time (and may not be connected to the internet very often) What they should do is 2 things: Priority 1) Fix the validator to stop marking these results valid, and thus stop issuing credits for invalid results Priority 2) Fix the workunits so they do not error under whatever conditions they are erroring - Jacob |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
What they should do is 2 things: Jacob, that would be fine, but primarily this is about the science. If the results are being corrupted the short term answer is to protect the validity of the science. Taking a credit hit is irrelevant. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I know this is about the science, silly! Fixing priority 1, the validator, will assure that invalid results are not included within any list of valid results, thus preserving "the science". Fixing priority 2 will prevent network strain and user confusion. I hope they are working on both. |
©2025 Universitat Pompeu Fabra