Message boards :
News :
Canceled SDOERR WU's
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
I apologize for all whose WU's were cancelled today in the process of stopping the old ones and re-sending the new fixed ones. At the beginning I didn't realize the effect and extend of the cancellation (since it was not intended to work this way), but I see now that many hours of computation and credits were lost due to this. It seems to have been a bug in a script that had not been seen before until now, so we will refrain from using it until it has been fixed. I realize that the computation time is very important for everyone so we hope that such hiccups won't happen too often in the future. Many thanks for your patience and calculations! |
Send message Joined: 18 Sep 08 Posts: 368 Credit: 4,174,624,885 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Well at least now I know why a lot of my Wu's were getting the Boot from the Server ... o_0 STE\/E |
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 52,725 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Okay, fine, learn from your mistakes, and let's go on! |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
This morning I saw that a system had recovered from a problem and was sitting at the Windows log-on screen. When I logged on it became unresponsive, though after waiting about a minute (due to my reg settings) I got 2 app crashed error messages for SDOERR WU's: I2HDQ_21R8-SDOERR_2HDQd-1-4-RND9506_3 4467647 18 May 2013 | 21:10:00 UTC 19 May 2013 | 11:13:06 UTC Error while computing 18,441.84 18,362.63 --- Long runs (8-12 hours on fastest card) v6.18 (cuda42) I2HDQ_22R9-SDOERR_2HDQd-0-4-RND3283_1 4465782 18 May 2013 | 17:35:46 UTC 19 May 2013 | 11:36:31 UTC Aborted by user 33,710.79 33,611.30 --- Long runs (8-12 hours on fastest card) v6.18 (cuda42) These 2 WU's were around 50% and 79% complete (9h). I tried Boinc exits and log -ns, cold restarts, but always got the same app crashes. Even tried a clean driver install. 2 Climate models also failed (only 87h lost this time). When I suspended and enabled the GPUGrid WU's they showed a fixed progress (~50% and 79%), but the elapsed and remaining time kept ticking over.
This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. Kernel not foundAssertion failed: a, file swanlibnv2.cpp, line 59
FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
![]() ![]() Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
You would be doing the recipient of the re-sends a favor by server aborting them, even mid-run; it's better than a crash that kills the WU's and takes out other work! Uh Oh. That's me, for the first one. Let's see if Linux can crunch it, fingers crossed. After looking at past attempts, two were with Linux, so I don't hold out much hope. |
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Hm Skgiven, unfortunately it is probably not related to the specific jobs. I cannot guarantee it obviously, but from the statistics I see that the jobs have a higher success rate than most others right now on GPUgrid. Also my modifications to Paola's jobs (which these jobs used to be) are essentially none except letting them run longer, meaning they have run before on GPUgrid without too grave errors. I am really sorry for the damage done to the other projects too, it must be very bad when so much work is lost, but at this point it doesn't seem that anyone has encountered similar problems to warrant cancelling the rest. I will definitely keep an eye on it though and if more people report the same problem I might cancel them. |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Stefan, I wasn't suggesting that you cancel the batch, just the individual WU's. It was just a grumpy suggestion. If you think these aren't WU specific problems then there is no point cancelling them. I was concerned about the I2HDQ_21R8-SDOERR_2HDQd-1-4-RND9506 WU, as it has already produced 4 different errors on different systems (with varying operating systems, GPU types/generations and probably drivers). While 2 systems have a high error rate, two don't. Stoneageman, it seems you ended up with resends for both of my failures! http://www.gpugrid.net/workunit.php?wuid=4465782 I hope they fare better for you. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Ah ok now I get it, sorry. I will keep the tab open and check if it fails again in which case it gets the boot. I don't mind much about canceling a single chain if it fails everywhere. Thanks for the heads up. |
![]() Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
unfortunately it is probably not related to the specific jobs. I cannot guarantee it obviously, but from the statistics I see that the jobs have a higher success rate than most others right now on GPUgrid. Also my modifications to Paola's jobs (which these jobs used to be) are essentially none except letting them run longer, meaning they have run before on GPUgrid without too grave errors. For me the SDOERR WUs aren't running quite as well as the NATHAN WUs but are running MUCH better than the NOELIA WUs. I had virtually no problems with the NATHANs, 2 errors with the SDOERR (both recovered and completed). I'd count the problem NOELIAS but probably don't have enough fingers and toes... |
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Completed and validated :) Yay! http://www.gpugrid.net/workunit.php?wuid=4467647 |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
and the other one, http://www.gpugrid.net/workunit.php?wuid=4465782 FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Darn, these things run and run and run. I2HDQ_8R2-SDOERR_2HDQd-3-4-RND2343_0 using acemdlong version 618 (cuda42) in slot 5 has been running 17:46 with 4:06 to go and 63.8% complete..... GTX 650 Ti with AMD A10 5800K @ 3.8 GHz John |
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
John, They take about 11 1/2 hours on my GTX 660s, so your results are in line with that. I2HDQ_28R1-SDOERR_2HDQd-2-4-RND3969_2 11:24:55 (11:21:05) 5/22/2013 1:05:26 AM 5/22/2013 1:14:31 AM 0.629C + 1NV (d1) 99.44 Reported: OK * I have not picked one up yet on my GTX 650 Ti (just started it up again), but it looks like they will squeak in under the deadline. |
Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks, Jim. |
![]() Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Darn, these things run and run and run. John, the SDOERR WUs are averaging a little under 16 hours on my OCed 650 Ti GPUs in Win7-64. (No failed WUs on the 3 OCed cards I might add.) A GTX 460/768 ran a bit over 20 hours (stock clocks). Edit: A new 650 Ti GPU I installed yesterday (non-OCed version) is running it's first SDOERR WU now and is at 73% after 12:49 hours. What % GPU usage are you getting? |
![]() Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Edit: A new 650 Ti GPU I installed yesterday (non-OCed version) is running it's first SDOERR WU now and is at 73% after 12:49 hours. Finished in 17:33 on the stock MSI 650 Ti. |
©2025 Universitat Pompeu Fabra