WU failures discussion

Message boards : Number crunching : WU failures discussion
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31937 - Posted: 10 Aug 2013, 12:44:05 UTC - in response to Message 31933.  

Long runs (8-12 hours on fastest card) 3,511 1,510 6.13 (0.62 - 19.10) 409

So, 3,511 of what?
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 31937 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 31938 - Posted: 10 Aug 2013, 13:07:49 UTC

Message received. We will have a meeting on Monday to see what we can do about the problems.
ID: 31938 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31941 - Posted: 10 Aug 2013, 17:38:05 UTC - in response to Message 31937.  

Long runs (8-12 hours on fastest card) 3,511 1,510 6.13 (0.62 - 19.10) 409

So, 3,511 of what?

I've 5 pieces of SANTI_RAP74wt, 1 NATHAN_KIDKIX, 1 NOELIA_KLEBE in progress at the moment.
Out of curiosity I've installed the v326.41 driver on one of my hosts. It had no errors since then, but there haven't been enough workunits completed on this host to say it solves our problems, but I keep my fingers crossed that it does.
The 3rd heatwave of this summer is gone here, so I can crunch all day long.
I'm thinking of a more complicated batch program, which checks the progress and the error messages of the acemd client, and restarts the host automatically if there's no progress.
ID: 31941 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31942 - Posted: 10 Aug 2013, 19:37:31 UTC - in response to Message 31941.  

I'm thinking of a more complicated batch program, which checks the progress and the error messages of the acemd client, and restarts the host automatically if there's no progress.

In an ideal world the app itself would take of this. But apparently it doesn't work like that yet, or at least for some errors.

The "assertion failed" error messages we're frequently seeing are actualy checks built in by the devs which spotted a critical (non-correctable?) error and halt the app (error out).

MrS
Scanning for our furry friends since Jan 2002
ID: 31942 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31944 - Posted: 10 Aug 2013, 22:57:43 UTC - in response to Message 31942.  

I'm thinking of a more complicated batch program, which checks the progress and the error messages of the acemd client, and restarts the host automatically if there's no progress.

In an ideal world the app itself would take of this.

In an ideal world an app couldn't crash other apps, nor the GPU driver, nor the OS.

But apparently it doesn't work like that yet, or at least for some errors.

We had similar problems before. For the last time, when the application error popped up it was correctable by a system restart, and after the restart the app could continue from the last checkpoint. Now we facing a different situation: the app hangs, or runs to error after the restart, so the user have to click on a button to terminate the app, and I have to experiment with this function (the progress check is already working). Luckily I didn't have such an error since then, mostly because I've received only a couple of NATHAN_KIDKIXc22's, which are quite stable. Maybe the other batches were put on hold (or on lower priority) because of our complaints?

The "assertion failed" error messages we're frequently seeing are actually checks built in by the devs which spotted a critical (non-correctable?) error and halt the app (error out).

The devs should respond to this...
ID: 31944 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31946 - Posted: 11 Aug 2013, 1:11:16 UTC - in response to Message 31928.  

I have had four separate system crashes in the last 24hrs. I am now aborting every Noelia task when possible. This I do with a heavy heart. The risk of crashing when running Noelia tasks is no longer acceptable to me. For some reason, this latest batch has been particularly difficult on all cards.

This has also been my experience. The SANTI and NATHAN WUs run fine. The NOELIA WUs have become so problematic that they're not even worth attempting. I'm tired of my GPUs locking up because of poor programming skills.
ID: 31946 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
dyeman

Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31948 - Posted: 11 Aug 2013, 4:24:08 UTC

Just got aanother Noelia (9-NOELIA_005p_express-2-3-RND7707) which crashed after about 3000 seconds.

Should I feel satisfaction that none of the previous 8 failures on this WU (including a Quadro) lasted more than 23 seconds?
ID: 31948 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31949 - Posted: 11 Aug 2013, 9:57:42 UTC - in response to Message 31948.  

A 50min failure is worse than a 23sec failure as you've wasted more time.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 31949 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31952 - Posted: 11 Aug 2013, 15:32:51 UTC

I don't think the problems are easy to solve as there are many different things that play a part.

My rigs have the least problems with Noelia's and the most trouble with Santi's. Others have more problems with Noelia's. Perhaps a card and/or driver issue? People have longer running WU´s, some are suspending for gaming, some have BSOD´s, some have acemd crashers, reboots, and driver crashes. So all type of problems are present.

My quad core vista 32 bit with a GTX550Ti and driver 320.49 (the driver that causing problems according to a lot of crunchers here) did very good. All SR Santi´s (take long on this card, LR more than 30-40 hours, that´s why I run only SR on it) 51 in a row without error. So driver seems no issue here.

My rig with the promoted good driver 310.99 has again Santi´s SR errored. Noelia´s and Nathan´s do fine (mostly) with all different drivers as well. Its the GTX660. I had a bunch beta with a few errors. I have now opt them out to give them to the Titans and 780.

My AMD rig with the GTX770 and driver 320.49 has done all types LR and SR with none errors yet from the moment it started. Was off a few days due to heat. Win7 64 bit. 29 Yet error free, also Noelia´s.

Cards are not OC unless by factory. I have most clocks set a bit less.

Then we see a small group of crunchers with comments and complains. Does this mean the rest has no problems? No it doesn´t. They don´t look as closely as we do, or do not react on the forum. So the error rate has to be monitored by the servers. All the ones that run good, we don´t hear about either. A status as Einsten@home has would be great to see all this.

At last I want to stress that Noelia´s WU are not worse than other one´s for my cards the GTX660 and 770.

We are here to help science and in science you have a lot of trial and error. Not everything can be completely controlled as in a lab as we are the lab.

Greetings from TJ
ID: 31952 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile dskagcommunity
Avatar

Send message
Joined: 28 Apr 11
Posts: 463
Credit: 958,266,958
RAC: 25
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31954 - Posted: 11 Aug 2013, 16:51:25 UTC - in response to Message 31952.  
Last modified: 11 Aug 2013, 16:54:07 UTC



My quad core vista 32 bit with a GTX550Ti and driver 320.49 (the driver that causing problems according to a lot of crunchers here) did very good. All SR Santi´s (take long on this card, LR more than 30-40 hours, that´s why I run only SR on it) 51 in a row without error. So driver seems no issue here.



Yes seems to be a problem with kepler and above (but still can be a driver issue then?). Fermi running fine, except these "klebes" witch needs more than 1GB ;) 570´s and 560ti 448 (all with 1,28GB and overclocked) running fine here too (but with 310.xx driver). I still use XP32 too.

Have now two Workunits with high times, but only because i didnt recognized the only error i had 2 days before hung up one of the cards a bit ^^
DSKAG Austria Research Team: http://www.research.dskag.at



ID: 31954 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31955 - Posted: 11 Aug 2013, 18:34:27 UTC - in response to Message 31954.  



My quad core vista 32 bit with a GTX550Ti and driver 320.49 (the driver that causing problems according to a lot of crunchers here) did very good. All SR Santi´s (take long on this card, LR more than 30-40 hours, that´s why I run only SR on it) 51 in a row without error. So driver seems no issue here.



Yes seems to be a problem with kepler and above (but still can be a driver issue then?). Fermi running fine, except these "klebes" witch needs more than 1GB ;) 570´s and 560ti 448 (all with 1,28GB and overclocked) running fine here too (but with 310.xx driver). I still use XP32 too.

Have now two Workunits with high times, but only because i didnt recognized the only error i had 2 days before hung up one of the cards a bit ^^


I had a Noelia WU (not the Klebe) that used less then 700MB on the GTX660. Seems that it depend on the WU?
Greetings from TJ
ID: 31955 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile dskagcommunity
Avatar

Send message
Joined: 28 Apr 11
Posts: 463
Credit: 958,266,958
RAC: 25
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31956 - Posted: 11 Aug 2013, 19:48:53 UTC

For sure, every scientists has several simulations witch you can recognize on the name. So they have different requirements.
DSKAG Austria Research Team: http://www.research.dskag.at



ID: 31956 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31957 - Posted: 11 Aug 2013, 21:55:02 UTC

I've made a rather particular observation:
One of my hosts has a WLAN connection (through USB), since the integrated LAN chip has no WinXP drivers. This host usually loose the WLAN connection daily, so I've made a scheduled batch program which checks the WLAN connection, and restarts the WLAN connection or the host if it's necessary to restore the WLAN connection. When the NOELIA tasks showed up, the WLAN connection's reliability dropped so much, that I've bought another USB-WLAN adapter based on a different chip to fix it, but the new one turned out to be as unreliable as the old one. Since the NOELIA's gone, this host haven't had a single occurence of such WLAN connectivity failure.... That is strange.
ID: 31957 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31958 - Posted: 11 Aug 2013, 22:17:13 UTC - in response to Message 31952.  

At last I want to stress that Noelia´s WU are not worse than other one´s for my cards the GTX660 and 770.

You've only run 1 of the problem NOELIA long runs and while it completed it took a very long time for a 770. A sample of 1 doesn't mean a lot, but maybe a 770 can run these WUs. Maybe not. Try running them on any 1GB GPU.
ID: 31958 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Vagelis Giannadakis

Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 31959 - Posted: 12 Aug 2013, 12:26:50 UTC - in response to Message 31958.  

At last I want to stress that Noelia´s WU are not worse than other one´s for my cards the GTX660 and 770.

You've only run 1 of the problem NOELIA long runs and while it completed it took a very long time for a 770. A sample of 1 doesn't mean a lot, but maybe a 770 can run these WUs. Maybe not. Try running them on any 1GB GPU.

My 1GB GTX 650Ti has successfully run a couple of the problematic NOELIAs. They took ~45h to complete, but they sure did complete without error.

When these NOELIAs first appeared, I tried to avoid them. Then a heatwave came along and they were ideal with the relatively low GPU and CPU load.
ID: 31959 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 31965 - Posted: 12 Aug 2013, 17:08:19 UTC
Last modified: 12 Aug 2013, 21:32:10 UTC

After the meeting there are some updates on the subject:

1. The Noelia WU's were put on low priority due to the crashes. You should be getting mostly Nate's WU's on long now.
2. Now for every batch we send out we decided we will make a thread in the News section with the exact batch name. If someone forgets to do that send a message quick and I will remind them :D These threads will also contain information about the specific batch.
3. There are plans to test features (maybe on a new project?) such as adding hardware requirements to WU's so that they only run on specified hardware and thus prevent unnessecary crashes.
4. Shorter WU's with faster turnaround might make their appearance also in the following months.

I hope this solves the most immediate problems and adds some interesting stuff for the future.
ID: 31965 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile dskagcommunity
Avatar

Send message
Joined: 28 Apr 11
Posts: 463
Credit: 958,266,958
RAC: 25
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31968 - Posted: 12 Aug 2013, 18:45:41 UTC

Thx for the updates, sound good.
DSKAG Austria Research Team: http://www.research.dskag.at



ID: 31968 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Stoneageman
Avatar

Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,374,498
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31969 - Posted: 12 Aug 2013, 18:52:39 UTC

Welcome news, but does (1) mean that these Noelia Wu's will eventually get released by the server when Nates start drying up? What % of Noelia WU's are left to crunch?
ID: 31969 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 31971 - Posted: 12 Aug 2013, 19:57:23 UTC - in response to Message 31969.  
Last modified: 12 Aug 2013, 20:03:03 UTC

Hm there are still quite a few left. I do not know why Gianni didn't outright cancel them. I will ask him tomorrow. Maybe because Noelia is on holidays. So yeah I guess they would come if Nate's dry up, but I think he suggested that he would keep it filled.(?)
In any case the error occurrences should go down. I remember StoneageMan that you said in another thread that they caused you system crashes no? That's weird, because until now no one else reported such problems. I mean they crash but only the WU's not the machine.
ID: 31971 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
neilp62

Send message
Joined: 23 Nov 10
Posts: 14
Credit: 8,017,535,732
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31973 - Posted: 12 Aug 2013, 20:42:05 UTC - in response to Message 31971.  

That's weird, because until now no one else reported such problems. I mean they crash but only the WU's not the machine.


Stefan, I agree with TJ that a lot of the severe platform crashes may have gone unreported. I am within the group of crunchers he described that don´t look as closely or react on the forum as a few more active forum participants do. I've had 11 Noelia WU failures in the last 14 days, and most of these WUs also failed for several other users. In two instances, a single WU failure seems to have induced failure in other active WUs on my platform. One failure left my platform's GPU that is attached to the display in a state that required a full power down before Windows7 would start properly.
ID: 31973 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : WU failures discussion

©2025 Universitat Pompeu Fabra