Message boards :
Number crunching :
WU failures discussion
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
| Author | Message |
|---|---|
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Long runs (8-12 hours on fastest card) 3,511 1,510 6.13 (0.62 - 19.10) 409 So, 3,511 of what? FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Message received. We will have a meeting on Monday to see what we can do about the problems. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Long runs (8-12 hours on fastest card) 3,511 1,510 6.13 (0.62 - 19.10) 409 I've 5 pieces of SANTI_RAP74wt, 1 NATHAN_KIDKIX, 1 NOELIA_KLEBE in progress at the moment. Out of curiosity I've installed the v326.41 driver on one of my hosts. It had no errors since then, but there haven't been enough workunits completed on this host to say it solves our problems, but I keep my fingers crossed that it does. The 3rd heatwave of this summer is gone here, so I can crunch all day long. I'm thinking of a more complicated batch program, which checks the progress and the error messages of the acemd client, and restarts the host automatically if there's no progress. |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I'm thinking of a more complicated batch program, which checks the progress and the error messages of the acemd client, and restarts the host automatically if there's no progress. In an ideal world the app itself would take of this. But apparently it doesn't work like that yet, or at least for some errors. The "assertion failed" error messages we're frequently seeing are actualy checks built in by the devs which spotted a critical (non-correctable?) error and halt the app (error out). MrS Scanning for our furry friends since Jan 2002 |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'm thinking of a more complicated batch program, which checks the progress and the error messages of the acemd client, and restarts the host automatically if there's no progress. In an ideal world an app couldn't crash other apps, nor the GPU driver, nor the OS. But apparently it doesn't work like that yet, or at least for some errors. We had similar problems before. For the last time, when the application error popped up it was correctable by a system restart, and after the restart the app could continue from the last checkpoint. Now we facing a different situation: the app hangs, or runs to error after the restart, so the user have to click on a button to terminate the app, and I have to experiment with this function (the progress check is already working). Luckily I didn't have such an error since then, mostly because I've received only a couple of NATHAN_KIDKIXc22's, which are quite stable. Maybe the other batches were put on hold (or on lower priority) because of our complaints? The "assertion failed" error messages we're frequently seeing are actually checks built in by the devs which spotted a critical (non-correctable?) error and halt the app (error out). The devs should respond to this... |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have had four separate system crashes in the last 24hrs. I am now aborting every Noelia task when possible. This I do with a heavy heart. The risk of crashing when running Noelia tasks is no longer acceptable to me. For some reason, this latest batch has been particularly difficult on all cards. This has also been my experience. The SANTI and NATHAN WUs run fine. The NOELIA WUs have become so problematic that they're not even worth attempting. I'm tired of my GPUs locking up because of poor programming skills. |
|
Send message Joined: 21 Mar 09 Posts: 35 Credit: 591,434,551 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just got aanother Noelia (9-NOELIA_005p_express-2-3-RND7707) which crashed after about 3000 seconds. Should I feel satisfaction that none of the previous 8 failures on this WU (including a Quadro) lasted more than 23 seconds? |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
A 50min failure is worse than a 23sec failure as you've wasted more time. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I don't think the problems are easy to solve as there are many different things that play a part. My rigs have the least problems with Noelia's and the most trouble with Santi's. Others have more problems with Noelia's. Perhaps a card and/or driver issue? People have longer running WU´s, some are suspending for gaming, some have BSOD´s, some have acemd crashers, reboots, and driver crashes. So all type of problems are present. My quad core vista 32 bit with a GTX550Ti and driver 320.49 (the driver that causing problems according to a lot of crunchers here) did very good. All SR Santi´s (take long on this card, LR more than 30-40 hours, that´s why I run only SR on it) 51 in a row without error. So driver seems no issue here. My rig with the promoted good driver 310.99 has again Santi´s SR errored. Noelia´s and Nathan´s do fine (mostly) with all different drivers as well. Its the GTX660. I had a bunch beta with a few errors. I have now opt them out to give them to the Titans and 780. My AMD rig with the GTX770 and driver 320.49 has done all types LR and SR with none errors yet from the moment it started. Was off a few days due to heat. Win7 64 bit. 29 Yet error free, also Noelia´s. Cards are not OC unless by factory. I have most clocks set a bit less. Then we see a small group of crunchers with comments and complains. Does this mean the rest has no problems? No it doesn´t. They don´t look as closely as we do, or do not react on the forum. So the error rate has to be monitored by the servers. All the ones that run good, we don´t hear about either. A status as Einsten@home has would be great to see all this. At last I want to stress that Noelia´s WU are not worse than other one´s for my cards the GTX660 and 770. We are here to help science and in science you have a lot of trial and error. Not everything can be completely controlled as in a lab as we are the lab. Greetings from TJ |
dskagcommunitySend message Joined: 28 Apr 11 Posts: 463 Credit: 958,266,958 RAC: 25 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yes seems to be a problem with kepler and above (but still can be a driver issue then?). Fermi running fine, except these "klebes" witch needs more than 1GB ;) 570´s and 560ti 448 (all with 1,28GB and overclocked) running fine here too (but with 310.xx driver). I still use XP32 too. Have now two Workunits with high times, but only because i didnt recognized the only error i had 2 days before hung up one of the cards a bit ^^ DSKAG Austria Research Team: http://www.research.dskag.at
|
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I had a Noelia WU (not the Klebe) that used less then 700MB on the GTX660. Seems that it depend on the WU? Greetings from TJ |
dskagcommunitySend message Joined: 28 Apr 11 Posts: 463 Credit: 958,266,958 RAC: 25 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
For sure, every scientists has several simulations witch you can recognize on the name. So they have different requirements. DSKAG Austria Research Team: http://www.research.dskag.at
|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've made a rather particular observation: One of my hosts has a WLAN connection (through USB), since the integrated LAN chip has no WinXP drivers. This host usually loose the WLAN connection daily, so I've made a scheduled batch program which checks the WLAN connection, and restarts the WLAN connection or the host if it's necessary to restore the WLAN connection. When the NOELIA tasks showed up, the WLAN connection's reliability dropped so much, that I've bought another USB-WLAN adapter based on a different chip to fix it, but the new one turned out to be as unreliable as the old one. Since the NOELIA's gone, this host haven't had a single occurence of such WLAN connectivity failure.... That is strange. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
At last I want to stress that Noelia´s WU are not worse than other one´s for my cards the GTX660 and 770. You've only run 1 of the problem NOELIA long runs and while it completed it took a very long time for a 770. A sample of 1 doesn't mean a lot, but maybe a 770 can run these WUs. Maybe not. Try running them on any 1GB GPU. |
|
Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
At last I want to stress that Noelia´s WU are not worse than other one´s for my cards the GTX660 and 770. My 1GB GTX 650Ti has successfully run a couple of the problematic NOELIAs. They took ~45h to complete, but they sure did complete without error. When these NOELIAs first appeared, I tried to avoid them. Then a heatwave came along and they were ideal with the relatively low GPU and CPU load.
|
|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
After the meeting there are some updates on the subject: 1. The Noelia WU's were put on low priority due to the crashes. You should be getting mostly Nate's WU's on long now. 2. Now for every batch we send out we decided we will make a thread in the News section with the exact batch name. If someone forgets to do that send a message quick and I will remind them :D These threads will also contain information about the specific batch. 3. There are plans to test features (maybe on a new project?) such as adding hardware requirements to WU's so that they only run on specified hardware and thus prevent unnessecary crashes. 4. Shorter WU's with faster turnaround might make their appearance also in the following months. I hope this solves the most immediate problems and adds some interesting stuff for the future. |
dskagcommunitySend message Joined: 28 Apr 11 Posts: 463 Credit: 958,266,958 RAC: 25 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
|
StoneagemanSend message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Welcome news, but does (1) mean that these Noelia Wu's will eventually get released by the server when Nates start drying up? What % of Noelia WU's are left to crunch? |
|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Hm there are still quite a few left. I do not know why Gianni didn't outright cancel them. I will ask him tomorrow. Maybe because Noelia is on holidays. So yeah I guess they would come if Nate's dry up, but I think he suggested that he would keep it filled.(?) In any case the error occurrences should go down. I remember StoneageMan that you said in another thread that they caused you system crashes no? That's weird, because until now no one else reported such problems. I mean they crash but only the WU's not the machine. |
|
Send message Joined: 23 Nov 10 Posts: 14 Credit: 8,017,535,732 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
That's weird, because until now no one else reported such problems. I mean they crash but only the WU's not the machine. Stefan, I agree with TJ that a lot of the severe platform crashes may have gone unreported. I am within the group of crunchers he described that don´t look as closely or react on the forum as a few more active forum participants do. I've had 11 Noelia WU failures in the last 14 days, and most of these WUs also failed for several other users. In two instances, a single WU failure seems to have induced failure in other active WUs on my platform. One failure left my platform's GPU that is attached to the display in a state that required a full power down before Windows7 would start properly. |
©2025 Universitat Pompeu Fabra