WU failures discussion

Author	Message
Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 31896 - Posted: 8 Aug 2013, 13:32:13 UTC - in response to Message 31887. I don´t see a thread about the current NOELIAs, but they are erroring on all my rigs. Please? They locked the Noelia thread, too many complaints I'd guess. Here's some of the NOELIA WUs I've gotten lately: http://www.gpugrid.net/workunit.php?wuid=4629901 http://www.gpugrid.net/workunit.php?wuid=4630935 http://www.gpugrid.net/workunit.php?wuid=4630045 http://www.gpugrid.net/workunit.php?wuid=4631659 Notice they've all been sent out 8 times with no successes and are now marked: "Too many errors (may have bug)". That's the understatement of the month. It's getting so that when a NOELIA WU downloads the Daleks in the basement get all excited and start yelling "EXTERMINATE, EXTERMINATE!". Gets kind of noisy sometimes... ID: 31896 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 31897 - Posted: 8 Aug 2013, 13:41:23 UTC - in response to Message 31888. The last 5 Noelia's I got, finished all without error. I like to take them all... You must have forgotten this one: http://www.gpugrid.net/workunit.php?wuid=4633344 ID: 31897 · Rating: 0 · rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 31898 - Posted: 8 Aug 2013, 14:44:00 UTC Last modified: 8 Aug 2013, 14:59:02 UTC I explained the reasons why I locked Noelia's thread in the last post of the thread. You can look it up. Otherwise, the failure rate of the WU's is at acceptable values. Unfortunately she is simulating more complex stuff (bigger systems) than Santi's WU's right now, so they do have more failures than Santi's. ID: 31898 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 31899 - Posted: 8 Aug 2013, 18:48:42 UTC - in response to Message 31898. I explained the reasons why I locked Noelia's thread in the last post of the thread. You can look it up. You can avoid the crunchers complaining in the news thread by opening a new thread (named by the batch) for each and every new batch of workunits in a specialized topic (named "Workunit Batch Problems" or similar). That's too bad that there is no such topic yet. Usually we notice new batches when they're starting to fail on our hosts - obviously the topic of the problem free batches would be empty, but it is good to know where to look (or post) in the forums if someone experiences failures with a batch, and to know how many others have problems with that batch. In my opinion it is your job to start such treads before we even receive the first workunits of them. Otherwise, the failure rate of the WU's is at acceptable values. Unfortunately she is simulating more complex stuff (bigger systems) than Santi's WU's right now, so they do have more failures than Santi's. I think you (the staff) and we (the crunchers) have different ideas about what is an acceptable failure rate. There are some factors which make a host more prone to errors, making our judgement worse than yours: - hosts running Windows (especially 7 and 8) - hosts running multiple GPUs - overclocked (even factory overclocked) GPUs (it is more complicated to overclock a Kepler than a Fermi based card.) - hosts having updated drivers - hosts which had an error, and wasn't rebooted then. - we do like to pay the cost of crunching, but we don't like to pay the cost of failures I am aware that none of the above is directly your concern, so it would be nice to have a topic where we can discuss our problems with each other without disturbing the news (or other) thread. ID: 31899 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 31901 - Posted: 8 Aug 2013, 19:34:48 UTC - in response to Message 31899. Last modified: 8 Aug 2013, 19:35:43 UTC You can avoid the crunchers complaining in the news thread by opening a new thread (named by the batch) for each and every new batch of workunits in a specialized topic (named "Workunit Batch Problems" or similar). That's too bad that there is no such topic yet. Usually we notice new batches when they're starting to fail on our hosts - obviously the topic of the problem free batches would be empty, but it is good to know where to look (or post) in the forums if someone experiences failures with a batch, and to know how many others have problems with that batch. In my opinion it is your job to start such treads before we even receive the first workunits of them. That a very good suggestion and goes hand in hand with what was also requested lately: some information about runs / batches. What is being simulated, what are special requirements? That would be of practical use and make us feel more involved. It should be worth the time to set these threads up. MrS Scanning for our furry friends since Jan 2002 ID: 31901 · Rating: 0 · rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 31902 - Posted: 8 Aug 2013, 19:44:30 UTC Last modified: 8 Aug 2013, 19:49:47 UTC Both very nice ideas actually. I will pass it along that everyone creates a new "problems" thread when he sends a batch. Yes you do make a point on the failure rate. It's a big discussion though so maybe it should also be moved. My current problem is the lack of a specific subforum for this. I will see if I can somehow get a new one made. ID: 31902 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 31905 - Posted: 9 Aug 2013, 3:42:04 UTC - in response to Message 31898. I explained the reasons why I locked Noelia's thread in the last post of the thread. You can look it up. Otherwise, the failure rate of the WU's is at acceptable values. Unfortunately she is simulating more complex stuff (bigger systems) than Santi's WU's right now, so they do have more failures than Santi's. I read it. It was locked but not moved. I don't agree that the failure rate of Noelia WUs is acceptable. They're awful and getting worse. There should be a separate queue for WUs that require more than 1GB, or you should detect the GPUs memory size and send those WUs only to machines that can handle them. ID: 31905 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 31910 - Posted: 9 Aug 2013, 12:56:49 UTC - in response to Message 31902. Last modified: 9 Aug 2013, 13:00:38 UTC ...Yes you do make a point on the failure rate. It's a big discussion though so maybe it should also be moved. My current problem is the lack of a specific subforum for this. I will see if I can somehow get a new one made. One thing a project/app failure rate can not identify is the fact that individuals (or systems) with high failure rates leave the project and crunch elsewhere. Thus the failure rates may look about the same. You have to look at the number of active crunchers, work turnover... Even those that stay with the project move to the short queue, to crunch different WU's (despite the lesser credit) - to avoid system failures/crashes/restarts/loops/failed WU's/failures at other projects... The bottom line is that you are reliant on crunchers. If crunchers are not happy, some just leave without saying a word. The more vocal complain, make suggestions and if nothing is done then they quit the project. This is the very reason why some people don't stick around, and that has been identified as a bigger problem to the project! Even if you think the failure rate isn't that bad, to the individual cruncher, failing work is terrible, and if it crashes drivers, the system, causes restarts and loss of other work its death. This imbalance of opinion needs to be redressed. The problems crunchers face with some WU types has not been dealt with to their satisfaction. It's not that suggestions haven't been made by the researchers, it's that they have been slow in coming and communication has not been great. Problems that exist for months, despite suggested work-rounds, are like an unhealed wound. Some of the crunchers suggestions might be a pain for you to deal with but they would keep crunchers happy. Number Crunching might be an appropriate sub-forum, as its 'about performance', though the name 'Crunching' would have been sufficient. Also, Wish List and other threads already contain many suggestions (that were not implemented, often due to staff shortages, funding or technical limitations). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 31910 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 31911 - Posted: 9 Aug 2013, 13:37:36 UTC - in response to Message 31910. One thing a project/app failure rate can not identify is the fact that individuals (or systems) with high failure rates leave the project and crunch elsewhere. Thus the failure rates may look about the same. You have to look at the number of active crunchers, work turnover... Even those that stay with the project move to the short queue, to crunch different WU's (despite the lesser credit) - to avoid system failures/crashes/restarts/loops/failed WU's/failures at other projects... Excellent point. If I leave on a trip, I don't want my machine to seize up or BSOD while I am gone. I work on a variety of BOINC projects, and they would all be affected. So even if I am willing to put up with a failure on one, I don't want it to take down all the others too. ID: 31911 · Rating: 0 · rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 31913 - Posted: 9 Aug 2013, 14:27:47 UTC Last modified: 9 Aug 2013, 14:42:41 UTC Let's see if I know how to move threads :) Edit: Ok I found out how but it's no fun. I have to do it post by post So on the subject. We take system crashes very seriously because it's obviously very difficult to have to reboot and lose all work from other projects, especially since it's volunteer work. So when there are systematic system crashes we cancel the batches as we have done before. I haven't heard of any systematic ones recently though so if I missed any please inform me. ID: 31913 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 31921 - Posted: 9 Aug 2013, 21:33:28 UTC At Einstein@Home they've got a subforum group "Help Desk". One subforum there is called "Problems and Bug Reports" and when ever a new search / batch is started they post a feedback thread there. Most of the feedback still ends up in the news thread, but at least the mechanism is there :) Granted their searchesa and batches don't change as often as ours, but I still think something like this would be appropriate. Regarding system crashes: I'm not aware of this currently happening. But failure rates of certain tasks on certains hosts seem really bad. Other hosts seem fine, though, so the average failure rate may be far lower than individual failure rates. And lacking statistically significiant hard numbers I don't think we can say much about the reasons for this behaviour. I know Noelia is pushing the boundaries with the new features and more complex systems she simulates. But I feel part of this could be handled better. For example the GPU memory requirements: BOINC knows how much memory cards have and has a mechanism to avoid overloading GPU mem. But it has to be told to use this, i.e. the expected maximum GPU mem needed has to be included in WUs. From what we're seeing this is not being done (cards with less mem outright failing tasks or becoming very slow). This should easily be avoided! And some of her tasks which require more memory also take longer. Longer than we'd like to, es even mid range Keplers are in trouble to make the 24h deadline. That's the current generation of GPUs, we can hardly buy anything faster without spending really big bucks. And the probability of failures probably increases with runtime. These should be put into smaller work chunks.. or you should establish a "super queue" for such long tasks. But not many people would be able to participate there, making it pretty much redundant. Then there's the topic of information, which I touched upon before. I think it would really help to lower frustration levels if we'd get more feedback: - why the current WUs? - any special requirements like driver versions, amount of memory - we're getting xx% bonus credits for this, since it's "risk production" () - once finished: what did you learn from this batch? (might sometimes be difficult to answer, or not look all that great.. but if there's anything to talk about, do so!) () The credit bonus for the long runs queue used to be static, as far as I know. You could add a dynamic bit based on the average failure rate of batches.. though I don't know how much they actually differ. Some of this was actually directed towards Noelia. But I feel you've got an open ear and we're in a broad discussion here anyway.. so I hope it's OK to just tell you about these thoughts of mine and hope they'll be passed on :) MrS Scanning for our furry friends since Jan 2002 ID: 31921 · Rating: 0 · rate: / Reply Quote

dyeman Send message Joined: 21 Mar 09 Posts: 35 Credit: 591,434,551 RAC: 0 Level Scientific publications	Message 31926 - Posted: 10 Aug 2013, 1:36:51 UTC - in response to Message 31921. When I checked this morning, out of 3 WUs one had completed (santi), one had failed (Noelia - ("ACEMD terminated abnormally" dialog displayed) while a third had managed 5% progress in nearly three hours (also Noelia). This was one with very low memory controller use - it also crashed as soon as I tried to suspend and resume it. Looking at the several failures I have had in the previous 3 days, most if not all of them have had one or more failures prior to mine. Of the 3 WUs I am currently processing, two are resends after a previous failure (one Noelia and one Santi). The Noelia is one of the slow 290px. It has previously crashed after 16 hours processing on a 660ti. It is now 9% complete after nearly 4 hours. So the failure rate on current WUs for me is more than 50%. It is happening on all of my cards (2 660 and 1 660ti - I've already moved the 560ti to Folding because of the performance impact running some Noelia WUs on 1GB cards). Of the WUs I have been able to complete successfully in the last 3 days only one has been a Noelia, and on that one there had been two previous failures on the WU. This is not an acceptable failure rate as far as I am concerned. ID: 31926 · Rating: 0 · rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 0 Level Scientific publications	Message 31928 - Posted: 10 Aug 2013, 7:04:57 UTC I have had four separate system crashes in the last 24hrs. I am now aborting every Noelia task when possible. This I do with a heavy heart. The risk of crashing when running Noelia tasks is no longer acceptable to me. For some reason, this latest batch has been particularly difficult on all cards. ID: 31928 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 31930 - Posted: 10 Aug 2013, 8:44:41 UTC - in response to Message 31928. As people have pointed out, some WU's take way too long (days on the top GPU's). This would not be detected on a 'failure-rate' model. You would also have to look at run times. The current GigaFLOPs is only 846,081 and the number of active crunchers is quite low, even for the Northern summer months. You probably have enough data to see what the best drivers are for each WU type and each GPU type. Query for it and share the results. What about the latest Beta drivers? If you recompiled for those would it also improve other card type performances, as well as Titan and 780's? Queues: On the Project Status Page, http://www.gpugrid.net/server_status.php, there are 3 queues, Short runs (2-3 hours on fastest card) 1,320 928 2.56 (0.22 - 10.90) 404 ACEMD beta version 0 10 0.02 (0.01 - 0.06) 24 Long runs (8-12 hours on fastest card) 3,493 1,533 5.96 (0.80 - 27.02) 414 On our preference page, http://www.gpugrid.net/prefs.php?subset=project, there are four! ACEMD short runs (2-3 hours on fastest card) for CUDA 4.2 ACEMD short runs (2-3 hours on fastest card) for CUDA 3.1 ACEMD beta ACEMD long runs (8-12 hours on fastest GPU) for CUDA 4.2 The 3.1 app is now deprecated. That queue space could be used for the more troublesome WU's, NOELIA and First Batch runs, and improve the credit accordingly (statically by +75%) or dynamically as MrS suggests. Also, the amount of work we do for different research (paper/presentation/thesis) is not visible. It's about time it was so we can determine how much we contribute. After all the wee badges are based on this: Active Research (put in MyAccount area) Cancer Top 1% (11th/6582) contribution ongoing towards Nathaniel (who might want to mention some of the work he does and add a poster/presentation...). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 31930 · Rating: 0 · rate: / Reply Quote

Vagelis Giannadakis Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level Scientific publications	Message 31931 - Posted: 10 Aug 2013, 9:01:00 UTC Yesterday evening, I discovered that a NOELIA_klebe had been running for ~13 hours and was stuck at 0%. I suspended / resumed it, but that didn't make it progress. The GPU was also idling. Strange thing, the acemd process was consuming one full CPU core... I aborted the NOELIA, then got a SANTI_RAP. These generally go well, so I was glad I got one, but my gladness didn't last long, pretty soon I discovered the same was happening: WU not progressing, GPU idling, one CPU core at 100%. Suspending / resuming the task didn't do anything. I killed Boinc, restarted it and it got in a state where it didn't start any tasks, did not respond to GUI RPC or even the command-line (boinccmd) and consumed one full CPU core (the boinc executable). Several Boinc restarts didn't help, nor rebooting the machine. I ended up reinstalling Boinc, which finally fixed this. All this on this machine - OS is Ubuntu 12.04 64-bit. ID: 31931 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 31932 - Posted: 10 Aug 2013, 9:18:43 UTC - in response to Message 31931. Last modified: 10 Aug 2013, 9:21:50 UTC The acemd process kept polling the CPU, and so appeared to consume one full CPU core. I've experienced something similar on Ubuntu 13.04 (304.88drivers). I shut down, powered the PSU off for a few minutes and when I started up things ran normally. The thing is, I was running a POEM WU on the card! >30h too. It's long been the case that GPUGrid has bigger WU's than other GPU projects. This inherently means a higher task error/problem rate. If a WU fails after 6h, you lose 6h work and 1 WU at GPUGrid. At POEM 18WU's would need to fail to reach 6h and several hundred at MW. Einstein's WU's are similar to the Short WU's here. Ditto for Albert. However if a WU run perpetually it's the same level of problem for any project. While some people don't like the idea of hard cut-off times, I've had the misfortune to run numerous tasks without progression on many different systems over the years, sometimes several hundred hours before I notice. Hal was desperate. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 31932 · Rating: 0 · rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 0 Level Scientific publications	Message 31933 - Posted: 10 Aug 2013, 10:45:50 UTC Typical! Looks like the server has run out of Santi long tasks. Anyone else getting some? Moving to short tasks for now :-( ID: 31933 · Rating: 0 · rate: / Reply Quote

John C MacAlister Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level Scientific publications	Message 31934 - Posted: 10 Aug 2013, 10:51:31 UTC Hi, Folks: Given the waste of resources recently experienced, I cannot continue processing GPUGrid WUs. I fully support the ideas posted, in particular shorter and reliable WUs with our ability to select the contribution area. Unfortunately the management in Barcelona appears unable, in my view, to communicate with us crunchers. After all, we contribute our resources pro bono and deserve more respect. While I understand the importance of GPUGrid research, I have stopped contributing to GPUGrid for now. From time to time I will review Forum posts to determine my future involvement. Regards, John ID: 31934 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 31935 - Posted: 10 Aug 2013, 12:43:19 UTC John, you wouldn't have to quit completely if you wanted to. The SR WUs still seem to be OK. @All and keep two things in mind: first it's weekend. And Second even failed WUs are very likely to help in some way. Sure, this feels like unvoluntary beta testing.. but I'm sure the researchers gain some knowledge from running them. I just can't tell you what and how much, since I don't have such insiders information either. SK wrote: Also, the amount of work we do for different research (paper/presentation/thesis) is not visible. It's about time it was so we can determine how much we contribute. After all the wee badges are based on this: I think the badges are quite nice already, if combined with general information on what the individual batches are doing. We already have some general information on HIV / Cancer etc. type of work, but this only covers the broad scope and highlights. Which is good and appreciated, but can not easily be connected to the individual batches we run. MrS Scanning for our furry friends since Jan 2002 ID: 31935 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 31936 - Posted: 10 Aug 2013, 12:43:31 UTC - in response to Message 31934. Hi, Folks: Given the waste of resources recently experienced, I cannot continue processing GPUGrid WUs. I fully support the ideas posted, in particular shorter and reliable WUs with our ability to select the contribution area. John, Me too. I like the project, but my GTX 660s are not now suited for the longs (BSODs or hangs with Santis, slow running and errors with Noelias). It is not that I get them all the time, just enough that I can't trust it to run reliably for days at a time. I will check back later when the quality control improves (or is implemented). ID: 31936 · Rating: 0 · rate: / Reply Quote