Message boards :
Graphics cards (GPUs) :
Working Unit Hanging...different than others reported?
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 21 Oct 08 Posts: 144 Credit: 2,973,555 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
Had to cancel a workunit that hung at about 85% complete (see here). Was curious if this is a different error than the others since it 1) is not one of the KASHIF_HIVPR ones--it is an IBUCH_KID, 2) I am using BOINC 6.5.0, so no 6.6.x problems, and 3) I believe that the driver is 178.24, so definitely not the 185.xx issues. The machine in question is running an 8800GS with shaders OC'ed, but this is the first hanging unit on it so far. |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Had to cancel a workunit that hung at about 85% complete (see here). Was curious if this is a different error than the others since it 1) is not one of the KASHIF_HIVPR ones--it is an IBUCH_KID, 2) I am using BOINC 6.5.0, so no 6.6.x problems, and 3) I believe that the driver is 178.24, so definitely not the 185.xx issues. Ok, we KNOW that 6.6.20 has hanging work units badly. I have seen it with other versions. THe problem is that we do NOT know what is causing this so there is no way to tell for sure what version the problem was introduced in... Or to put it another way, you could be seeing the earliest occurence of this bug. Try a reboot and if it is the 6.6.20 problem you will likely start to see an increase in speed. USUALLY you will start to see the time to completion drop several seconds per second if this WAS the long run bug... |
|
Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I returned home 2 hours ago and see another so called big unit being stuck at 7,480 % for a long time at least the 2 hours i am at home. Since i am now running with 6.5.0 again i can no longer see how long it is actually being stuck at that percentage. The only thing i see is the steadily increase of the time to complete from 25 to 32 hours now, so i guess this one is also gonna error out after many hours running. I really don't think a unit will not get some progress in more then 2 hours even if it is a long running one. |
|
Send message Joined: 21 Oct 08 Posts: 144 Credit: 2,973,555 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
Drat...I had already aborted that one before I thought about it being potentially different from the already known problem (or potentially informative as a earlier version example). That machine has already completed another unit--a RAUL unit--in typical time, without a restart. |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Scott, that is what is making this bug so much fun. 6.6.20 was unique in that it affected nearly 50% of the tasks I ran on it. I think i have seen it on 6.6.23 or .25 ... not sure ... but, there are those odd long tasks so sometimes it is hard to know for sure until they are done. Sadly you cannot always tell by the names ... or I can't remember the key ... At any rate, it is still on my list of things to be looked for ... I found one more pointer today, not that it will do much good ... |
[AF>DoJ] supersonicSend message Joined: 8 Nov 08 Posts: 8 Credit: 3,032,744 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Hello, just to report that after a fist IBUCH_KID that hanged last week, I had a GIANNI_FB that hanged this weekend. after aborting it, my machine is now stuck on a RAUL_pY ... boinc 6.4.7, (i can't remember drivers) 8800 GTS 512 I'm surprised, because my machine has been running since december without problem. my rac is dropping... |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Have you just tried stopping are restarting BOINC? We KNOW that there are some issues in the Resource Scheduler with relationship to starting tasks (at least) and some other issues that MAY be related. The trouble is at the moment I am chasing rumors and have no data to work with yet (I am hoping to be collecting some as we speak)... We KNOW 6.6.20 could cause tasks to hang, or run slow, but we do ****NOT**** know that the problem is restricted to that version. So, try just stopping and restarting BOINC first, if that does not unstick it, suspend it and let another task run, also try rebooting ... I know it is a lot to ask ... but, if we are ever going to get our hands around this we have to figure out what the problem is and what versions it might affect. |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
You're right, Paul, BUT.. this is at least the second report of repeated hanging tasks and different WUs with 6.4.7. It looks like *something* is up, but it does not seem to be "the 6.6.20 problem". MrS Scanning for our furry friends since Jan 2002 |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
You're right, Paul, BUT.. this is at least the second report of repeated hanging tasks and different WUs with 6.4.7. It looks like *something* is up, but it does not seem to be "the 6.6.20 problem". I did not think that I said it was ... I said we don't know enough ... and that it is a possibility ... in the mean time, try these things ... :) The more I dig into the Resource Scheduler and ponder the implications of the code buried therein the less sanguine I get about how this system works. Richard Haselgrove has documented a problem on SaH where the CUDA tasks are started before they are initialized... and the task of course immediately crashes ... also not this problem, but it is a flaw in the way resources and tasks are managed. I am trying to gather data for another error I am calling "silent restart" which may or may not be related to long running tasks. The fundamental problem is that there is too much we do not know ... and too many times this last month I have dug deep into a problem and found that the error is one that has been plaguing us for years. Meaning what? Meaning that though superficial changes in some of the code may cause problems of long standing nature to slip in and out of view. The part of code that I am worried about has not changed in a long time that I know of... which means that what was a disaster in 6.6.20 may only be a mild annoyance in other versions ... but the bug is still there... As proof of my case, the "no heartbeat" and "too many restarts" have been longstanding problems where people would lose tasks and we had been pulling our hair out trying to figure out what was causing these errors ... well, I now know of two different potential causes. Neither are related to the tasks that were being mangled. Or to put it another way, we were looking in the wrong places... {edit} As an example of how this can happen, 6.4.7 (or actually any version of BOINC) and specific tasks, and specific card ... task hits particular point in the loops and takes slightly longer to get through the loop than expected. Science Application does not send heartbeat message in time, BOINC Shoots Application and relaunches at prior checkpoint which means that you could see very little advancement of the task because the task was being killed and restarted all the time. One quick way to see if this is happening is to watch the PID of the processes running under BOINC, if the one for GPU Grid keeps increasing then ... (you have to turn on the additional column in the View menu of Task Manager for windows). Again, we don't know why 6.6.20 caused many tasks to seemingly run forever, and though here we are concerned with the GPU, I also have experience with a system where it was happening to a CPU class task. And I am pretty sure I saw it on a task run with 6.6.23 ... |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
OK, sorry, so I read too much into what you actually wrote. Apart from this.. still agreed ;) MrS Scanning for our furry friends since Jan 2002 |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
OK, sorry, so I read too much into what you actually wrote. Apart from this.. still agreed ;) No worries ... I did not take offense or get bugged... :) Just trying to keep us all on the same page... Though my input is discounted, debugging software is something I have been doing for about 34 some year. All the way from assembly language programs to ADA. Even when I play computer games I rarely play to win, I usually play to kill time and to learn how the AI cheats ... What concerns me with this area of code is that I suspect that the same fundamental flaw is presenting itself under different guises... so we see what we think are 3-4 problems and it is really one flaw. The problem is that our diagnostic tools are very limited and hard to use. Anyway, onward ... |
[AF>DoJ] supersonicSend message Joined: 8 Nov 08 Posts: 8 Credit: 3,032,744 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Hello, I did try a stop / start of Boinc client, but no chance for me, I got another KASHIF_HIVPR unit that is hanging since 4 days now... :( As there is no way to stop a WU other than beeing connected to the machine (BAM manages project only, no WUs), and as I have no remote connection established with this machine, here is my question : Do a WU crunching stops when deadline is reached and crossed over ? Or do the crunching goes on cycling and cycling until the end of times ? ;) thank you. |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
All tasks have a drop dead time ... the problem is if the task or machine is hung this may or may not be detected. WIthout knowing more it is hard to say what is going on. If the task is "running" it will eventually get to the drop dead time and quit as running too long. |
[AF>DoJ] supersonicSend message Joined: 8 Nov 08 Posts: 8 Credit: 3,032,744 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
OK. And how long is the drop dead time ? say, 24 hours after deadline, or more than that ? thanks. |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
OK. The short answer is that I am not sure. It is not a set time per se, but the max CPU time exceeded and is a function of the speed of the system. I don't know what option the project selected here because this is about the first time we have had this issue ... If you have BAM access you could detach and reattach through BAm and that would clear the task ... since this is a remote machine ... only thing I can think of so you can get back to being productive. I cannot remember if you can do a project reset through BAM or not ... the other option to try if it is available. |
©2025 Universitat Pompeu Fabra