Message boards :
Number crunching :
Abrupt computer restart - Tasks stuck - Kernel not found
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I recently had a power outage here, where the computer lost power while it had been working on BOINC. When I turned the computer on again, for the 2 Long-run GPUGrid tasks, they got stuck in a continual "Driver reset" loop, where I was getting continuous Windows balloons saying the driver had successfully recovered, over and over. I looked at the stderr.txt file in the slots directory, and remember seeing: Kernel not found# SWAN swan_assert 0 ... over and over, along with each retry to start the task. The only way I could see to get out of the loop, was to abort the work units. So I did. The tasks are below. Curiously, there was also a beta task that I had worked on (which error'd out and was reported way before the power outage), where it also said: Kernel not found# SWAN swan_assert 0 1) Why did the full stderr.txt not get included in my aborted task logs? 2) Why did the app continually try to restart this unresumable situation? 3) Was the error in the beta task intentionally set (to test the retry logic?) Thanks, Jacob Name I66R8-NATHAN_KIDKIXc22_6-9-50-RND7714_1 Workunit 4795185 Created 29 Sep 2013 | 9:39:42 UTC Sent 29 Sep 2013 | 9:56:59 UTC Received 30 Sep 2013 | 4:01:08 UTC Server state Over Outcome Computation error Client state Aborted by user Exit status 203 (0xcb) EXIT_ABORTED_VIA_GUI Computer ID 153764 Report deadline 4 Oct 2013 | 9:56:59 UTC Run time 48,589.21 CPU time 48,108.94 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.14 (cuda55) Stderr output <core_client_version>7.2.16</core_client_version> <![CDATA[ <message> aborted by user </message> ]]> Name 17x6-SANTI_RAP74wtCUBIC-13-34-RND0681_0 Workunit 4807187 Created 29 Sep 2013 | 13:06:23 UTC Sent 29 Sep 2013 | 17:32:54 UTC Received 30 Sep 2013 | 4:01:08 UTC Server state Over Outcome Computation error Client state Aborted by user Exit status 203 (0xcb) EXIT_ABORTED_VIA_GUI Computer ID 153764 Report deadline 4 Oct 2013 | 17:32:54 UTC Run time 17,822.88 CPU time 3,669.02 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.14 (cuda55) Stderr output <core_client_version>7.2.16</core_client_version> <![CDATA[ <message> aborted by user </message> ]]> Name 112-MJHARVEY_CRASH3-14-25-RND0090_2 Workunit 4807215 Created 29 Sep 2013 | 17:32:12 UTC Sent 29 Sep 2013 | 17:32:54 UTC Received 29 Sep 2013 | 19:04:42 UTC Server state Over Outcome Computation error Client state Compute error Exit status -226 (0xffffffffffffff1e) ERR_TOO_MANY_EXITS Computer ID 153764 Report deadline 4 Oct 2013 | 17:32:54 UTC Run time 4,020.13 CPU time 1,062.94 Validate state Invalid Credit 0.00 Application version ACEMD beta version v8.14 (cuda55) Stderr output <core_client_version>7.2.16</core_client_version> <![CDATA[ <message> too many exit(0)s </message> <stderr_txt> # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32723 # GPU 0 : 68C # GPU 1 : 61C # GPU 2 : 83C # GPU 1 : 63C # GPU 1 : 64C # GPU 1 : 65C # GPU 1 : 66C # GPU 1 : 67C # GPU 1 : 68C # GPU 0 : 69C # GPU 1 : 69C # GPU 1 : 70C # GPU 0 : 70C # GPU 1 : 71C # GPU 0 : 71C # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32723 Kernel not found# SWAN swan_assert 0 14:56:38 (1696): Can't acquire lockfile (32) - waiting 35s # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32723 Kernel not found# SWAN swan_assert 0 # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32723 Kernel not found# SWAN swan_assert 0 ... # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32723 Kernel not found# SWAN swan_assert 0 </stderr_txt> ]]> |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Interesting that you and Matt - no, not Matt Harvey, the guy in GPUGrid Start Up/Recovery Issues - should both post about similar issues on the same day. I've also had the problem of the continual "Driver reset" loop after an abnormal shutdown, also mostly with NATHAN_KIDKIXc22 tasks. The problem would appear to be a failure to restart the tasks from a (possibly damaged or currupt) checkpoint file - maybe the project team could look into that? My workround has been to restart Windows in safe mode (which prevents BOINC loading), and edit client_state.xml to add the line <suspended_via_gui/> to the <result> block for the suspect task. As the name suggests, that's the same as clicking 'suspend' for the task while BOINC is running, and gets control of the machine back so you can investigate on the next normal restart. By convention, the line goes just under <plan_class> in client_state, but I think anywhere at the first indent level will do. Interesting point about stderr.txt - I hadn't looked that far into it. The process for stderr is: It gets written as a file in the slot directory On task completion, the contents of the file gets copied into that same <result> block in client_state.xml The <result> data is copied into a sched_request file for the project's server The scheduler result handler copies it into the database for display on the web. So, which of those gets skipped if a task gets aborted? Next time it happens, I'll follow the process through and see where it goes missing. Any which way, it's probably a BOINC problem, and I agree it would be better if partial information was available for aborted tasks. You and I both know where and how to get that changed once we've narrowed down the problem ;) |
|
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have the same problem with Nathan units on a GTX460 but I didn't have power outages. ADDED I wish they would Beta test full size WU's before releasing them on an unsuspecting public. It's little wonder there a very small bunch of hardcore GPUGrid crunchers because it's just too much hassle for ordinary user and causes too many problems. They join and quickly leave...shame because a little more full beta testing would catch these problems. Radio Caroline, the world's most famous offshore pirate radio station. Great music since April 1964. Support Radio Caroline Team - Radio Caroline |
|
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Also had my GTX660TI throw a wobbly on a Noelia WU here http://www.gpugrid.net/result.php?resultid=7310174 Radio Caroline, the world's most famous offshore pirate radio station. Great music since April 1964. Support Radio Caroline Team - Radio Caroline |
|
Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I recently had a power outage here, where the computer lost power while it had been working on BOINC. Jacob, this has been my life with my GTX 590 box for the last month. I usually just end up resetting the whole project because the apps will not continue. It may run for a day or two or it may just run for two hours before BSOD. I'm fighting the nvlddmkm.sys thing right now and will probably end up reinstalling as a last ditch effort. This system does not normally crash unless BOINC is running GPUGrid WUs. It is not overclocked and is water cooled. All timings and specs are as from the Dell factory for this T7500. But yeah..I completely understand what you're going through. Operator |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
MJH: Can you try to reproduce this problem (in my report in the first post) and fix it? |
|
Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I did reinstall the OS on my GTX 590 box and have not installed any updates. I am using 314.22 right now and it's been running for two days without any errors at all. I am now convinced that there was a "third party software" issue or possibly the Microsoft WDDM 1.1, 1.2, 1.3 update package that caused my problems. I'm using Win7 x64 so I really don't think I need the update to my windows display model to work with Win 8 or 8.1 if I'm not using that OS. Regardless, it's working now! Operator |
|
Send message Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I had the same problem with this WU on my GTX 580 machine - http://www.gpugrid.net/workunit.php?wuid=4819239 Only I did not have a power outage. My symptoms were finding my computer frozen, with no choice other than to hit the reset switch. When the computer came back up, I kept getting the windows balloons saying that there were driver problems and that the driver failed to start, and then blue screens. I booted into safe mode, then downloaded and installed the latest WHQL NVidia driver. I then rebooted and got exactly the same thing again. I figured it was the GPU grid WU, so I again booted into safe mode, brought up BOINC manager and aborted the task. Now the my computer comes back up and is running, however, I got a computation error on this WU - http://www.gpugrid.net/workunit.php?wuid=4820870 which also caused a blue screen. I've set my GPU Grid project to not get new tasks for the time. Interestingly enough, my GTX 460 machine seems to be having no problems at the moment. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
MJH: Any response? I tried to provide as much detail as possible. |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
Jacob, Next time this happens, please email me the contents of the slot directory and the task files. Mjh |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Ha! Considering it seems like it should be easy to reproduce (turn off PC, via switch and not via normal shutdown, in the middle of the GPUGrid task)... Challenge accepted. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
MJH: If I'm able to reproduce the issue, where should I email the requested files? Can you please PM me your email address? Also... For my first test, the issue did not occur on my Long-run SANTI-baxbim tasks. I wonder if it is task-type-specific? I'll try to test a "abrupt computer restart" against other task types. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
MJH: I have been able to reproduce the problem with a SANTI_MAR422dim task. Can you please PM me your email address? Thanks, Jacob |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Matt, I have received your PM, and have sent you the files. Please let me know if you need anything or find anything! Thanks, Jacob |
|
Send message Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
MJH: FWIW - My GTX 460 machine finished the task that I posted about. Although it took longer than 24-hours, it was a SANTI-baxbim task - http://www.gpugrid.net/workunit.php?wuid=4818983 Also, I have to say that I somewhat agree with the above post about people who run this project really needing to know what they are doing. I'm a software developer / computer scientist by trade, and I build my own PCs when I need them. One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends. In general, I have found this project to be relatively stable with this, perhaps, the only serious fault I have encountered so far. However, when faults like this arise, it would almost certainly take skilled people to get out of the situation created. Unfortunately, though, this and other similar projects, at least as I see it, are on the bleeding edge. As in my job where the software that I work with is also on the bleeding edge (a custom FEA program), it is sometimes extraordinarily difficult to catch a bug like this since it sounds like it occurs only under limited circumstances that may not be caught in a test of the software unless, in this case, the PC were shut down abnormally. Just my $0.02. |
|
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends. That's another thing that is a "trap" and confusing. While the deadline is 5 days if you don't return within 2 days the WU is resent to another host and if that host returns first (likely) your computing time, if you return a result, has been wasted because the first valid result returned is canonical and yours is binned. Radio Caroline, the world's most famous offshore pirate radio station. Great music since April 1964. Support Radio Caroline Team - Radio Caroline |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends. The 2 day resend was ceased long ago. |
|
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends. I type corrected :-) Radio Caroline, the world's most famous offshore pirate radio station. Great music since April 1964. Support Radio Caroline Team - Radio Caroline |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Alright, back on topic here... I'm awaiting MJH to analyze the files that I sent him. |
|
Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Alright, back on topic here... Jacob; Are you talking about when one of the GPUs TDR it screws up all the other tasks running on other GPUs as well? That happens to me on my GTX590 box all the time (mostly power outages). If one messes up and ends up causing a TDR or complete dump and reboot, when I start BOINC again all the remaining WUs in progress on the other GPUs also cause more TDRs unless I abort them. Sometimes even that doesn't help and I have to completely reset the project. Example: I had a TDR the other day. Three WUs were uploading at the time. Only one was actually processing. Fine. So I reboot and catch BOINC before it starts processing the problem WU and suspend processing so the three that did complete can upload for credit. Now, I abort the problem WU and let the system download 4 new WUs. As soon as processing starts, Wham! Another TDR. So I do a reset of the project and 4 more WUs download and start processing without any problem at all. So the point is, unless I reset the project when I get a TDR I'm just wasting my time downloading new WUs because they are all going to continue to crash until I do a complete reset. So I'm not sure what file that's left over in the BOINC or GPUGrid project folder(s) is causing the TDRs after the original event. Is that the same issue you are talking about here or am I way off? Operator |
©2025 Universitat Pompeu Fabra