Advanced search

Message boards : Number crunching : GIANNI_ntl units crash on reboot

Author Message
Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 486
Credit: 11,351,834,716
RAC: 9,275,292
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35893 - Posted: 26 Mar 2014 | 0:32:38 UTC
Last modified: 26 Mar 2014 | 0:37:51 UTC

GIANNI_ntl units crash on reboot. It happened on both windows xp and 7. The units were running fine, before the reboot.


3/25/2014 8:12:03 PM | | Starting BOINC client version 7.2.42 for windows_x86_64
3/25/2014 8:12:03 PM | | log flags: file_xfer, sched_ops, task
3/25/2014 8:12:03 PM | | Libraries: libcurl/7.25.0 OpenSSL/1.0.1 zlib/1.2.6
3/25/2014 8:12:03 PM | | Data directory: D:\BOINC APPLACATION
3/25/2014 8:12:03 PM | | Running under account Bedrich Hajek
3/25/2014 8:12:03 PM | | CUDA: NVIDIA GPU 0: GeForce GTX 690 (driver version 335.23, CUDA version 6.0, compute capability 3.0, 2048MB, 1954MB available, 3132 GFLOPS peak)
3/25/2014 8:12:03 PM | | CUDA: NVIDIA GPU 1: GeForce GTX 690 (driver version 335.23, CUDA version 6.0, compute capability 3.0, 2048MB, 1954MB available, 3132 GFLOPS peak)
3/25/2014 8:12:03 PM | | CUDA: NVIDIA GPU 2: GeForce GTX 690 (driver version 335.23, CUDA version 6.0, compute capability 3.0, 2048MB, 1954MB available, 3132 GFLOPS peak)
3/25/2014 8:12:03 PM | | CUDA: NVIDIA GPU 3: GeForce GTX 690 (driver version 335.23, CUDA version 6.0, compute capability 3.0, 2048MB, 1949MB available, 3132 GFLOPS peak)
3/25/2014 8:12:03 PM | | OpenCL: NVIDIA GPU 0: GeForce GTX 690 (driver version 335.23, device version OpenCL 1.1 CUDA, 2048MB, 1954MB available, 3132 GFLOPS peak)
3/25/2014 8:12:03 PM | | OpenCL: NVIDIA GPU 1: GeForce GTX 690 (driver version 335.23, device version OpenCL 1.1 CUDA, 2048MB, 1954MB available, 3132 GFLOPS peak)
3/25/2014 8:12:03 PM | | OpenCL: NVIDIA GPU 2: GeForce GTX 690 (driver version 335.23, device version OpenCL 1.1 CUDA, 2048MB, 1954MB available, 3132 GFLOPS peak)
3/25/2014 8:12:03 PM | | OpenCL: NVIDIA GPU 3: GeForce GTX 690 (driver version 335.23, device version OpenCL 1.1 CUDA, 2048MB, 1949MB available, 3132 GFLOPS peak)
3/25/2014 8:12:03 PM | | OpenCL CPU: AMD Phenom(tm) II X6 1090T Processor (OpenCL driver vendor: Advanced Micro Devices, Inc., driver version 2.0, device version OpenCL 1.1 AMD-APP-SDK-v2.5 (709.2))
3/25/2014 8:12:03 PM | | Host name: New-PC
3/25/2014 8:12:03 PM | | Processor: 6 AuthenticAMD AMD Phenom(tm) II X6 1090T Processor [Family 16 Model 10 Stepping 0]
3/25/2014 8:12:03 PM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni cx16 popcnt syscall nx lm svm sse4a osvw ibs skinit wdt page1gb rdtscp 3dnowext 3dnow
3/25/2014 8:12:03 PM | | OS: Microsoft Windows 7: Professional x64 Edition, Service Pack 1, (06.01.7601.00)
3/25/2014 8:12:03 PM | | Memory: 7.99 GB physical, 15.98 GB virtual
3/25/2014 8:12:03 PM | | Disk: 1.82 TB total, 1.68 TB free
3/25/2014 8:12:03 PM | | Local time is UTC -4 hours
3/25/2014 8:12:03 PM | | Config: report completed tasks immediately
3/25/2014 8:12:03 PM | | Config: use all coprocessors
3/25/2014 8:12:03 PM | Einstein@Home | URL http://einstein.phys.uwm.edu/; Computer ID 5457828; resource share 100
3/25/2014 8:12:03 PM | GPUGRID | URL http://www.gpugrid.net/; Computer ID 127986; resource share 28800
3/25/2014 8:12:03 PM | GPUGRID | General prefs: from GPUGRID (last modified 03-Feb-2010 21:03:49)
3/25/2014 8:12:03 PM | GPUGRID | Host location: none
3/25/2014 8:12:03 PM | GPUGRID | General prefs: using your defaults
3/25/2014 8:12:03 PM | | Reading preferences override file
3/25/2014 8:12:03 PM | | Preferences:
3/25/2014 8:12:03 PM | | max memory usage when active: 4091.88MB
3/25/2014 8:12:03 PM | | max memory usage when idle: 7365.38MB
3/25/2014 8:12:03 PM | | max disk usage: 50.00GB
3/25/2014 8:12:03 PM | | max CPUs used: 4
3/25/2014 8:12:03 PM | | max download rate: 3000003 bytes/sec
3/25/2014 8:12:03 PM | | max upload rate: 384000 bytes/sec
3/25/2014 8:12:03 PM | | (to change preferences, visit a project web site or select Preferences in the Manager)
3/25/2014 8:12:03 PM | | Not using a proxy
3/25/2014 8:12:44 PM | GPUGRID | Computation for task 1029-GIANNI_ntl-1-4-RND8646_0 finished
3/25/2014 8:12:44 PM | GPUGRID | Output file 1029-GIANNI_ntl-1-4-RND8646_0_1 for task 1029-GIANNI_ntl-1-4-RND8646_0 absent
3/25/2014 8:12:44 PM | GPUGRID | Output file 1029-GIANNI_ntl-1-4-RND8646_0_2 for task 1029-GIANNI_ntl-1-4-RND8646_0 absent
3/25/2014 8:12:44 PM | GPUGRID | Output file 1029-GIANNI_ntl-1-4-RND8646_0_3 for task 1029-GIANNI_ntl-1-4-RND8646_0 absent
3/25/2014 8:12:44 PM | GPUGRID | Computation for task 1435-GIANNI_ntl-1-4-RND1081_0 finished
3/25/2014 8:12:44 PM | GPUGRID | Output file 1435-GIANNI_ntl-1-4-RND1081_0_1 for task 1435-GIANNI_ntl-1-4-RND1081_0 absent
3/25/2014 8:12:44 PM | GPUGRID | Output file 1435-GIANNI_ntl-1-4-RND1081_0_2 for task 1435-GIANNI_ntl-1-4-RND1081_0 absent
3/25/2014 8:12:44 PM | GPUGRID | Output file 1435-GIANNI_ntl-1-4-RND1081_0_3 for task 1435-GIANNI_ntl-1-4-RND1081_0 absent
3/25/2014 8:12:44 PM | GPUGRID | Computation for task 1784-GIANNI_ntl-1-4-RND9043_0 finished
3/25/2014 8:12:44 PM | GPUGRID | Output file 1784-GIANNI_ntl-1-4-RND9043_0_1 for task 1784-GIANNI_ntl-1-4-RND9043_0 absent
3/25/2014 8:12:44 PM | GPUGRID | Output file 1784-GIANNI_ntl-1-4-RND9043_0_2 for task 1784-GIANNI_ntl-1-4-RND9043_0 absent
3/25/2014 8:12:44 PM | GPUGRID | Output file 1784-GIANNI_ntl-1-4-RND9043_0_3 for task 1784-GIANNI_ntl-1-4-RND9043_0 absent


1784-GIANNI_ntl-1-4-RND9043_0 5404600 127986 25 Mar 2014 | 23:31:49 UTC 26 Mar 2014 | 0:17:49 UTC Error while computing 2.18 0.00 --- Long runs (8-12 hours on fastest card) v8.15 (cuda55)
1229-GIANNI_ntl-0-4-RND0298_1 5398972 30790 25 Mar 2014 | 21:11:42 UTC 26 Mar 2014 | 0:19:20 UTC Error while computing 60.33 0.00 --- Long runs (8-12 hours on fastest card) v8.15 (cuda55)
1029-GIANNI_ntl-1-4-RND8646_0 5403779 127986 25 Mar 2014 | 18:01:01 UTC 26 Mar 2014 | 0:17:49 UTC Error while computing 2.18 0.00 --- Long runs (8-12 hours on fastest card) v8.15 (cuda42)
1435-GIANNI_ntl-1-4-RND1081_0 5403702 127986 25 Mar 2014 | 19:20:06 UTC 26 Mar 2014 | 0:17:49 UTC Error while computing 2.18 0.00 --- Long runs (8-12 hours on fastest card) v8.15 (cuda42)







http://www.gpugrid.net/results.php?hostid=30790&offset=0&show_names=1&state=5&appid=


http://www.gpugrid.net/results.php?hostid=127986&offset=0&show_names=0&state=5&appid=

Matt
Avatar
Send message
Joined: 11 Jan 13
Posts: 216
Credit: 846,538,252
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35895 - Posted: 26 Mar 2014 | 2:17:52 UTC

Same here. If I close BOINC or just suspend crunching, these WUs crash on restart every time.
____________

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35914 - Posted: 26 Mar 2014 | 16:58:24 UTC - in response to Message 35895.
Last modified: 26 Mar 2014 | 17:04:16 UTC

same here 2 different rigs running Win7_64

GTX670 - Driver version : r334_89 : 33523
GTX480 - Driver version : r331_00 : 33182
____________
Thanks - Steve

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2353
Credit: 16,375,531,916
RAC: 5,811,976
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35919 - Posted: 26 Mar 2014 | 18:18:21 UTC

I've also experienced that.
These workunits don't make any checkpoints.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35937 - Posted: 27 Mar 2014 | 18:26:06 UTC - in response to Message 35919.

I'm running a GIANNI_ntl WU's now on W7x64. When I suspend and start them they continue from their last checkpoint. Did it a few times, no problems.
335.23 driver and CUDA 5.5 app. WU running on 2nd GPU.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 486
Credit: 11,351,834,716
RAC: 9,275,292
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35945 - Posted: 27 Mar 2014 | 22:03:38 UTC - in response to Message 35937.

Were you doing your stops and starts on a "xxxx"-GIANNI_ntl WU or an e"xx"s"xx"_e"xx"s"xxxxx"-GIANNI_ntl WU?

I hope you understand my nomenclature. The "x" is any number or letter.

It was the "xxxx"-GIANNI_ntl WU's that crashed, the others were good.

I should have been more specific the first time.



Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2353
Credit: 16,375,531,916
RAC: 5,811,976
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35954 - Posted: 28 Mar 2014 | 9:40:23 UTC - in response to Message 35937.
Last modified: 28 Mar 2014 | 9:42:26 UTC

(No checkpoints made by GIANNI_ntl + my script looking for stuck tasks = computer rastarts in every 10 minutes) + GIANNI_ntl failing at restart = 33 + 31 + 17 failed GIANNI_ntl's

BTW I've suspended my script until these GIANNI_ntl's cleared from the queue.

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,735,009,416
RAC: 626,127
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36012 - Posted: 30 Mar 2014 | 23:41:03 UTC

It might not be the right topic, but as I have detected it with a Gianni WU, I place it here:
Ultimately I do have poor run times, since the new climateprediction batch (EU and ANZ) arrived on the AMD system with my GTX670 card. I noticed that the fan slows often down, which made me curious, and I could see that on EVGA OC Scanner, the video card stops several times for short periods. On the other hand on the Administrator Panel (after ctrl. alt. del.) I am able to see that all of me cores are fully loaded. And it makes not a lot of difference if I stop one or two cores, all cores are 100% loaded. I always run with one core less or free for the GTX670. However under Process, I see acemd uses one core, 6 or 7 hadam3p_anz use the 6 or 7 cores. The strange thing is there are more hadam3p_anz which occupy the same memory space as the running ones and there are some with low memory usage, all of the latter do not run (CPU:00). I have not noticed those before.

Today I saw the following messages, please scroll down until you see the GPUGRID, the climatepredictions are only illustration, although it is the same massage:

30/03/2014 03:57:46 p.m. | climateprediction.net | Task hadam3p_anz_nb1g_2012_1_008587996_1 exited with zero status but no 'finished' file
30/03/2014 03:57:46 p.m. | climateprediction.net | If this happens repeatedly you may need to reset the project.
30/03/2014 03:57:46 p.m. | climateprediction.net | Task hadam3p_anz_n4y1_2012_1_008580097_1 exited with zero status but no 'finished' file
30/03/2014 03:57:46 p.m. | climateprediction.net | If this happens repeatedly you may need to reset the project.
30/03/2014 03:57:46 p.m. | climateprediction.net | Task hadam3p_anz_nb56_2012_1_008602058_0 exited with zero status but no 'finished' file
30/03/2014 03:57:46 p.m. | climateprediction.net | If this happens repeatedly you may need to reset the project.
30/03/2014 03:57:46 p.m. | climateprediction.net | Task hadam3p_anz_n2de_2012_1_008592431_0 exited with zero status but no 'finished' file
30/03/2014 03:57:46 p.m. | climateprediction.net | If this happens repeatedly you may need to reset the project.
30/03/2014 03:57:46 p.m. | GPUGRID | Task e40s7_e20s6f82-GIANNI_ntl9c-0-1-RND9820_0 exited with zero status but no 'finished' file
30/03/2014 03:57:46 p.m. | GPUGRID | If this happens repeatedly you may need to reset the project.
30/03/2014 03:57:46 p.m. | climateprediction.net | Restarting task hadam3p_anz_nb1g_2012_1_008587996_1 using hadam3p_anz version 610 in slot 7
30/03/2014 03:57:46 p.m. | climateprediction.net | Restarting task hadam3p_anz_n4y1_2012_1_008580097_1 using hadam3p_anz version 610 in slot 2
30/03/2014 03:57:46 p.m. | climateprediction.net | Restarting task hadam3p_anz_nb56_2012_1_008602058_0 using hadam3p_anz version 610 in slot 1
30/03/2014 03:57:46 p.m. | climateprediction.net | Restarting task hadam3p_anz_n2de_2012_1_008592431_0 using hadam3p_anz version 610 in slot 6
30/03/2014 03:57:46 p.m. | GPUGRID | Restarting task e40s7_e20s6f82-GIANNI_ntl9c-0-1-RND9820_0 using acemdlong version 815 (cuda55) in slot 3

I have to inform the following as well:
1) My mother board and CPU have burned (990FXA-UD and AMD 8150) and I had to replace them with the same motherboard newer bios and an AMD 8350, so I was expecting better run times and not poorer. With the AMD8150 all was easy cruising, I never had problems, all stable and the system used about 8 hours for Nathans.
2) After a long time I changed Nvidia Driver from301.42 to 332.21.

It stings me for several days, but the message I have seen for the first time today. Is there a connection between my slow run times of GPUGRID and the new batch of climateprediction (the combination has worked very well before), or the change of driver, or the new AMD8350, or does it have something to do with the “exited with zero status but no 'finished' file”?

Please comment!

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36014 - Posted: 31 Mar 2014 | 4:19:45 UTC
Last modified: 31 Mar 2014 | 4:37:51 UTC

If you click the task on the GPUGrid website, and view the stderr.txt at the bottom, and it says something like

Application version Long runs (8-12 hours on fastest card) v8.15 (cuda42)

Stderr output
<core_client_version>7.0.25</core_client_version>
<![CDATA[
<message>
Este archivo ya existe. (0x50) - exit code 80 (0x50)
</message>

or

<![CDATA[
<message>
The file exists.
(0x50) - exit code 80 (0x50)
</message>


...then it is a known bug suspending/exiting/resuming in the 8.15 application. Supposedly, MJH has fixed it in the 8.20 application (which has been deployed to the cuda 6.0 tasks on the Short queue), but for reasons beyond my comprehension, he has not yet deployed the new version to the Long queue.

Current GPUGrid applications:
http://www.gpugrid.net/apps.php

See here:
http://www.gpugrid.net/forum_thread.php?id=3621
... where I reported the problem over 6 weeks ago!

MJH: Still awaiting the fix on Long queue! :(

Post to thread

Message boards : Number crunching : GIANNI_ntl units crash on reboot

//