Advanced search

Message boards : Number crunching : All ADRIA_2OV5_CLOSED, ADRIA_2OV5_CONF_CLOSED & ADRIA_2OV5_OPEN failing

Author Message
Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44321 - Posted: 29 Aug 2016 | 16:35:38 UTC
Last modified: 29 Aug 2016 | 16:37:24 UTC

All ADRIA_2OV5_CLOSED, ADRIA_2OV5_CONF_CLOSED & ADRIA_2OV5_OPEN are failing.

All ADRIA WUs with either CLOSED or OPEN in the name are failing and there's enough of them that some machines are being disallowed WUs until tomorrow. Would someone please cancel these cancel these bad WUs and please do a little testing before releasing large numbers of workunits?

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,194,346,966
RAC: 10,534,102
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44329 - Posted: 29 Aug 2016 | 21:40:43 UTC - in response to Message 44321.

I had 10 tasks failed on me today:


e1s12_0-ADRIA_2OV5_CLOSED2-0-1-RND0076_3 11710442 263612 29 Aug 2016 | 19:56:28 UTC 29 Aug 2016 | 20:30:34 UTC Error while computing 2.48 0.16 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s28_2-ADRIA_2OV5_CLOSED1-0-1-RND5679_2 11710388 263612 29 Aug 2016 | 16:11:41 UTC 29 Aug 2016 | 17:03:24 UTC Error while computing 2.16 0.16 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s28_2-ADRIA_2OV5_OPEN2-0-1-RND8992_2 11710322 30790 29 Aug 2016 | 14:51:14 UTC 29 Aug 2016 | 15:51:19 UTC Error while computing 2.09 0.17 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s37_2-ADRIA_2OV5_OPEN2-0-1-RND0901_1 11710331 30790 29 Aug 2016 | 13:05:48 UTC 29 Aug 2016 | 14:00:43 UTC Error while computing 3.08 0.13 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s16_1-ADRIA_2OV5_OPEN1-0-1-RND6003_1 11710260 30790 29 Aug 2016 | 12:49:55 UTC 29 Aug 2016 | 13:05:47 UTC Error while computing 3.70 0.16 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s50_3-ADRIA_2OV5_CLOSED1-0-1-RND0724_0 11710422 263612 29 Aug 2016 | 15:40:14 UTC 29 Aug 2016 | 16:11:41 UTC Error while computing 2.42 0.20 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s46_3-ADRIA_2OV5_CLOSED1-0-1-RND8830_0 11710416 30790 29 Aug 2016 | 14:00:43 UTC 29 Aug 2016 | 14:51:51 UTC Error while computing 2.02 0.19 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s21_1-ADRIA_2OV5_OPEN2-0-1-RND0203_0 11710315 263612 29 Aug 2016 | 11:22:16 UTC 29 Aug 2016 | 11:23:45 UTC Error while computing 2.09 0.19 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s2_0-ADRIA_2OV5_OPEN1-0-1-RND7002_0 11710246 263612 29 Aug 2016 | 11:33:31 UTC 29 Aug 2016 | 12:07:18 UTC Error while computing 2.47 0.22 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s38_2-ADRIA_2OV5_CONF_CLOSED2-0-1-RND5866_0 11710244 263612 29 Aug 2016 | 11:23:45 UTC 29 Aug 2016 | 11:24:49 UTC Error while computing 2.08 0.14 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)


All with the same error:

ERROR: file mdioload.cpp line 229: Error reading parmtop file
16:33:42 (5512): called boinc_finish

These tasks should be canceled immediately.

This is also another reason, why we need to have long runs broken down into subcategories, so we can temporarily click no on a bad batch of tasks, and not get a perfectly good computer get labeled bad, with too many errors, and not get any more tasks.


http://www.gpugrid.net/forum_thread.php?id=4366&nowrap=true#44307


jjch
Send message
Joined: 10 Nov 13
Posts: 98
Credit: 15,288,150,388
RAC: 1,732,962
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44334 - Posted: 30 Aug 2016 | 4:38:37 UTC

Multiple WU's are failing on several of my systems. I will set no new work until the problem is resolved.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44335 - Posted: 30 Aug 2016 | 4:58:03 UTC

Some of my GPUs are not getting work because of these flawed WUs. Ridiculous.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44336 - Posted: 30 Aug 2016 | 5:31:06 UTC

One of my machines with 2 GPUs has gotten 12 of these bad ARIA WUs today and now won't get work, what's worse it says it's now allowed only 1 WU/day. WOULD SOMEBODY PLEASE FIX THIS!

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 44337 - Posted: 30 Aug 2016 | 7:15:16 UTC

Same here.

e1s1_0-ADRIA_2OV5_OPEN1-0-1-RND9061_6 11710245 335350 30 Aug 2016 | 5:10:13 UTC 30 Aug 2016 | 5:11:01 UTC Error while computing 2.07 0.20 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s40_3-ADRIA_2OV5_CLOSED1-0-1-RND6672_7 11710407 335350 30 Aug 2016 | 5:11:49 UTC 30 Aug 2016 | 5:13:37 UTC Error while computing 2.06 0.19 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s30_2-ADRIA_2OV5_OPEN2-0-1-RND6447_2 11710324 335350 30 Aug 2016 | 5:11:01 UTC 30 Aug 2016 | 5:11:49 UTC Error while computing 2.08 0.20 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s25_1-ADRIA_2OV5_CLOSED1-0-1-RND3264_7 11710383 335350 30 Aug 2016 | 5:07:38 UTC 30 Aug 2016 | 5:09:39 UTC Error while computing 2.04 0.19 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s39_3-ADRIA_2OV5_CLOSED2-0-1-RND4880_3 11710483 335350 30 Aug 2016 | 3:06:15 UTC 30 Aug 2016 | 4:01:06 UTC Error while computing 2.04 0.19 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s27_2-ADRIA_2OV5_OPEN2-0-1-RND6642_0 11710321 335350 29 Aug 2016 | 11:29:15 UTC 29 Aug 2016 | 11:31:19 UTC Error while computing 2.07 0.20 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s36_2-ADRIA_2OV5_OPEN1-0-1-RND4404_0 11710280 335350 29 Aug 2016 | 10:30:10 UTC 29 Aug 2016 | 10:32:04 UTC Error while computing 2.09 0.23 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s25_1-ADRIA_2OV5_OPEN1-0-1-RND0409_0 11710269 335350 29 Aug 2016 | 10:34:07 UTC 29 Aug 2016 | 11:29:15 UTC Error while computing 2.06 0.19 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)
e1s9_0-ADRIA_2OV5_OPEN1-0-1-RND3309_0 11710253 335350 29 Aug 2016 | 10:32:04 UTC 29 Aug 2016 | 10:34:07 UTC Error while computing 2.06 0.22 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65)

Long runs (8-12 hours on fastest card) 8.48 windows_intelx86 (cuda65)
Number of tasks completed 10
Max tasks per day 9
Number of tasks today 15
Consecutive valid tasks 1

Luckily I caught some Gianni and Gerrard tasks to fill this machine before it ran out of allowed tasks or dropped it to zero!
____________
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44355 - Posted: 31 Aug 2016 | 9:41:03 UTC - in response to Message 44336.
Last modified: 31 Aug 2016 | 10:07:23 UTC

One of my machines with 2 GPUs has gotten 12 of these bad ARIA WUs today and now won't get work, what's worse it says it's now allowed only 1 WU/day. WOULD SOMEBODY PLEASE FIX THIS!


When your quota says 1 WU/day you will actually get 2 a day. You are compounding the problem by "Aborting WU's" as these also reduce your daily qouta.


Adria is going for it again with

ADRIA_2OV5_AMBER_CLOSED
ADRIA_2OV5_AMBER_OPEN

Make sure you don't abort these as they may actually work. 3rd time lucky.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44357 - Posted: 31 Aug 2016 | 13:39:03 UTC - in response to Message 44355.

One of my machines with 2 GPUs has gotten 12 of these bad ARIA WUs today and now won't get work, what's worse it says it's now allowed only 1 WU/day. WOULD SOMEBODY PLEASE FIX THIS!


When your quota says 1 WU/day you will actually get 2 a day.

GPUGRID 8/30/2016 12:43:17 AM Requesting new tasks for NVIDIA GPU
GPUGRID 8/30/2016 12:43:22 AM Scheduler request completed: got 0 new tasks
GPUGRID 8/30/2016 12:43:22 AM No tasks sent
GPUGRID 8/30/2016 12:43:22 AM No tasks are available for Long runs (8-12 hours on fastest card)
GPUGRID 8/30/2016 12:43:22 AM This computer has finished a daily quota of 1 tasks

Well at least SETI got some GPU time and their WUs actually run.

Just got my first ADRIA_2OV5_AMBER. After the last 2 batches I don't have a lot of confidence but we'll see. It's crazy that they just let these bad batches run without cancelling them when they know they're flawed. 12 bad WUs in a day is ridiculous.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44358 - Posted: 31 Aug 2016 | 14:46:59 UTC - in response to Message 44357.

Just got my first ADRIA_2OV5_AMBER.

After 10 minutes it's still running. That's a good sign, Looks to be about the same length as the old ADRIA WUs that would complete (pre OPEN & CLOSED). A bit longer than the recent GERARD WUs on my machines.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1090
Credit: 6,603,906,926
RAC: 18,783,925
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 44359 - Posted: 31 Aug 2016 | 16:27:22 UTC
Last modified: 31 Aug 2016 | 16:30:04 UTC

About 1 hour ago, one of my GTX980Ti hosts downloaded:

e1s24_1-ADRIA_2OV5_AMBER_CLOSED1-0-1-RND8355

it is still running (12,45% done after 56 minutes), but why is it calles "CLOSED1" ?

jjch
Send message
Joined: 10 Nov 13
Posts: 98
Credit: 15,288,150,388
RAC: 1,732,962
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44360 - Posted: 31 Aug 2016 | 17:51:59 UTC

It looks like The jobs have been re-worked by Amber and are being sent out again. Here is some info from the Project Status page.

ADRIA_2OV5_AMBER_CLOSED 14 86 0 just released...
ADRIA_2OV5_AMBER_OPEN 13 87 0 just released...
ADRIA_2OV5_CLOSED 0 42 0 100%
ADRIA_2OV5_OPEN 0 41 0 100%

The remaining original jobs probably will still fail but the new ones look promising. I will fire things back up and see how it goes.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,386,851
RAC: 8,706,046
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44361 - Posted: 31 Aug 2016 | 19:18:05 UTC - in response to Message 44360.

I have e1s45_3-ADRIA_2OV5_AMBER_OPEN2-0-1-RND2778_1, which seems to be running well (with low CPU usage) on a windows GTX 970.

My wingmate blew _0 away with

#SWAN: FATAL: cannot find image for module [.nonbonded.cu.] for device version 610

- so no Linux GTX 1080 support yet.

jjch
Send message
Joined: 10 Nov 13
Posts: 98
Credit: 15,288,150,388
RAC: 1,732,962
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44366 - Posted: 31 Aug 2016 | 22:55:24 UTC

I have one e1s16_1-ADRIA_2OV5_AMBER_OPEN1-0-1-RND6090_1. It has been running for a little over 7hrs. Says it has a day left to go so it looks good so far. It's CPU usage is .7% average and GPU load is about 70% on a Quadro M4000.

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 44370 - Posted: 1 Sep 2016 | 9:02:26 UTC - in response to Message 44359.

About 1 hour ago, one of my GTX980Ti hosts downloaded:

e1s24_1-ADRIA_2OV5_AMBER_CLOSED1-0-1-RND8355

it is still running (12,45% done after 56 minutes), but why is it calles "CLOSED1" ?

The tasks are named based on their function, the function of the test or desired result, so OPEN and CLOSED probably refer to the state of a protein or interaction substance or desired effect and not the task type on our end. It is an indicator name for the result so the servers know how to receive and store it and how the scientists then have to classify and analyze it.
____________
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44384 - Posted: 1 Sep 2016 | 18:34:17 UTC - in response to Message 44358.

Just got my first ADRIA_2OV5_AMBER.

After 10 minutes it's still running. That's a good sign, Looks to be about the same length as the old ADRIA WUs that would complete (pre OPEN & CLOSED). A bit longer than the recent GERARD WUs on my machines.

This AMBER_OPEN finished in just under 27 hours on a Ti750. Unfortunately just had a ADRIA_2OV5_AMBER_CLOSED crash and actually cause the machine to reboot. The WU on the other GPU crashed on restart and the ADRIA_2OV5_AMBER_CLOSED restarted from zero. I aborted it. Maybe it'll run on a different GPU...

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44385 - Posted: 1 Sep 2016 | 21:03:49 UTC - in response to Message 44384.

Unfortunately just had a ADRIA_2OV5_AMBER_CLOSED crash and actually cause the machine to reboot.

I have been debating with myself whether a bad WU can cause a machine to crash. It seems to have happened at times, but it is hard to pin down.
I often find that it is really a hardware problem, but you never know.

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 44387 - Posted: 2 Sep 2016 | 6:14:40 UTC - in response to Message 44385.
Last modified: 2 Sep 2016 | 6:18:28 UTC

Unfortunately just had a ADRIA_2OV5_AMBER_CLOSED crash and actually cause the machine to reboot.

I have been debating with myself whether a bad WU can cause a machine to crash. It seems to have happened at times, but it is hard to pin down.
I often find that it is really a hardware problem, but you never know.

I've had it happen several times across all my systems except my laptop. I think the fact that there is a battery in it makes it more stable and does not shut down. A regular PC, even with a backup, if it comes to a software error, which this might be considered if a WU fails, could surely send an OS into a soft crash which ends as a full reboot. (What I mean by a soft crash is when it affects other parts of the OS like explorer.exe or some dependent system function.) One program affecting the next and some of those programs can just restart as per OS function, like explorer.exe, but others need a reboot and some can't wait, like the Winlogon process. I suspect the same about Linux systems, though I am speaking in Windows terms. Then if it does fail and restarts explorer.exe, for example, the task has failed, but affects no other tasks or the system. If it affects a driver or critical system process that can be reloaded, it could fail all the tasks out or just the other one on the same GPU. And if it is critical to the OS, it could fail all tasks and reboot. Even when a program does all its working 'inside a bubble' that does not affect the rest of the system, the program itself is not a bubble to itself not drawing on other system resources and processes. You can't isolate anything on a system that runs other things, including an OS. About the laptop too, I can't say for certain that it never crashed for a WU. Its just I never noticed it happen that way. It is as vulnerable to a software error as any other PC without a battery. I just don't expect too much from it so I don't check BOINC as much on it as I do the stronger PCs. I can't say I never woke up and it was on a login screen from some shutdown and figured it was a windows update or something, but I can't say for certain it was never a failed task either.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44403 - Posted: 2 Sep 2016 | 15:50:41 UTC - in response to Message 44387.

Unfortunately just had a ADRIA_2OV5_AMBER_CLOSED crash and actually cause the machine to reboot.

I've had it happen several times across all my systems except my laptop. I think the fact that there is a battery in it makes it more stable and does not shut down. A regular PC, even with a backup, if it comes to a software error, which this might be considered if a WU fails, could surely send an OS into a soft crash which ends as a full reboot. (What I mean by a soft crash is when it affects other parts of the OS like explorer.exe or some dependent system function.) One program affecting the next and some of those programs can just restart as per OS function, like explorer.exe, but others need a reboot and some can't wait, like the Winlogon process. I suspect the same about Linux systems, though I am speaking in Windows terms. Then if it does fail and restarts explorer.exe, for example, the task has failed, but affects no other tasks or the system. If it affects a driver or critical system process that can be reloaded, it could fail all the tasks out or just the other one on the same GPU. And if it is critical to the OS, it could fail all tasks and reboot. Even when a program does all its working 'inside a bubble' that does not affect the rest of the system, the program itself is not a bubble to itself not drawing on other system resources and processes. You can't isolate anything on a system that runs other things, including an OS. About the laptop too, I can't say for certain that it never crashed for a WU. Its just I never noticed it happen that way. It is as vulnerable to a software error as any other PC without a battery. I just don't expect too much from it so I don't check BOINC as much on it as I do the stronger PCs. I can't say I never woke up and it was on a login screen from some shutdown and figured it was a windows update or something, but I can't say for certain it was never a failed task either.

The battery (as in your laptop) will only protect it from power glitches but not from the "software errors" you mention. Simply as a point of information, I've built hundreds (maybe thousands) of PCs since the days of the original 8088/8086 (not to mention V20 & V30)(and Apple II clones and CPM before that). At the peak sometimes built 2 or 3 per week. Sheesh, I'm showing my age. Point is that I test them. I know how to spot hardware errors. Almost all my crunchers run only a minimal set of programs to support BOINC. The one area that could be a possibility is that new higher usage WUs may be stressing a GPU further than previous WUs. If I suspect that, I downclock the GPU. None of my GPUs are OCed expect for factory OC and many are downclocked. I did have a "validate" error (as you ask about in the other thread) a while ago on one 750Ti and downclocked it a bit. Haven't had one since. My errors are mostly caused by power glitches that are too common around here. The power goes out generally for a second or two, enough to reboot the computers. Hopefully the next app will be more fault tolerant. The apps from most other projects do not cause their WUs to error in my experience.

What I did notice in tracking down your box with validate errors (for Richard in the other thread), is that your GTX 980Ti and 980 GPUs are throwing quite a few errors on WUs that should run on them (the GTX 980Ti cards seem to be failing the long GIANNIs, sometimes after running for a long time). You may wish to try downclocking. Voltage increase may also work but that causes more heat and power draw for any extra speed that you get and may also adversely affect GPU life. Best of luck getting it sorted out!

Tomas Brada
Send message
Joined: 3 Nov 15
Posts: 38
Credit: 6,768,093
RAC: 0
Level
Ser
Scientific publications
wat
Message 44496 - Posted: 12 Sep 2016 | 20:18:53 UTC - in response to Message 44387.

I would say the most fragile thing in the OS are the nVidia GPU drivers.
Battery does not protect you from software errors.
An OS by design should not crash from error in application. Applications are protected from each other, but the OS kernel is not protected from the kernel and if a error occurs in a kernel (or driver) then the machine (in best case) reboots.
Winlogon (in your example) is a process that os relies upon and will crash if winlogon crashes, in Linux there is init process.
____________

Post to thread

Message boards : Number crunching : All ADRIA_2OV5_CLOSED, ADRIA_2OV5_CONF_CLOSED & ADRIA_2OV5_OPEN failing

//