Message boards :
Number crunching :
All ADRIA_2OV5_CLOSED, ADRIA_2OV5_CONF_CLOSED & ADRIA_2OV5_OPEN failing
Message board moderation
| Author | Message |
|---|---|
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
All ADRIA_2OV5_CLOSED, ADRIA_2OV5_CONF_CLOSED & ADRIA_2OV5_OPEN are failing. All ADRIA WUs with either CLOSED or OPEN in the name are failing and there's enough of them that some machines are being disallowed WUs until tomorrow. Would someone please cancel these cancel these bad WUs and please do a little testing before releasing large numbers of workunits? |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 57 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I had 10 tasks failed on me today: e1s12_0-ADRIA_2OV5_CLOSED2-0-1-RND0076_3 11710442 263612 29 Aug 2016 | 19:56:28 UTC 29 Aug 2016 | 20:30:34 UTC Error while computing 2.48 0.16 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s28_2-ADRIA_2OV5_CLOSED1-0-1-RND5679_2 11710388 263612 29 Aug 2016 | 16:11:41 UTC 29 Aug 2016 | 17:03:24 UTC Error while computing 2.16 0.16 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s28_2-ADRIA_2OV5_OPEN2-0-1-RND8992_2 11710322 30790 29 Aug 2016 | 14:51:14 UTC 29 Aug 2016 | 15:51:19 UTC Error while computing 2.09 0.17 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s37_2-ADRIA_2OV5_OPEN2-0-1-RND0901_1 11710331 30790 29 Aug 2016 | 13:05:48 UTC 29 Aug 2016 | 14:00:43 UTC Error while computing 3.08 0.13 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s16_1-ADRIA_2OV5_OPEN1-0-1-RND6003_1 11710260 30790 29 Aug 2016 | 12:49:55 UTC 29 Aug 2016 | 13:05:47 UTC Error while computing 3.70 0.16 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s50_3-ADRIA_2OV5_CLOSED1-0-1-RND0724_0 11710422 263612 29 Aug 2016 | 15:40:14 UTC 29 Aug 2016 | 16:11:41 UTC Error while computing 2.42 0.20 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s46_3-ADRIA_2OV5_CLOSED1-0-1-RND8830_0 11710416 30790 29 Aug 2016 | 14:00:43 UTC 29 Aug 2016 | 14:51:51 UTC Error while computing 2.02 0.19 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s21_1-ADRIA_2OV5_OPEN2-0-1-RND0203_0 11710315 263612 29 Aug 2016 | 11:22:16 UTC 29 Aug 2016 | 11:23:45 UTC Error while computing 2.09 0.19 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s2_0-ADRIA_2OV5_OPEN1-0-1-RND7002_0 11710246 263612 29 Aug 2016 | 11:33:31 UTC 29 Aug 2016 | 12:07:18 UTC Error while computing 2.47 0.22 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s38_2-ADRIA_2OV5_CONF_CLOSED2-0-1-RND5866_0 11710244 263612 29 Aug 2016 | 11:23:45 UTC 29 Aug 2016 | 11:24:49 UTC Error while computing 2.08 0.14 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) All with the same error: ERROR: file mdioload.cpp line 229: Error reading parmtop file 16:33:42 (5512): called boinc_finish These tasks should be canceled immediately. This is also another reason, why we need to have long runs broken down into subcategories, so we can temporarily click no on a bad batch of tasks, and not get a perfectly good computer get labeled bad, with too many errors, and not get any more tasks. http://www.gpugrid.net/forum_thread.php?id=4366&nowrap=true#44307 |
|
Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,773,211,122 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Multiple WU's are failing on several of my systems. I will set no new work until the problem is resolved. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Some of my GPUs are not getting work because of these flawed WUs. Ridiculous. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
One of my machines with 2 GPUs has gotten 12 of these bad ARIA WUs today and now won't get work, what's worse it says it's now allowed only 1 WU/day. WOULD SOMEBODY PLEASE FIX THIS! |
caffeineyellow5Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Same here. e1s1_0-ADRIA_2OV5_OPEN1-0-1-RND9061_6 11710245 335350 30 Aug 2016 | 5:10:13 UTC 30 Aug 2016 | 5:11:01 UTC Error while computing 2.07 0.20 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s40_3-ADRIA_2OV5_CLOSED1-0-1-RND6672_7 11710407 335350 30 Aug 2016 | 5:11:49 UTC 30 Aug 2016 | 5:13:37 UTC Error while computing 2.06 0.19 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s30_2-ADRIA_2OV5_OPEN2-0-1-RND6447_2 11710324 335350 30 Aug 2016 | 5:11:01 UTC 30 Aug 2016 | 5:11:49 UTC Error while computing 2.08 0.20 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s25_1-ADRIA_2OV5_CLOSED1-0-1-RND3264_7 11710383 335350 30 Aug 2016 | 5:07:38 UTC 30 Aug 2016 | 5:09:39 UTC Error while computing 2.04 0.19 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s39_3-ADRIA_2OV5_CLOSED2-0-1-RND4880_3 11710483 335350 30 Aug 2016 | 3:06:15 UTC 30 Aug 2016 | 4:01:06 UTC Error while computing 2.04 0.19 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s27_2-ADRIA_2OV5_OPEN2-0-1-RND6642_0 11710321 335350 29 Aug 2016 | 11:29:15 UTC 29 Aug 2016 | 11:31:19 UTC Error while computing 2.07 0.20 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s36_2-ADRIA_2OV5_OPEN1-0-1-RND4404_0 11710280 335350 29 Aug 2016 | 10:30:10 UTC 29 Aug 2016 | 10:32:04 UTC Error while computing 2.09 0.23 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s25_1-ADRIA_2OV5_OPEN1-0-1-RND0409_0 11710269 335350 29 Aug 2016 | 10:34:07 UTC 29 Aug 2016 | 11:29:15 UTC Error while computing 2.06 0.19 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) e1s9_0-ADRIA_2OV5_OPEN1-0-1-RND3309_0 11710253 335350 29 Aug 2016 | 10:32:04 UTC 29 Aug 2016 | 10:34:07 UTC Error while computing 2.06 0.22 --- Long runs (8-12 hours on fastest card) v8.48 (cuda65) Long runs (8-12 hours on fastest card) 8.48 windows_intelx86 (cuda65) Number of tasks completed 10 Max tasks per day 9 Number of tasks today 15 Consecutive valid tasks 1 Luckily I caught some Gianni and Gerrard tasks to fill this machine before it ran out of allowed tasks or dropped it to zero! 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org |
|
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
One of my machines with 2 GPUs has gotten 12 of these bad ARIA WUs today and now won't get work, what's worse it says it's now allowed only 1 WU/day. WOULD SOMEBODY PLEASE FIX THIS! When your quota says 1 WU/day you will actually get 2 a day. You are compounding the problem by "Aborting WU's" as these also reduce your daily qouta. Adria is going for it again with ADRIA_2OV5_AMBER_CLOSED ADRIA_2OV5_AMBER_OPEN Make sure you don't abort these as they may actually work. 3rd time lucky. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
One of my machines with 2 GPUs has gotten 12 of these bad ARIA WUs today and now won't get work, what's worse it says it's now allowed only 1 WU/day. WOULD SOMEBODY PLEASE FIX THIS! GPUGRID 8/30/2016 12:43:17 AM Requesting new tasks for NVIDIA GPU GPUGRID 8/30/2016 12:43:22 AM Scheduler request completed: got 0 new tasks GPUGRID 8/30/2016 12:43:22 AM No tasks sent GPUGRID 8/30/2016 12:43:22 AM No tasks are available for Long runs (8-12 hours on fastest card) GPUGRID 8/30/2016 12:43:22 AM This computer has finished a daily quota of 1 tasks Well at least SETI got some GPU time and their WUs actually run. Just got my first ADRIA_2OV5_AMBER. After the last 2 batches I don't have a lot of confidence but we'll see. It's crazy that they just let these bad batches run without cancelling them when they know they're flawed. 12 bad WUs in a day is ridiculous. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just got my first ADRIA_2OV5_AMBER. After 10 minutes it's still running. That's a good sign, Looks to be about the same length as the old ADRIA WUs that would complete (pre OPEN & CLOSED). A bit longer than the recent GERARD WUs on my machines. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
About 1 hour ago, one of my GTX980Ti hosts downloaded: e1s24_1-ADRIA_2OV5_AMBER_CLOSED1-0-1-RND8355 it is still running (12,45% done after 56 minutes), but why is it calles "CLOSED1" ? |
|
Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,773,211,122 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It looks like The jobs have been re-worked by Amber and are being sent out again. Here is some info from the Project Status page. ADRIA_2OV5_AMBER_CLOSED 14 86 0 just released... ADRIA_2OV5_AMBER_OPEN 13 87 0 just released... ADRIA_2OV5_CLOSED 0 42 0 100% ADRIA_2OV5_OPEN 0 41 0 100% The remaining original jobs probably will still fail but the new ones look promising. I will fire things back up and see how it goes. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have e1s45_3-ADRIA_2OV5_AMBER_OPEN2-0-1-RND2778_1, which seems to be running well (with low CPU usage) on a windows GTX 970. My wingmate blew _0 away with #SWAN: FATAL: cannot find image for module [.nonbonded.cu.] for device version 610 - so no Linux GTX 1080 support yet. |
|
Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,773,211,122 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have one e1s16_1-ADRIA_2OV5_AMBER_OPEN1-0-1-RND6090_1. It has been running for a little over 7hrs. Says it has a day left to go so it looks good so far. It's CPU usage is .7% average and GPU load is about 70% on a Quadro M4000. |
caffeineyellow5Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
About 1 hour ago, one of my GTX980Ti hosts downloaded: The tasks are named based on their function, the function of the test or desired result, so OPEN and CLOSED probably refer to the state of a protein or interaction substance or desired effect and not the task type on our end. It is an indicator name for the result so the servers know how to receive and store it and how the scientists then have to classify and analyze it. 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just got my first ADRIA_2OV5_AMBER. This AMBER_OPEN finished in just under 27 hours on a Ti750. Unfortunately just had a ADRIA_2OV5_AMBER_CLOSED crash and actually cause the machine to reboot. The WU on the other GPU crashed on restart and the ADRIA_2OV5_AMBER_CLOSED restarted from zero. I aborted it. Maybe it'll run on a different GPU... |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Unfortunately just had a ADRIA_2OV5_AMBER_CLOSED crash and actually cause the machine to reboot. I have been debating with myself whether a bad WU can cause a machine to crash. It seems to have happened at times, but it is hard to pin down. I often find that it is really a hardware problem, but you never know. |
caffeineyellow5Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Unfortunately just had a ADRIA_2OV5_AMBER_CLOSED crash and actually cause the machine to reboot. I've had it happen several times across all my systems except my laptop. I think the fact that there is a battery in it makes it more stable and does not shut down. A regular PC, even with a backup, if it comes to a software error, which this might be considered if a WU fails, could surely send an OS into a soft crash which ends as a full reboot. (What I mean by a soft crash is when it affects other parts of the OS like explorer.exe or some dependent system function.) One program affecting the next and some of those programs can just restart as per OS function, like explorer.exe, but others need a reboot and some can't wait, like the Winlogon process. I suspect the same about Linux systems, though I am speaking in Windows terms. Then if it does fail and restarts explorer.exe, for example, the task has failed, but affects no other tasks or the system. If it affects a driver or critical system process that can be reloaded, it could fail all the tasks out or just the other one on the same GPU. And if it is critical to the OS, it could fail all tasks and reboot. Even when a program does all its working 'inside a bubble' that does not affect the rest of the system, the program itself is not a bubble to itself not drawing on other system resources and processes. You can't isolate anything on a system that runs other things, including an OS. About the laptop too, I can't say for certain that it never crashed for a WU. Its just I never noticed it happen that way. It is as vulnerable to a software error as any other PC without a battery. I just don't expect too much from it so I don't check BOINC as much on it as I do the stronger PCs. I can't say I never woke up and it was on a login screen from some shutdown and figured it was a windows update or something, but I can't say for certain it was never a failed task either. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Unfortunately just had a ADRIA_2OV5_AMBER_CLOSED crash and actually cause the machine to reboot. The battery (as in your laptop) will only protect it from power glitches but not from the "software errors" you mention. Simply as a point of information, I've built hundreds (maybe thousands) of PCs since the days of the original 8088/8086 (not to mention V20 & V30)(and Apple II clones and CPM before that). At the peak sometimes built 2 or 3 per week. Sheesh, I'm showing my age. Point is that I test them. I know how to spot hardware errors. Almost all my crunchers run only a minimal set of programs to support BOINC. The one area that could be a possibility is that new higher usage WUs may be stressing a GPU further than previous WUs. If I suspect that, I downclock the GPU. None of my GPUs are OCed expect for factory OC and many are downclocked. I did have a "validate" error (as you ask about in the other thread) a while ago on one 750Ti and downclocked it a bit. Haven't had one since. My errors are mostly caused by power glitches that are too common around here. The power goes out generally for a second or two, enough to reboot the computers. Hopefully the next app will be more fault tolerant. The apps from most other projects do not cause their WUs to error in my experience. What I did notice in tracking down your box with validate errors (for Richard in the other thread), is that your GTX 980Ti and 980 GPUs are throwing quite a few errors on WUs that should run on them (the GTX 980Ti cards seem to be failing the long GIANNIs, sometimes after running for a long time). You may wish to try downclocking. Voltage increase may also work but that causes more heat and power draw for any extra speed that you get and may also adversely affect GPU life. Best of luck getting it sorted out! |
|
Send message Joined: 3 Nov 15 Posts: 38 Credit: 6,768,093 RAC: 0 Level ![]() Scientific publications
|
I would say the most fragile thing in the OS are the nVidia GPU drivers. Battery does not protect you from software errors. An OS by design should not crash from error in application. Applications are protected from each other, but the OS kernel is not protected from the kernel and if a error occurs in a kernel (or driver) then the machine (in best case) reboots. Winlogon (in your example) is a process that os relies upon and will crash if winlogon crashes, in Linux there is init process.
|
©2025 Universitat Pompeu Fabra