acemdbeta application - discussion

Message boards : News : acemdbeta application - discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6

AuthorMessage
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33952 - Posted: 21 Nov 2013, 12:19:38 UTC - in response to Message 33949.  

OK, now I'm awake, I've checked the logs for those two tasks, and the sequence of events is as I surmised.

20-Nov-2013 16:50:15 [---] NOTICES::write: sending notice 6
20-Nov-2013 17:03:31 [---] Starting BOINC client version 7.2.28 for windows_x86_64

That was the host crash interval

20-Nov-2013 16:43:15 [GPUGRID] Starting task 1-KLAUDE_6429-1-2-RND1937_0 using acemdbeta version 815 (cuda55) in slot 7
20-Nov-2013 17:04:06 [GPUGRID] Restarting task 1-KLAUDE_6429-1-2-RND1937_0 using acemdbeta version 815 (cuda55) in slot 7
20-Nov-2013 17:04:14 [GPUGRID] Task 1-KLAUDE_6429-1-2-RND1937_0 exited with zero status but no 'finished' file
20-Nov-2013 17:04:14 [GPUGRID] Restarting task 1-KLAUDE_6429-1-2-RND1937_0 using acemdbeta version 815 (cuda55) in slot 7
20-Nov-2013 17:04:16 [GPUGRID] [sched_op] Deferring communication for 00:01:39
20-Nov-2013 17:04:16 [GPUGRID] [sched_op] Reason: Unrecoverable error for task 1-KLAUDE_6429-1-2-RND1937_0
20-Nov-2013 17:04:16 [GPUGRID] Computation for task 1-KLAUDE_6429-1-2-RND1937_0 finished

That task crashed, but in a 'benign' way (it didn't take the driver down with it)

20-Nov-2013 16:35:47 [GPUGRID] Starting task 95-KLAUDE_6429-0-2-RND2489_1 using acemdbeta version 815 (cuda55) in slot 3
20-Nov-2013 17:04:06 [GPUGRID] Restarting task 95-KLAUDE_6429-0-2-RND2489_1 using acemdbeta version 815 (cuda55) in slot 3
20-Nov-2013 23:43:46 [GPUGRID] Computation for task 95-KLAUDE_6429-0-2-RND2489_1 finished

And that task validated.
ID: 33952 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Damaraland

Send message
Joined: 7 Nov 09
Posts: 152
Credit: 16,181,924
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwat
Message 33999 - Posted: 24 Nov 2013, 0:05:08 UTC

This unit 3-KLAUDE_6429-0-2-RND6465 worked fine for me after it failed on 6 computers before

Maybe this can help???

If you need any further conf just ask.
ID: 33999 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 34005 - Posted: 24 Nov 2013, 2:33:02 UTC

All KLAUDE failed here after 2 seconds. Found this in the BOINC event log.

11/23/2013 9:28:11 PM | GPUGRID | Output file 91-KLAUDE_6444-0-3-RND9028_1_1 for task 91-KLAUDE_6444-0-3-RND9028_1 absent
11/23/2013 9:28:11 PM | GPUGRID | Output file 91-KLAUDE_6444-0-3-RND9028_1_2 for task 91-KLAUDE_6444-0-3-RND9028_1 absent
11/23/2013 9:28:11 PM | GPUGRID | Output file 91-KLAUDE_6444-0-3-RND9028_1_3 for task 91-KLAUDE_6444-0-3-RND9028_1 absent
ID: 34005 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 47,738
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34006 - Posted: 24 Nov 2013, 2:55:07 UTC - in response to Message 34005.  

All KLAUDE failed here after 2 seconds. Found this in the BOINC event log.

11/23/2013 9:28:11 PM | GPUGRID | Output file 91-KLAUDE_6444-0-3-RND9028_1_1 for task 91-KLAUDE_6444-0-3-RND9028_1 absent
11/23/2013 9:28:11 PM | GPUGRID | Output file 91-KLAUDE_6444-0-3-RND9028_1_2 for task 91-KLAUDE_6444-0-3-RND9028_1 absent
11/23/2013 9:28:11 PM | GPUGRID | Output file 91-KLAUDE_6444-0-3-RND9028_1_3 for task 91-KLAUDE_6444-0-3-RND9028_1 absent



Same here!
ID: 34006 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath

Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34008 - Posted: 24 Nov 2013, 8:14:55 UTC

I had several KLAUDE tasks fail after I configured my system to crunch 2 GPUgrid tasks simultaneously on my single 660Ti. They failed ~2 secs after starting but were reported as "error while computing" and did not verify.

I then turned off beta tasks in my website prefs and received a NATHAN which ran fine alongside the KLAUDE I had run to ~50% completion before trying 2 simultaneous tasks.

I do understand GPUgrid does not support 2 tasks on 1 GPU and I'm not expecting a fix for that, just passing along what I saw, FWIW, as I find it interesting 2 KLAUDE would not run together but 1 KLAUDE + 1 NATHAN were OK and now 1 NATHAN plus 1 SANTI are crunching fine.

I expect I'll witness what others have reported (that 2 tasks don't give much of a production increase) but I wanna try it for a day or 2 just to say I tried it. At that point I'll likely turn beta tasks on again and revert to just 1 task at a time.

BOINC <<--- credit whores, pedants, alien hunters
ID: 34008 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34009 - Posted: 24 Nov 2013, 8:57:58 UTC

It looks as if there was a bad batch of KLAUDE workunits overnight, all of which failed with

ERROR: file mdioload.cpp line 209: Error reading parmtop file

That includes yours, Dagorath - I don't think you can draw a conclusion that running two at a time had anything to do with the failures.
ID: 34009 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile (retired account)

Send message
Joined: 22 Dec 11
Posts: 38
Credit: 28,606,255
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 34010 - Posted: 24 Nov 2013, 9:34:12 UTC

Yes, same here, lots of errors over night. Luckily enough, the Titan already hit a max.-per-day-limit of 15.
Mark my words and remember me. - 11th Hour, Lamb of God

ID: 34010 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34011 - Posted: 24 Nov 2013, 10:17:10 UTC - in response to Message 34010.  
Last modified: 24 Nov 2013, 10:20:18 UTC

Yeah, bad batch.

57-KLAUDE_6444-0-3-RND0265
ACEMD beta version

7485150 129153 24 Nov 2013 | 0:53:30 UTC 24 Nov 2013 | 0:55:10 UTC Error while computing 2.14 0.08 --- ACEMD beta version v8.15 (cuda42)
7487860 159186 24 Nov 2013 | 2:18:25 UTC 24 Nov 2013 | 5:58:30 UTC Error while computing 2.06 0.04 --- ACEMD beta version v8.14 (cuda55)
7488724 99934 24 Nov 2013 | 6:00:24 UTC 24 Nov 2013 | 6:06:35 UTC Error while computing 1.30 0.13 --- ACEMD beta version v8.15 (cuda55)
7488745 160877 24 Nov 2013 | 6:08:47 UTC 24 Nov 2013 | 6:14:58 UTC Error while computing 2.08 0.25 --- ACEMD beta version v8.15 (cuda55)
7488772 161748 24 Nov 2013 | 6:17:42 UTC 29 Nov 2013 | 6:17:42 UTC In progress --- --- --- ACEMD beta version v8.15 (cuda55)...

Its the same on Windows and Linux and the errors occur on different generations of GPU from GTX400's to GTX700's.

Exit status 98 (0x62) Unknown error number
process exited with code 98 (0x62, -158)

ERROR: file mdioload.cpp line 209: Error reading parmtop file
05:54:54 (21170): called boinc_finish

I expect this batch just wasn't built correctly.

-
I see that some WU's have already failed 8 times - the cutoff failure point, so they won't be resent. Given that they fail after 2seconds, and there are only ~130 of these Betas the batch probably won't be around too long.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 34011 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath

Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34012 - Posted: 24 Nov 2013, 10:18:35 UTC - in response to Message 34009.  

It looks as if there was a bad batch of KLAUDE workunits overnight, all of which failed with

ERROR: file mdioload.cpp line 209: Error reading parmtop file

That includes yours, Dagorath - I don't think you can draw a conclusion that running two at a time had anything to do with the failures.


That's good to know, thanks.

BOINC <<--- credit whores, pedants, alien hunters
ID: 34012 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 34014 - Posted: 24 Nov 2013, 14:04:16 UTC - in response to Message 34012.  

Yes, my fault. I put out some broken WUs on the beta channel.

Matt
ID: 34014 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34015 - Posted: 24 Nov 2013, 16:25:39 UTC - in response to Message 34014.  

And not only KLAUDE, a trypsin has result in yet another fatal cuda driver error and downclocked even to 1/3th of the normal clock speed. I know trypsin is nasty stuff as its purpose is to "break down" molecules, but that it can also "break down" a GPU-clock is new for me :)
So I have now opt for LR only on the 660.
Greetings from TJ
ID: 34015 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34016 - Posted: 24 Nov 2013, 19:57:57 UTC - in response to Message 34015.  

The Trp tasks are not part of the bad KLAUDE batch:

trypsin_lig_75x2-NOELIA_RCDOS-0-1-RND0557_1 4940668 159186 24 Nov 2013 | 8:14:31 UTC 24 Nov 2013 | 15:31:51 UTC Completed and validated 6,504.01 2,200.87 30,000.00 ACEMD beta version v8.14 (cuda55)

...and yes, Trypsin is a very useful digestive enzyme.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 34016 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 47,738
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34036 - Posted: 27 Nov 2013, 11:04:52 UTC
Last modified: 27 Nov 2013, 11:08:11 UTC

The latest NOELIA_RCDOSequ betas run fine, but they do down clock the video card speed to 914 Mhz, not the 1019 Mhz speed as recorded on the Stderr output below. Notice the temperature readings, they are in the 50's, not the 60's to low 70's when I run the long tasks. This is true for both windows xp and 7. Below is a typical output for all these betas. The cards do return to normal speed when they run the long runs.

Stderr output

<core_client_version>7.0.64</core_client_version>
<![CDATA[
<stderr_txt>
# GPU [GeForce GTX 690] Platform [Windows] Rev [3203M] VERSION [55]
# SWAN Device 1 :
# Name : GeForce GTX 690
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:04:00.0
# Device clock : 1019MHz
# Memory clock : 3004MHz
# Memory width : 256bit
# Driver version : r325_00 : 32723
# GPU 0 : 52C
# GPU 1 : 51C
# GPU 2 : 51C
# GPU 3 : 49C
# GPU 1 : 52C
# GPU 1 : 53C
# GPU 1 : 54C
# GPU 1 : 55C
# GPU 1 : 56C
# GPU 1 : 57C
# GPU 1 : 58C
# GPU 3 : 51C
# GPU 3 : 52C
# GPU 3 : 53C
# GPU 3 : 54C
# GPU 3 : 55C
# GPU 3 : 56C
# GPU 2 : 53C
# GPU 2 : 54C
# GPU 2 : 55C
# GPU 2 : 56C
# GPU 2 : 57C
# GPU 2 : 58C
# GPU 0 : 53C
# GPU 0 : 54C
# GPU 0 : 55C
# GPU 0 : 56C
# GPU 0 : 57C
# GPU 0 : 58C
# Time per step (avg over 525000 steps): 5.742 ms
# Approximate elapsed time for entire WU: 3014.682 s
02:08:35 (2912): called boinc_finish

</stderr_txt>
]]>
ID: 34036 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34037 - Posted: 27 Nov 2013, 15:52:51 UTC

I see the latest GPU-Z gives a reason for performance capping:



That GTX 670 was capped for reliability voltage and operating voltage (it was running a SANTI_MAR423cap from the long queue at the time, not a Beta - just illustrating the point).
ID: 34037 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34046 - Posted: 27 Nov 2013, 21:34:00 UTC - in response to Message 34037.  

I see the latest GPU-Z gives a reason for performance capping:

That GTX 670 was capped for reliability voltage and operating voltage (it was running a SANTI_MAR423cap from the long queue at the time, not a Beta - just illustrating the point).


<ot>

I have GPUZ 0.7.4 (which says it is the latest version) and I do not see that entry (I have a 670 also), is that a beta version and can you provide a link to where you got it from?

</ot>

Thanks - Steve
ID: 34046 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34047 - Posted: 27 Nov 2013, 21:48:57 UTC - in response to Message 34046.  

GPU-Z has been showing this for months. I'm still using 0.7.3 and it shows it, as did the previous version, and probably the one before that. Thought I posted about this several months ago?!? Anyway, its a useful tool but only works on Windows. My GTX660Ti (which is hanging out of the case against a wall complements of a riser) is limited by V.Rel and V0p (Reliability Voltage and Operating Voltage, respectively). My GTX660 is limited by Power and Reliability Voltage. My GTX770 is limited by Reliability Voltage and Operating Voltage. All in the same system. Of note is that only the GTX660 is limited by Power!
Just saying...

FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 34047 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34048 - Posted: 27 Nov 2013, 22:07:55 UTC
Last modified: 27 Nov 2013, 22:08:24 UTC

Yes I have that seen already from GPU-Z in previous versions just as skgiven says.
My 660 and 770 have limited power reliability. When changing power, nothing changed and the PSU's are powerful enough for the cards.
I saw/see it with beta, SR and LR, from all scientists.
Greetings from TJ
ID: 34048 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34123 - Posted: 4 Dec 2013, 20:22:09 UTC

If GPU-Z reports "VRel., VOp" as throttling reason this actually means the card is running full throttle and has reached the highest boost bin. Since it would need higher voltage for the next bin, it's reported as being throttled by voltage. Unless the power limit is set tightly or cooling is poor, then this should be the default state a GPU-Grid-crunching Kepler is in.

Bedrich wrote:
but they do down clock the video card speed to 914 Mhz, not the 1019 Mhz speed as recorded on the Stderr output below. Notice the temperature readings, they are in the 50's, not the 60's to low 70's when I run the long tasks.

This sounds like the GPU utilization was low, in which case the driver doesn't see it necessary to push to full boost. In this case GPU-Z reports "Util" as throttle reason, for "GPU utilization too low". This mostly happens with small / short WUs. Those are also the ones where running 2 concurrent WUs actually provides some throughput gains.

MrS
Scanning for our furry friends since Jan 2002
ID: 34123 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6

Message boards : News : acemdbeta application - discussion

©2025 Universitat Pompeu Fabra