acemdbeta application - discussion

Message boards : News : acemdbeta application - discussion
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32697 - Posted: 4 Sep 2013, 14:42:04 UTC - in response to Message 32696.  


Note: A Blue-Screen is worse than a driver reset. And I have sometimes seen GPUGrid tasks, when being suspended (looking at the NOELIA ones again)... give me blue screens in the past, with error DPC_WATCHDOG_VIOLATION.


Yes - I read "blue-screen" but heard "driver reset". BSOD's are by definition a driver bug. It's axiomatic that no user program should be able to crash the kernel.
"DPC_WATCHDOG_VIOLATION" is the event that the driver is supposed to trap and recover from, triggering a driver reset. Evidently that's not a perfect process.

MJH
ID: 32697 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32699 - Posted: 4 Sep 2013, 15:01:26 UTC - in response to Message 32697.  

I think if a driver encounters too many TDRs in a short period of time, the OS issues the DPC_WATCHDOG_VIOLATION BSOD.

I believe it is not a driver issue.
I believe it is a result of getting too many TDRs (from GPUGrid apps).
ID: 32699 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Stoneageman
Avatar

Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,374,498
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32707 - Posted: 4 Sep 2013, 18:03:34 UTC
Last modified: 4 Sep 2013, 18:11:59 UTC

5/5, 8.11-Noelia beta wu's failed on time exceeded.Sample
Previously, 8.11 MJHarvey betas ran OK as did 8.05 Noelia betas.
Xp32 GTX570 stock, 314.22, 7.0.64
ID: 32707 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32709 - Posted: 4 Sep 2013, 18:06:57 UTC - in response to Message 32707.  

Just killing off the remaining beta WUs now.

MJH
ID: 32709 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32716 - Posted: 4 Sep 2013, 21:40:58 UTC - in response to Message 32709.  

Just killing off the remaining beta WUs now.

MJH

I had a beta WU from the TEST18 series, and my host reported it successfully, and it didn't received an abort request from the GPUGrid scheduler.
Is there anything we should do? (for example manually abort all beta tasks, including NOELIA_KLEBEbeta tasks?)
ID: 32716 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 47,738
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32718 - Posted: 5 Sep 2013, 1:51:46 UTC - in response to Message 32709.  
Last modified: 5 Sep 2013, 1:54:36 UTC

I guess we did enough testing for now, and we can forgot beta versions 8.05 to 8.10. Here is the proof. Look at the finishing times.

Versions 8.05 & 8.10

7245172 4749995 3 Sep 2013 | 23:26:39 UTC 4 Sep 2013 | 19:10:37 UTC Completed and validated 68,637.53 21,605.83 142,800.00 ACEMD beta version v8.10 (cuda55)

7245072 4732574 3 Sep 2013 | 22:31:54 UTC 4 Sep 2013 | 20:20:27 UTC Completed and validated 75,759.22 21,670.35 142,800.00 ACEMD beta version v8.05 (cuda42)

Versus version 8.11

7247558 4731877 4 Sep 2013 | 13:32:52 UTC 5 Sep 2013 | 1:24:54 UTC Completed and validated 35,208.78 7,022.73 142,800.00 ACEMD beta version v8.11 (cuda55)

7247095 4751418 4 Sep 2013 | 10:00:08 UTC 5 Sep 2013 | 1:03:53 UTC Completed and validated 34,651.15 7,074.61 142,800.00 ACEMD beta version v8.11 (cuda55)


Talk about down clocking and low GPU usage that happened in windows 7!!
ID: 32718 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32722 - Posted: 5 Sep 2013, 11:02:48 UTC

The "simulation unstable" (err -97) failure mode can be quite painful in terms of lost credit. In some circumstances this can be recoverable error, so aborting the WU is unnecessarily wasteful.

There'll be a new beta out in a few hours putting this recovery into practice. It will also be accompanied by a new batch of beta WUs, MJHARVEY-CRASHY.

If you have been encountering this error a lot please start taking these WUs - I need to see err -97 failures, and lots of them.

MJH
ID: 32722 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
juan BFP

Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,887,858
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 32724 - Posted: 5 Sep 2013, 12:15:05 UTC
Last modified: 5 Sep 2013, 12:20:19 UTC

Sorry if i put this on the wrong thread.

Could be just curiosity but somebody could explain why this WU receives such diferences in credits, on the same host and are from the same kind of WU (NOELIA)

WU 1 - http://www.gpugrid.net/result.php?resultid=7238706

WU 2 - http://www.gpugrid.net/result.php?resultid=7247939

WU1 is cuda42 and WU2 is cuda55, WU1 runs for less time than WU2 and receives about 20% more credit. Both WU reported within the 24 hrs limit.
ID: 32724 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32725 - Posted: 5 Sep 2013, 12:20:44 UTC - in response to Message 32724.  

Could be just curiosity but somebody could explain why this WU receives such diferences in credits, if they uses about the same GPU/CPU times on similar hosts and are from the same kind of WU (NOELIA)


These workunits are from the same scientist (Noelia), but they are not in the same batch.

The first workunit is a NOELIA_FRAG041p.
The second workunit is a NOELIA_INS1P.
ID: 32725 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
juan BFP

Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,887,858
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 32726 - Posted: 5 Sep 2013, 13:11:25 UTC - in response to Message 32725.  
Last modified: 5 Sep 2013, 13:15:48 UTC

Could be just curiosity but somebody could explain why this WU receives such diferences in credits, if they uses about the same GPU/CPU times on similar hosts and are from the same kind of WU (NOELIA)


These workunits are from the same scientist (Noelia), but they are not in the same batch.

The first workunit is a NOELIA_FRAG041p.
The second workunit is a NOELIA_INS1P.

By that i understand, the credit "paid" is not related to the processing power used to crunch the WU is related to the batch of the WU when somebody decides the number of credit paid by the batch WU´s, that´s diferent from most of the other Boinc projects and why that´s bugs my mind. Initialy i expect the same number of credits for the same processing time used on the same host (or something aproximately).

Please understand me, i don´t question the metodoth i just want to find the answer why. That´s ok now.

Thanks for the answer and happy crunching.
ID: 32726 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32727 - Posted: 5 Sep 2013, 13:56:08 UTC - in response to Message 32726.  

By that i understand, the credit "paid" is not related to the processing power used to crunch the WU is related to the batch of the WU when somebody decides the number of credit paid by the batch WU´s, that´s diferent from most of the other Boinc projects and why that´s bugs my mind. Initialy i expect the same number of credits for the same processing time used on the same host (or something aproximately).

Please understand me, i don´t question the metodoth i just want to find the answer why. That´s ok now.

Thanks for the answer and happy crunching.

For the second time I've read your previous post it came to my mind that your problem could be that a shorter WU received more credit than a longer WU. Well, that's a paradox. It happens from time to time. Later on you will get used to it. It's probably caused by the method used for approximating the processing power needed for the given WU (based on the complexity of the model, and the steps needed).
The shorter (~30ksec) WU had 6.25 million steps, and received 180k credit. (6 credit/sec)
The longer (~31.4ksec) WU had 4.2 million steps, and received 148.5k credit. (4.73 credit/sec)
There is 27% difference between the credit/sec rate of the two workunits. It's significant, but not unusual.
ID: 32727 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
juan BFP

Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,887,858
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 32728 - Posted: 5 Sep 2013, 14:09:42 UTC - in response to Message 32727.  

For the second time I've read your previous post it came to my mind that your problem could be that a shorter WU received more credit than a longer WU. Well, that's a paradox......It's significant, but not unusual.

That´s exactly what i means, the paradox (less time more credit - more time less credit). But if is normal and not a bug... then Go crunching both. Thanks for your time and explanations.

ID: 32728 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32729 - Posted: 5 Sep 2013, 14:11:15 UTC

Ok chaps, there's a new beta 8.12.
This comes along with a bunch of WUs, MJHARVEY-CRASH1.

If you have suffered error -97 "simulation unstable" errors, please take some of these WUs.

The new app implements a recovery mechanism that should see unstable simulations automatically restarted from an earlier checkpoint. Recoveries should be accompanied by a message to the BOINC console, and in the stderr when the job is complete.

I have had to update the BOINC client library to implement this, so expect everything to go hilariously wrong.

MJH
ID: 32729 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32730 - Posted: 5 Sep 2013, 14:54:14 UTC - in response to Message 32729.  

I have had to update the BOINC client library to implement this, so expect everything to go hilariously wrong.

ROFL - now that's a good way to attract testers! We have been duly warned, and I'm off to try and download some now.

I've seen a few exits with 'error -97', but not any great number. If I get any CRASH1 tasks now, they will run on device 0 - the hotter of my two GTX 670s - hopefully that will generate some material for you to work with.
ID: 32730 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32731 - Posted: 5 Sep 2013, 14:57:52 UTC - in response to Message 32729.  
Last modified: 5 Sep 2013, 15:00:34 UTC

I grabbed 4 of them too. It'll be a couple hours before my 2 GPUs can begin work on them.

Note: It looks like the 8.11 app "floods" the stderr.txt file with tons of lines of GPU temperature readings. This makes it impossible for me to see all the "GPU start blocks" on the results page.

Is there any way to either not flood it with temps, or maybe put continuous temp readings on a single line?

Basically, if possible, for any of my completed tasks, I'd prefer to see ALL of the blocks that looks like this:

# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:09:00.0
# Device clock : 1124MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r325_00 : 32680

... but, instead, all I'd get to see in the "truncated" web result is:

# GPU 0 Current Temp: 67 C
# GPU 1 Current Temp: 71 C
# GPU 2 Current Temp: 80 C
# GPU 0 Current Temp: 67 C
# GPU 1 Current Temp: 71 C
# GPU 2 Current Temp: 80 C
# GPU 0 Current Temp: 67 C
# GPU 1 Current Temp: 71 C
# GPU 2 Current Temp: 80 C
# GPU 0 Current Temp: 67 C
# GPU 1 Current Temp: 71 C
# GPU 2 Current Temp: 80 C
# GPU 0 Current Temp: 67 C
# GPU 1 Current Temp: 71 C
# GPU 2 Current Temp: 80 C
# GPU 0 Current Temp: 67 C
# GPU 1 Current Temp: 71 C
# GPU 2 Current Temp: 80 C
# GPU 0 Current Temp: 67 C
# GPU 1 Current Temp: 71 C
# GPU 2 Current Temp: 80 C
ID: 32731 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32732 - Posted: 5 Sep 2013, 15:07:47 UTC - in response to Message 32731.  

I hadn't reset my DCF, so I only got one job, and it started immediately in high priority - task 7250880. Initial indications are that it will run for roughly 2 hours, if it doesn't spend too much time rewinding.

@ MJH - those temperature readings.

BOINC has a limit on the amount of text it will return via stderr - 64KB, IIRC. Depending on the client version in use, you might get the first 64KB (with that startup block Jacob wanted), or the last 64KB (which is more likely to contain a crash dump analysis, of interest to debuggers). We could look up which version does what, if you wish.

Of course, if you could shrink the bit in the middle, you might be able to fit both ends into 64KB.
ID: 32732 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32733 - Posted: 5 Sep 2013, 15:10:38 UTC - in response to Message 32732.  

The next version will emit temperatures only when they change.

MJH
ID: 32733 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32734 - Posted: 5 Sep 2013, 15:21:05 UTC - in response to Message 32733.  
Last modified: 5 Sep 2013, 15:25:39 UTC

Just to clarify... Most of my GPUGrid tasks get "started" about 5-10 times, as I'm suspending the GPU often (for exclusive apps), and restarting the machine often (troubleshooting nVidia driver problems).

What I'm MOST interested in, is seeing the "GPU identification block" for every restart. So, if it was restarted 10 times, I expect to see 10 blocks, without truncation, in the web result.

Hopefully that's not too much to ask.
Thanks.
ID: 32734 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32736 - Posted: 5 Sep 2013, 15:50:32 UTC - in response to Message 32734.  

8.13: reduce temperature verbosity and trap access violations and recover.

MJH
ID: 32736 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32738 - Posted: 5 Sep 2013, 17:47:26 UTC - in response to Message 32732.  
Last modified: 5 Sep 2013, 18:03:23 UTC

task 7250880.

I think we got what you were hoping for:

# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 2561000)
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]

Edit - on the other hand, task 7250828 wasn't so lucky. And I can't see any special messages in the BOINC event log either: these are all ones I would have expected to see anyway.

05/09/2013 18:52:21 | GPUGRID | Finished download of 147-MJHARVEY_CRASH1-0-xsc_file
05/09/2013 18:52:21 | GPUGRID | [coproc] NVIDIA instance 1: confirming for 170xlow-NOELIA_INS1P-0-12-RND0803_1
05/09/2013 18:52:21 | GPUGRID | [coproc] Assigning 0.990000 of NVIDIA free instance 0 to 119-MJHARVEY_CRASH1-0-25-RND2510_0
05/09/2013 18:52:21 | SETI@home | [coproc] NVIDIA device 0 already assigned: task 29mr08ag.26459.13160.3.12.250_0
05/09/2013 18:52:21 | SETI@home | [coproc] NVIDIA device 0 already assigned: task 29mr08ag.26459.13160.3.12.74_0
05/09/2013 18:52:21 | SETI@home | [cpu_sched] Preempting 29mr08ag.26459.13160.3.12.74_0 (removed from memory)
05/09/2013 18:52:21 | SETI@home | [cpu_sched] Preempting 29mr08ag.26459.13160.3.12.250_0 (removed from memory)
05/09/2013 18:52:21 | GPUGRID | [coproc] NVIDIA instance 1: confirming for 170xlow-NOELIA_INS1P-0-12-RND0803_1
05/09/2013 18:52:21 | GPUGRID | [coproc] Assigning 0.990000 of NVIDIA free instance 0 to 119-MJHARVEY_CRASH1-0-25-RND2510_0
05/09/2013 18:52:22 | GPUGRID | [coproc] NVIDIA instance 1: confirming for 170xlow-NOELIA_INS1P-0-12-RND0803_1
05/09/2013 18:52:22 | GPUGRID | [coproc] Assigning 0.990000 of NVIDIA free instance 0 to 119-MJHARVEY_CRASH1-0-25-RND2510_0
05/09/2013 18:52:22 | GPUGRID | Restarting task 119-MJHARVEY_CRASH1-0-25-RND2510_0 using acemdbeta version 812 (cuda55) in slot 10
05/09/2013 18:52:23 | GPUGRID | [sched_op] Deferring communication for 00:01:36
05/09/2013 18:52:23 | GPUGRID | [sched_op] Reason: Unrecoverable error for task 119-MJHARVEY_CRASH1-0-25-RND2510_0
05/09/2013 18:52:23 | GPUGRID | Computation for task 119-MJHARVEY_CRASH1-0-25-RND2510_0 finished
05/09/2013 18:52:23 | GPUGRID | Output file 119-MJHARVEY_CRASH1-0-25-RND2510_0_1 for task 119-MJHARVEY_CRASH1-0-25-RND2510_0 absent
05/09/2013 18:52:23 | GPUGRID | Output file 119-MJHARVEY_CRASH1-0-25-RND2510_0_2 for task 119-MJHARVEY_CRASH1-0-25-RND2510_0 absent
05/09/2013 18:52:23 | GPUGRID | Output file 119-MJHARVEY_CRASH1-0-25-RND2510_0_3 for task 119-MJHARVEY_CRASH1-0-25-RND2510_0 absent
05/09/2013 18:52:23 | GPUGRID | [coproc] NVIDIA instance 1: confirming for 170xlow-NOELIA_INS1P-0-12-RND0803_1
05/09/2013 18:52:23 | GPUGRID | [coproc] Assigning 0.990000 of NVIDIA free instance 0 to 147-MJHARVEY_CRASH1-0-25-RND2539_0
05/09/2013 18:52:34 | GPUGRID | [coproc] NVIDIA instance 1: confirming for 170xlow-NOELIA_INS1P-0-12-RND0803_1
05/09/2013 18:52:34 | GPUGRID | [coproc] Assigning 0.990000 of NVIDIA free instance 0 to 147-MJHARVEY_CRASH1-0-25-RND2539_0
05/09/2013 18:52:34 | GPUGRID | Starting task 147-MJHARVEY_CRASH1-0-25-RND2539_0 using acemdbeta version 813 (cuda55) in slot 9
05/09/2013 18:52:35 | GPUGRID | Started upload of 119-MJHARVEY_CRASH1-0-25-RND2510_0_0
ID: 32738 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : News : acemdbeta application - discussion

©2025 Universitat Pompeu Fabra