acemdbeta application

Author	Message
MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32697 - Posted: 4 Sep 2013, 14:42:04 UTC - in response to Message 32696. Note: A Blue-Screen is worse than a driver reset. And I have sometimes seen GPUGrid tasks, when being suspended (looking at the NOELIA ones again)... give me blue screens in the past, with error DPC_WATCHDOG_VIOLATION. Yes - I read "blue-screen" but heard "driver reset". BSOD's are by definition a driver bug. It's axiomatic that no user program should be able to crash the kernel. "DPC_WATCHDOG_VIOLATION" is the event that the driver is supposed to trap and recover from, triggering a driver reset. Evidently that's not a perfect process. MJH ID: 32697 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32699 - Posted: 4 Sep 2013, 15:01:26 UTC - in response to Message 32697. I think if a driver encounters too many TDRs in a short period of time, the OS issues the DPC_WATCHDOG_VIOLATION BSOD. I believe it is not a driver issue. I believe it is a result of getting too many TDRs (from GPUGrid apps). ID: 32699 · Rating: 0 · rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 0 Level Scientific publications	Message 32707 - Posted: 4 Sep 2013, 18:03:34 UTC Last modified: 4 Sep 2013, 18:11:59 UTC 5/5, 8.11-Noelia beta wu's failed on time exceeded.Sample Previously, 8.11 MJHarvey betas ran OK as did 8.05 Noelia betas. Xp32 GTX570 stock, 314.22, 7.0.64 ID: 32707 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32709 - Posted: 4 Sep 2013, 18:06:57 UTC - in response to Message 32707. Just killing off the remaining beta WUs now. MJH ID: 32709 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 32716 - Posted: 4 Sep 2013, 21:40:58 UTC - in response to Message 32709. Just killing off the remaining beta WUs now. MJH I had a beta WU from the TEST18 series, and my host reported it successfully, and it didn't received an abort request from the GPUGrid scheduler. Is there anything we should do? (for example manually abort all beta tasks, including NOELIA_KLEBEbeta tasks?) ID: 32716 · Rating: 0 · rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 69 Level Scientific publications	Message 32718 - Posted: 5 Sep 2013, 1:51:46 UTC - in response to Message 32709. Last modified: 5 Sep 2013, 1:54:36 UTC I guess we did enough testing for now, and we can forgot beta versions 8.05 to 8.10. Here is the proof. Look at the finishing times. Versions 8.05 & 8.10 7245172 4749995 3 Sep 2013 \| 23:26:39 UTC 4 Sep 2013 \| 19:10:37 UTC Completed and validated 68,637.53 21,605.83 142,800.00 ACEMD beta version v8.10 (cuda55) 7245072 4732574 3 Sep 2013 \| 22:31:54 UTC 4 Sep 2013 \| 20:20:27 UTC Completed and validated 75,759.22 21,670.35 142,800.00 ACEMD beta version v8.05 (cuda42) Versus version 8.11 7247558 4731877 4 Sep 2013 \| 13:32:52 UTC 5 Sep 2013 \| 1:24:54 UTC Completed and validated 35,208.78 7,022.73 142,800.00 ACEMD beta version v8.11 (cuda55) 7247095 4751418 4 Sep 2013 \| 10:00:08 UTC 5 Sep 2013 \| 1:03:53 UTC Completed and validated 34,651.15 7,074.61 142,800.00 ACEMD beta version v8.11 (cuda55) Talk about down clocking and low GPU usage that happened in windows 7!! ID: 32718 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32722 - Posted: 5 Sep 2013, 11:02:48 UTC The "simulation unstable" (err -97) failure mode can be quite painful in terms of lost credit. In some circumstances this can be recoverable error, so aborting the WU is unnecessarily wasteful. There'll be a new beta out in a few hours putting this recovery into practice. It will also be accompanied by a new batch of beta WUs, MJHARVEY-CRASHY. If you have been encountering this error a lot please start taking these WUs - I need to see err -97 failures, and lots of them. MJH ID: 32722 · Rating: 0 · rate: / Reply Quote

juan BFP Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level Scientific publications	Message 32724 - Posted: 5 Sep 2013, 12:15:05 UTC Last modified: 5 Sep 2013, 12:20:19 UTC Sorry if i put this on the wrong thread. Could be just curiosity but somebody could explain why this WU receives such diferences in credits, on the same host and are from the same kind of WU (NOELIA) WU 1 - http://www.gpugrid.net/result.php?resultid=7238706 WU 2 - http://www.gpugrid.net/result.php?resultid=7247939 WU1 is cuda42 and WU2 is cuda55, WU1 runs for less time than WU2 and receives about 20% more credit. Both WU reported within the 24 hrs limit. ID: 32724 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 32725 - Posted: 5 Sep 2013, 12:20:44 UTC - in response to Message 32724. Could be just curiosity but somebody could explain why this WU receives such diferences in credits, if they uses about the same GPU/CPU times on similar hosts and are from the same kind of WU (NOELIA) These workunits are from the same scientist (Noelia), but they are not in the same batch. The first workunit is a NOELIA_FRAG041p. The second workunit is a NOELIA_INS1P. ID: 32725 · Rating: 0 · rate: / Reply Quote

juan BFP Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level Scientific publications	Message 32726 - Posted: 5 Sep 2013, 13:11:25 UTC - in response to Message 32725. Last modified: 5 Sep 2013, 13:15:48 UTC Could be just curiosity but somebody could explain why this WU receives such diferences in credits, if they uses about the same GPU/CPU times on similar hosts and are from the same kind of WU (NOELIA) These workunits are from the same scientist (Noelia), but they are not in the same batch. The first workunit is a NOELIA_FRAG041p. The second workunit is a NOELIA_INS1P. By that i understand, the credit "paid" is not related to the processing power used to crunch the WU is related to the batch of the WU when somebody decides the number of credit paid by the batch WU´s, that´s diferent from most of the other Boinc projects and why that´s bugs my mind. Initialy i expect the same number of credits for the same processing time used on the same host (or something aproximately). Please understand me, i don´t question the metodoth i just want to find the answer why. That´s ok now. Thanks for the answer and happy crunching. ID: 32726 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 32727 - Posted: 5 Sep 2013, 13:56:08 UTC - in response to Message 32726. By that i understand, the credit "paid" is not related to the processing power used to crunch the WU is related to the batch of the WU when somebody decides the number of credit paid by the batch WU´s, that´s diferent from most of the other Boinc projects and why that´s bugs my mind. Initialy i expect the same number of credits for the same processing time used on the same host (or something aproximately). Please understand me, i don´t question the metodoth i just want to find the answer why. That´s ok now. Thanks for the answer and happy crunching. For the second time I've read your previous post it came to my mind that your problem could be that a shorter WU received more credit than a longer WU. Well, that's a paradox. It happens from time to time. Later on you will get used to it. It's probably caused by the method used for approximating the processing power needed for the given WU (based on the complexity of the model, and the steps needed). The shorter (~30ksec) WU had 6.25 million steps, and received 180k credit. (6 credit/sec) The longer (~31.4ksec) WU had 4.2 million steps, and received 148.5k credit. (4.73 credit/sec) There is 27% difference between the credit/sec rate of the two workunits. It's significant, but not unusual. ID: 32727 · Rating: 0 · rate: / Reply Quote

juan BFP Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level Scientific publications	Message 32728 - Posted: 5 Sep 2013, 14:09:42 UTC - in response to Message 32727. For the second time I've read your previous post it came to my mind that your problem could be that a shorter WU received more credit than a longer WU. Well, that's a paradox......It's significant, but not unusual. That´s exactly what i means, the paradox (less time more credit - more time less credit). But if is normal and not a bug... then Go crunching both. Thanks for your time and explanations. ID: 32728 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32729 - Posted: 5 Sep 2013, 14:11:15 UTC Ok chaps, there's a new beta 8.12. This comes along with a bunch of WUs, MJHARVEY-CRASH1. If you have suffered error -97 "simulation unstable" errors, please take some of these WUs. The new app implements a recovery mechanism that should see unstable simulations automatically restarted from an earlier checkpoint. Recoveries should be accompanied by a message to the BOINC console, and in the stderr when the job is complete. I have had to update the BOINC client library to implement this, so expect everything to go hilariously wrong. MJH ID: 32729 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level Scientific publications	Message 32730 - Posted: 5 Sep 2013, 14:54:14 UTC - in response to Message 32729. I have had to update the BOINC client library to implement this, so expect everything to go hilariously wrong. ROFL - now that's a good way to attract testers! We have been duly warned, and I'm off to try and download some now. I've seen a few exits with 'error -97', but not any great number. If I get any CRASH1 tasks now, they will run on device 0 - the hotter of my two GTX 670s - hopefully that will generate some material for you to work with. ID: 32730 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32731 - Posted: 5 Sep 2013, 14:57:52 UTC - in response to Message 32729. Last modified: 5 Sep 2013, 15:00:34 UTC I grabbed 4 of them too. It'll be a couple hours before my 2 GPUs can begin work on them. Note: It looks like the 8.11 app "floods" the stderr.txt file with tons of lines of GPU temperature readings. This makes it impossible for me to see all the "GPU start blocks" on the results page. Is there any way to either not flood it with temps, or maybe put continuous temp readings on a single line? Basically, if possible, for any of my completed tasks, I'd prefer to see ALL of the blocks that looks like this: # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 3072MB # Capability : 3.0 # PCI ID : 0000:09:00.0 # Device clock : 1124MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : r325_00 : 32680 ... but, instead, all I'd get to see in the "truncated" web result is: # GPU 0 Current Temp: 67 C # GPU 1 Current Temp: 71 C # GPU 2 Current Temp: 80 C # GPU 0 Current Temp: 67 C # GPU 1 Current Temp: 71 C # GPU 2 Current Temp: 80 C # GPU 0 Current Temp: 67 C # GPU 1 Current Temp: 71 C # GPU 2 Current Temp: 80 C # GPU 0 Current Temp: 67 C # GPU 1 Current Temp: 71 C # GPU 2 Current Temp: 80 C # GPU 0 Current Temp: 67 C # GPU 1 Current Temp: 71 C # GPU 2 Current Temp: 80 C # GPU 0 Current Temp: 67 C # GPU 1 Current Temp: 71 C # GPU 2 Current Temp: 80 C # GPU 0 Current Temp: 67 C # GPU 1 Current Temp: 71 C # GPU 2 Current Temp: 80 C ID: 32731 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level Scientific publications	Message 32732 - Posted: 5 Sep 2013, 15:07:47 UTC - in response to Message 32731. I hadn't reset my DCF, so I only got one job, and it started immediately in high priority - task 7250880. Initial indications are that it will run for roughly 2 hours, if it doesn't spend too much time rewinding. @ MJH - those temperature readings. BOINC has a limit on the amount of text it will return via stderr - 64KB, IIRC. Depending on the client version in use, you might get the first 64KB (with that startup block Jacob wanted), or the last 64KB (which is more likely to contain a crash dump analysis, of interest to debuggers). We could look up which version does what, if you wish. Of course, if you could shrink the bit in the middle, you might be able to fit both ends into 64KB. ID: 32732 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32733 - Posted: 5 Sep 2013, 15:10:38 UTC - in response to Message 32732. The next version will emit temperatures only when they change. MJH ID: 32733 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32734 - Posted: 5 Sep 2013, 15:21:05 UTC - in response to Message 32733. Last modified: 5 Sep 2013, 15:25:39 UTC Just to clarify... Most of my GPUGrid tasks get "started" about 5-10 times, as I'm suspending the GPU often (for exclusive apps), and restarting the machine often (troubleshooting nVidia driver problems). What I'm MOST interested in, is seeing the "GPU identification block" for every restart. So, if it was restarted 10 times, I expect to see 10 blocks, without truncation, in the web result. Hopefully that's not too much to ask. Thanks. ID: 32734 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32736 - Posted: 5 Sep 2013, 15:50:32 UTC - in response to Message 32734. 8.13: reduce temperature verbosity and trap access violations and recover. MJH ID: 32736 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level Scientific publications	Message 32738 - Posted: 5 Sep 2013, 17:47:26 UTC - in response to Message 32732. Last modified: 5 Sep 2013, 18:03:23 UTC task 7250880. I think we got what you were hoping for: # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 2561000) # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] Edit - on the other hand, task 7250828 wasn't so lucky. And I can't see any special messages in the BOINC event log either: these are all ones I would have expected to see anyway. 05/09/2013 18:52:21 \| GPUGRID \| Finished download of 147-MJHARVEY_CRASH1-0-xsc_file 05/09/2013 18:52:21 \| GPUGRID \| [coproc] NVIDIA instance 1: confirming for 170xlow-NOELIA_INS1P-0-12-RND0803_1 05/09/2013 18:52:21 \| GPUGRID \| [coproc] Assigning 0.990000 of NVIDIA free instance 0 to 119-MJHARVEY_CRASH1-0-25-RND2510_0 05/09/2013 18:52:21 \| SETI@home \| [coproc] NVIDIA device 0 already assigned: task 29mr08ag.26459.13160.3.12.250_0 05/09/2013 18:52:21 \| SETI@home \| [coproc] NVIDIA device 0 already assigned: task 29mr08ag.26459.13160.3.12.74_0 05/09/2013 18:52:21 \| SETI@home \| [cpu_sched] Preempting 29mr08ag.26459.13160.3.12.74_0 (removed from memory) 05/09/2013 18:52:21 \| SETI@home \| [cpu_sched] Preempting 29mr08ag.26459.13160.3.12.250_0 (removed from memory) 05/09/2013 18:52:21 \| GPUGRID \| [coproc] NVIDIA instance 1: confirming for 170xlow-NOELIA_INS1P-0-12-RND0803_1 05/09/2013 18:52:21 \| GPUGRID \| [coproc] Assigning 0.990000 of NVIDIA free instance 0 to 119-MJHARVEY_CRASH1-0-25-RND2510_0 05/09/2013 18:52:22 \| GPUGRID \| [coproc] NVIDIA instance 1: confirming for 170xlow-NOELIA_INS1P-0-12-RND0803_1 05/09/2013 18:52:22 \| GPUGRID \| [coproc] Assigning 0.990000 of NVIDIA free instance 0 to 119-MJHARVEY_CRASH1-0-25-RND2510_0 05/09/2013 18:52:22 \| GPUGRID \| Restarting task 119-MJHARVEY_CRASH1-0-25-RND2510_0 using acemdbeta version 812 (cuda55) in slot 10 05/09/2013 18:52:23 \| GPUGRID \| [sched_op] Deferring communication for 00:01:36 05/09/2013 18:52:23 \| GPUGRID \| [sched_op] Reason: Unrecoverable error for task 119-MJHARVEY_CRASH1-0-25-RND2510_0 05/09/2013 18:52:23 \| GPUGRID \| Computation for task 119-MJHARVEY_CRASH1-0-25-RND2510_0 finished 05/09/2013 18:52:23 \| GPUGRID \| Output file 119-MJHARVEY_CRASH1-0-25-RND2510_0_1 for task 119-MJHARVEY_CRASH1-0-25-RND2510_0 absent 05/09/2013 18:52:23 \| GPUGRID \| Output file 119-MJHARVEY_CRASH1-0-25-RND2510_0_2 for task 119-MJHARVEY_CRASH1-0-25-RND2510_0 absent 05/09/2013 18:52:23 \| GPUGRID \| Output file 119-MJHARVEY_CRASH1-0-25-RND2510_0_3 for task 119-MJHARVEY_CRASH1-0-25-RND2510_0 absent 05/09/2013 18:52:23 \| GPUGRID \| [coproc] NVIDIA instance 1: confirming for 170xlow-NOELIA_INS1P-0-12-RND0803_1 05/09/2013 18:52:23 \| GPUGRID \| [coproc] Assigning 0.990000 of NVIDIA free instance 0 to 147-MJHARVEY_CRASH1-0-25-RND2539_0 05/09/2013 18:52:34 \| GPUGRID \| [coproc] NVIDIA instance 1: confirming for 170xlow-NOELIA_INS1P-0-12-RND0803_1 05/09/2013 18:52:34 \| GPUGRID \| [coproc] Assigning 0.990000 of NVIDIA free instance 0 to 147-MJHARVEY_CRASH1-0-25-RND2539_0 05/09/2013 18:52:34 \| GPUGRID \| Starting task 147-MJHARVEY_CRASH1-0-25-RND2539_0 using acemdbeta version 813 (cuda55) in slot 9 05/09/2013 18:52:35 \| GPUGRID \| Started upload of 119-MJHARVEY_CRASH1-0-25-RND2510_0_0 ID: 32738 · Rating: 0 · rate: / Reply Quote

acemdbeta application - discussion