Message boards :
News :
acemdbeta application - discussion
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next
Author | Message |
---|---|
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Yes - I read "blue-screen" but heard "driver reset". BSOD's are by definition a driver bug. It's axiomatic that no user program should be able to crash the kernel. "DPC_WATCHDOG_VIOLATION" is the event that the driver is supposed to trap and recover from, triggering a driver reset. Evidently that's not a perfect process. MJH |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I think if a driver encounters too many TDRs in a short period of time, the OS issues the DPC_WATCHDOG_VIOLATION BSOD. I believe it is not a driver issue. I believe it is a result of getting too many TDRs (from GPUGrid apps). |
![]() ![]() Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
5/5, 8.11-Noelia beta wu's failed on time exceeded.Sample Previously, 8.11 MJHarvey betas ran OK as did 8.05 Noelia betas. Xp32 GTX570 stock, 314.22, 7.0.64 |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Just killing off the remaining beta WUs now. MJH |
![]() ![]() Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Just killing off the remaining beta WUs now. I had a beta WU from the TEST18 series, and my host reported it successfully, and it didn't received an abort request from the GPUGrid scheduler. Is there anything we should do? (for example manually abort all beta tasks, including NOELIA_KLEBEbeta tasks?) |
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 47,738 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I guess we did enough testing for now, and we can forgot beta versions 8.05 to 8.10. Here is the proof. Look at the finishing times. Versions 8.05 & 8.10 7245172 4749995 3 Sep 2013 | 23:26:39 UTC 4 Sep 2013 | 19:10:37 UTC Completed and validated 68,637.53 21,605.83 142,800.00 ACEMD beta version v8.10 (cuda55) 7245072 4732574 3 Sep 2013 | 22:31:54 UTC 4 Sep 2013 | 20:20:27 UTC Completed and validated 75,759.22 21,670.35 142,800.00 ACEMD beta version v8.05 (cuda42) Versus version 8.11 7247558 4731877 4 Sep 2013 | 13:32:52 UTC 5 Sep 2013 | 1:24:54 UTC Completed and validated 35,208.78 7,022.73 142,800.00 ACEMD beta version v8.11 (cuda55) 7247095 4751418 4 Sep 2013 | 10:00:08 UTC 5 Sep 2013 | 1:03:53 UTC Completed and validated 34,651.15 7,074.61 142,800.00 ACEMD beta version v8.11 (cuda55) Talk about down clocking and low GPU usage that happened in windows 7!! |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
The "simulation unstable" (err -97) failure mode can be quite painful in terms of lost credit. In some circumstances this can be recoverable error, so aborting the WU is unnecessarily wasteful. There'll be a new beta out in a few hours putting this recovery into practice. It will also be accompanied by a new batch of beta WUs, MJHARVEY-CRASHY. If you have been encountering this error a lot please start taking these WUs - I need to see err -97 failures, and lots of them. MJH |
Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Sorry if i put this on the wrong thread. Could be just curiosity but somebody could explain why this WU receives such diferences in credits, on the same host and are from the same kind of WU (NOELIA) WU 1 - http://www.gpugrid.net/result.php?resultid=7238706 WU 2 - http://www.gpugrid.net/result.php?resultid=7247939 WU1 is cuda42 and WU2 is cuda55, WU1 runs for less time than WU2 and receives about 20% more credit. Both WU reported within the 24 hrs limit. ![]() |
![]() ![]() Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Could be just curiosity but somebody could explain why this WU receives such diferences in credits, if they uses about the same GPU/CPU times on similar hosts and are from the same kind of WU (NOELIA) These workunits are from the same scientist (Noelia), but they are not in the same batch. The first workunit is a NOELIA_FRAG041p. The second workunit is a NOELIA_INS1P. |
Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Could be just curiosity but somebody could explain why this WU receives such diferences in credits, if they uses about the same GPU/CPU times on similar hosts and are from the same kind of WU (NOELIA) By that i understand, the credit "paid" is not related to the processing power used to crunch the WU is related to the batch of the WU when somebody decides the number of credit paid by the batch WU´s, that´s diferent from most of the other Boinc projects and why that´s bugs my mind. Initialy i expect the same number of credits for the same processing time used on the same host (or something aproximately). Please understand me, i don´t question the metodoth i just want to find the answer why. That´s ok now. Thanks for the answer and happy crunching. ![]() |
![]() ![]() Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
By that i understand, the credit "paid" is not related to the processing power used to crunch the WU is related to the batch of the WU when somebody decides the number of credit paid by the batch WU´s, that´s diferent from most of the other Boinc projects and why that´s bugs my mind. Initialy i expect the same number of credits for the same processing time used on the same host (or something aproximately). For the second time I've read your previous post it came to my mind that your problem could be that a shorter WU received more credit than a longer WU. Well, that's a paradox. It happens from time to time. Later on you will get used to it. It's probably caused by the method used for approximating the processing power needed for the given WU (based on the complexity of the model, and the steps needed). The shorter (~30ksec) WU had 6.25 million steps, and received 180k credit. (6 credit/sec) The longer (~31.4ksec) WU had 4.2 million steps, and received 148.5k credit. (4.73 credit/sec) There is 27% difference between the credit/sec rate of the two workunits. It's significant, but not unusual. |
Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
For the second time I've read your previous post it came to my mind that your problem could be that a shorter WU received more credit than a longer WU. Well, that's a paradox......It's significant, but not unusual. That´s exactly what i means, the paradox (less time more credit - more time less credit). But if is normal and not a bug... then Go crunching both. Thanks for your time and explanations. ![]() |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Ok chaps, there's a new beta 8.12. This comes along with a bunch of WUs, MJHARVEY-CRASH1. If you have suffered error -97 "simulation unstable" errors, please take some of these WUs. The new app implements a recovery mechanism that should see unstable simulations automatically restarted from an earlier checkpoint. Recoveries should be accompanied by a message to the BOINC console, and in the stderr when the job is complete. I have had to update the BOINC client library to implement this, so expect everything to go hilariously wrong. MJH |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have had to update the BOINC client library to implement this, so expect everything to go hilariously wrong. ROFL - now that's a good way to attract testers! We have been duly warned, and I'm off to try and download some now. I've seen a few exits with 'error -97', but not any great number. If I get any CRASH1 tasks now, they will run on device 0 - the hotter of my two GTX 670s - hopefully that will generate some material for you to work with. |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I grabbed 4 of them too. It'll be a couple hours before my 2 GPUs can begin work on them. Note: It looks like the 8.11 app "floods" the stderr.txt file with tons of lines of GPU temperature readings. This makes it impossible for me to see all the "GPU start blocks" on the results page. Is there any way to either not flood it with temps, or maybe put continuous temp readings on a single line? Basically, if possible, for any of my completed tasks, I'd prefer to see ALL of the blocks that looks like this: # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 3072MB # Capability : 3.0 # PCI ID : 0000:09:00.0 # Device clock : 1124MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : r325_00 : 32680 ... but, instead, all I'd get to see in the "truncated" web result is: # GPU 0 Current Temp: 67 C # GPU 1 Current Temp: 71 C # GPU 2 Current Temp: 80 C # GPU 0 Current Temp: 67 C # GPU 1 Current Temp: 71 C # GPU 2 Current Temp: 80 C # GPU 0 Current Temp: 67 C # GPU 1 Current Temp: 71 C # GPU 2 Current Temp: 80 C # GPU 0 Current Temp: 67 C # GPU 1 Current Temp: 71 C # GPU 2 Current Temp: 80 C # GPU 0 Current Temp: 67 C # GPU 1 Current Temp: 71 C # GPU 2 Current Temp: 80 C # GPU 0 Current Temp: 67 C # GPU 1 Current Temp: 71 C # GPU 2 Current Temp: 80 C # GPU 0 Current Temp: 67 C # GPU 1 Current Temp: 71 C # GPU 2 Current Temp: 80 C |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I hadn't reset my DCF, so I only got one job, and it started immediately in high priority - task 7250880. Initial indications are that it will run for roughly 2 hours, if it doesn't spend too much time rewinding. @ MJH - those temperature readings. BOINC has a limit on the amount of text it will return via stderr - 64KB, IIRC. Depending on the client version in use, you might get the first 64KB (with that startup block Jacob wanted), or the last 64KB (which is more likely to contain a crash dump analysis, of interest to debuggers). We could look up which version does what, if you wish. Of course, if you could shrink the bit in the middle, you might be able to fit both ends into 64KB. |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
The next version will emit temperatures only when they change. MJH |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Just to clarify... Most of my GPUGrid tasks get "started" about 5-10 times, as I'm suspending the GPU often (for exclusive apps), and restarting the machine often (troubleshooting nVidia driver problems). What I'm MOST interested in, is seeing the "GPU identification block" for every restart. So, if it was restarted 10 times, I expect to see 10 blocks, without truncation, in the web result. Hopefully that's not too much to ask. Thanks. |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
8.13: reduce temperature verbosity and trap access violations and recover. MJH |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
task 7250880. I think we got what you were hoping for: # The simulation has become unstable. Terminating to avoid lock-up (1) Edit - on the other hand, task 7250828 wasn't so lucky. And I can't see any special messages in the BOINC event log either: these are all ones I would have expected to see anyway. 05/09/2013 18:52:21 | GPUGRID | Finished download of 147-MJHARVEY_CRASH1-0-xsc_file |
©2025 Universitat Pompeu Fabra