acemdlong application 8.14

Author	Message
skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 32925 - Posted: 12 Sep 2013, 23:15:21 UTC - in response to Message 32924. Last modified: 12 Sep 2013, 23:15:55 UTC Operator, what setting do you have in place for, Switch Between Applications Every (time)? - Boinc Manager (Advanced View), Tools, Computing Preferences, Other Options. I use 990min, but I think the default is 60min (which means Boinc will run one app then another...). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 32925 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32926 - Posted: 12 Sep 2013, 23:16:40 UTC - in response to Message 32924. Oh and the Titan machine keeps switching to waiting WUs and then back again. Meaning it will work on two of them for 10 or 20% and then stop and start on the other two in the queue, and then go back to the first two. And that's in addition to the "thread suspend"ing ? Can you say how frequently that is happening (watch the output to the stderr file as the app is running). Does the task state that the client reports (running, suspend, etc) match, or is it always showing the task as running ,even when you can see that it has suspended? Could you try running just a single task, and see whether that behaves itself. Use the app_config setting Beyond advises, here: [quote http://www.gpugrid.net/forum_thread.php?id=3473&nowrap=true#32913 [/quote] Matt ID: 32926 · Rating: 0 · rate: / Reply Quote

Operator Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level Scientific publications	Message 32927 - Posted: 13 Sep 2013, 2:09:00 UTC - in response to Message 32926. And that's in addition to the "thread suspend"ing ? Can you say how frequently that is happening (watch the output to the stderr file as the app is running). Does the task state that the client reports (running, suspend, etc) match, or is it always showing the task as running ,even when you can see that it has suspended? Could you try running just a single task, and see whether that behaves itself. Use the app_config setting Beyond advises, here: [quote http://www.gpugrid.net/forum_thread.php?id=3473&nowrap=true#32913 Matt[/quote] I've just changed the computing prefs from 60.0 minutes to 990.0. Don't know why this Titan box would start hopping around now when it never has before, and the GTX590 box has the exact same settings and doesn't do it with the 4 WUs it has waiting in the queue..but... And as for the stderr; Check this out, this is just a portion.. 9/12/2013 8:09:37 PM \| GPUGRID \| Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1 9/12/2013 8:09:37 PM \| GPUGRID \| Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3 9/12/2013 8:14:11 PM \| GPUGRID \| Restarting task I93R10-NATHAN_KIDc22_glu-1-10-RND9319_0 using acemdlong version 814 (cuda55) in slot 0 9/12/2013 8:15:26 PM \| GPUGRID \| Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1 9/12/2013 8:17:38 PM \| GPUGRID \| Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3 9/12/2013 8:19:38 PM \| GPUGRID \| Restarting task I93R10-NATHAN_KIDc22_glu-1-10-RND9319_0 using acemdlong version 814 (cuda55) in slot 0 9/12/2013 8:22:10 PM \| GPUGRID \| Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1 9/12/2013 8:23:00 PM \| GPUGRID \| Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3 9/12/2013 8:27:13 PM \| GPUGRID \| Restarting task I93R10-NATHAN_KIDc22_glu-1-10-RND9319_0 using acemdlong version 814 (cuda55) in slot 0 9/12/2013 8:28:06 PM \| GPUGRID \| Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3 9/12/2013 8:30:32 PM \| GPUGRID \| Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1 9/12/2013 8:36:27 PM \| GPUGRID \| Restarting task I93R10-NATHAN_KIDc22_glu-1-10-RND9319_0 using acemdlong version 814 (cuda55) in slot 0 9/12/2013 8:39:18 PM \| GPUGRID \| Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1 9/12/2013 8:48:54 PM \| GPUGRID \| Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3 9/12/2013 8:51:12 PM \| GPUGRID \| Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1 9/12/2013 8:55:34 PM \| GPUGRID \| Restarting task I93R10-NATHAN_KIDc22_glu-1-10-RND9319_0 using acemdlong version 814 (cuda55) in slot 0 9/12/2013 8:56:47 PM \| GPUGRID \| Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1 9/12/2013 9:01:11 PM \| GPUGRID \| Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3 I will try running just one WU (suspending the others) and see what happens. Operator ID: 32927 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 32930 - Posted: 13 Sep 2013, 12:49:35 UTC - in response to Message 32925. Operator, what setting do you have in place for, Switch Between Applications Every (time)? - Boinc Manager (Advanced View), Tools, Computing Preferences, Other Options. I use 990min, but I think the default is 60min (which means Boinc will run one app then another...). I use a higher setting too but this should only apply when running more than 1 project. If it's switching between WUs of the same project it's a BOINC bug (heaven forbid) and should be reported. ID: 32930 · Rating: 0 · rate: / Reply Quote

5pot Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level Scientific publications	Message 32931 - Posted: 13 Sep 2013, 13:30:03 UTC My 780 completes them fine. But has constant access violations. ID: 32931 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 32932 - Posted: 13 Sep 2013, 13:56:07 UTC - in response to Message 32930. Operator, what setting do you have in place for, Switch Between Applications Every (time)? - Boinc Manager (Advanced View), Tools, Computing Preferences, Other Options. I use 990min, but I think the default is 60min (which means Boinc will run one app then another...). I use a higher setting too but this should only apply when running more than 1 project. If it's switching between WUs of the same project it's a BOINC bug (heaven forbid) and should be reported. That is one side-effect of the "BOINC temporary exit" procedure, which I think is what Matt Harvey is using for his new crash-recovery application. When an internal problem occurs, the new v8.14 application exits completely, and as far as BOINC knows, the GPU is free and available to schedule another task. If you have another task - from this or any other project - ready to start, BOINC will start it. On my system, with (still) a stupidly high DCF, the sequence is: Task 1 errors and exits Task 2 starts on the vacant GPU Task 1 becomes ready ('waiting to run') again. BOINC notices a deadline miss looming BOINC pre-empts Task 2, and restarts Task 1, marking it 'high priority' That results in Task 2 showing 'Waiting to run', with a minute or two of runtime completed, before Task 1 finally completes. With more normal estimates and no EDF, Task 1 and Task 2 would swap places every time a fault and temporary exit occurred. Even if you run minimal cache and don't normally fetch the next task until shortly before it is needed, you may find that a work fetch is triggered by the first error and temporary exit, and the swapping behaviour starts after that. If the tasks swap places more than a few times each run, your GPU is marginal and you should investigate the cause. ID: 32932 · Rating: 0 · rate: / Reply Quote

Operator Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level Scientific publications	Message 32933 - Posted: 13 Sep 2013, 15:48:12 UTC - in response to Message 32932. When an internal problem occurs, the new v8.14 application exits completely, and as far as BOINC knows, the GPU is free and available to schedule another task. If you have another task - from this or any other project - ready to start, BOINC will start it. On my system, with (still) a stupidly high DCF, the sequence is: Task 1 errors and exits Task 2 starts on the vacant GPU Task 1 becomes ready ('waiting to run') again. BOINC notices a deadline miss looming BOINC pre-empts Task 2, and restarts Task 1, marking it 'high priority' That results in Task 2 showing 'Waiting to run', with a minute or two of runtime completed, before Task 1 finally completes. With more normal estimates and no EDF, Task 1 and Task 2 would swap places every time a fault and temporary exit occurred. Even if you run minimal cache and don't normally fetch the next task until shortly before it is needed, you may find that a work fetch is triggered by the first error and temporary exit, and the swapping behaviour starts after that. If the tasks swap places more than a few times each run, your GPU is marginal and you should investigate the cause. I'm not sure what your last statement means "your GPU is marginal". I have reviewed some completed WUs for another Titan host to see if we were having similar issues. http://www.gpugrid.net/results.php?hostid=156948 And this host (Anonymous) is posting completion times I used to get while running the 8.03 long app. The major thing that jumped out at me was this box is running driver 326.41. Still showing lots of "Access violations" though. I believe I'm running 326.84. I just can't figure out why this just started seemingly out of the blue. I don't remember making any changes to the system and this box does nothing but GPUGrid. Operator ID: 32933 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 32934 - Posted: 13 Sep 2013, 16:29:17 UTC - in response to Message 32933. I'm not sure what your last statement means "your GPU is marginal". For me, a marginal GPU is one which throws a lot of errors, and a GPU which throws a lot of errors is marginal, even if it recovers from them. Marginal being faulty, badly installed, badly maintained, or just in need of some TLC. Things like overclocking, overheating, under-ventilating, under-powering (small PSU), badly seating (in PCIe slot) - anything which makes it unhappy. ID: 32934 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32935 - Posted: 13 Sep 2013, 17:04:03 UTC - in response to Message 32934. Error -97s are a strong indication that the GPU is misbehaving. When I get around to it, I'll sort out a memory testing program for you all to test for this. The access violations are indicative of GPU hardware problems. There are a few hosts that are disproportionately affected by these (Operator, 5pot) and it's not clear why. I suspect there's a relationship with some third-party software, but don't yet have a handle on it. Matt ID: 32935 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 32936 - Posted: 13 Sep 2013, 17:15:39 UTC - in response to Message 32935. Error -97s are a strong indication that the GPU is misbehaving. When I get around to it, I'll sort out a memory testing program for you all to test for this. The access violations are indicative of GPU hardware problems. There are a few hosts that are disproportionately affected by these (Operator, 5pot) and it's not clear why. I suspect there's a relationship with some third-party software, but don't yet have a handle on it. Matt Well my GTX660 has a lot of -97 errors, especially with the CRASH (Santi) tests. LR's and the Noelia's are handled most of the time error free on it. Einstein, Albert and Milkyway run almost error free on it too. So if it is something of the PC or other software (I don't have many installed), would also result in 50% error with the other 3 GPU projects. Or not? Greetings from TJ ID: 32936 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 32943 - Posted: 13 Sep 2013, 23:14:45 UTC - in response to Message 32936. Last modified: 13 Sep 2013, 23:16:55 UTC Well my GTX660 has a lot of -97 errors, especially with the CRASH (Santi) tests. LR's and the Noelia's are handled most of the time error free on it. Einstein, Albert and Milkyway run almost error free on it too. So if it is something of the PC or other software (I don't have many installed), would also result in 50% error with the other 3 GPU projects. Or not? No. Different projects have different apps, which use different parts of the GPU (causing different GPU usage, and/or power draw). I have two Asus GTX 670 DC2OG 2GD5 (factory overclocked) cards in one of my hosts, and they were unreliable with some batches of workunits. I've upgraded the MB/CPU/RAM in this host, but its reliability stayed low with those batches. Even the WLAN connection was lost when the GPU related failures happened. Then I've found a voltage tweaking utility for the Kepler based cards, and with the help of this utility I've raised the GPU's boost voltage by 25mV, and its power limits by 25W. Since then this host didn't have any GPU related errors. Maybe you should give it a try too. It's a quite simple tool, if you put nvflash beside its files, it can directly flash the modified BIOS. ID: 32943 · Rating: 0 · rate: / Reply Quote

5pot Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level Scientific publications	Message 32944 - Posted: 13 Sep 2013, 23:34:38 UTC The only other thing running besides GPUgrid, was WCG. And I mean no anti-virus, nothing at all. Still got the access violations. So I suspended the WCG app. Still the access violations occurred. Although my GPU appears to be functioning correctly (just in general), I bumped the voltage to 1.175 from 1.163. Will report back if this has any impact on the access violations. ID: 32944 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 32945 - Posted: 13 Sep 2013, 23:39:43 UTC - in response to Message 32943. Thanks Zoltan, I will install them and try tomorrow (later today after sleep). Greetings from TJ ID: 32945 · Rating: 0 · rate: / Reply Quote

Operator Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level Scientific publications	Message 32954 - Posted: 14 Sep 2013, 17:37:41 UTC - in response to Message 32935. Last modified: 14 Sep 2013, 17:44:58 UTC Error -97s are a strong indication that the GPU is misbehaving. When I get around to it, I'll sort out a memory testing program for you all to test for this. The access violations are indicative of GPU hardware problems. There are a few hosts that are disproportionately affected by these (Operator, 5pot) and it's not clear why. I suspect there's a relationship with some third-party software, but don't yet have a handle on it. Matt Yesterday evening in response to Mr. Haselgrove's post I decided to do a thorough review of my Titan system. I removed the 326.84 drivers and did a clean install of the 326.41 drivers. Removed the Precision X utility completely. Removed, inspected and reinstalled both Titan GPUs. No dust or other foreign contaminants were found inside the case or GPU enclosures. All cabling checks out. Confirmed all bios settings were as originally set and correct. No other programs are starting with Windows except Teamviewer (which is on all my systems and was long before these problems started). I reinstalled BOINC from scratch with no settings saved from the old installation. I setup GPUGrid all over again (new machine number now). No appconfig or any other XML tweak files are present. Both GPUs are running with factory settings (EVGA GTX Titan SC). I downloaded two long 8.14 WUs and they started. Then the second two downloaded and went from "Ready to Start" to "Waiting to Run" as the first two paused and the system started processing the newly arrived ones. And then after a few minutes back again to the first two that were downloaded. Temps the whole time were nominal - that is to say in the 70's C. I monitored the task manager and occasionally would see one of the 814 apps drop off, leaving only one running, and then after a minute or so there would be two of them running again. The results show multiple access violations: http://www.gpugrid.net/result.php?resultid=7277767 http://www.gpugrid.net/result.php?resultid=7267751 It is also taking longer to complete these WUs using 8.14 (than with the 8.03 app). Again, this started for me with the 8.14 app. None of this happened on the 8.03 app, even with a bit of overclocking it was dead stable. I am confident with what I went through last evening that this issue is not with my system. I have now set this box to do betas only. - Updated the betas (8.14) are doing the same swapping as the normal long WUs were doing. I have to manually suspend two of them to keep this from happening. On the other hand my dual GTX 590 system is happy as a clam. Two steps forward, one step back. Operator ID: 32954 · Rating: 0 · rate: / Reply Quote

5pot Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level Scientific publications	Message 32955 - Posted: 14 Sep 2013, 17:54:40 UTC Yours appear to be happening much more frequently than mine. Even after I increased the voltage, still get access violations. To reiterate, this is with absolutely no third party software running at all. In all honesty, I can only assume that something in the app is causing this behavior. Why it's affecting Operator's Titans more than my 780s, I'm not sure. ID: 32955 · Rating: 0 · rate: / Reply Quote

Operator Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level Scientific publications	Message 32957 - Posted: 14 Sep 2013, 18:59:54 UTC - in response to Message 32954. Last modified: 14 Sep 2013, 19:02:07 UTC I have now set this box to do betas only. - Updated the betas (8.14) are doing the same swapping as the normal long WUs were doing. I have to manually suspend two of them to keep this from happening. And now... 9/14/2013 12:44:56 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 12:48:12 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 12:51:13 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 12:54:26 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 12:57:43 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:00:45 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:03:48 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:07:02 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:10:19 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:13:22 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:16:24 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:19:27 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:22:29 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:25:32 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:28:34 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:31:38 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:34:40 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:37:33 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:40:50 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:43:52 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:46:56 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:49:59 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:53:02 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:56:04 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 9/14/2013 1:58:56 PM \| GPUGRID \| Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0 So even manually suspending the two waiting WUs the ones that are supposed to be running keep starting and stopping. I really have no idea what could be causing this. Operator ID: 32957 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 32961 - Posted: 14 Sep 2013, 23:56:07 UTC - in response to Message 32957. Hello Operator, Perhaps a ridiculous idea and perhaps you have tried it already: what happens when you only use one Titan in that PC? Have you set the minimum and maximum work buffer as low as possible, so that you don't get to fast new WU's? It seems to be happening with Titan and 780 the two top cards. The 326.41 drivers are not the latest one at nVidia's site. Greetings from TJ ID: 32961 · Rating: 0 · rate: / Reply Quote

Operator Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level Scientific publications	Message 32963 - Posted: 15 Sep 2013, 3:18:04 UTC - in response to Message 32961. Hello Operator, Perhaps a ridiculous idea and perhaps you have tried it already: what happens when you only use one Titan in that PC? Have you set the minimum and maximum work buffer as low as possible, so that you don't get to fast new WU's? It seems to be happening with Titan and 780 the two top cards. The 326.41 drivers are not the latest one at nVidia's site. TJ I thought of taking one card out to see what happens and will probably try that tomorrow. I observed today running beta apps (8.14) that there were times when both WUs were "Waiting to Run" Scheduler: Access Violation. Meaning with only two WUs, one for each GPU, nothing was getting done because the app(s) had temporarily shutdown. So I would expect that with just one card the same thing would happen, just one WU and sometimes it would just stop and then restart. I will try that tomorrow anyway. Operator ID: 32963 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 32970 - Posted: 15 Sep 2013, 12:15:55 UTC - in response to Message 32963. Operator: this sounds like your card(s ?) may experience computation errors which were not detected by the 8.03 app. I don't know for sure, but it could well be that the new recovery mode also added enhanced detection of faults. What I'd try: - use 1 Titan (as TJ said) to rule out an insufficient PSU - downclock the cards a fair bit (~50 MHz should do).. I know they're factory clocked now, but it wouldn't be the first time that a manufacturer set overclocks too high - try a card in a different PC MrS Scanning for our furry friends since Jan 2002 ID: 32970 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 32977 - Posted: 15 Sep 2013, 20:54:00 UTC This morning I saw that my GTX660 was down clocked to 50% again (would be nice Matt if we can see that in the stderr output file), so I booted the system. Then in the evening I saw that one WU was Waiting to Run, while the other one was running with only 2% left on the WU that was waiting. Same as Operator had on his Titan. However I did not intervene and the WU finished with good result but I saw this i the output: (only a part of it off course): # GPU 0 : 64C SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 660] Platform [Windows] Rev [3203] VERSION [42] # SWAN Device 0 : # Name : GeForce GTX 660 I have never seen this before. Perhaps "strange" things happened in the past as weel, but now we can look for it. Greetings from TJ ID: 32977 · Rating: 0 · rate: / Reply Quote

acemdlong application 8.14 - discussion