acemdlong application 8.14 - discussion

Message boards : News : acemdlong application 8.14 - discussion
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32925 - Posted: 12 Sep 2013, 23:15:21 UTC - in response to Message 32924.  
Last modified: 12 Sep 2013, 23:15:55 UTC

Operator, what setting do you have in place for, Switch Between Applications Every (time)?
- Boinc Manager (Advanced View), Tools, Computing Preferences, Other Options.

I use 990min, but I think the default is 60min (which means Boinc will run one app then another...).
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 32925 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32926 - Posted: 12 Sep 2013, 23:16:40 UTC - in response to Message 32924.  


Oh and the Titan machine keeps switching to waiting WUs and then back again. Meaning it will work on two of them for 10 or 20% and then stop and start on the other two in the queue, and then go back to the first two.


And that's in addition to the "thread suspend"ing ? Can you say how frequently that is happening (watch the output to the stderr file as the app is running). Does the task state that the client reports (running, suspend, etc) match, or is it always showing the task as running ,even when you can see that it has suspended?

Could you try running just a single task, and see whether that behaves itself. Use the app_config setting Beyond advises, here:
[quote
http://www.gpugrid.net/forum_thread.php?id=3473&nowrap=true#32913
[/quote]

Matt
ID: 32926 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32927 - Posted: 13 Sep 2013, 2:09:00 UTC - in response to Message 32926.  


And that's in addition to the "thread suspend"ing ? Can you say how frequently that is happening (watch the output to the stderr file as the app is running). Does the task state that the client reports (running, suspend, etc) match, or is it always showing the task as running ,even when you can see that it has suspended?

Could you try running just a single task, and see whether that behaves itself. Use the app_config setting Beyond advises, here:
[quote
http://www.gpugrid.net/forum_thread.php?id=3473&nowrap=true#32913


Matt[/quote]

I've just changed the computing prefs from 60.0 minutes to 990.0.

Don't know why this Titan box would start hopping around now when it never has before, and the GTX590 box has the exact same settings and doesn't do it with the 4 WUs it has waiting in the queue..but...

And as for the stderr;

Check this out, this is just a portion..

9/12/2013 8:09:37 PM | GPUGRID | Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1
9/12/2013 8:09:37 PM | GPUGRID | Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3
9/12/2013 8:14:11 PM | GPUGRID | Restarting task I93R10-NATHAN_KIDc22_glu-1-10-RND9319_0 using acemdlong version 814 (cuda55) in slot 0
9/12/2013 8:15:26 PM | GPUGRID | Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1
9/12/2013 8:17:38 PM | GPUGRID | Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3
9/12/2013 8:19:38 PM | GPUGRID | Restarting task I93R10-NATHAN_KIDc22_glu-1-10-RND9319_0 using acemdlong version 814 (cuda55) in slot 0
9/12/2013 8:22:10 PM | GPUGRID | Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1
9/12/2013 8:23:00 PM | GPUGRID | Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3
9/12/2013 8:27:13 PM | GPUGRID | Restarting task I93R10-NATHAN_KIDc22_glu-1-10-RND9319_0 using acemdlong version 814 (cuda55) in slot 0
9/12/2013 8:28:06 PM | GPUGRID | Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3
9/12/2013 8:30:32 PM | GPUGRID | Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1
9/12/2013 8:36:27 PM | GPUGRID | Restarting task I93R10-NATHAN_KIDc22_glu-1-10-RND9319_0 using acemdlong version 814 (cuda55) in slot 0
9/12/2013 8:39:18 PM | GPUGRID | Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1
9/12/2013 8:48:54 PM | GPUGRID | Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3
9/12/2013 8:51:12 PM | GPUGRID | Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1
9/12/2013 8:55:34 PM | GPUGRID | Restarting task I93R10-NATHAN_KIDc22_glu-1-10-RND9319_0 using acemdlong version 814 (cuda55) in slot 0
9/12/2013 8:56:47 PM | GPUGRID | Restarting task e7s7_e6s12f416-SDOERR_VillinAdaptN2-0-1-RND2117_0 using acemdlong version 814 (cuda55) in slot 1
9/12/2013 9:01:11 PM | GPUGRID | Restarting task e7s15_e4s11f480-SDOERR_VillinAdaptN4-0-1-RND1175_0 using acemdlong version 814 (cuda55) in slot 3

I will try running just one WU (suspending the others) and see what happens.

Operator
ID: 32927 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32930 - Posted: 13 Sep 2013, 12:49:35 UTC - in response to Message 32925.  

Operator, what setting do you have in place for, Switch Between Applications Every (time)?
- Boinc Manager (Advanced View), Tools, Computing Preferences, Other Options.

I use 990min, but I think the default is 60min (which means Boinc will run one app then another...).

I use a higher setting too but this should only apply when running more than 1 project. If it's switching between WUs of the same project it's a BOINC bug (heaven forbid) and should be reported.
ID: 32930 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
5pot

Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32931 - Posted: 13 Sep 2013, 13:30:03 UTC

My 780 completes them fine. But has constant access violations.
ID: 32931 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32932 - Posted: 13 Sep 2013, 13:56:07 UTC - in response to Message 32930.  

Operator, what setting do you have in place for, Switch Between Applications Every (time)?
- Boinc Manager (Advanced View), Tools, Computing Preferences, Other Options.

I use 990min, but I think the default is 60min (which means Boinc will run one app then another...).

I use a higher setting too but this should only apply when running more than 1 project. If it's switching between WUs of the same project it's a BOINC bug (heaven forbid) and should be reported.

That is one side-effect of the "BOINC temporary exit" procedure, which I think is what Matt Harvey is using for his new crash-recovery application.

When an internal problem occurs, the new v8.14 application exits completely, and as far as BOINC knows, the GPU is free and available to schedule another task. If you have another task - from this or any other project - ready to start, BOINC will start it.

On my system, with (still) a stupidly high DCF, the sequence is:
Task 1 errors and exits
Task 2 starts on the vacant GPU
Task 1 becomes ready ('waiting to run') again.
BOINC notices a deadline miss looming
BOINC pre-empts Task 2, and restarts Task 1, marking it 'high priority'

That results in Task 2 showing 'Waiting to run', with a minute or two of runtime completed, before Task 1 finally completes.

With more normal estimates and no EDF, Task 1 and Task 2 would swap places every time a fault and temporary exit occurred. Even if you run minimal cache and don't normally fetch the next task until shortly before it is needed, you may find that a work fetch is triggered by the first error and temporary exit, and the swapping behaviour starts after that.

If the tasks swap places more than a few times each run, your GPU is marginal and you should investigate the cause.
ID: 32932 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32933 - Posted: 13 Sep 2013, 15:48:12 UTC - in response to Message 32932.  



When an internal problem occurs, the new v8.14 application exits completely, and as far as BOINC knows, the GPU is free and available to schedule another task. If you have another task - from this or any other project - ready to start, BOINC will start it.

On my system, with (still) a stupidly high DCF, the sequence is:
Task 1 errors and exits
Task 2 starts on the vacant GPU
Task 1 becomes ready ('waiting to run') again.
BOINC notices a deadline miss looming
BOINC pre-empts Task 2, and restarts Task 1, marking it 'high priority'

That results in Task 2 showing 'Waiting to run', with a minute or two of runtime completed, before Task 1 finally completes.

With more normal estimates and no EDF, Task 1 and Task 2 would swap places every time a fault and temporary exit occurred. Even if you run minimal cache and don't normally fetch the next task until shortly before it is needed, you may find that a work fetch is triggered by the first error and temporary exit, and the swapping behaviour starts after that.

If the tasks swap places more than a few times each run, your GPU is marginal and you should investigate the cause.



I'm not sure what your last statement means "your GPU is marginal".

I have reviewed some completed WUs for another Titan host to see if we were having similar issues.
http://www.gpugrid.net/results.php?hostid=156948

And this host (Anonymous) is posting completion times I used to get while running the 8.03 long app.

The major thing that jumped out at me was this box is running driver 326.41.

Still showing lots of "Access violations" though.

I believe I'm running 326.84.

I just can't figure out why this just started seemingly out of the blue. I don't remember making any changes to the system and this box does nothing but GPUGrid.

Operator
ID: 32933 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32934 - Posted: 13 Sep 2013, 16:29:17 UTC - in response to Message 32933.  

I'm not sure what your last statement means "your GPU is marginal".

For me, a marginal GPU is one which throws a lot of errors, and a GPU which throws a lot of errors is marginal, even if it recovers from them.

Marginal being faulty, badly installed, badly maintained, or just in need of some TLC.

Things like overclocking, overheating, under-ventilating, under-powering (small PSU), badly seating (in PCIe slot) - anything which makes it unhappy.
ID: 32934 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32935 - Posted: 13 Sep 2013, 17:04:03 UTC - in response to Message 32934.  

Error -97s are a strong indication that the GPU is misbehaving. When I get around to it, I'll sort out a memory testing program for you all to test for this. The access violations are indicative of GPU hardware problems. There are a few hosts that are disproportionately affected by these (Operator, 5pot) and it's not clear why. I suspect there's a relationship with some third-party software, but don't yet have a handle on it.

Matt
ID: 32935 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32936 - Posted: 13 Sep 2013, 17:15:39 UTC - in response to Message 32935.  

Error -97s are a strong indication that the GPU is misbehaving. When I get around to it, I'll sort out a memory testing program for you all to test for this. The access violations are indicative of GPU hardware problems. There are a few hosts that are disproportionately affected by these (Operator, 5pot) and it's not clear why. I suspect there's a relationship with some third-party software, but don't yet have a handle on it.

Matt

Well my GTX660 has a lot of -97 errors, especially with the CRASH (Santi) tests. LR's and the Noelia's are handled most of the time error free on it. Einstein, Albert and Milkyway run almost error free on it too. So if it is something of the PC or other software (I don't have many installed), would also result in 50% error with the other 3 GPU projects. Or not?
Greetings from TJ
ID: 32936 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32943 - Posted: 13 Sep 2013, 23:14:45 UTC - in response to Message 32936.  
Last modified: 13 Sep 2013, 23:16:55 UTC

Well my GTX660 has a lot of -97 errors, especially with the CRASH (Santi) tests. LR's and the Noelia's are handled most of the time error free on it. Einstein, Albert and Milkyway run almost error free on it too. So if it is something of the PC or other software (I don't have many installed), would also result in 50% error with the other 3 GPU projects. Or not?

No. Different projects have different apps, which use different parts of the GPU (causing different GPU usage, and/or power draw).
I have two Asus GTX 670 DC2OG 2GD5 (factory overclocked) cards in one of my hosts, and they were unreliable with some batches of workunits. I've upgraded the MB/CPU/RAM in this host, but its reliability stayed low with those batches. Even the WLAN connection was lost when the GPU related failures happened. Then I've found a voltage tweaking utility for the Kepler based cards, and with the help of this utility I've raised the GPU's boost voltage by 25mV, and its power limits by 25W. Since then this host didn't have any GPU related errors. Maybe you should give it a try too. It's a quite simple tool, if you put nvflash beside its files, it can directly flash the modified BIOS.
ID: 32943 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
5pot

Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32944 - Posted: 13 Sep 2013, 23:34:38 UTC

The *only* other thing running besides GPUgrid, was WCG. And I mean no anti-virus, nothing at all. Still got the access violations. So I suspended the WCG app. Still the access violations occurred.

Although my GPU appears to be functioning correctly (just in general), I bumped the voltage to 1.175 from 1.163. Will report back if this has any impact on the access violations.
ID: 32944 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32945 - Posted: 13 Sep 2013, 23:39:43 UTC - in response to Message 32943.  

Thanks Zoltan, I will install them and try tomorrow (later today after sleep).
Greetings from TJ
ID: 32945 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32954 - Posted: 14 Sep 2013, 17:37:41 UTC - in response to Message 32935.  
Last modified: 14 Sep 2013, 17:44:58 UTC

Error -97s are a strong indication that the GPU is misbehaving. When I get around to it, I'll sort out a memory testing program for you all to test for this. The access violations are indicative of GPU hardware problems. There are a few hosts that are disproportionately affected by these (Operator, 5pot) and it's not clear why. I suspect there's a relationship with some third-party software, but don't yet have a handle on it.

Matt


Yesterday evening in response to Mr. Haselgrove's post I decided to do a thorough review of my Titan system.

I removed the 326.84 drivers and did a clean install of the 326.41 drivers.

Removed the Precision X utility completely.

Removed, inspected and reinstalled both Titan GPUs. No dust or other foreign contaminants were found inside the case or GPU enclosures.

All cabling checks out.

Confirmed all bios settings were as originally set and correct.

No other programs are starting with Windows except Teamviewer (which is on all my systems and was long before these problems started).

I reinstalled BOINC from scratch with no settings saved from the old installation.

I setup GPUGrid all over again (new machine number now).

No appconfig or any other XML tweak files are present.

Both GPUs are running with factory settings (EVGA GTX Titan SC).

I downloaded two long 8.14 WUs and they started. Then the second two downloaded and went from "Ready to Start" to "Waiting to Run" as the first two paused and the system started processing the newly arrived ones. And then after a few minutes back again to the first two that were downloaded.

Temps the whole time were nominal - that is to say in the 70's C.

I monitored the task manager and occasionally would see one of the 814 apps drop off, leaving only one running, and then after a minute or so there would be two of them running again.

The results show multiple access violations:

http://www.gpugrid.net/result.php?resultid=7277767

http://www.gpugrid.net/result.php?resultid=7267751

It is also taking longer to complete these WUs using 8.14 (than with the 8.03 app).

Again, this started for me with the 8.14 app. None of this happened on the 8.03 app, even with a bit of overclocking it was dead stable.

I am confident with what I went through last evening that this issue is not with my system.

I have now set this box to do betas only.

- Updated the betas (8.14) are doing the same swapping as the normal long WUs were doing. I have to manually suspend two of them to keep this from happening.

On the other hand my dual GTX 590 system is happy as a clam.

Two steps forward, one step back.

Operator
ID: 32954 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
5pot

Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32955 - Posted: 14 Sep 2013, 17:54:40 UTC

Yours appear to be happening much more frequently than mine. Even after I increased the voltage, still get access violations.

To reiterate, this is with absolutely no third party software running at all. In all honesty, I can only assume that something in the app is causing this behavior.

Why it's affecting Operator's Titans more than my 780s, I'm not sure.
ID: 32955 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32957 - Posted: 14 Sep 2013, 18:59:54 UTC - in response to Message 32954.  
Last modified: 14 Sep 2013, 19:02:07 UTC


I have now set this box to do betas only.

- Updated the betas (8.14) are doing the same swapping as the normal long WUs were doing. I have to manually suspend two of them to keep this from happening.



And now...

9/14/2013 12:44:56 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 12:48:12 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 12:51:13 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 12:54:26 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 12:57:43 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:00:45 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:03:48 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:07:02 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:10:19 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:13:22 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:16:24 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:19:27 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:22:29 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:25:32 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:28:34 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:31:38 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:34:40 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:37:33 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:40:50 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:43:52 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:46:56 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:49:59 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:53:02 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:56:04 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0
9/14/2013 1:58:56 PM | GPUGRID | Restarting task 152-MJHARVEY_CRASH2-24-25-RND4251_0 using acemdbeta version 814 (cuda55) in slot 0

So even manually suspending the two waiting WUs the ones that are supposed to be running keep starting and stopping.

I really have no idea what could be causing this.

Operator
ID: 32957 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32961 - Posted: 14 Sep 2013, 23:56:07 UTC - in response to Message 32957.  

Hello Operator,

Perhaps a ridiculous idea and perhaps you have tried it already: what happens when you only use one Titan in that PC?

Have you set the minimum and maximum work buffer as low as possible, so that you don't get to fast new WU's?

It seems to be happening with Titan and 780 the two top cards. The 326.41 drivers are not the latest one at nVidia's site.
Greetings from TJ
ID: 32961 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32963 - Posted: 15 Sep 2013, 3:18:04 UTC - in response to Message 32961.  

Hello Operator,

Perhaps a ridiculous idea and perhaps you have tried it already: what happens when you only use one Titan in that PC?

Have you set the minimum and maximum work buffer as low as possible, so that you don't get to fast new WU's?

It seems to be happening with Titan and 780 the two top cards. The 326.41 drivers are not the latest one at nVidia's site.


TJ

I thought of taking one card out to see what happens and will probably try that tomorrow.

I observed today running beta apps (8.14) that there were times when both WUs were "Waiting to Run" Scheduler: Access Violation. Meaning with only two WUs, one for each GPU, nothing was getting done because the app(s) had temporarily shutdown.

So I would expect that with just one card the same thing would happen, just one WU and sometimes it would just stop and then restart.

I will try that tomorrow anyway.

Operator

ID: 32963 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32970 - Posted: 15 Sep 2013, 12:15:55 UTC - in response to Message 32963.  

Operator: this sounds like your card(s ?) may experience computation errors which were not detected by the 8.03 app. I don't know for sure, but it could well be that the new recovery mode also added enhanced detection of faults. What I'd try:

- use 1 Titan (as TJ said) to rule out an insufficient PSU
- downclock the cards a fair bit (~50 MHz should do).. I know they're factory clocked now, but it wouldn't be the first time that a manufacturer set overclocks too high
- try a card in a different PC

MrS
Scanning for our furry friends since Jan 2002
ID: 32970 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32977 - Posted: 15 Sep 2013, 20:54:00 UTC

This morning I saw that my GTX660 was down clocked to 50% again (would be nice Matt if we can see that in the stderr output file), so I booted the system.
Then in the evening I saw that one WU was Waiting to Run, while the other one was running with only 2% left on the WU that was waiting. Same as Operator had on his Titan. However I did not intervene and the WU finished with good result but I saw this i the output:
(only a part of it off course):
# GPU 0 : 64C
SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
# GPU [GeForce GTX 660] Platform [Windows] Rev [3203] VERSION [42]
# SWAN Device 0	:
#	Name		: GeForce GTX 660


I have never seen this before. Perhaps "strange" things happened in the past as weel, but now we can look for it.
Greetings from TJ
ID: 32977 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : News : acemdlong application 8.14 - discussion

©2025 Universitat Pompeu Fabra