Advanced search

Message boards : Number crunching : strange behaviour...

Author Message
capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 33072 - Posted: 18 Sep 2013 | 19:36:05 UTC

Hi there,

one of my boinc machines is a Win7 Pro 64Bit with an ASUS GTX570 card. The NVidia driver is the latest 320.49. This machine shows a strange behaviour: each of the WUs (http://www.gpugrid.net/results.php?hostid=158339) will be started without any failure, seems to run for hours, but nothing happens...no CPU usage, no GPU usage, no progress...

What's wrong here ?

best regards,
Rene

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33076 - Posted: 18 Sep 2013 | 20:10:39 UTC - in response to Message 33072.

Did you reboot the machine? Power off, remove the power cord, wait 10+ mins and power back on? Driver reinstall, maybe just straight the new 326.80? Is BOINC actually saying "running" in the manager?

MrS
____________
Scanning for our furry friends since Jan 2002

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 33080 - Posted: 18 Sep 2013 | 20:38:37 UTC - in response to Message 33076.

Hi,

yes I did. The BOINC manager says that it is running and the messages file shows it too. I've got another machine for GPUGRID with the same OS and drivers, but with a GTX480 and a GTX560Ti. This machine doesn't show any unusual behaviour.

Hmmm...the 326.80 isn't stable but beta. Since this is not a boinc-only machine, I'd prefer to stay with the stable drivers.

Rene

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33081 - Posted: 18 Sep 2013 | 21:12:52 UTC - in response to Message 33080.

Hmmm...the 326.80 isn't stable but beta. Since this is not a boinc-only machine, I'd prefer to stay with the stable drivers.

Hi Rene, just for info I have 8 machines running here on 326.80 with no noticeable problems. In fact they all have both NVidia and AMD GPUs installed.

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 33085 - Posted: 19 Sep 2013 | 5:28:25 UTC - in response to Message 33081.

Hi,

thanks for the info. Maybe I should give it a try...

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 33086 - Posted: 19 Sep 2013 | 5:50:20 UTC - in response to Message 33085.

Non, not even with the new drivers does it work. The application still does nothing... I cancelled both WUs.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33089 - Posted: 19 Sep 2013 | 8:40:51 UTC - in response to Message 33086.
Last modified: 19 Sep 2013 | 8:42:23 UTC

pnitrox122-NOELIA_INS1P-1-12-RND5810_0
2Mgx191-NOELIA_INS1P-6-12-RND2605_0
I99R1-NATHAN_KIDc22_glu-3-10-RND8774_1

Yesterday I reported similar behavior while running a NOELIA_INS1P WU (even on Linux),
http://www.gpugrid.net/forum_thread.php?id=3466&nowrap=true#33057
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Paul
Send message
Joined: 25 Apr 13
Posts: 26
Credit: 219,745,553
RAC: 238,476
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33091 - Posted: 19 Sep 2013 | 9:10:39 UTC

This run keeps increasing its remaining time with no end in sight.

Should I abort it?

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33094 - Posted: 19 Sep 2013 | 14:34:19 UTC - in response to Message 33089.

Was that on the GTX 650 Ti BOOST? I think you also have a GTX 660 as I recall. I want to give mine a try again on the just-released 327.23 drivers, but the 660s seem to have been somewhat problematic recently.

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 33095 - Posted: 19 Sep 2013 | 16:37:30 UTC - in response to Message 33089.

@skgiven

the WU that is running (more or less) at the moment is a SANTI_RAP74. This one also does nothing... :(

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33097 - Posted: 19 Sep 2013 | 18:52:56 UTC

The WU which did run for some hours has lot's of "# BOINC suspending at user request (thread suspend)" lines in the log. If it's a new installation: did you already check "Nutze die GPU wenn der Computer benutzt wird" in the local BOINC settings, CPU tab? And "Wenn CPU-Auslastung geringer als x%" with x set to 0?

MrS
____________
Scanning for our furry friends since Jan 2002

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 33098 - Posted: 19 Sep 2013 | 19:19:45 UTC - in response to Message 33097.

Sure, see screenshot. Those message lines are more than interesting, but I can't explain what causes them.

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 33099 - Posted: 19 Sep 2013 | 19:21:36 UTC - in response to Message 33098.

The thing is that all other GPU tasks (SETI, Einstein, PrimeGrid, POEM) are running fine on this machine.

Jozef J
Send message
Joined: 7 Jun 12
Posts: 112
Credit: 1,140,895,172
RAC: 315,774
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 33100 - Posted: 19 Sep 2013 | 20:00:18 UTC

I got a similar problem for months .. I already try all the tutorials on this forum and the clean reinstal win 8 64 bit .. and even observe the problem on my hardware manufacturer's website ..

A problem is in communication GPU grid taks and nvidia drivers .. cuda and programming errors ..
Just two-week working gpu grid normally and then comes tasks wrong and all work is **
I see a lot of people who do not have problems, but they probably use computers only for gpu grid, or is in use linux .. But for many people discourage these problems by counting in GPUGRID
For example, the Collatz Conjecture I for about a week, two, the average rac 650000 .. as well as the gpu grid for few months, but then the problems started about which is fully forum ..
Two days ago I did one job for about 8-9 hours .. they are running me two because I have two cards in sli .. After today crash nvidia driver and subsequent BSOD and forced restarts, obviously wasteful tasks and credit .. I already shows one manager onetasks performed for about 9-10 hours ...
weeks before the clean installation, win 8 64bit on my ssd, I one task done in 14-16 hours ... Then I had an older bios on board and voalaa,, I counts one task for 8 hours until this morning when back on the old problem of crash nvidia drivers, and, chrome browser, and others ..
I've never not install the beta nvidia drivers, just WHQL,because with the beta drivers it worse..

just going to install nvidia 327.23 driver...((

Jozef J
Send message
Joined: 7 Jun 12
Posts: 112
Credit: 1,140,895,172
RAC: 315,774
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 33102 - Posted: 19 Sep 2013 | 20:25:33 UTC - in response to Message 33100.

when I installed nvidia drivers, nvidia driver fell again in a few second intervals, pop up notification of a collapse of the controls is flashed ... it's crazy.. after the next reboot while it works well but one task will count 9 -10 hours .. so again is really something wrong..
is proably never ending problems in this project :-)

Jozef J
Send message
Joined: 7 Jun 12
Posts: 112
Credit: 1,140,895,172
RAC: 315,774
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 33103 - Posted: 19 Sep 2013 | 20:26:14 UTC - in response to Message 33099.

Just so..

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 33104 - Posted: 19 Sep 2013 | 20:40:06 UTC - in response to Message 33099.

Ok, did a couple of debug sessions and took a look into the app_control code.

It seems that the task gets suspended due to CPU throttling. I'll have a deeper look now to find out why this is happening.

Will keep you posted...

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 33105 - Posted: 19 Sep 2013 | 20:52:26 UTC - in response to Message 33104.

Ok, this was a quick solution ;) I think the CPU throttling in BOINC 7.0.64 is non-optimal. When you take a look at my screenshot of the BOINC options, you'll notice that I only allow to use 75% of my CPUs. I'm not running any CPU-only WUs on this machine, so there is always just 1 active WU, since I've only got one GPU.

After analyzing the debug output and the source code, I've just changed the option from 75% to 100%...BINGO!!! That worked :)

Now the WU is running fine.

But I think the CPU throttle handling in BOINC needs a bit of tweaking, since the GPUGrid task never ever used 75% of one CPU...

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 33106 - Posted: 19 Sep 2013 | 21:05:28 UTC - in response to Message 33100.

@Josef

maybe you should lower the GPU and memory clock speeds a bit. If the GPUs are running nearly at 100% for a long period of time, the electronics might not be able to support the factory clocks speeds any longer.

In the past I've had the same problems (see http://www.gpugrid.net/forum_thread.php?id=3421#31554). After I lowered the clocks a bit, everything is running smooth.

cheers
Rene

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,873,762,138
RAC: 19,679,913
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33107 - Posted: 19 Sep 2013 | 21:42:29 UTC - in response to Message 33105.

Ok, this was a quick solution ;) I think the CPU throttling in BOINC 7.0.64 is non-optimal.

Correct. That was a brief (and fortunately now abandoned) aberration in BOINC. Later developmental versions (and BOINC v7.2 when it's released "real soon now") will go back to the old behaviour - CPU throttling not applied to GPU apps.

I've written up details of the exact versions affected on some project's message board - I'll try and work out which project it was, and copy them back here later.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33109 - Posted: 19 Sep 2013 | 22:14:28 UTC - in response to Message 33107.

Richard,

What's this CPU throtting thing? Do you know how it works? There's no thing germane in the library code so presumably it's all in the client.

Matt

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,873,762,138
RAC: 19,679,913
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33111 - Posted: 19 Sep 2013 | 23:26:00 UTC - in response to Message 33109.

Richard,

What's this CPU throtting thing? Do you know how it works? There's no thing germane in the library code so presumably it's all in the client.

Matt

Yes, in the client. It's meant for thermal control of CPUs, and it dates back to the early days of BOINC. If you look at the Computing preferences on your account here, the bottom item under Processor usage is:

Use at most
Can be used to reduce CPU heat 100% of CPU time

The implementation is crude: they wanted it to use the same source code on every platform, and there isn't a fine control like that. So it operates on a granularity of 1 second, so capeITLabs' 75% would have been 3 seconds on and 1 second off. That, of course, means three eternities on and one eternity off at the speeds GPUs operate.

David Anderson made a gut reaction to a single user's request on the mailing list back in January: http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2013-January/019305.html - I'm sure you can think of such a reason. That emerged in version 7.0.45

It was removed with v7.2.1. You might like to look at the note:

client: don't apply CPU throttling to apps that use < .5 CPUs (like GPU, NCI).

and http://boinc.berkeley.edu/trac/changeset/4cb34a123aacfaccc28b5f1f76717864b0b63a57/boinc-v2 with respect to the requested CPU reservation for Keplers and above (and make the same suggestion to any OpenCL developers you know).

Links to the earlier changesets are contained in my email at http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2013-July/020131.html

Any casual reader here who wishes to apply thermal control to their CPU or GPU under Windows (only) would be better advised to consider TThrottle

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33112 - Posted: 19 Sep 2013 | 23:29:57 UTC - in response to Message 33111.

Thanks Richard,

I guess I'd better take a look and see exactly how this third suspend-resume mechanism works under the hood..

MJH

Paul
Send message
Joined: 25 Apr 13
Posts: 26
Credit: 219,745,553
RAC: 238,476
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33117 - Posted: 20 Sep 2013 | 20:40:06 UTC - in response to Message 33094.

660 ... have aborted run ... just installed latest driver.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33120 - Posted: 20 Sep 2013 | 23:16:27 UTC - in response to Message 33117.

660 ... have aborted run ... just installed latest driver.

I have updated the drivers on my two GTX 660s to 327.23 and completed my first Noelia with no problems (4-NOELIA_INS1P-9-15-RND4205_0 14:09:09). Each card is running another Noelia with no problems thus far, so I will let them run and see what happens.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,873,762,138
RAC: 19,679,913
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33122 - Posted: 21 Sep 2013 | 7:21:32 UTC - in response to Message 33112.

Thanks Richard,

I guess I'd better take a look and see exactly how this third suspend-resume mechanism works under the hood..

MJH

I don't know whether this is concidence, or whether you've been in communication behind the scenes, but David Anderson has just started work on a better throttling solution.

"client: preliminary implementation (commented out) of sub-second throttling"
http://boinc.berkeley.edu/trac/changeset/ebde7809ceaca8cc35d75c2a2b5adc32c19694e5/boinc-v2
http://boinc.berkeley.edu/trac/changeset/35f489d36f4c7734d13f76af5844ec42d244be59/boinc-v2

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33124 - Posted: 21 Sep 2013 | 11:41:09 UTC

I'm against coarse-grained throttling for thermal control as it's inefficient for any hardware using adaptive power states (like boosting nVidias and Intels + AMDs with Turbo). The reason: during activity the hardware boosts into the maximum power state supported, which implies a high voltage and lower power efficiency, whereas during idle periods it obviously does nothing.

If the throttling took place fine-grained the hardare could adjust to the requested performance level and sustain a lower power state (lower voltage - higher power efficiency) and achieve the same throughput. Starting and stopping this often is inefficient from a software perspective, though.

At least for GPU-Grid there's a far better solution: simply lower the GPUs power target and leave it at 100% time. It will take care of adjusting clocks and voltages down by itself. The downside of this: it requires the user to use tuning software, since this is not even available in nVidias control panel under Win (just checked mine). Let alone Linux or Mac OS, which generally don't have working hardware control software.

Adjusting the power target down for CPU is also not as easy as it should be.. given Intels mobile chips already support cTDP in principle. And with AMD GPUs boosting is not yet as wide-soread, efficient and controllable as for the green team :/

MrS
____________
Scanning for our furry friends since Jan 2002

Paul
Send message
Joined: 25 Apr 13
Posts: 26
Credit: 219,745,553
RAC: 238,476
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33163 - Posted: 23 Sep 2013 | 9:26:42 UTC - in response to Message 33117.

I91R9-NATHAN_KIDc22_glu-7-10-RND1126_1

Has been running for over 49 hours ... elapsed time increases, remaining time barely moves, but increases.

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 33165 - Posted: 23 Sep 2013 | 9:54:28 UTC - in response to Message 33163.
Last modified: 23 Sep 2013 | 10:47:50 UTC

Hi, GPUGrid Folks:

Short run task has been grinding away for over 14 h......


251-NOELIA_CRYST1-9-12-RND5111_0 (60% complete)

:(

John

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33166 - Posted: 23 Sep 2013 | 11:21:25 UTC - in response to Message 33165.
Last modified: 23 Sep 2013 | 11:22:51 UTC

Paul and John,

If the GPU is cooler than normal, I suggest shutting the system down, turning the PSU off for a minute and then turning it back on and starting the system up again - doing this allowed me to finish a WU that had run for 5days (but had really stopped after about 6h). Keep an eye on your runtime before and after you restart.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33167 - Posted: 23 Sep 2013 | 12:35:59 UTC - in response to Message 33166.

If the GPU is cooler than normal, I suggest shutting the system down, turning the PSU off for a minute and then turning it back on and starting the system up again - doing this allowed me to finish a WU that had run for 5days (but had really stopped after about 6h).

That fixed it for me with I18R10-NATHAN_KIDc22_glu-8-10-RND4986_1, which was taking 30 hours to complete on a GTX 660 (327.23 drivers). It had previously completed three others in the NATHAN_KIDc22 series with no problems in about 12 hours.

That is unfortunately not a practical solution for me, since I lost 10 hours of CEP2 work running on the CPU. It seems to be more of a problem with the mid-range cards (GTX 660, 660 Ti). Are the 700 series cards immune?

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 33168 - Posted: 23 Sep 2013 | 12:57:43 UTC - in response to Message 33166.
Last modified: 23 Sep 2013 | 12:58:18 UTC

Many thanks, skgiven. Problem fixed. I had hoped to run these tasks in a 'set and forget' mode, but that may not be possible. Being unable to sleep last night, I took a peek at my machine at around 05:00h to see if all is well and that's when I discovered the long run.

I will try again and if the problem recurs I will make the suggested fix.

Thanks again,

John

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33171 - Posted: 23 Sep 2013 | 16:04:33 UTC - in response to Message 33167.

It seems to be more of a problem with the mid-range cards (GTX 660, 660 Ti). Are the 700 series cards immune?


Depends on whether the same thing that is causing your system to just stop processing is the same thing that causes 780s/Titans to have constant "Access violations" and app restarts.

Could be the same thing causing different symptoms using different GPUs.

I think it's all down to 8.14 myself.

Operator.
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33172 - Posted: 23 Sep 2013 | 18:17:05 UTC - in response to Message 33171.

Depends on whether the same thing that is causing your system to just stop processing is the same thing that causes 780s/Titans to have constant "Access violations" and app restarts.

The Memory Controller Load apparently runs at a constant 14% rate when it is running slowly, so I doubt that it is the start/stop condition. (It should run about 30% normally on these work units.) I know they had a similar problem with the older apps (before the 8 series), particularly with the GTX 660s, and thought it might have been solved. Otherwise, the 8.14 app works very nicely that I can see, except for one Noelia that errored out, but no crashes or other bad behavior. I hope they can get the last wrinkles ironed out for the mid-range cards, and also for the 700 cards or else there is not much incentive to upgrade to those.

Paul
Send message
Joined: 25 Apr 13
Posts: 26
Credit: 219,745,553
RAC: 238,476
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33173 - Posted: 23 Sep 2013 | 19:45:52 UTC - in response to Message 33166.

Shut the machine down while I went to work, 12 hrs later turned it back on.

The elapsed time increases, the remaining stagnant ?

Paul
Send message
Joined: 25 Apr 13
Posts: 26
Credit: 219,745,553
RAC: 238,476
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33174 - Posted: 23 Sep 2013 | 20:06:18 UTC - in response to Message 33173.

This is the second run I have aborted. My GPUGRID credits are decreasing because I am running programs that don't work and I have to abort.

Paul
Send message
Joined: 25 Apr 13
Posts: 26
Credit: 219,745,553
RAC: 238,476
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33175 - Posted: 23 Sep 2013 | 20:11:55 UTC - in response to Message 33174.

All this started happening just recently ...

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33176 - Posted: 23 Sep 2013 | 21:26:38 UTC - in response to Message 33172.

Depends on whether the same thing that is causing your system to just stop processing is the same thing that causes 780s/Titans to have constant "Access violations" and app restarts.

The Memory Controller Load apparently runs at a constant 14% rate when it is running slowly, so I doubt that it is the start/stop condition. (It should run about 30% normally on these work units.) I know they had a similar problem with the older apps (before the 8 series), particularly with the GTX 660s, and thought it might have been solved. Otherwise, the 8.14 app works very nicely that I can see, except for one Noelia that errored out, but no crashes or other bad behavior. I hope they can get the last wrinkles ironed out for the mid-range cards, and also for the 700 cards or else there is not much incentive to upgrade to those.

Have you checked if the GPU clock runs still at full load (to load you want or have set to)? I have had a lot of troubles with my 660's, even bought a new motherboard. They do fine now with the beta's and long runs and 8.14. Short runs give (still) the most problems.
My 770 from Asus is almost error free with all types of WU, and more over most WU's don't even stop en route, they run in one go. We can now see that with the new stderr Matt has made. So in my new builds only 770, 780 and Titan.

____________
Greetings from TJ

Paul
Send message
Joined: 25 Apr 13
Posts: 26
Credit: 219,745,553
RAC: 238,476
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33178 - Posted: 23 Sep 2013 | 22:51:42 UTC - in response to Message 33175.

If nothing will fix this, I will delete GPUGRID and run another BOINC program.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33179 - Posted: 24 Sep 2013 | 0:59:01 UTC - in response to Message 33176.
Last modified: 24 Sep 2013 | 1:04:20 UTC

Have you checked if the GPU clock runs still at full load (to load you want or have set to)? I have had a lot of troubles with my 660's, even bought a new motherboard. They do fine now with the beta's and long runs and 8.14. Short runs give (still) the most problems.

Yes, the GPU clock shows running a full speed on GPU-Z. It is normally 993 MHz as set by the card, but I had reduced it to 980 MHz (hardly a difference) and also bumped up the core voltage slightly (by 12.5 mv) with Nvidia Inspector. But there was no obvious down-clocking, as was a problem for some Nvidia cards a few years ago. But maybe not all the relevant clocks are shown by GPU-Z? It is nothing I can fix at any rate, and I have seen no reports of such problems for these current drivers. It is on a Z77 motherboard with an Ivy Bridge i7-3770, with each GPU supported by a vitual CPU core, so that should not be a limitation. And the fact that a reboot fixes it would indicate that it is a software, not a hardware problem (to me at any rate).

There was some speculation earlier on various reasons that some cards were affected and others weren't, such as cache size, memory bandwidth, etc., but I don't think any definitive answer has been found. It is apparently something only GPUGrid can fix.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33180 - Posted: 24 Sep 2013 | 9:11:15 UTC - in response to Message 33179.
Last modified: 24 Sep 2013 | 9:13:21 UTC

Jim, I think that is a fair assessment.
This issue is most likely caused by the WU running on the app in a slightly different way than the other WU's. It's been around for some time, but difficult to spot due to other errors (especially in the summer months). I have mainly just been running the Beta WU's (for half the normal Long WU credits), and recently only came across this issue the once on a Linux system.

If it's just being caused by Noelia WU's then we need a mechanism to allow crunchers to select to not run these WU; either put the Noelia's in the Beta queue or create another queue. It doesn't appear to be effecting some systems, and the Noelia WU's pay the best, so others will want to run these WU's.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,222,865,968
RAC: 1,764,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33182 - Posted: 24 Sep 2013 | 13:41:03 UTC - in response to Message 33178.

If nothing will fix this, I will delete GPUGRID and run another BOINC program.

Paul,
Looking at your tasks' stderr.out file, they are full of:
# BOINC suspending at user request (thread suspend)
# BOINC resuming at user request (thread suspend)

which means that your BOINC manager keeps on suspending and resuming the GPUGrid application. This could be the result of improper settings of the BOINC manager and/or Windows.
For example:
1. The CPU project you are running uses too much CPU time
resolution: you should limit the CPU usage of those projects to give the GPU projects a single core per GPU by the "on multiprocessor systems use at most" 50% of the processor cores (as you have a dual core CPU, so 1 CPU core is 50% on your system) in the Boinc Manager (Advanced View) / Tools / Computing preferences / processor usage tab
2. The BOINC manager "throttles" its applications
This setting is used to limit the heat generated by the CPU, but it throttles the GPU applications also by mistake.
resolution: go to Boinc Manager (Advanced View) / Tools / Computing preferences / processor usage tab / use at most 100% cpu time
3. The BOINC manager is not using the GPU while you are using your computer
resoluntion: go to Boinc Manager (Advanced View) / Tools / Computing preferences / processor usage tab / check the "While the computer is in use" and the "Use GPU while the computer is in use" checkboxes
If you play games which needs the GPU, you should put them on the list in the "Exclusive Applications" tab
4. Windows power management limits the time while your computer is "awake"
resolution: go to Start / Control Panel (large icon view) / Power options / change current scheme settings / Put the computer to sleep: Never

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33187 - Posted: 24 Sep 2013 | 22:12:56 UTC - in response to Message 33180.

On my 660 two times today the GPU clock was down clocked. The WU's where beta ans Santi. My 660 has the most problems with Santi and almost none with Noelia's.
I regret that I have bought two 660's during summer, as my 770 is running fine with all WU's and very little errors. The card was more expensive but absolutely no frustration to run, while the 660 is frustrating me several times a day. The will be replaced by 7xx before the year is over.
____________
Greetings from TJ

Paul
Send message
Joined: 25 Apr 13
Posts: 26
Credit: 219,745,553
RAC: 238,476
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33191 - Posted: 24 Sep 2013 | 22:36:04 UTC

I have deleted gpugrid from my computer.

I suspended all runs except for the gpu

It runs but the remaining times keeps increasing. never more never more.

Paul
Send message
Joined: 25 Apr 13
Posts: 26
Credit: 219,745,553
RAC: 238,476
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33213 - Posted: 26 Sep 2013 | 9:19:07 UTC - in response to Message 33182.

Will try 50% processor and 100% CPU.

Will let you know how this works.

Paul
Send message
Joined: 25 Apr 13
Posts: 26
Credit: 219,745,553
RAC: 238,476
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33226 - Posted: 27 Sep 2013 | 19:18:22 UTC - in response to Message 33182.

Had my first GPU completion in quite awhile after the CPU & processor usage change ... Thanks !

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 817,865,789
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33638 - Posted: 27 Oct 2013 | 11:47:45 UTC
Last modified: 27 Oct 2013 | 11:48:44 UTC

Sooo whats the official solution for this, because this problem seems to still remain on short queue here on a 560ti 448 core edition. Lucky i can switch to long runs only with this card, but want to set it for both queues any later again for dont lose any publicationbadges ;) restart the machine after every short unit on an unattended machine is no solution ;)
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33639 - Posted: 27 Oct 2013 | 12:02:18 UTC - in response to Message 33638.

What drivers are you using? It looks like your computer is using 301.42 from May 2012.

I'd recommend trying the most-recent 331.58 drivers that were released, and if those still don't work, possible try 314.22 (which was before they created the 320 branch and broke some stuff).

Believe it or not, new drivers can actually fix things!

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 817,865,789
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33640 - Posted: 27 Oct 2013 | 12:47:23 UTC

Hmm. ok i was only wondering because TJ has the same problem and updated the drivers but still fail.
____________
DSKAG Austria Research Team: http://www.research.dskag.at



TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33657 - Posted: 28 Oct 2013 | 23:15:33 UTC - in response to Message 33640.

Hmm. ok i was only wondering because TJ has the same problem and updated the drivers but still fail.

Hello dskagcommunity,
I need to tell you that I didn't run any short WU in the last 30 days, I don't dare to.
Later this week when I am at the computer with the 660 I will run a few short ones and let you know how they do.
____________
Greetings from TJ

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 817,865,789
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33667 - Posted: 29 Oct 2013 | 19:23:43 UTC

oh i read it like that because you answered me O.o ok then, i will upgrade the driver ^^
____________
DSKAG Austria Research Team: http://www.research.dskag.at



TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33681 - Posted: 30 Oct 2013 | 16:49:38 UTC - in response to Message 33667.

Hello dskagcommunity, as promised I would let you know about the SR's.
Well the first one did finished okay but with this "result" in the stderr:
# GPU 0 : 65C
SWAN : FATAL : Cuda driver error 716 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
# GPU [GeForce GTX 660] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 660
That means that the clock has down clocked to half, so very slow processing now.
The PC had run for 10 and a half day with LR's and no such "error".
To me the SR are a pain on the 660. However 331.40 are not the latest beta drivers anymore.
I do hope your SR's do good. If you have time and willing you could try the latest drivers. Good luck.
____________
Greetings from TJ

Jozef J
Send message
Joined: 7 Jun 12
Posts: 112
Credit: 1,140,895,172
RAC: 315,774
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 33773 - Posted: 4 Nov 2013 | 16:34:38 UTC

few weeks I have a problem. When im start a project manager Boinc gpu grid starts to work,then run the Chrome browser,or movie player program..or gpu-Z,youtube a whole computer slows down like in slow motion, if I in this manage to hit the slow mode mouse and clicking on the boinc manager and Pause only gpu grid, everything starts working normally-off in this

Already have fun few days, sometimes I succeed to run and feelings do not come into a slow-mode which subsequently repaired just reboot or reset. but immediately after start when windows boots up normally boinc manager and GPUGRID and I'll start windows normally uses 8.1 clean install nvidia driver geforce R331, after starting certain specific applications (Chrome, PotPlayer ..) to start mowing after again .. But it happened to me already without launching the program ... I checked everything but when it's almost a clean install, the problem is gpudrid and nvidia drivers and some acceleration in programs who work at the same time of GPUGRID ..

shorter this could be explained if I shot this video ...
but a longer time to work assignments from GPUGRID perfectly but recently launched this problem .. and the solution will be hard to find ..
I changed the HDD and SSD because of it whether it is really HW-no problem ..
strange that I simply could not fit bsod month because of faulty work of GPUGRID or other problems of labor, which is fully-forum :)
But he began to show me the problem-slow motion .. this is a situation in

I think it will be some conflict-GPUGRID-nvidia-acceleration program that uses the graphics card
because another project which was then the cpu is running at the same time with no problem

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33775 - Posted: 4 Nov 2013 | 18:06:48 UTC - in response to Message 33773.

few weeks I have a problem. When im start a project manager Boinc gpu grid starts to work,then run the Chrome browser,or movie player program..or gpu-Z,youtube a whole computer slows down like in slow motion, if I in this manage to hit the slow mode mouse and clicking on the boinc manager and Pause only gpu grid, everything starts working normally-off in this

Already have fun few days, sometimes I succeed to run and feelings do not come into a slow-mode which subsequently repaired just reboot or reset. but immediately after start when windows boots up normally boinc manager and GPUGRID and I'll start windows normally uses 8.1 clean install nvidia driver geforce R331, after starting certain specific applications (Chrome, PotPlayer ..) to start mowing after again .. But it happened to me already without launching the program ... I checked everything but when it's almost a clean install, the problem is gpudrid and nvidia drivers and some acceleration in programs who work at the same time of GPUGRID ..

shorter this could be explained if I shot this video ...
but a longer time to work assignments from GPUGRID perfectly but recently launched this problem .. and the solution will be hard to find ..
I changed the HDD and SSD because of it whether it is really HW-no problem ..
strange that I simply could not fit bsod month because of faulty work of GPUGRID or other problems of labor, which is fully-forum :)
But he began to show me the problem-slow motion .. this is a situation in

I think it will be some conflict-GPUGRID-nvidia-acceleration program that uses the graphics card
because another project which was then the cpu is running at the same time with no problem


You are running Noelia Ins1p units on you 275 card which doesn't have enough memory this will cause what you describe.

____________
Radio Caroline, the world's most famous offshore pirate radio station.
Great music since April 1964. Support Radio Caroline Team -
Radio Caroline

Post to thread

Message boards : Number crunching : strange behaviour...

//