Old Noelia WUs

Message boards : News : Old Noelia WUs
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 17 · Next

AuthorMessage
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31475 - Posted: 13 Jul 2013, 22:09:47 UTC - in response to Message 31474.  

I may have identified the source of some problems with the present Noelia WU's. When I checked the Memory Controller Load it was 1% for a GTX 660Ti. The last time I looked it was around 40%. The GPU load was 98% and clocks were normal (high).


FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 31475 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31476 - Posted: 13 Jul 2013, 22:15:53 UTC - in response to Message 31475.  

I may have identified the source of some problems with the present Noelia WU's. When I checked the Memory Controller Load it was 1% for a GTX 660Ti. The last time I looked it was around 40%. The GPU load was 98% and clocks were normal (high).

I saw that a couple of days ago with one of my cards, I think a 660. I exited BOINC as normal, and when it restarted, the Noelia errored out.

But that means the work unit could hang that way for a long time unless you manually intervene; not a fun thought.
ID: 31476 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31477 - Posted: 13 Jul 2013, 22:35:04 UTC - in response to Message 31475.  
Last modified: 13 Jul 2013, 22:35:25 UTC

I may have identified the source of some problems with the present Noelia WU's. When I checked the Memory Controller Load it was 1% for a GTX 660Ti. The last time I looked it was around 40%. The GPU load was 98% and clocks were normal (high).

I have that too on my quad with the 660 still in it. I did some alternations with Precision X, and a reboot, but it stays at MCU stays at 1% and the GPU power sits around 62%. It has done 34% in 17 hours. The other 660 in the T7400 does great.

How did you fix this problem skgiven with the 1% MCU load?
Greetings from TJ
ID: 31477 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Carlesa25
Avatar

Send message
Joined: 13 Nov 10
Posts: 328
Credit: 72,619,453
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31478 - Posted: 13 Jul 2013, 22:43:16 UTC

Hello: It seems that I have a problem on Linux / Ubuntu with the GTX 770 and Noelia tasks, performance is pitiful salary at the GPU and no CPU usage.

I can not get off to a less than 319.23 as driver do not support the 770.

Is there any forecast of when this issue will be resolved or have to wait for a new driver from Nvidia?
ID: 31478 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31479 - Posted: 13 Jul 2013, 22:46:11 UTC - in response to Message 31476.  
Last modified: 13 Jul 2013, 23:05:27 UTC

I've restarted Boinc, the system, suspended and resumed tasks to make them swap GPU and now both Noelia WU's are using 0 or 1% Memory Controller Load. The worrying thing is that one WU is at 52% after 24h, mostly on a GTX660Ti and the other is at 39% after 5h40min, but will no doubt take days since the memory controller load is banjaxed.
The GPU temperature, Fan speeds and Power targets are all down but the clocks are normal (high).

If I raise the Power Limit using Afterburner from 100% to 101% the GPU power drops from 65% to 56%, when I raise it to 102% it goes back to 65%. It appears that something is either being set to on or off. It doesn't matter what the percentage is, it changes to 65 then 56 and back to 65...

I'm going to dispose of the 314.22 drivers and try 310.90, but since I have not experienced the memory controller issues with other WU's I would say it's task related. I'm also seeing wonky driver restarts, but I've seen that before with Noelia WU's.

... No difference.
I will have to abort the WU's, as they will take days at 1% memory controller load.
Short queue here I come,
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 31479 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31481 - Posted: 13 Jul 2013, 23:01:29 UTC - in response to Message 31479.  

Could it be hardware/software related? Your 660Ti isn't worse than my 660. Both 660's are exact the same both EVGA not OC. One in the T7400 with PCIe 2.0 and is doing good with 93% GPU load, 65°C, 35% MCU load and 96% GPU power. It does a Noelia in about 14 hours.
The other is in a quad core in PCIe 1.1 and uses 1% MCU load at 60% GPU power and 97 GPU load.
So there is something wrong. Your card is taking a lot of time to finish as well.
Greetings from TJ
ID: 31481 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31483 - Posted: 13 Jul 2013, 23:06:19 UTC - in response to Message 31479.  
Last modified: 13 Jul 2013, 23:09:22 UTC

I'm going to dispose of the 314.22 drivers and try 306.97, but since I have not experienced the memory controller issues with other WU's I would say it's task related. I'm also seeing wonky driver restarts, but I've seen that before with Noelia WU's.

What kind of OS is running on this host?
ID: 31483 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31485 - Posted: 13 Jul 2013, 23:22:42 UTC - in response to Message 31483.  

W7x64, but went to 310.90.

I aborted both WU's and started running short WU's. So far no issues, 6% in.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 31485 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31486 - Posted: 13 Jul 2013, 23:34:01 UTC

I have noticed recently that when exiting BOINC (7.0.64 x64) I have been getting crashes of the Nvidia drivers. But I have just upgraded to BOINC 7.2.4, and don't see this. Whether that has anything to do with the present Noelia problems is another matter, but it is worth watching.
http://boinc.berkeley.edu/dl/

(I am currently on the 311.06 drivers, a Windows update from the 310.90 drivers, but I think it happens on the other versions too.)
ID: 31486 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31487 - Posted: 13 Jul 2013, 23:54:36 UTC - in response to Message 31485.  
Last modified: 13 Jul 2013, 23:54:52 UTC

W7x64, but went to 310.90.

I have the feeling that Win7x64 is more prone to cause workunit errors (especially Noelia's) than WinXPx64.
ID: 31487 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31489 - Posted: 14 Jul 2013, 0:18:53 UTC - in response to Message 31487.  

I still think this might be a WU issue, but I've suspected for some time that hidden WDDM bugs could occasionally cause issues.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 31489 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Chilean
Avatar

Send message
Joined: 8 Oct 12
Posts: 98
Credit: 385,652,461
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31491 - Posted: 14 Jul 2013, 3:53:05 UTC - in response to Message 31486.  

I have noticed recently that when exiting BOINC (7.0.64 x64) I have been getting crashes of the Nvidia drivers. But I have just upgraded to BOINC 7.2.4, and don't see this. Whether that has anything to do with the present Noelia problems is another matter, but it is worth watching.
http://boinc.berkeley.edu/dl/

(I am currently on the 311.06 drivers, a Windows update from the 310.90 drivers, but I think it happens on the other versions too.)


I think this is a bug with NOELIA's long WUs.

I switched over to short WUs only to avoid this NVIDIA driver crash everytime I suspend or exit BOINC which sometimes crashes my whole system and I'm forced to hard reboot.
ID: 31491 · Rating: 0 · rate: Rate + / Rate - Report as offensive
5pot

Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31493 - Posted: 14 Jul 2013, 5:08:57 UTC

I've only had two or three that errored out. Both were almost immediate, with one unit erroring for everyone. And the one that I just crashed on went to SAM, and he finished it.

The WUs are currently at 1.2 GB, AFAIK, and I haven't experienced any running for an abnormally long time. Although all my cards are on the high end side of things.

EDIT: I can say this. I thought the WU's were supposed to have swan_sync=0 enabled by default for 6xx+ cards? The latestn Noelia units have not been doing this, and have been running at about 1/2 that. Meaning 2:1 GPU:CPU time
ID: 31493 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31495 - Posted: 14 Jul 2013, 9:37:44 UTC - in response to Message 31493.  
Last modified: 14 Jul 2013, 9:41:28 UTC

These latest Noelia WU's use the older v6.18 Application. Previous Noelia WU's used v6.49. So that may explain behavior differences. Previous Noelia WU's did not use a full CPU core/thread (the only type of work that doesn't).

While some WU's complete successfully, there seems to be at least four types of problem:
Early failures after a few minutes,
Driver restarts that crash the work,
High GDDR memory usage that prevents some cards from being used,
A reduction in Memory Controller load which causes the tasks to appear to run normally (even faster going by GPU usage) but actually slows the work down massively (causing it to take days).

WU behavior may be different on different operating systems, and with different drivers. GPU card architectural differences may also be an issue and these WU might challenge the GPU in different ways exposing weaknesses in the GPU that were previously unseen with different WU's. It's not often you see 3 or 4 known good cards all fail a WU, and then the WU to succeed on another card.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 31495 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31496 - Posted: 14 Jul 2013, 9:51:40 UTC - in response to Message 31495.  
Last modified: 14 Jul 2013, 9:51:58 UTC

If I understand correct than the 1% MCU load is a result of the WU and we can not do anything about it?
Well it has done 54% on the 660 in 24 hours. So aborting it seems a waist. I let it run for another day and then no new work for that rig.
Greetings from TJ
ID: 31496 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31497 - Posted: 14 Jul 2013, 10:19:53 UTC - in response to Message 31496.  
Last modified: 14 Jul 2013, 10:20:27 UTC

It will be interesting to see how that turns out.
Some WU's start normally and run normally for hours and then the Memory Controller load drops. From then on progress will be very slow.
This reminds me of what was happening in Linux for some WU's a few months back.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 31497 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31501 - Posted: 14 Jul 2013, 11:38:16 UTC - in response to Message 31497.  

While you mentioned it, I took a close eye on it. And I saw at least two WU's from Noelia from the start and the MCU was 1% from the begin onwards.
I could be wrong, but I have a bit of a feeling that it is hardware oriented as well. My quad have the "difficulties", while the CPU do not crunch momentarily. The 7 year old high-end T7400 runs smooth at steady loads. I can stop BOINC, reboot the system or power it down, when I leave for longer, the WU's keep going smooth and about 1 hour longer than a Nathan did last week (on the 660).
Greetings from TJ
ID: 31501 · Rating: 0 · rate: Rate + / Rate - Report as offensive
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31505 - Posted: 14 Jul 2013, 14:02:24 UTC - in response to Message 31487.  

Zoltan wrote:
I have the feeling that Win7x64 is more prone to cause workunit errors (especially Noelia's) than WinXPx64.

This could well be. XP uses the old driver architecture versus WDDM on Vista/7/8, so they're actually on different branches now. Generally they should be similar, but especially corner cases like bugs being triggered would be expected to differ between them.

Carlesa25 wrote:
It seems that I have a problem on Linux / Ubuntu with the GTX 770 and Noelia tasks, performance is pitiful salary at the GPU and no CPU usage.
I can not get off to a less than 319.23 as driver do not support the 770.
Is there any forecast of when this issue will be resolved or have to wait for a new driver from Nvidia?

Well, it's obviously a driver issue, since it works with the older versions. I can't see anything BOINC or GPU-Grid could do about this other than to inform nVidia and hope they'll fix it at some point. If the most recent beta drivers are still not working, chances are that nVidia doesn't yet know about this problem.

As a work around you switch the GTX770 to a windows box, if you've got any. And.. the issue applies to other WUs as well doesn't it? Otherwise you could go for the short queue.

@1% MCU load: so far the only reports of this happening have been from SK and TJ. Are you guys just watching more closely than others.. or is the error only happening on your systems? In the latter case it could be the disabled driver watchdog (did you apply this registry change as well, TJ?). If something goes wrong in the GPU and normally the watchdog would reset the driver & GPU (with task failure or not, whatever)... and you disable the watchdog, then your GPU may just continue to do something in this strange state.

MrS
Scanning for our furry friends since Jan 2002
ID: 31505 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31507 - Posted: 14 Jul 2013, 14:16:04 UTC - in response to Message 31505.  
Last modified: 14 Jul 2013, 14:17:13 UTC

@1% MCU load: so far the only reports of this happening have been from SK and TJ. Are you guys just watching more closely than others.. or is the error only happening on your systems? In the latter case it could be the disabled driver watchdog (did you apply this registry change as well, TJ?). If something goes wrong in the GPU and normally the watchdog would reset the driver & GPU (with task failure or not, whatever)... and you disable the watchdog, then your GPU may just continue to do something in this strange state.

MrS

No I did not change this in the registry. I have looked for it but didn't find it. So to not mess things up I left it. Yes I look closely at these WU's at the moment, and I guess skgiven does too.

skgiven said:
To be fair I've had 13 Noelia WU's finish and only 2 fail (both within a few minutes, which is a lot better than after 10h). That said I did edit the registry to try to prevent failures.

Perhaps skgiven can give a hint what need to be changed. I suppose it is in Software, nVidia driver or card manufactures?
Greetings from TJ
ID: 31507 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile dskagcommunity
Avatar

Send message
Joined: 28 Apr 11
Posts: 463
Credit: 958,266,958
RAC: 31
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31511 - Posted: 14 Jul 2013, 14:41:26 UTC

skygiven do you mean the registrychange that no windows error messages pops up and block the GPU/BOINC Slot from working on? I added them too on my Systems, or do you mean another regedit?
DSKAG Austria Research Team: http://www.research.dskag.at



ID: 31511 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 17 · Next

Message boards : News : Old Noelia WUs

©2025 Universitat Pompeu Fabra