Message boards :
News :
Old Noelia WUs
Message board moderation
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 17 · Next
| Author | Message |
|---|---|
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I may have identified the source of some problems with the present Noelia WU's. When I checked the Memory Controller Load it was 1% for a GTX 660Ti. The last time I looked it was around 40%. The GPU load was 98% and clocks were normal (high). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I may have identified the source of some problems with the present Noelia WU's. When I checked the Memory Controller Load it was 1% for a GTX 660Ti. The last time I looked it was around 40%. The GPU load was 98% and clocks were normal (high). I saw that a couple of days ago with one of my cards, I think a 660. I exited BOINC as normal, and when it restarted, the Noelia errored out. But that means the work unit could hang that way for a long time unless you manually intervene; not a fun thought. |
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I may have identified the source of some problems with the present Noelia WU's. When I checked the Memory Controller Load it was 1% for a GTX 660Ti. The last time I looked it was around 40%. The GPU load was 98% and clocks were normal (high). I have that too on my quad with the 660 still in it. I did some alternations with Precision X, and a reboot, but it stays at MCU stays at 1% and the GPU power sits around 62%. It has done 34% in 17 hours. The other 660 in the T7400 does great. How did you fix this problem skgiven with the 1% MCU load? Greetings from TJ |
Carlesa25Send message Joined: 13 Nov 10 Posts: 328 Credit: 72,619,453 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hello: It seems that I have a problem on Linux / Ubuntu with the GTX 770 and Noelia tasks, performance is pitiful salary at the GPU and no CPU usage. I can not get off to a less than 319.23 as driver do not support the 770. Is there any forecast of when this issue will be resolved or have to wait for a new driver from Nvidia? |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I've restarted Boinc, the system, suspended and resumed tasks to make them swap GPU and now both Noelia WU's are using 0 or 1% Memory Controller Load. The worrying thing is that one WU is at 52% after 24h, mostly on a GTX660Ti and the other is at 39% after 5h40min, but will no doubt take days since the memory controller load is banjaxed. The GPU temperature, Fan speeds and Power targets are all down but the clocks are normal (high). If I raise the Power Limit using Afterburner from 100% to 101% the GPU power drops from 65% to 56%, when I raise it to 102% it goes back to 65%. It appears that something is either being set to on or off. It doesn't matter what the percentage is, it changes to 65 then 56 and back to 65... I'm going to dispose of the 314.22 drivers and try 310.90, but since I have not experienced the memory controller issues with other WU's I would say it's task related. I'm also seeing wonky driver restarts, but I've seen that before with Noelia WU's. ... No difference. I will have to abort the WU's, as they will take days at 1% memory controller load. Short queue here I come, FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Could it be hardware/software related? Your 660Ti isn't worse than my 660. Both 660's are exact the same both EVGA not OC. One in the T7400 with PCIe 2.0 and is doing good with 93% GPU load, 65°C, 35% MCU load and 96% GPU power. It does a Noelia in about 14 hours. The other is in a quad core in PCIe 1.1 and uses 1% MCU load at 60% GPU power and 97 GPU load. So there is something wrong. Your card is taking a lot of time to finish as well. Greetings from TJ |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'm going to dispose of the 314.22 drivers and try 306.97, but since I have not experienced the memory controller issues with other WU's I would say it's task related. I'm also seeing wonky driver restarts, but I've seen that before with Noelia WU's. What kind of OS is running on this host? |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
W7x64, but went to 310.90. I aborted both WU's and started running short WU's. So far no issues, 6% in. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have noticed recently that when exiting BOINC (7.0.64 x64) I have been getting crashes of the Nvidia drivers. But I have just upgraded to BOINC 7.2.4, and don't see this. Whether that has anything to do with the present Noelia problems is another matter, but it is worth watching. http://boinc.berkeley.edu/dl/ (I am currently on the 311.06 drivers, a Windows update from the 310.90 drivers, but I think it happens on the other versions too.) |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
W7x64, but went to 310.90. I have the feeling that Win7x64 is more prone to cause workunit errors (especially Noelia's) than WinXPx64. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I still think this might be a WU issue, but I've suspected for some time that hidden WDDM bugs could occasionally cause issues. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
ChileanSend message Joined: 8 Oct 12 Posts: 98 Credit: 385,652,461 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have noticed recently that when exiting BOINC (7.0.64 x64) I have been getting crashes of the Nvidia drivers. But I have just upgraded to BOINC 7.2.4, and don't see this. Whether that has anything to do with the present Noelia problems is another matter, but it is worth watching. I think this is a bug with NOELIA's long WUs. I switched over to short WUs only to avoid this NVIDIA driver crash everytime I suspend or exit BOINC which sometimes crashes my whole system and I'm forced to hard reboot. |
|
Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've only had two or three that errored out. Both were almost immediate, with one unit erroring for everyone. And the one that I just crashed on went to SAM, and he finished it. The WUs are currently at 1.2 GB, AFAIK, and I haven't experienced any running for an abnormally long time. Although all my cards are on the high end side of things. EDIT: I can say this. I thought the WU's were supposed to have swan_sync=0 enabled by default for 6xx+ cards? The latestn Noelia units have not been doing this, and have been running at about 1/2 that. Meaning 2:1 GPU:CPU time |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
These latest Noelia WU's use the older v6.18 Application. Previous Noelia WU's used v6.49. So that may explain behavior differences. Previous Noelia WU's did not use a full CPU core/thread (the only type of work that doesn't). While some WU's complete successfully, there seems to be at least four types of problem: Early failures after a few minutes, Driver restarts that crash the work, High GDDR memory usage that prevents some cards from being used, A reduction in Memory Controller load which causes the tasks to appear to run normally (even faster going by GPU usage) but actually slows the work down massively (causing it to take days). WU behavior may be different on different operating systems, and with different drivers. GPU card architectural differences may also be an issue and these WU might challenge the GPU in different ways exposing weaknesses in the GPU that were previously unseen with different WU's. It's not often you see 3 or 4 known good cards all fail a WU, and then the WU to succeed on another card. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
If I understand correct than the 1% MCU load is a result of the WU and we can not do anything about it? Well it has done 54% on the 660 in 24 hours. So aborting it seems a waist. I let it run for another day and then no new work for that rig. Greetings from TJ |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It will be interesting to see how that turns out. Some WU's start normally and run normally for hours and then the Memory Controller load drops. From then on progress will be very slow. This reminds me of what was happening in Linux for some WU's a few months back. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
While you mentioned it, I took a close eye on it. And I saw at least two WU's from Noelia from the start and the MCU was 1% from the begin onwards. I could be wrong, but I have a bit of a feeling that it is hardware oriented as well. My quad have the "difficulties", while the CPU do not crunch momentarily. The 7 year old high-end T7400 runs smooth at steady loads. I can stop BOINC, reboot the system or power it down, when I leave for longer, the WU's keep going smooth and about 1 hour longer than a Nathan did last week (on the 660). Greetings from TJ |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Zoltan wrote: I have the feeling that Win7x64 is more prone to cause workunit errors (especially Noelia's) than WinXPx64. This could well be. XP uses the old driver architecture versus WDDM on Vista/7/8, so they're actually on different branches now. Generally they should be similar, but especially corner cases like bugs being triggered would be expected to differ between them. Carlesa25 wrote: It seems that I have a problem on Linux / Ubuntu with the GTX 770 and Noelia tasks, performance is pitiful salary at the GPU and no CPU usage. Well, it's obviously a driver issue, since it works with the older versions. I can't see anything BOINC or GPU-Grid could do about this other than to inform nVidia and hope they'll fix it at some point. If the most recent beta drivers are still not working, chances are that nVidia doesn't yet know about this problem. As a work around you switch the GTX770 to a windows box, if you've got any. And.. the issue applies to other WUs as well doesn't it? Otherwise you could go for the short queue. @1% MCU load: so far the only reports of this happening have been from SK and TJ. Are you guys just watching more closely than others.. or is the error only happening on your systems? In the latter case it could be the disabled driver watchdog (did you apply this registry change as well, TJ?). If something goes wrong in the GPU and normally the watchdog would reset the driver & GPU (with task failure or not, whatever)... and you disable the watchdog, then your GPU may just continue to do something in this strange state. MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
@1% MCU load: so far the only reports of this happening have been from SK and TJ. Are you guys just watching more closely than others.. or is the error only happening on your systems? In the latter case it could be the disabled driver watchdog (did you apply this registry change as well, TJ?). If something goes wrong in the GPU and normally the watchdog would reset the driver & GPU (with task failure or not, whatever)... and you disable the watchdog, then your GPU may just continue to do something in this strange state. No I did not change this in the registry. I have looked for it but didn't find it. So to not mess things up I left it. Yes I look closely at these WU's at the moment, and I guess skgiven does too. skgiven said: To be fair I've had 13 Noelia WU's finish and only 2 fail (both within a few minutes, which is a lot better than after 10h). That said I did edit the registry to try to prevent failures. Perhaps skgiven can give a hint what need to be changed. I suppose it is in Software, nVidia driver or card manufactures? Greetings from TJ |
dskagcommunitySend message Joined: 28 Apr 11 Posts: 463 Credit: 958,266,958 RAC: 31 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
skygiven do you mean the registrychange that no windows error messages pops up and block the GPU/BOINC Slot from working on? I added them too on my Systems, or do you mean another regedit? DSKAG Austria Research Team: http://www.research.dskag.at
|
©2025 Universitat Pompeu Fabra