acemdlong application 8.14 - discussion

Message boards : News : acemdlong application 8.14 - discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33028 - Posted: 16 Sep 2013, 23:37:06 UTC

In the afternoon I looked at my PC and just by coincidence I saw a WU stop (CRASH) and then another one started but the GPU clock dropped to half by that. Nothing I did with suspending/resuming/EVGA software to get the clock up again than booting the system, 1 day and 11 hours after its last boot by the same issue.

Now that WU is finished with good result I looked at it and found this again:
# BOINC suspending at user request (exit)

I did nothing and the PC was only doing GPUGRID and 5 Rosetta WU's in the CPU's. Virus scanner was not in use will happen during night time.
And I used the line from Operator in cc_config to never do a Benchmark.

I think Matt has made a good diagnostic program and we get now to see things we never saw but could have happened. It would be nice though to see somewhere what all these messages mean (and what we could do or not do about it).
But only when you have time Matt, we know you are busy with programming and you need to get your PhD as well.

I am now 3 days error free even on my 660, so things have improved, for me at least. Thanks for that.
Greetings from TJ
ID: 33028 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
5pot

Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33030 - Posted: 16 Sep 2013, 23:42:31 UTC - in response to Message 33026.  

For these access violation problems, it seems that I'm going to have to set up a Windows system with a Titan in the lab and try to reproduce it. Unfortunately I'll not be back to do that until mid October at the earliest. I hope you can tolerate the current state of affairs until then?

Matt


Like I said, theyre running and validating. Fine with me
ID: 33030 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33032 - Posted: 17 Sep 2013, 0:38:46 UTC - in response to Message 33026.  

I hope you can tolerate the current state of affairs until then?

Matt


Matt;

Will have to do. Thanks for looking into it though. That's encouraging.

As I indicated I would, I removed one GPU and booted up to run long WUs.

Got one NATHAN_KIDc downloaded and running and the second, a NOELIA_INS "Ready to Start".

After one hour I came back to check and sure enough the first one had stopped and was now "Waiting to run" and the second one was running.

I'm sure they'll swap back and forth again several times before completion.

Operator

ID: 33032 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33033 - Posted: 17 Sep 2013, 2:09:17 UTC - in response to Message 33032.  

Operator:

I sent you a private message on GPUGrid, with my email address, requesting some files from you. I'd like to help your situation. Can you send me those files?

Thanks,
Jacob
ID: 33033 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33035 - Posted: 17 Sep 2013, 10:08:00 UTC

Here's an anecdotal story, based on a random sample of one. YMMV - in fact, your system will certainly be different - but this may be of interest.

My GTX 670 host has been having a lot of problems - starting in August, which was particularly warm here. "Problems" were the occasional BSOD, but most commonly a total system freeze - Windows desktop shows on screen as normal, but the system clock stops updating and there's no response to mouse or keyboard. First suspect was overheating, so I installed extra side fans in an already well-ventilated HAF case and moved the machine to a cooler room - that seemed to improve things, but wasn't a complete cure.

Then, after this month's Windows security updates, it got much worse again - freezing every six hours or so. OS is Windows 7 Home Premium, 64-bit, and CPU is an 'Ivy Bridge' (third generation) i7 with HD 4000 graphics. Motherboard is by Gigabyte with Z77 express chipset.

Looking around, I found:



After consulting an experienced developer and system builder, I installed - in this order - the following updates:

1) Platform Update - http://support.microsoft.com/kb/2670838
2) Intel HD 4000 driver from the Intel site - Intel Download Centre
3) The two Driver Framework updates from the list above - Kernel-Mode and User-Mode
4) The most recent NVidia driver available - 326.80 Beta (using the 'clean install' option)

Since I did all that, the machine has run without error, and no errors have been logged in the most recent beta tasks. I'm going to try switching back to long tasks after the current beta has finished.
ID: 33035 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33038 - Posted: 17 Sep 2013, 12:56:52 UTC - in response to Message 33033.  
Last modified: 17 Sep 2013, 13:20:37 UTC

Jacob;

Files are in your inbox now.

Thanks,

Operator
ID: 33038 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33040 - Posted: 17 Sep 2013, 13:49:12 UTC - in response to Message 33035.  



After consulting an experienced developer and system builder, I installed - in this order - the following updates:

1) Platform Update - http://support.microsoft.com/kb/2670838
2) Intel HD 4000 driver from the Intel site - Intel Download Centre
3) The two Driver Framework updates from the list above - Kernel-Mode and User-Mode
4) The most recent NVidia driver available - 326.80 Beta (using the 'clean install' option)



Richard;

Thanks. My system board (Dell) has no integrated Intel HD video, discrete only.

I do have the platform updates already installed, and in fact have most if not all of the other updates you show there installed as well.

I actually did have Nvidia driver version 326.84 installed and reverted back to a clean install of 326.41 to determine if that had anything to do with the problem, but apparently it didn't. I think it's the way the 8.14 app runs on 780/Titan GPUs that is the issue. I don't see any of these problems with apps running on my 590 box. Matt (MJH) says he's going to have a go at investigating when he gets a chance.

I'm considering doing a Linux build to see if that makes any difference because it seems that the development branches may be different for Windows vs Linux GPUGrid apps. But I have very little experience with Linux in general so this would be time consuming for me to get spun up on.

Operator

ID: 33040 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John C MacAlister

Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 33041 - Posted: 17 Sep 2013, 15:25:33 UTC
Last modified: 17 Sep 2013, 15:26:57 UTC

Hi, Folks

26h run time forecast....Is this reasonable?

AMD FX-8350 with GTX 650 Ti.

Computer ID Name Location Avg. credit Total credit BOINC
version CPU GPU Operating System Last contact
ID: 158482
Details | Tasks
Cross-project stats:
BOINCstats.com Free-DC Panzer-001 home 52,782.88 786,400 7.0.64 AuthenticAMD
AMD FX(tm)-8350 Eight-Core Processor [Family 21 Model 2 Stepping 0]
(8 processors) [2] NVIDIA GeForce GTX 650 Ti (1023MB) driver: 314.22 Microsoft Windows 7
Ultimate x64 Edition, Service Pack 1, (06.01.7601.00) 17 Sep 2013 | 15:16:19 UTC

Name 35x7-SANTI_RAP74wtCUBIC-5-34-RND8406_1
Workunit 4779214
Created 17 Sep 2013 | 12:01:53 UTC
Sent 17 Sep 2013 | 15:16:19 UTC
Received ---
Server state In progress
Outcome ---
Client state New
Exit status 0 (0x0)
Computer ID 158482
Report deadline 22 Sep 2013 | 15:16:19 UTC
Run time 0.00
CPU time 0.00
Validate state Initial
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.14 (cuda42)
ID: 33041 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33043 - Posted: 17 Sep 2013, 15:45:04 UTC - in response to Message 33040.  

Operator,

Would a bootable Linux image be useful for you?
Was planning to put one together for the memory tester anyway.

Matt
ID: 33043 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33044 - Posted: 17 Sep 2013, 16:12:48 UTC - in response to Message 33041.  

Hi, Folks

26h run time forecast....Is this reasonable?

I wouldn't pay much attention to the forecast. See what the actual run time is; it should be about 18 hours.
ID: 33044 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John C MacAlister

Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 33045 - Posted: 17 Sep 2013, 16:25:39 UTC - in response to Message 33044.  
Last modified: 17 Sep 2013, 16:26:19 UTC

Hi, Folks

26h run time forecast....Is this reasonable?

I wouldn't pay much attention to the forecast. See what the actual run time is; it should be about 18 hours.



Many thanks, Jim.

Regards,

John
ID: 33045 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33048 - Posted: 17 Sep 2013, 20:27:40 UTC - in response to Message 33017.  

What could be interesting is whether these access violations already happened in 8.03 but had no visible effect, or if they're caused by some change made to the app since then.

MrS

MrS;

Easy enough to find out.

Here's the results of the last 4 WUs crunched on the Titan system using 8.03 before 8.14 got downloaded automatically:
...

I don't see any evidence of either errors or Access violations.

If I remember correctly Matt only introduced the error handling with 8.11. And may have also improved the error detection. So I still think it's possible that what ever triggers the error detection now was happening before, but did not actually harm the WUs. It's just one possibility, though, which I don't think we can answer.

Matt, would it be sufficient if you got remote access to a Titan on Win? I don't have any, but others might want to help. That would certainly be quicker than to set the system up yourself.. although you migth want to have some Windows system to hunt nasty bugs anyway.

MrS
Scanning for our furry friends since Jan 2002
ID: 33048 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33049 - Posted: 18 Sep 2013, 0:38:51 UTC - in response to Message 33048.  

What could be interesting is whether these access violations already happened in 8.03 but had no visible effect, or if they're caused by some change made to the app since then.

MrS

MrS;

Easy enough to find out.

Here's the results of the last 4 WUs crunched on the Titan system using 8.03 before 8.14 got downloaded automatically:
...

I don't see any evidence of either errors or Access violations.


If I remember correctly Matt only introduced the error handling with 8.11. And may have also improved the error detection. So I still think it's possible that what ever triggers the error detection now was happening before, but did not actually harm the WUs. It's just one possibility, though, which I don't think we can answer.

MrS


MrS;

Looking back at the last 10 or so SANTI_RAP, NOELIA-INSP, and NATHAN_KIDKIX WUs that were run on the 8.03 app just before the switch to 8.14...

http://www.gpugrid.net/results.php?hostid=158641&offset=20&show_names=1&state=0&appid=

you can see that average completion times were about 20k.

After 8.14? Sometimes double that due to the constant restarts.

So even if error checking was introduced with version 8.11, and there may have been hidden errors created when running the 8.03 app (I'm not sure how that follows logically though), the near doubling of the work unit completion times immediately upon initial usage of the 8.14 app is enough of a smoking gun that there is something amiss.

And that is the real problem here I think, the amount of time it takes a WU to complete due to all the starts and stops. That directly impacts the number of WUs that this system (and other Titan/780 equipped systems like it) can get returned. If you like, look at it from the perspective of the "return on the Kilowatts consumed".

Now, I am perfectly happy to wait till Matt has a chance to do some testing, and see where that takes us.

I'll put the second Titan GPU back in the case and continue as before until...whatever.

Operator
ID: 33049 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33054 - Posted: 18 Sep 2013, 14:41:59 UTC

A CRASHNPT was suspended, still 3% to finish, and another was running. I suppose this happened due to the "termination by the app to avoid hangup". So I suspended the other WU and the one that was almost finished, started again, but failed immediately. So this manually suspending is not working properly anymore, or it is because the app stopped it itself?
Greetings from TJ
ID: 33054 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33056 - Posted: 18 Sep 2013, 16:03:19 UTC - in response to Message 33049.  



And that is the real problem here I think, the amount of time it takes a WU to complete due to all the starts and stops. That directly impacts the number of WUs that this system (and other Titan/780 equipped systems like it) can get returned. If you like, look at it from the perspective of the "return on the Kilowatts consumed".

Operator



As an example of what I was referring to above:

With one Titan GPU installed and only one WU downloaded and crunching, the amount of time 'wasted' by the "Scheduler: Access violation, Waiting to Run" issue for I59R6-NATHAN_KIDc22_glu-6-10-RND3767_1 was 2 hours 47 minutes and 31 seconds of nothing happening.

This data came from the stdoutdae.txt file and was imported into Excel where the 'gaps' between restarts for this WU were totalled up.

So this WU could have finished in 'real time' (not GPU time) almost three hours earlier than it did and would have allowed another WU to have been mostly completed if not for all the restarts.

Let me know if anybody sees this a different way.

Operator
ID: 33056 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33061 - Posted: 18 Sep 2013, 16:56:54 UTC - in response to Message 33056.  

I agree that it's possible that the loading and clearing of the app could use up a substantial amount of time. This again suggests that recoverable errors are now triggering the app suspension and recovery mechanism. Maybe the app just needs to be refined so that it doesn't get triggered so often.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 33061 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33062 - Posted: 18 Sep 2013, 17:04:08 UTC - in response to Message 33043.  

Operator,

Would a bootable Linux image be useful for you?
Was planning to put one together for the memory tester anyway.

Matt

It would be nice to have a 64bit Linux image with BOINC, NVidia and ATI drivers installed if that's even possible. No need for anything else. All my boxes are AMD with both NVidia and AMD GPUs. Haven't had a lot of success getting Linux running so that BOINC will work for both GPU types.
ID: 33062 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
5pot

Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33065 - Posted: 18 Sep 2013, 17:41:45 UTC - in response to Message 33056.  



And that is the real problem here I think, the amount of time it takes a WU to complete due to all the starts and stops. That directly impacts the number of WUs that this system (and other Titan/780 equipped systems like it) can get returned. If you like, look at it from the perspective of the "return on the Kilowatts consumed".

Operator



As an example of what I was referring to above:

With one Titan GPU installed and only one WU downloaded and crunching, the amount of time 'wasted' by the "Scheduler: Access violation, Waiting to Run" issue for I59R6-NATHAN_KIDc22_glu-6-10-RND3767_1 was 2 hours 47 minutes and 31 seconds of nothing happening.

This data came from the stdoutdae.txt file and was imported into Excel where the 'gaps' between restarts for this WU were totalled up.

So this WU could have finished in 'real time' (not GPU time) almost three hours earlier than it did and would have allowed another WU to have been mostly completed if not for all the restarts.

Let me know if anybody sees this a different way.

Operator


yours is doing something completely different from mine. Why I don't know. But since mine suspend and start another task, very little is lost. In fact, my times are pretty much unchanged.

Your issue is. Odd and unique.
ID: 33065 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33066 - Posted: 18 Sep 2013, 17:47:45 UTC - in response to Message 33062.  


Haven't had a lot of success getting Linux running so that BOINC will work for both GPU types.


Unsurprising. It's difficult to do, and fragile when it's done. The trick is to do the installation in this order:
* Operating System's X, mesa packages
* Nvidia driver
* force a re-install X, mesa packages
* Catalyst
* Configure X server for the AMD card.
* Start X

MJH
ID: 33066 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33069 - Posted: 18 Sep 2013, 19:19:14 UTC - in response to Message 33065.  



yours is doing something completely different from mine. Why I don't know. But since mine suspend and start another task, very little is lost. In fact, my times are pretty much unchanged.

Your issue is. Odd and unique.


To be clear I'm referring to the difference between the WU runtime showing in the results (20+k seconds) and the actual 'real' time the computer took to complete the WU from start to finish.

As an example, if you start a WU and you only have that one running, and it repeatedly starts and stops until its finished, there will be a difference in the 'GPU runtime' versus the actual clock time the WU took to complete.

Unless I'm way off base the GPU time is logged only when the WU is being actively worked. If it's "Waiting to run" I don't think that time counts. So that's why I said that there was 2 hours 47 minutes and 31 seconds of nothing happening that was essentially lost.

Now, if I completely have this wrong about GPU time vs. 'real time' please jump in here and straighten me out!

Operator

ID: 33069 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

Message boards : News : acemdlong application 8.14 - discussion

©2025 Universitat Pompeu Fabra