Message boards :
News :
acemdlong application 8.14 - discussion
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
In the afternoon I looked at my PC and just by coincidence I saw a WU stop (CRASH) and then another one started but the GPU clock dropped to half by that. Nothing I did with suspending/resuming/EVGA software to get the clock up again than booting the system, 1 day and 11 hours after its last boot by the same issue. Now that WU is finished with good result I looked at it and found this again: # BOINC suspending at user request (exit) I did nothing and the PC was only doing GPUGRID and 5 Rosetta WU's in the CPU's. Virus scanner was not in use will happen during night time. And I used the line from Operator in cc_config to never do a Benchmark. I think Matt has made a good diagnostic program and we get now to see things we never saw but could have happened. It would be nice though to see somewhere what all these messages mean (and what we could do or not do about it). But only when you have time Matt, we know you are busy with programming and you need to get your PhD as well. I am now 3 days error free even on my 660, so things have improved, for me at least. Thanks for that. Greetings from TJ |
Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
For these access violation problems, it seems that I'm going to have to set up a Windows system with a Titan in the lab and try to reproduce it. Unfortunately I'll not be back to do that until mid October at the earliest. I hope you can tolerate the current state of affairs until then? Like I said, theyre running and validating. Fine with me |
Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I hope you can tolerate the current state of affairs until then? Matt; Will have to do. Thanks for looking into it though. That's encouraging. As I indicated I would, I removed one GPU and booted up to run long WUs. Got one NATHAN_KIDc downloaded and running and the second, a NOELIA_INS "Ready to Start". After one hour I came back to check and sure enough the first one had stopped and was now "Waiting to run" and the second one was running. I'm sure they'll swap back and forth again several times before completion. Operator |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Operator: I sent you a private message on GPUGrid, with my email address, requesting some files from you. I'd like to help your situation. Can you send me those files? Thanks, Jacob |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Here's an anecdotal story, based on a random sample of one. YMMV - in fact, your system will certainly be different - but this may be of interest. My GTX 670 host has been having a lot of problems - starting in August, which was particularly warm here. "Problems" were the occasional BSOD, but most commonly a total system freeze - Windows desktop shows on screen as normal, but the system clock stops updating and there's no response to mouse or keyboard. First suspect was overheating, so I installed extra side fans in an already well-ventilated HAF case and moved the machine to a cooler room - that seemed to improve things, but wasn't a complete cure. Then, after this month's Windows security updates, it got much worse again - freezing every six hours or so. OS is Windows 7 Home Premium, 64-bit, and CPU is an 'Ivy Bridge' (third generation) i7 with HD 4000 graphics. Motherboard is by Gigabyte with Z77 express chipset. Looking around, I found: ![]() After consulting an experienced developer and system builder, I installed - in this order - the following updates: 1) Platform Update - http://support.microsoft.com/kb/2670838 2) Intel HD 4000 driver from the Intel site - Intel Download Centre 3) The two Driver Framework updates from the list above - Kernel-Mode and User-Mode 4) The most recent NVidia driver available - 326.80 Beta (using the 'clean install' option) Since I did all that, the machine has run without error, and no errors have been logged in the most recent beta tasks. I'm going to try switching back to long tasks after the current beta has finished. |
Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Jacob; Files are in your inbox now. Thanks, Operator |
Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Richard; Thanks. My system board (Dell) has no integrated Intel HD video, discrete only. I do have the platform updates already installed, and in fact have most if not all of the other updates you show there installed as well. I actually did have Nvidia driver version 326.84 installed and reverted back to a clean install of 326.41 to determine if that had anything to do with the problem, but apparently it didn't. I think it's the way the 8.14 app runs on 780/Titan GPUs that is the issue. I don't see any of these problems with apps running on my 590 box. Matt (MJH) says he's going to have a go at investigating when he gets a chance. I'm considering doing a Linux build to see if that makes any difference because it seems that the development branches may be different for Windows vs Linux GPUGrid apps. But I have very little experience with Linux in general so this would be time consuming for me to get spun up on. Operator |
Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi, Folks 26h run time forecast....Is this reasonable? AMD FX-8350 with GTX 650 Ti. Computer ID Name Location Avg. credit Total credit BOINC version CPU GPU Operating System Last contact ID: 158482 Details | Tasks Cross-project stats: BOINCstats.com Free-DC Panzer-001 home 52,782.88 786,400 7.0.64 AuthenticAMD AMD FX(tm)-8350 Eight-Core Processor [Family 21 Model 2 Stepping 0] (8 processors) [2] NVIDIA GeForce GTX 650 Ti (1023MB) driver: 314.22 Microsoft Windows 7 Ultimate x64 Edition, Service Pack 1, (06.01.7601.00) 17 Sep 2013 | 15:16:19 UTC Name 35x7-SANTI_RAP74wtCUBIC-5-34-RND8406_1 Workunit 4779214 Created 17 Sep 2013 | 12:01:53 UTC Sent 17 Sep 2013 | 15:16:19 UTC Received --- Server state In progress Outcome --- Client state New Exit status 0 (0x0) Computer ID 158482 Report deadline 22 Sep 2013 | 15:16:19 UTC Run time 0.00 CPU time 0.00 Validate state Initial Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.14 (cuda42) |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Operator, Would a bootable Linux image be useful for you? Was planning to put one together for the memory tester anyway. Matt |
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi, Folks I wouldn't pay much attention to the forecast. See what the actual run time is; it should be about 18 hours. |
Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi, Folks Many thanks, Jim. Regards, John |
![]() Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
What could be interesting is whether these access violations already happened in 8.03 but had no visible effect, or if they're caused by some change made to the app since then. If I remember correctly Matt only introduced the error handling with 8.11. And may have also improved the error detection. So I still think it's possible that what ever triggers the error detection now was happening before, but did not actually harm the WUs. It's just one possibility, though, which I don't think we can answer. Matt, would it be sufficient if you got remote access to a Titan on Win? I don't have any, but others might want to help. That would certainly be quicker than to set the system up yourself.. although you migth want to have some Windows system to hunt nasty bugs anyway. MrS Scanning for our furry friends since Jan 2002 |
Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
What could be interesting is whether these access violations already happened in 8.03 but had no visible effect, or if they're caused by some change made to the app since then. MrS; Looking back at the last 10 or so SANTI_RAP, NOELIA-INSP, and NATHAN_KIDKIX WUs that were run on the 8.03 app just before the switch to 8.14... http://www.gpugrid.net/results.php?hostid=158641&offset=20&show_names=1&state=0&appid= you can see that average completion times were about 20k. After 8.14? Sometimes double that due to the constant restarts. So even if error checking was introduced with version 8.11, and there may have been hidden errors created when running the 8.03 app (I'm not sure how that follows logically though), the near doubling of the work unit completion times immediately upon initial usage of the 8.14 app is enough of a smoking gun that there is something amiss. And that is the real problem here I think, the amount of time it takes a WU to complete due to all the starts and stops. That directly impacts the number of WUs that this system (and other Titan/780 equipped systems like it) can get returned. If you like, look at it from the perspective of the "return on the Kilowatts consumed". Now, I am perfectly happy to wait till Matt has a chance to do some testing, and see where that takes us. I'll put the second Titan GPU back in the case and continue as before until...whatever. Operator |
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
A CRASHNPT was suspended, still 3% to finish, and another was running. I suppose this happened due to the "termination by the app to avoid hangup". So I suspended the other WU and the one that was almost finished, started again, but failed immediately. So this manually suspending is not working properly anymore, or it is because the app stopped it itself? Greetings from TJ |
Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
As an example of what I was referring to above: With one Titan GPU installed and only one WU downloaded and crunching, the amount of time 'wasted' by the "Scheduler: Access violation, Waiting to Run" issue for I59R6-NATHAN_KIDc22_glu-6-10-RND3767_1 was 2 hours 47 minutes and 31 seconds of nothing happening. This data came from the stdoutdae.txt file and was imported into Excel where the 'gaps' between restarts for this WU were totalled up. So this WU could have finished in 'real time' (not GPU time) almost three hours earlier than it did and would have allowed another WU to have been mostly completed if not for all the restarts. Let me know if anybody sees this a different way. Operator |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I agree that it's possible that the loading and clearing of the app could use up a substantial amount of time. This again suggests that recoverable errors are now triggering the app suspension and recovery mechanism. Maybe the app just needs to be refined so that it doesn't get triggered so often. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
![]() Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Operator, It would be nice to have a 64bit Linux image with BOINC, NVidia and ATI drivers installed if that's even possible. No need for anything else. All my boxes are AMD with both NVidia and AMD GPUs. Haven't had a lot of success getting Linux running so that BOINC will work for both GPU types. |
Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
yours is doing something completely different from mine. Why I don't know. But since mine suspend and start another task, very little is lost. In fact, my times are pretty much unchanged. Your issue is. Odd and unique. |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Unsurprising. It's difficult to do, and fragile when it's done. The trick is to do the installation in this order: * Operating System's X, mesa packages * Nvidia driver * force a re-install X, mesa packages * Catalyst * Configure X server for the AMD card. * Start X MJH |
Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
To be clear I'm referring to the difference between the WU runtime showing in the results (20+k seconds) and the actual 'real' time the computer took to complete the WU from start to finish. As an example, if you start a WU and you only have that one running, and it repeatedly starts and stops until its finished, there will be a difference in the 'GPU runtime' versus the actual clock time the WU took to complete. Unless I'm way off base the GPU time is logged only when the WU is being actively worked. If it's "Waiting to run" I don't think that time counts. So that's why I said that there was 2 hours 47 minutes and 31 seconds of nothing happening that was essentially lost. Now, if I completely have this wrong about GPU time vs. 'real time' please jump in here and straighten me out! Operator |
©2025 Universitat Pompeu Fabra