Message boards :
News :
acemdlong application 8.14 - discussion
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8
Author | Message |
---|---|
![]() Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Now, if I completely have this wrong about GPU time vs. 'real time' please jump in here and straighten me out! You're certainly not wrong here.. but looking at your tasks I actually see 2 different issues. With 8.03 you needed ~20ks for the long runs. Now you get some tasks which take ~23ks and have many access violations and subsequent restarts. Not good, for sure. These use as much CPU as GPU time. Then there are the tasks taking 40 - 50ks with lot's of "# BOINC suspending at user request (thread suspend)" and the occasional access violation thrown in for fun. Here the GPU time is twice as high as CPU time. These ones really hurt, I think. Maybe a stupid question, but just to make sure: do you have BOINC set to use 100% CPU time? And "only run if CPU load is lower than 0%"? Do you run TThrottle? It's curious.. which user is requesting this suspension? So even if error checking was introduced with version 8.11, and there may have been hidden errors created when running the 8.03 app (I'm not sure how that follows logically though), the near doubling of the work unit completion times immediately upon initial usage of the 8.14 app is enough of a smoking gun that there is something amiss. No doubt about something being wrong. The error I was speculating about is this: I don't know how exactly Matt's error detection works, but he's certainly has to look for unusual / unwanted behaviour of the simulation. Now it could be that something fullfils his criteria, something which has been happening all along and which is not an actual error in the sense that the simulation can simply continue despite this whatever happening. It's just me speculating about a possibility, though, so don't spend too much time wondering about it. We can't do anything to research this. Another wild guess: if some functionality added between 8.03 and 8.14 triggers the error.. why not deactivate half of them (as far as possible) and try a bisection search with a new beta? This could work to identify at least the offending functionality in a few days. If it's as simple as the temperature reporting, it could easily be removed for GK110 until nVidia fixes it. MrS Scanning for our furry friends since Jan 2002 |
Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Correct, if you only have 1 WU, and none downloaded. Im saying, if you have one you're working on, and one thats next in line. The time lost switching between tasks wont be nearly that large. Will it still affect real time computation? Yes. But maybe by a couple minutes. |
Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Operator, Matt; Yes it would! Operator |
Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Correct, if you only have 1 WU, and none downloaded. Im saying, if you have one you're working on, and one thats next in line. The time lost switching between tasks wont be nearly that large. Will it still affect real time computation? Yes. But maybe by a couple minutes. I agree. In part. Most of the time when one WU goes into the "Waiting to run" state the next one in the queue resumes computation. But not always! There have actually been times when all WUs were showing "Waiting to run" and absolutely nothing was happening. Doesn't happen often I'll admit. So I do have the system set to have an additional 0.2 days worth of work in the queue and that does provide another WU for the system to start crunching when one is stopped for some reason. But I was referring to a specific set of circumstances where I had only one Titan installed and only one WU downloaded (nothing waiting to start). That's the worst case scenario. When I did the calculations of the 'gaps' between the stops and restarts (previous post about the 2:47:31) I was struck by the fact that the majority of the times spans the app was not actively working (stopped and waiting to restart) was 2 minutes and 20 seconds. Not every gap but most of them were precisely that long. Curious. Operator |
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My troublesome GTX660 has now done a Noelia LR and a Santi LR, both without any interruption!. So no "Terminating to avoid lock-up" and no "BOINC suspending at user request (exit)" (whatever that may be). While with SR this happens mostly at least once. While I should think that a LR for more than 12 hours would be more susceptible for interruption. Could it be that the LR are different written than the SR? Greetings from TJ |
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I don't think a single successful completion with a Noelia long proves much. My GTX 660s successfully completed three NOELIA_INS1P until erroring out on the fourth. But a mere error is not a big deal; it was the slow run on a NATHAN_KIDKIXc22 that caused me the real problem. |
![]() Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Could be lower temperatures helping you, TJ (in this case it would be hardware-related). MrS Scanning for our furry friends since Jan 2002 |
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Could be lower temperatures helping you, TJ (in this case it would be hardware-related). Or throttling the GPU clock a little :) However it was a bot of joy to early, had a beta (Santi SR) again with The simulation has become unstable. Terminating to avoid lock-up and a downclocking again overnight. Another beta ran without any interruption. So a bit random. I know now for sure that I don't like 660's. Greetings from TJ |
©2025 Universitat Pompeu Fabra