acemdlong application 8.14 - discussion

Message boards : News : acemdlong application 8.14 - discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8

AuthorMessage
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33073 - Posted: 18 Sep 2013, 20:02:03 UTC - in response to Message 33069.  

Now, if I completely have this wrong about GPU time vs. 'real time' please jump in here and straighten me out!

You're certainly not wrong here.. but looking at your tasks I actually see 2 different issues. With 8.03 you needed ~20ks for the long runs.

Now you get some tasks which take ~23ks and have many access violations and subsequent restarts. Not good, for sure. These use as much CPU as GPU time.

Then there are the tasks taking 40 - 50ks with lot's of
"# BOINC suspending at user request (thread suspend)"
and the occasional access violation thrown in for fun. Here the GPU time is twice as high as CPU time. These ones really hurt, I think.

Maybe a stupid question, but just to make sure: do you have BOINC set to use 100% CPU time? And "only run if CPU load is lower than 0%"? Do you run TThrottle? It's curious.. which user is requesting this suspension?

So even if error checking was introduced with version 8.11, and there may have been hidden errors created when running the 8.03 app (I'm not sure how that follows logically though), the near doubling of the work unit completion times immediately upon initial usage of the 8.14 app is enough of a smoking gun that there is something amiss.

No doubt about something being wrong. The error I was speculating about is this: I don't know how exactly Matt's error detection works, but he's certainly has to look for unusual / unwanted behaviour of the simulation. Now it could be that something fullfils his criteria, something which has been happening all along and which is not an actual error in the sense that the simulation can simply continue despite this whatever happening.

It's just me speculating about a possibility, though, so don't spend too much time wondering about it. We can't do anything to research this.

Another wild guess: if some functionality added between 8.03 and 8.14 triggers the error.. why not deactivate half of them (as far as possible) and try a bisection search with a new beta? This could work to identify at least the offending functionality in a few days. If it's as simple as the temperature reporting, it could easily be removed for GK110 until nVidia fixes it.

MrS
Scanning for our furry friends since Jan 2002
ID: 33073 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
5pot

Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33078 - Posted: 18 Sep 2013, 20:21:45 UTC

Correct, if you only have 1 WU, and none downloaded. Im saying, if you have one you're working on, and one thats next in line. The time lost switching between tasks wont be nearly that large. Will it still affect real time computation? Yes. But maybe by a couple minutes.
ID: 33078 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33092 - Posted: 19 Sep 2013, 13:54:31 UTC - in response to Message 33043.  

Operator,

Would a bootable Linux image be useful for you?
Was planning to put one together for the memory tester anyway.

Matt


Matt;

Yes it would!

Operator

ID: 33092 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33093 - Posted: 19 Sep 2013, 14:13:10 UTC - in response to Message 33078.  

Correct, if you only have 1 WU, and none downloaded. Im saying, if you have one you're working on, and one thats next in line. The time lost switching between tasks wont be nearly that large. Will it still affect real time computation? Yes. But maybe by a couple minutes.


I agree. In part.

Most of the time when one WU goes into the "Waiting to run" state the next one in the queue resumes computation. But not always!

There have actually been times when all WUs were showing "Waiting to run" and absolutely nothing was happening. Doesn't happen often I'll admit.

So I do have the system set to have an additional 0.2 days worth of work in the queue and that does provide another WU for the system to start crunching when one is stopped for some reason.

But I was referring to a specific set of circumstances where I had only one Titan installed and only one WU downloaded (nothing waiting to start). That's the worst case scenario.

When I did the calculations of the 'gaps' between the stops and restarts (previous post about the 2:47:31) I was struck by the fact that the majority of the times spans the app was not actively working (stopped and waiting to restart) was 2 minutes and 20 seconds. Not every gap but most of them were precisely that long. Curious.

Operator

ID: 33093 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33290 - Posted: 30 Sep 2013, 17:51:08 UTC

My troublesome GTX660 has now done a Noelia LR and a Santi LR, both without any interruption!. So no "Terminating to avoid lock-up" and no "BOINC suspending at user request (exit)" (whatever that may be).
While with SR this happens mostly at least once. While I should think that a LR for more than 12 hours would be more susceptible for interruption.
Could it be that the LR are different written than the SR?

Greetings from TJ
ID: 33290 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33291 - Posted: 30 Sep 2013, 18:22:23 UTC - in response to Message 33290.  

I don't think a single successful completion with a Noelia long proves much. My GTX 660s successfully completed three NOELIA_INS1P until erroring out on the fourth. But a mere error is not a big deal; it was the slow run on a NATHAN_KIDKIXc22 that caused me the real problem.
ID: 33291 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33314 - Posted: 1 Oct 2013, 21:07:24 UTC

Could be lower temperatures helping you, TJ (in this case it would be hardware-related).

MrS
Scanning for our furry friends since Jan 2002
ID: 33314 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33316 - Posted: 1 Oct 2013, 21:23:39 UTC - in response to Message 33314.  

Could be lower temperatures helping you, TJ (in this case it would be hardware-related).

MrS

Or throttling the GPU clock a little :)
However it was a bot of joy to early, had a beta (Santi SR) again with The simulation has become unstable. Terminating to avoid lock-up and a downclocking again overnight. Another beta ran without any interruption. So a bit random. I know now for sure that I don't like 660's.
Greetings from TJ
ID: 33316 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8

Message boards : News : acemdlong application 8.14 - discussion

©2025 Universitat Pompeu Fabra