Advanced search

Message boards : Server and website : More detailed server status page

Author Message
Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 6,169
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41895 - Posted: 23 Sep 2015 | 15:05:40 UTC
Last modified: 23 Sep 2015 | 15:07:06 UTC

Is it possible to display the batches (with the number of workunits) in the queues, not just the total number of the tasks?
I think about something like this:

Application unsent in progress valid invalid error Short runs (2-3 hours on fastest card) 1 48 130 5 17 NOELIA_467x 1 48 130 5 17 Long runs (8-12 hours on fastest card) 114 2,295 1400 34 420 GERARD_FXCXCL12_LIG 34 1,865 970 39 40 GERARD_PTCL2_CTL_IPZ1 15 320 294 22 34 GERARD_PTCL2_CTL_PRZ1 10 201 231 12 22 GERARD_PTCL_CTL_IPZ2 11 181 175 10 12 GERARD_PTCL_CTL_PRZ1 9 117 111 7 9 GERARD_PTCTL_LFE_AIN3 14 104 94 5 8 GERARD_PTCTL_LFE_IBP2 7 97 55 2 5 GERARD_PTCTL_PLA2_AIN3 12 86 64 3 6 GERARD_PTCTL_PLA2_IBP1 67 34 73 4 9 GERARD_VACXCL12_LIG 42 309 211 20 19 SDOERR_ntl9evSSXX3 3 15 15 0 1

These are *not* the actual numbers, so they won't add up.

Gerard
Send message
Joined: 26 Mar 14
Posts: 101
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 41904 - Posted: 25 Sep 2015 | 15:56:15 UTC - in response to Message 41895.

Beta version released in the server_status page. I tweaked a bit the original idea but you can see the information you desired. I got a bit surprised about the error rate at first, later I realised the amount of errors a client can make so is not that bad after all. Hopefully this data will enhance the ability of detecting corrupted batches.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 6,169
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41905 - Posted: 25 Sep 2015 | 17:15:54 UTC - in response to Message 41904.

Beta version released in the server_status page.
Thank you!

I tweaked a bit the original idea but you can see the information you desired.
Well, that's the point of it. :)
The another purpose of this list to "announce" new batches.

I got a bit surprised about the error rate at first, later I realised the amount of errors a client can make so is not that bad after all.
Oh, another idea popped in my mind while I read your words:
There should be a top list of the worst hosts (most errors per day) on the performance chart.
Is there a way to make a "normalized" error rate column by filtering out these worst hosts? Would such a column be more conclusive?

Hopefully this data will enhance the ability of detecting corrupted batches.
Hopefully you've already had something like this for internal use before, right? :)

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41906 - Posted: 25 Sep 2015 | 21:08:04 UTC

Nice work!!!

How about adding "total iterations" (you know how each WU begets another - what is the total number of WUs planned for each batch?) and then the math is easy to calculate "remaining". This would give us a more complete perspective of the crunching "pool" and its current status. I'm not going to go into "estimated completion date of each batch" yet but I can see someone asking that question next :-)
____________
Thanks - Steve

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,196,221,966
RAC: 10,419,715
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41907 - Posted: 26 Sep 2015 | 2:41:47 UTC - in response to Message 41905.

There should be a top list of the worst hosts (most errors per day) on the performance chart.
Is there a way to make a "normalized" error rate column by filtering out these worst hosts? Would such a column be more conclusive?


Before you start listing the worst hosts, it would be a good idea to set up a proper criteria for this. Errors can be put into 2 categories: hosts errors and non hosts errors (like bad batch of WU's, or the server canceling the units), so make sure the host are labeled with host errors only. Ok, this is obvious, but I don't want to be labeled with a scarlet letter because of a bad batch.

Also, errors that happened a while back (which are mostly back batch errors) should not count either. I would think that the cleaning out the data base of these errors should prerequisite.



Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 6,169
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41913 - Posted: 28 Sep 2015 | 11:38:26 UTC - in response to Message 41907.
Last modified: 28 Sep 2015 | 11:39:15 UTC

Before you start listing the worst hosts, it would be a good idea to set up a proper criteria for this.
At first, I thought to exclude from the statistic only the obviously bad hosts, which fail on every task (for example: host 255774, 180977) or only occasionally finish a task (their error rate is above say 90%, for example: host 179830, 74100). But it could be a more sophisticated statistical algorithm.

Errors can be put into 2 categories: hosts errors and non hosts errors (like bad batch of WU's, or the server canceling the units), so make sure the host are labeled with host errors only. Ok, this is obvious, but I don't want to be labeled with a scarlet letter because of a bad batch.
I meant "most error in the past 24 hours" by "most errors per day", so this list would be automatically refreshed / fixed hosts would be cleared.
The purpose of filtering the worst hosts is to avoid putting a scarlet letter on a batch, caused by the worst hosts failing workunits from a more demanding batch (presumably because the host's GPU is overclocked above its maximum), which result in misleading percentages.
A "scarlet letter" on a batch could be dangerous, as it could make some crunchers selectively cancelling workunits from the (mistakenly) worst batches, making the whole process worse.

Also, errors that happened a while back (which are mostly back batch errors) should not count either. I would think that the cleaning out the data base of these errors should prerequisite.
That's a good idea anyway.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,196,221,966
RAC: 10,419,715
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42012 - Posted: 17 Oct 2015 | 14:24:19 UTC - in response to Message 41904.

I am not sure how the Server Status page calculates the error rate, but it looks like to me that units in progress are included as errors, until they complete successfully. Is my observation correct?



Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 6,169
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42013 - Posted: 17 Oct 2015 | 19:26:11 UTC - in response to Message 42012.

I am not sure how the Server Status page calculates the error rate, but it looks like to me that units in progress are included as errors, until they complete successfully. Is my observation correct?
I think the high error rate is the consequence of that those workunits which immediately (or early) run into an error are returned much earlier than those which run until completion (which takes 5-8-12-24 or even more hours), so the error rate normalizes only after this period. That's why I wanted to exclude those hosts from this calculation which fail every single workunit, because they do not contribute to this project, but actually spam it, making these statistics misleading.

Gerard
Send message
Joined: 26 Mar 14
Posts: 101
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 42019 - Posted: 19 Oct 2015 | 9:17:32 UTC - in response to Message 42013.

You are right Retvari. I also have observed this behavior. First days after a new batch is launched the error rate is pretty high. Supposedly, some hosts receive WU and either they manually cancel them or abort for many other reasons. I believe a user stops receiving certain WU batch after an x number of fails, and as the successful WU proceed, the error gets corrected.

Post to thread

Message boards : Server and website : More detailed server status page

//