More detailed server status page

Message boards : Server and website : More detailed server status page
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41895 - Posted: 23 Sep 2015, 15:05:40 UTC
Last modified: 23 Sep 2015, 15:07:06 UTC

Is it possible to display the batches (with the number of workunits) in the queues, not just the total number of the tasks?
I think about something like this:

       Application                  unsent   in progress   valid   invalid    error

Short runs (2-3 hours on fastest card)   1        48        130        5        17
NOELIA_467x                              1        48        130        5        17

Long runs (8-12 hours on fastest card) 114     2,295       1400       34       420
GERARD_FXCXCL12_LIG                     34     1,865        970       39        40
GERARD_PTCL2_CTL_IPZ1                   15       320        294       22        34
GERARD_PTCL2_CTL_PRZ1                   10       201        231       12        22
GERARD_PTCL_CTL_IPZ2                    11       181        175       10        12
GERARD_PTCL_CTL_PRZ1                     9       117        111        7         9
GERARD_PTCTL_LFE_AIN3                   14       104         94        5         8
GERARD_PTCTL_LFE_IBP2                    7        97         55        2         5
GERARD_PTCTL_PLA2_AIN3                  12        86         64        3         6
GERARD_PTCTL_PLA2_IBP1                  67        34         73        4         9
GERARD_VACXCL12_LIG                     42       309        211       20        19
SDOERR_ntl9evSSXX3                       3        15         15        0         1

These are *not* the actual numbers, so they won't add up.
ID: 41895 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gerard

Send message
Joined: 26 Mar 14
Posts: 101
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 41904 - Posted: 25 Sep 2015, 15:56:15 UTC - in response to Message 41895.  

Beta version released in the server_status page. I tweaked a bit the original idea but you can see the information you desired. I got a bit surprised about the error rate at first, later I realised the amount of errors a client can make so is not that bad after all. Hopefully this data will enhance the ability of detecting corrupted batches.
ID: 41904 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41905 - Posted: 25 Sep 2015, 17:15:54 UTC - in response to Message 41904.  

Beta version released in the server_status page.
Thank you!

I tweaked a bit the original idea but you can see the information you desired.
Well, that's the point of it. :)
The another purpose of this list to "announce" new batches.

I got a bit surprised about the error rate at first, later I realised the amount of errors a client can make so is not that bad after all.
Oh, another idea popped in my mind while I read your words:
There should be a top list of the worst hosts (most errors per day) on the performance chart.
Is there a way to make a "normalized" error rate column by filtering out these worst hosts? Would such a column be more conclusive?

Hopefully this data will enhance the ability of detecting corrupted batches.
Hopefully you've already had something like this for internal use before, right? :)
ID: 41905 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41906 - Posted: 25 Sep 2015, 21:08:04 UTC

Nice work!!!

How about adding "total iterations" (you know how each WU begets another - what is the total number of WUs planned for each batch?) and then the math is easy to calculate "remaining". This would give us a more complete perspective of the crunching "pool" and its current status. I'm not going to go into "estimated completion date of each batch" yet but I can see someone asking that question next :-)
Thanks - Steve
ID: 41906 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,850,145,728
RAC: 301,281
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41907 - Posted: 26 Sep 2015, 2:41:47 UTC - in response to Message 41905.  

There should be a top list of the worst hosts (most errors per day) on the performance chart.
Is there a way to make a "normalized" error rate column by filtering out these worst hosts? Would such a column be more conclusive?


Before you start listing the worst hosts, it would be a good idea to set up a proper criteria for this. Errors can be put into 2 categories: hosts errors and non hosts errors (like bad batch of WU's, or the server canceling the units), so make sure the host are labeled with host errors only. Ok, this is obvious, but I don't want to be labeled with a scarlet letter because of a bad batch.

Also, errors that happened a while back (which are mostly back batch errors) should not count either. I would think that the cleaning out the data base of these errors should prerequisite.



ID: 41907 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41913 - Posted: 28 Sep 2015, 11:38:26 UTC - in response to Message 41907.  
Last modified: 28 Sep 2015, 11:39:15 UTC

Before you start listing the worst hosts, it would be a good idea to set up a proper criteria for this.
At first, I thought to exclude from the statistic only the obviously bad hosts, which fail on every task (for example: host 255774, 180977) or only occasionally finish a task (their error rate is above say 90%, for example: host 179830, 74100). But it could be a more sophisticated statistical algorithm.

Errors can be put into 2 categories: hosts errors and non hosts errors (like bad batch of WU's, or the server canceling the units), so make sure the host are labeled with host errors only. Ok, this is obvious, but I don't want to be labeled with a scarlet letter because of a bad batch.
I meant "most error in the past 24 hours" by "most errors per day", so this list would be automatically refreshed / fixed hosts would be cleared.
The purpose of filtering the worst hosts is to avoid putting a scarlet letter on a batch, caused by the worst hosts failing workunits from a more demanding batch (presumably because the host's GPU is overclocked above its maximum), which result in misleading percentages.
A "scarlet letter" on a batch could be dangerous, as it could make some crunchers selectively cancelling workunits from the (mistakenly) worst batches, making the whole process worse.

Also, errors that happened a while back (which are mostly back batch errors) should not count either. I would think that the cleaning out the data base of these errors should prerequisite.
That's a good idea anyway.
ID: 41913 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,850,145,728
RAC: 301,281
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42012 - Posted: 17 Oct 2015, 14:24:19 UTC - in response to Message 41904.  

I am not sure how the Server Status page calculates the error rate, but it looks like to me that units in progress are included as errors, until they complete successfully. Is my observation correct?



ID: 42012 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42013 - Posted: 17 Oct 2015, 19:26:11 UTC - in response to Message 42012.  

I am not sure how the Server Status page calculates the error rate, but it looks like to me that units in progress are included as errors, until they complete successfully. Is my observation correct?
I think the high error rate is the consequence of that those workunits which immediately (or early) run into an error are returned much earlier than those which run until completion (which takes 5-8-12-24 or even more hours), so the error rate normalizes only after this period. That's why I wanted to exclude those hosts from this calculation which fail every single workunit, because they do not contribute to this project, but actually spam it, making these statistics misleading.
ID: 42013 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gerard

Send message
Joined: 26 Mar 14
Posts: 101
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 42019 - Posted: 19 Oct 2015, 9:17:32 UTC - in response to Message 42013.  

You are right Retvari. I also have observed this behavior. First days after a new batch is launched the error rate is pretty high. Supposedly, some hosts receive WU and either they manually cancel them or abort for many other reasons. I believe a user stops receiving certain WU batch after an x number of fails, and as the successful WU proceed, the error gets corrected.
ID: 42019 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Server and website : More detailed server status page

©2026 Universitat Pompeu Fabra