Message boards :
Server and website :
More detailed server status page
Message board moderation
| Author | Message |
|---|---|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Is it possible to display the batches (with the number of workunits) in the queues, not just the total number of the tasks? I think about something like this: Application unsent in progress valid invalid error Short runs (2-3 hours on fastest card) 1 48 130 5 17 NOELIA_467x 1 48 130 5 17 Long runs (8-12 hours on fastest card) 114 2,295 1400 34 420 GERARD_FXCXCL12_LIG 34 1,865 970 39 40 GERARD_PTCL2_CTL_IPZ1 15 320 294 22 34 GERARD_PTCL2_CTL_PRZ1 10 201 231 12 22 GERARD_PTCL_CTL_IPZ2 11 181 175 10 12 GERARD_PTCL_CTL_PRZ1 9 117 111 7 9 GERARD_PTCTL_LFE_AIN3 14 104 94 5 8 GERARD_PTCTL_LFE_IBP2 7 97 55 2 5 GERARD_PTCTL_PLA2_AIN3 12 86 64 3 6 GERARD_PTCTL_PLA2_IBP1 67 34 73 4 9 GERARD_VACXCL12_LIG 42 309 211 20 19 SDOERR_ntl9evSSXX3 3 15 15 0 1 These are *not* the actual numbers, so they won't add up. |
|
Send message Joined: 26 Mar 14 Posts: 101 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Beta version released in the server_status page. I tweaked a bit the original idea but you can see the information you desired. I got a bit surprised about the error rate at first, later I realised the amount of errors a client can make so is not that bad after all. Hopefully this data will enhance the ability of detecting corrupted batches. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Beta version released in the server_status page.Thank you! I tweaked a bit the original idea but you can see the information you desired.Well, that's the point of it. :) The another purpose of this list to "announce" new batches. I got a bit surprised about the error rate at first, later I realised the amount of errors a client can make so is not that bad after all.Oh, another idea popped in my mind while I read your words: There should be a top list of the worst hosts (most errors per day) on the performance chart. Is there a way to make a "normalized" error rate column by filtering out these worst hosts? Would such a column be more conclusive? Hopefully this data will enhance the ability of detecting corrupted batches.Hopefully you've already had something like this for internal use before, right? :) |
|
Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Nice work!!! How about adding "total iterations" (you know how each WU begets another - what is the total number of WUs planned for each batch?) and then the math is easy to calculate "remaining". This would give us a more complete perspective of the crunching "pool" and its current status. I'm not going to go into "estimated completion date of each batch" yet but I can see someone asking that question next :-) Thanks - Steve |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 301,281 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
There should be a top list of the worst hosts (most errors per day) on the performance chart. Before you start listing the worst hosts, it would be a good idea to set up a proper criteria for this. Errors can be put into 2 categories: hosts errors and non hosts errors (like bad batch of WU's, or the server canceling the units), so make sure the host are labeled with host errors only. Ok, this is obvious, but I don't want to be labeled with a scarlet letter because of a bad batch. Also, errors that happened a while back (which are mostly back batch errors) should not count either. I would think that the cleaning out the data base of these errors should prerequisite. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Before you start listing the worst hosts, it would be a good idea to set up a proper criteria for this.At first, I thought to exclude from the statistic only the obviously bad hosts, which fail on every task (for example: host 255774, 180977) or only occasionally finish a task (their error rate is above say 90%, for example: host 179830, 74100). But it could be a more sophisticated statistical algorithm. Errors can be put into 2 categories: hosts errors and non hosts errors (like bad batch of WU's, or the server canceling the units), so make sure the host are labeled with host errors only. Ok, this is obvious, but I don't want to be labeled with a scarlet letter because of a bad batch.I meant "most error in the past 24 hours" by "most errors per day", so this list would be automatically refreshed / fixed hosts would be cleared. The purpose of filtering the worst hosts is to avoid putting a scarlet letter on a batch, caused by the worst hosts failing workunits from a more demanding batch (presumably because the host's GPU is overclocked above its maximum), which result in misleading percentages. A "scarlet letter" on a batch could be dangerous, as it could make some crunchers selectively cancelling workunits from the (mistakenly) worst batches, making the whole process worse. Also, errors that happened a while back (which are mostly back batch errors) should not count either. I would think that the cleaning out the data base of these errors should prerequisite.That's a good idea anyway. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 301,281 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I am not sure how the Server Status page calculates the error rate, but it looks like to me that units in progress are included as errors, until they complete successfully. Is my observation correct? |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I am not sure how the Server Status page calculates the error rate, but it looks like to me that units in progress are included as errors, until they complete successfully. Is my observation correct?I think the high error rate is the consequence of that those workunits which immediately (or early) run into an error are returned much earlier than those which run until completion (which takes 5-8-12-24 or even more hours), so the error rate normalizes only after this period. That's why I wanted to exclude those hosts from this calculation which fail every single workunit, because they do not contribute to this project, but actually spam it, making these statistics misleading. |
|
Send message Joined: 26 Mar 14 Posts: 101 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
You are right Retvari. I also have observed this behavior. First days after a new batch is launched the error rate is pretty high. Supposedly, some hosts receive WU and either they manually cancel them or abort for many other reasons. I believe a user stops receiving certain WU batch after an x number of fails, and as the successful WU proceed, the error gets corrected. |
©2026 Universitat Pompeu Fabra