Message boards :
Wish list :
User reset for host error count
Message board moderation
| Author | Message |
|---|---|
BikermattSend message Joined: 8 Apr 10 Posts: 37 Credit: 4,422,457,619 RAC: 93 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Right now I have a host that can not get tasks because it was having errors on a lot of tasks. I think one of the video cards may be bad or I may be having a driver issue. Either way, I will not get tasks from GPU grid on this host for a while so it makes tracking down the problem very hard. I switched out a video card, but by the time the host starts getting tasks again I may not be around. If I didn’t fix the issue and the errors continue it could be a really long time before I get tasks again making fixing the problem even harder. I’m glad GPU grid shuts down my hosts because it saves bandwidth and alerts me that there is a problem. What would be nice though is if there were somewhere I could go to manually reset a host’s error count, it would allow troubleshooting and get hosts back online sooner once the user sees that there is a problem. Is there any way this is possible? |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
For some reason (probably system related) you started to get runaway failures. 3590506 2256661 20 Jan 2011 21:39:05 UTC 20 Jan 2011 21:41:40 UTC Error while computing 3.10 1.00 7,491.18 --- ACEMD2: GPU molecular dynamics v6.12 (cuda) 3590481 2256651 20 Jan 2011 21:30:06 UTC 20 Jan 2011 21:32:57 UTC Error while computing 3.09 1.09 7,491.18 --- ACEMD2: GPU molecular dynamics v6.12 (cuda) 3590465 2256641 20 Jan 2011 21:32:57 UTC 20 Jan 2011 21:35:42 UTC Error while computing 2.09 0.66 0.00 --- ACEMD2: GPU molecular dynamics v6.12 (cuda) 3589838 2256236 20 Jan 2011 21:58:20 UTC 20 Jan 2011 22:02:25 UTC Error while computing 3.49 1.84 7,645.29 --- ACEMD2: GPU molecular dynamics v6.12 (cuda) 3589656 2256194 20 Jan 2011 21:44:37 UTC 20 Jan 2011 21:47:23 UTC Error while computing 4.12 1.86 7,645.29 --- ACEMD2: GPU molecular dynamics v6.12 (cuda) 3589635 2256184 20 Jan 2011 21:41:41 UTC 20 Jan 2011 21:44:36 UTC Error while computing 2.09 1.84 7,645.29 --- ACEMD2: GPU molecular dynamics v6.12 (cuda) 3589571 2256141 20 Jan 2011 21:35:46 UTC 20 Jan 2011 21:39:04 UTC Error while computing 2.09 1.78 7,645.29 --- ACEMD2: GPU molecular dynamics v6.12 (cuda) 3588292 2255374 20 Jan 2011 10:16:36 UTC 20 Jan 2011 22:01:07 UTC Error while computing 33,279.59 3,327.99 7,903.02 --- ACEMD2: GPU molecular dynamics v6.12 (cuda) 3587776 2254389 20 Jan 2011 6:57:29 UTC 20 Jan 2011 21:30:06 UTC Error while computing 43,945.54 4,487.77 7,903.02 --- ACEMD2: GPU molecular dynamics v6.12 (cuda) If the system could determine that your computer was restarted, then it could start resending tasks to your system. One for the Boinc developers perhaps. For now the alternative suggestion to allow the user the option of resetting the error count would be useful, though I would suggest this is capped (you can only pick up 1 task per GPU per hour), until a successful return of a task. Alternatively, the creation of test work untis (say around 10min long) could be used. If the cruncher can run these then this would in itself allow new tasks to be sent under the existing system. |
StoneagemanSend message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
As there are quite a few rogue work units about now, it's resulting in the server penalizing healthy gpus. This causes them to be idle for possibly many hours if one is unlucky enough to get two of these duff tasks in row! The reset option discussed here, needs to be looked at sooner rather than later. Less patient crunchers may quit and forget to come back, lol. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Any chance of introducing a system whereby if tasks continuously fail crunchers are sent an auto-generated email? It could contain a suggestion to restart, a link to the recommended driver and FAQ. If the system is reset/their driver updated they could be allowed to do an error reset, to allow say up to say 5 failures. An email could also be sent to a CA, for guidance, if needed. Thanks, |
©2025 Universitat Pompeu Fabra