Experimental Python tasks (beta)

Author	Message
Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59504 - Posted: 21 Oct 2022, 14:30:14 UTC are newer tasks using more VRAM? or is there something on your system using more VRAM? what is the breakdown of VRAM used by the different processes? that will tell you what process is actually using the vram ID: 59504 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59505 - Posted: 21 Oct 2022, 14:45:56 UTC - in response to Message 59504. what is the breakdown of VRAM used by the different processes? that will tell you what process is actually using the vram hm, I will have to find a tool that tells me :-) Any recommendation? ID: 59505 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59506 - Posted: 21 Oct 2022, 15:43:50 UTC - in response to Message 59505. what is the breakdown of VRAM used by the different processes? that will tell you what process is actually using the vram hm, I will have to find a tool that tells me :-) Any recommendation? nvidia-smi in the Terminal does nicely. ID: 59506 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59507 - Posted: 21 Oct 2022, 16:18:26 UTC check here for nvidia-smi use on Windows. it's easy on Linux, but less intuitive on Windows https://stackoverflow.com/questions/57100015/how-do-i-run-nvidia-smi-on-windows ID: 59507 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59510 - Posted: 22 Oct 2022, 7:16:46 UTC Last modified: 22 Oct 2022, 7:43:42 UTC my hosts still keep receiving faulty tasks which are totally "fresh", no re-submitted ones. So there must be tons of those still in the bucket :-( Just noticed that a task failed after >19 hours. This is not nice :-( ID: 59510 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59511 - Posted: 22 Oct 2022, 11:40:15 UTC - in response to Message 59510. my hosts still keep receiving faulty tasks which are totally "fresh", no re-submitted ones. So there must be tons of those still in the bucket :-( Just noticed that a task failed after >19 hours. This is not nice :-( I was out for a few hours, and when I came back, I noticed 2 more failed tasks (both ran for almost 3 hours before they crashed). Whereas at the beginning of the problem, the tasks failed - as also Abouh noted - within short time so that there was not too much of waste, now these tasks fail only after several hours. Within the past 24 hours, my hosts' total computation time of all the failing tasks was 104.526 seconds = 29 hours! I am very much willing to support the science with my time, my equipment and my permanently increasing electricity bill as long as it makes sense (and as long as I can afford it). FYI, my electricity costs have more than tripled since the beginning of the year, for known reasons. That's significant! I simply cannot believe that all these faulty tasks in the big download bucket cannot be stopped, retrieved, cancelled or what ever else. It makes absolutely no sense to leave them in there and send them out to us for the next several weeks. If the GPUGRID people cannot confirm that they are finding a way quickly to stop these faulty tasks, I have no other choice, as sorry as I am, to switch to other projects :-( ID: 59511 · Rating: 0 · rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 0 Level Scientific publications	Message 59512 - Posted: 22 Oct 2022, 11:45:46 UTC For the time being, I already suspended receiving new tasks and reverted back to E@H & F@H as long as this situation with faulty tasks has been sorted out. ID: 59512 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59513 - Posted: 22 Oct 2022, 13:26:11 UTC Last modified: 22 Oct 2022, 13:29:49 UTC Most peculiar, I have had no failed task. Seven so far. I wish with internet problems we could also get a standby task. Maybe they are sending these tasks to those multiple WUs crunching machines who can quickly clear up the backlog :) ID: 59513 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59514 - Posted: 22 Oct 2022, 15:29:36 UTC - in response to Message 59510. I have reviewed your recent tasks and there is a mix of faulty and successful tasks. The successful ones are newer and are the only ones being submitted now. I could not figure out how to cancel the faulty tasks earlier. However, they should be almost all if not all crunched by now. Maybe other hosts can confirm if they are still getting tasks that crash, but I expect the problem to be solved now. For the last 2-3 days only good tasks have been sent. ID: 59514 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59515 - Posted: 22 Oct 2022, 15:51:18 UTC - in response to Message 59514. @ Erich56: you have to look into the history and the reason for the crashes. I got one of the last replications from workunit 27327972 last night - but that's one that was created on 16 October, almost a week ago. it's just that the first owner hung on to it for five days and did nothing. That's not the project's fault, even if the initial error was. ID: 59515 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59516 - Posted: 22 Oct 2022, 15:55:21 UTC - in response to Message 59514. For the last 2-3 days only good tasks have been sent. thanks, Abouh, for your reply. When you say what I quoted above - you are talking about "fresh" tasks, right? However, repetitions (up to 8) of the former, faulty tasks are still going out. Just an example of a task which one of my hosts received this morning, and which failed after about 2 1/2 hours: https://www.gpugrid.net/result.php?resultid=33112434 ID: 59516 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59517 - Posted: 22 Oct 2022, 15:58:18 UTC Likewise. Since I posted, I've received another one which is likely to go the same way, from workunit 27328975. Another 5-day no-show by a Science United user. ID: 59517 · Rating: 0 · rate: / Reply Quote

GS Send message Joined: 16 Oct 22 Posts: 12 Credit: 1,382,500 RAC: 0 Level Scientific publications	Message 59518 - Posted: 22 Oct 2022, 16:17:52 UTC Maybe, that person misjugded how long these tasks can take. If that person asked for a stock of tasks for 7 or 10 days, some probably will never start on that machine before the deadline. ID: 59518 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59519 - Posted: 22 Oct 2022, 16:37:02 UTC - in response to Message 59518. Maybe, that person misjugded how long these tasks can take. If that person asked for a stock of tasks for 7 or 10 days, some probably will never start on that machine before the deadline. Mine are set to ten plus ten days but I still get one. This is not the reason. ID: 59519 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59520 - Posted: 22 Oct 2022, 16:52:27 UTC - in response to Message 59516. For the last 2-3 days only good tasks have been sent. thanks, Abouh, for your reply. When you say what I quoted above - you are talking about "fresh" tasks, right? However, repetitions (up to 8) of the former, faulty tasks are still going out. Just an example of a task which one of my hosts received this morning, and which failed after about 2 1/2 hours: https://www.gpugrid.net/result.php?resultid=33112434 When you get a resend, especially a high number resend like that, check the reason that it was resent so much. If there’s tons of errors, probably safe to just abort it and not waste your time on it. Especially when you know a bunch of bad tasks had gone out recently. ID: 59520 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59521 - Posted: 22 Oct 2022, 16:57:30 UTC - in response to Message 59518. Maybe, that person misjugded how long these tasks can take. If that person asked for a stock of tasks for 7 or 10 days, some probably will never start on that machine before the deadline. Ideally, when the task approaches the deadline it should jump into high priority mode and jump to the front of the line for task priority. But the process doesn’t always work ideally with BOINC. But there are also many people who blindly download tasks then shut off their computer for extended periods of time. ID: 59521 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59522 - Posted: 22 Oct 2022, 17:10:49 UTC - in response to Message 59516. Last modified: 22 Oct 2022, 17:12:02 UTC Yes, I meant fresh tasks, which would be sent out to for the first time out of 8 possible attempts. Yes, repetitions are an issue. I understand why it was set to a relatively high value. Many machines with limited GPU memory (e.g. 2Gb) or configuration problems are in the network are fail inevitably with this tasks. That gave the experiments some error tolerance. However, ideally I would like to be able to modify it just for the python apps momentarily for cases like this one. I could set it to 1 for a few hours so all bad tasks are process fast and then go back to 8. ID: 59522 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59523 - Posted: 22 Oct 2022, 18:10:21 UTC - in response to Message 59520. Ian&Steve C. wrote: When you get a resend, especially a high number resend like that, check the reason that it was resent so much. If there’s tons of errors, probably safe to just abort it and not waste your time on it. Especially when you know a bunch of bad tasks had gone out recently. Well, not a bad idea, if I had the time to babysit my hosts 24/7 :-) However, this would end up with a problem rather quickly: isn't it still the case that once a certain number of downloaded tasks is being deleted, no further ones will be sent within the following 24 hours? In fact, I remember that this was even true for failing tasks in the past, based on the assumption that there is something wrong with the host. So, in view of the many failed tasks now, I am surprised that I still get new ones within the mentioned 24 hours ban. ID: 59523 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59524 - Posted: 22 Oct 2022, 18:20:24 UTC Depends on how they have set up the server software. There are BOINC configs so that "bad actors" are put into timeout mode when they return a large number of bad results in a short time period. The 24 hour timeout you mentioned. Once a host starts returning valid results, they are given increasing amounts of work on each scheduler request. ID: 59524 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59525 - Posted: 23 Oct 2022, 16:16:10 UTC Crazy, I had another task which failed after more than 20 hours :-( I could live with the situation when a task fails after say 20 minutes or half an hour, once in a while. There was another task yesterday which failed after almost 20 hours. And there were numerous tasks in addition which failed after less than one hour but also after much more than one hour. My assumption is that these misconfigured tasks with 8 repetitions each will be around for many more weeks. I am sorry but I no longer can live with this waste, particularly with what electricity here costs by now (and getting even more expensive soon). So I put GPUGRID on NNT and will crunch other projects. As sorry as I am for this step :-( What I hope is that one day BOINC will develop a mechanism for calling back faulty batches. And I don't understand why this is not possible so far. ID: 59525 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description