Message boards :
News :
Experimental Python tasks (beta) - task description
Message board moderation
Previous · 1 . . . 33 · 34 · 35 · 36 · 37 · 38 · 39 . . . 50 · Next
| Author | Message |
|---|---|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
are newer tasks using more VRAM? or is there something on your system using more VRAM? what is the breakdown of VRAM used by the different processes? that will tell you what process is actually using the vram
|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
what is the breakdown of VRAM used by the different processes? that will tell you what process is actually using the vram hm, I will have to find a tool that tells me :-) Any recommendation? |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 662 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
what is the breakdown of VRAM used by the different processes? that will tell you what process is actually using the vram nvidia-smi in the Terminal does nicely. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
check here for nvidia-smi use on Windows. it's easy on Linux, but less intuitive on Windows https://stackoverflow.com/questions/57100015/how-do-i-run-nvidia-smi-on-windows
|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
my hosts still keep receiving faulty tasks which are totally "fresh", no re-submitted ones. So there must be tons of those still in the bucket :-( Just noticed that a task failed after >19 hours. This is not nice :-( |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
my hosts still keep receiving faulty tasks which are totally "fresh", no re-submitted ones. I was out for a few hours, and when I came back, I noticed 2 more failed tasks (both ran for almost 3 hours before they crashed). Whereas at the beginning of the problem, the tasks failed - as also Abouh noted - within short time so that there was not too much of waste, now these tasks fail only after several hours. Within the past 24 hours, my hosts' total computation time of all the failing tasks was 104.526 seconds = 29 hours! I am very much willing to support the science with my time, my equipment and my permanently increasing electricity bill as long as it makes sense (and as long as I can afford it). FYI, my electricity costs have more than tripled since the beginning of the year, for known reasons. That's significant! I simply cannot believe that all these faulty tasks in the big download bucket cannot be stopped, retrieved, cancelled or what ever else. It makes absolutely no sense to leave them in there and send them out to us for the next several weeks. If the GPUGRID people cannot confirm that they are finding a way quickly to stop these faulty tasks, I have no other choice, as sorry as I am, to switch to other projects :-( |
|
Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 0 Level ![]() Scientific publications
|
For the time being, I already suspended receiving new tasks and reverted back to E@H & F@H as long as this situation with faulty tasks has been sorted out. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
Most peculiar, I have had no failed task. Seven so far. I wish with internet problems we could also get a standby task. Maybe they are sending these tasks to those multiple WUs crunching machines who can quickly clear up the backlog :) |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
I have reviewed your recent tasks and there is a mix of faulty and successful tasks. The successful ones are newer and are the only ones being submitted now. I could not figure out how to cancel the faulty tasks earlier. However, they should be almost all if not all crunched by now. Maybe other hosts can confirm if they are still getting tasks that crash, but I expect the problem to be solved now. For the last 2-3 days only good tasks have been sent. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 318 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
@ Erich56: you have to look into the history and the reason for the crashes. I got one of the last replications from workunit 27327972 last night - but that's one that was created on 16 October, almost a week ago. it's just that the first owner hung on to it for five days and did nothing. That's not the project's fault, even if the initial error was. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
For the last 2-3 days only good tasks have been sent. thanks, Abouh, for your reply. When you say what I quoted above - you are talking about "fresh" tasks, right? However, repetitions (up to 8) of the former, faulty tasks are still going out. Just an example of a task which one of my hosts received this morning, and which failed after about 2 1/2 hours: https://www.gpugrid.net/result.php?resultid=33112434 |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 318 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Likewise. Since I posted, I've received another one which is likely to go the same way, from workunit 27328975. Another 5-day no-show by a Science United user. |
|
Send message Joined: 16 Oct 22 Posts: 12 Credit: 1,382,500 RAC: 0 Level ![]() Scientific publications
|
Maybe, that person misjugded how long these tasks can take. If that person asked for a stock of tasks for 7 or 10 days, some probably will never start on that machine before the deadline. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
Maybe, that person misjugded how long these tasks can take. If that person asked for a stock of tasks for 7 or 10 days, some probably will never start on that machine before the deadline. Mine are set to ten plus ten days but I still get one. This is not the reason. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
For the last 2-3 days only good tasks have been sent. When you get a resend, especially a high number resend like that, check the reason that it was resent so much. If there’s tons of errors, probably safe to just abort it and not waste your time on it. Especially when you know a bunch of bad tasks had gone out recently.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
Maybe, that person misjugded how long these tasks can take. If that person asked for a stock of tasks for 7 or 10 days, some probably will never start on that machine before the deadline. Ideally, when the task approaches the deadline it should jump into high priority mode and jump to the front of the line for task priority. But the process doesn’t always work ideally with BOINC. But there are also many people who blindly download tasks then shut off their computer for extended periods of time.
|
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Yes, I meant fresh tasks, which would be sent out to for the first time out of 8 possible attempts. Yes, repetitions are an issue. I understand why it was set to a relatively high value. Many machines with limited GPU memory (e.g. 2Gb) or configuration problems are in the network are fail inevitably with this tasks. That gave the experiments some error tolerance. However, ideally I would like to be able to modify it just for the python apps momentarily for cases like this one. I could set it to 1 for a few hours so all bad tasks are process fast and then go back to 8. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Ian&Steve C. wrote: When you get a resend, especially a high number resend like that, check the reason that it was resent so much. If there’s tons of errors, probably safe to just abort it and not waste your time on it. Especially when you know a bunch of bad tasks had gone out recently. Well, not a bad idea, if I had the time to babysit my hosts 24/7 :-) However, this would end up with a problem rather quickly: isn't it still the case that once a certain number of downloaded tasks is being deleted, no further ones will be sent within the following 24 hours? In fact, I remember that this was even true for failing tasks in the past, based on the assumption that there is something wrong with the host. So, in view of the many failed tasks now, I am surprised that I still get new ones within the mentioned 24 hours ban. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 662 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Depends on how they have set up the server software. There are BOINC configs so that "bad actors" are put into timeout mode when they return a large number of bad results in a short time period. The 24 hour timeout you mentioned. Once a host starts returning valid results, they are given increasing amounts of work on each scheduler request. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Crazy, I had another task which failed after more than 20 hours :-( I could live with the situation when a task fails after say 20 minutes or half an hour, once in a while. There was another task yesterday which failed after almost 20 hours. And there were numerous tasks in addition which failed after less than one hour but also after much more than one hour. My assumption is that these misconfigured tasks with 8 repetitions each will be around for many more weeks. I am sorry but I no longer can live with this waste, particularly with what electricity here costs by now (and getting even more expensive soon). So I put GPUGRID on NNT and will crunch other projects. As sorry as I am for this step :-( What I hope is that one day BOINC will develop a mechanism for calling back faulty batches. And I don't understand why this is not possible so far. |
©2025 Universitat Pompeu Fabra