Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 33 · 34 · 35 · 36 · 37 · 38 · 39 . . . 50 · Next

AuthorMessage
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 59504 - Posted: 21 Oct 2022, 14:30:14 UTC

are newer tasks using more VRAM? or is there something on your system using more VRAM?

what is the breakdown of VRAM used by the different processes? that will tell you what process is actually using the vram
ID: 59504 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59505 - Posted: 21 Oct 2022, 14:45:56 UTC - in response to Message 59504.  

what is the breakdown of VRAM used by the different processes? that will tell you what process is actually using the vram

hm, I will have to find a tool that tells me :-)
Any recommendation?
ID: 59505 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59506 - Posted: 21 Oct 2022, 15:43:50 UTC - in response to Message 59505.  

what is the breakdown of VRAM used by the different processes? that will tell you what process is actually using the vram

hm, I will have to find a tool that tells me :-)
Any recommendation?

nvidia-smi in the Terminal does nicely.
ID: 59506 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 59507 - Posted: 21 Oct 2022, 16:18:26 UTC

check here for nvidia-smi use on Windows. it's easy on Linux, but less intuitive on Windows

https://stackoverflow.com/questions/57100015/how-do-i-run-nvidia-smi-on-windows
ID: 59507 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59510 - Posted: 22 Oct 2022, 7:16:46 UTC
Last modified: 22 Oct 2022, 7:43:42 UTC

my hosts still keep receiving faulty tasks which are totally "fresh", no re-submitted ones.
So there must be tons of those still in the bucket :-(

Just noticed that a task failed after >19 hours. This is not nice :-(
ID: 59510 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59511 - Posted: 22 Oct 2022, 11:40:15 UTC - in response to Message 59510.  

my hosts still keep receiving faulty tasks which are totally "fresh", no re-submitted ones.
So there must be tons of those still in the bucket :-(

Just noticed that a task failed after >19 hours. This is not nice :-(

I was out for a few hours, and when I came back, I noticed 2 more failed tasks (both ran for almost 3 hours before they crashed).

Whereas at the beginning of the problem, the tasks failed - as also Abouh noted - within short time so that there was not too much of waste, now these tasks fail only after several hours.

Within the past 24 hours, my hosts' total computation time of all the failing tasks was 104.526 seconds = 29 hours!

I am very much willing to support the science with my time, my equipment and my permanently increasing electricity bill as long as it makes sense (and as long as I can afford it).
FYI, my electricity costs have more than tripled since the beginning of the year, for known reasons. That's significant!

I simply cannot believe that all these faulty tasks in the big download bucket cannot be stopped, retrieved, cancelled or what ever else. It makes absolutely no sense to leave them in there and send them out to us for the next several weeks.

If the GPUGRID people cannot confirm that they are finding a way quickly to stop these faulty tasks, I have no other choice, as sorry as I am, to switch to other projects :-(
ID: 59511 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 22 May 20
Posts: 110
Credit: 115,525,136
RAC: 0
Level
Cys
Scientific publications
wat
Message 59512 - Posted: 22 Oct 2022, 11:45:46 UTC

For the time being, I already suspended receiving new tasks and reverted back to E@H & F@H as long as this situation with faulty tasks has been sorted out.
ID: 59512 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 59513 - Posted: 22 Oct 2022, 13:26:11 UTC
Last modified: 22 Oct 2022, 13:29:49 UTC

Most peculiar, I have had no failed task. Seven so far.
I wish with internet problems we could also get a standby task.
Maybe they are sending these tasks to those multiple WUs crunching machines who can quickly clear up the backlog :)
ID: 59513 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59514 - Posted: 22 Oct 2022, 15:29:36 UTC - in response to Message 59510.  

I have reviewed your recent tasks and there is a mix of faulty and successful tasks. The successful ones are newer and are the only ones being submitted now.

I could not figure out how to cancel the faulty tasks earlier. However, they should be almost all if not all crunched by now.

Maybe other hosts can confirm if they are still getting tasks that crash, but I expect the problem to be solved now. For the last 2-3 days only good tasks have been sent.

ID: 59514 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59515 - Posted: 22 Oct 2022, 15:51:18 UTC - in response to Message 59514.  

@ Erich56: you have to look into the history and the reason for the crashes. I got one of the last replications from workunit 27327972 last night - but that's one that was created on 16 October, almost a week ago. it's just that the first owner hung on to it for five days and did nothing. That's not the project's fault, even if the initial error was.
ID: 59515 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59516 - Posted: 22 Oct 2022, 15:55:21 UTC - in response to Message 59514.  

For the last 2-3 days only good tasks have been sent.

thanks, Abouh, for your reply.
When you say what I quoted above - you are talking about "fresh" tasks, right?
However, repetitions (up to 8) of the former, faulty tasks are still going out.

Just an example of a task which one of my hosts received this morning, and which failed after about 2 1/2 hours:

https://www.gpugrid.net/result.php?resultid=33112434
ID: 59516 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59517 - Posted: 22 Oct 2022, 15:58:18 UTC

Likewise. Since I posted, I've received another one which is likely to go the same way, from workunit 27328975. Another 5-day no-show by a Science United user.
ID: 59517 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GS

Send message
Joined: 16 Oct 22
Posts: 12
Credit: 1,382,500
RAC: 0
Level
Ala
Scientific publications
wat
Message 59518 - Posted: 22 Oct 2022, 16:17:52 UTC

Maybe, that person misjugded how long these tasks can take. If that person asked for a stock of tasks for 7 or 10 days, some probably will never start on that machine before the deadline.
ID: 59518 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 59519 - Posted: 22 Oct 2022, 16:37:02 UTC - in response to Message 59518.  

Maybe, that person misjugded how long these tasks can take. If that person asked for a stock of tasks for 7 or 10 days, some probably will never start on that machine before the deadline.


Mine are set to ten plus ten days but I still get one. This is not the reason.
ID: 59519 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 59520 - Posted: 22 Oct 2022, 16:52:27 UTC - in response to Message 59516.  

For the last 2-3 days only good tasks have been sent.

thanks, Abouh, for your reply.
When you say what I quoted above - you are talking about "fresh" tasks, right?
However, repetitions (up to 8) of the former, faulty tasks are still going out.

Just an example of a task which one of my hosts received this morning, and which failed after about 2 1/2 hours:

https://www.gpugrid.net/result.php?resultid=33112434


When you get a resend, especially a high number resend like that, check the reason that it was resent so much. If there’s tons of errors, probably safe to just abort it and not waste your time on it. Especially when you know a bunch of bad tasks had gone out recently.
ID: 59520 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 59521 - Posted: 22 Oct 2022, 16:57:30 UTC - in response to Message 59518.  

Maybe, that person misjugded how long these tasks can take. If that person asked for a stock of tasks for 7 or 10 days, some probably will never start on that machine before the deadline.


Ideally, when the task approaches the deadline it should jump into high priority mode and jump to the front of the line for task priority. But the process doesn’t always work ideally with BOINC.

But there are also many people who blindly download tasks then shut off their computer for extended periods of time.
ID: 59521 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59522 - Posted: 22 Oct 2022, 17:10:49 UTC - in response to Message 59516.  
Last modified: 22 Oct 2022, 17:12:02 UTC

Yes, I meant fresh tasks, which would be sent out to for the first time out of 8 possible attempts.

Yes, repetitions are an issue. I understand why it was set to a relatively high value. Many machines with limited GPU memory (e.g. 2Gb) or configuration problems are in the network are fail inevitably with this tasks. That gave the experiments some error tolerance.

However, ideally I would like to be able to modify it just for the python apps momentarily for cases like this one. I could set it to 1 for a few hours so all bad tasks are process fast and then go back to 8.
ID: 59522 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59523 - Posted: 22 Oct 2022, 18:10:21 UTC - in response to Message 59520.  

Ian&Steve C. wrote:
When you get a resend, especially a high number resend like that, check the reason that it was resent so much. If there’s tons of errors, probably safe to just abort it and not waste your time on it. Especially when you know a bunch of bad tasks had gone out recently.

Well, not a bad idea, if I had the time to babysit my hosts 24/7 :-)

However, this would end up with a problem rather quickly: isn't it still the case that once a certain number of downloaded tasks is being deleted, no further ones will be sent within the following 24 hours?
In fact, I remember that this was even true for failing tasks in the past, based on the assumption that there is something wrong with the host. So, in view of the many failed tasks now, I am surprised that I still get new ones within the mentioned 24 hours ban.
ID: 59523 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59524 - Posted: 22 Oct 2022, 18:20:24 UTC

Depends on how they have set up the server software.

There are BOINC configs so that "bad actors" are put into timeout mode when they return a large number of bad results in a short time period. The 24 hour timeout you mentioned.

Once a host starts returning valid results, they are given increasing amounts of work on each scheduler request.
ID: 59524 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59525 - Posted: 23 Oct 2022, 16:16:10 UTC

Crazy, I had another task which failed after more than 20 hours :-(

I could live with the situation when a task fails after say 20 minutes or half an hour, once in a while.
There was another task yesterday which failed after almost 20 hours.
And there were numerous tasks in addition which failed after less than one hour but also after much more than one hour.

My assumption is that these misconfigured tasks with 8 repetitions each will be around for many more weeks.
I am sorry but I no longer can live with this waste, particularly with what electricity here costs by now (and getting even more expensive soon).

So I put GPUGRID on NNT and will crunch other projects. As sorry as I am for this step :-(

What I hope is that one day BOINC will develop a mechanism for calling back faulty batches. And I don't understand why this is not possible so far.
ID: 59525 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 33 · 34 · 35 · 36 · 37 · 38 · 39 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2025 Universitat Pompeu Fabra