ATMML

Message boards : Number crunching : ATMML
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61939 - Posted: 15 Nov 2024, 19:39:23 UTC

in the past few days, there have been a lot of ATMML tasks which failed after a few minutes. It's the sub-type PTP1B...
Clicking on the working package reveals that this does not only happen with my hosts, but likewise with other vulunteers.
ID: 61939 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61941 - Posted: 15 Nov 2024, 19:49:08 UTC

Yes, that batch was badly configured. Seems to have been cleared out and expired relatively soon. I'm on to a different MCL1 batch that is working correctly.
ID: 61941 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61998 - Posted: 4 Dec 2024, 19:44:57 UTC

Looks like the latest batch of ATTML work finally has working checkpointing.

This should make everyone happy with these long running tasks.

Congratz to @Quico and @[BAT]Svennemans for the code improvement.
ID: 61998 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62211 - Posted: 17 Feb 2025, 16:52:56 UTC
Last modified: 17 Feb 2025, 16:53:32 UTC

I can't get my head round what this project is doing at the moment. I was sent an ATMML task at 04:30 this morning (a brand-new workunit, not a user resend). I've now completed running it, but I find that not only is the Scheduler down (as shown on the server status page), but the upload servers are also 'down for maintenance'.

What's the point in preparing and sending out new work, if you're not interested in seeing if it worked?

[typo]
ID: 62211 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 62212 - Posted: 17 Feb 2025, 17:48:59 UTC

The admin is preparing to move the servers.

"Now that all running tasks have completed I am beginning to stop various server functions, it will take some days to move. We will update here as we make progress."
ID: 62212 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62213 - Posted: 17 Feb 2025, 17:54:14 UTC - in response to Message 62212.  

Has the admin communicated that to the researcher who created new work this morning?

WU 31438508
ID: 62213 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 62214 - Posted: 17 Feb 2025, 18:40:53 UTC - in response to Message 62213.  

Don't know. All the researchers belong and comment regularly on the Discord channel.

Assume that all are aware of the project move.

Quico must not have stopped all of his processes, or thought he did and a task slipped out.
ID: 62214 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62215 - Posted: 17 Feb 2025, 19:02:59 UTC - in response to Message 62214.  

Well, the scheduler must have been running at 04:30 this morning, otherwise I couldn't have picked up the new task.

And yesterday, there were three tasks running. Today, there are also three tasks running, and one user has reported a runtime in the last 24 hours - so the upload server must have been running too.

Unless they manage comms in both directions in tandem, admin will never reach his aim of "all running tasks have completed".

I don't get on with Discord - vastly bloated, IMO. Could you pass a message back to the team, please? I've changed my 15 minute script to one which retries uploads as well, so he only needs to switch comms back on for a few minutes.
ID: 62215 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 62245 - Posted: 6 Mar 2025, 17:08:09 UTC - in response to Message 62215.  

I don't get on with Discord - vastly bloated, IMO.


you can use discord just in the web browser. you don't need to download their application. not any more bloated than any other webpage.
ID: 62245 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 62248 - Posted: 8 Mar 2025, 15:19:02 UTC

ID: 62248 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 62401 - Posted: 7 May 2025, 19:10:08 UTC

Task failing with "energy is NaN" after more than 8 hours runtime is not nice:

https://gpugrid.net/gpugrid/result.php?resultid=38499642

:-(
ID: 62401 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 62402 - Posted: 7 May 2025, 19:39:36 UTC

Your link is unusable. Those hosts errors are simply you ran out of disk space or exceeded time limits.
ID: 62402 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 62404 - Posted: 8 May 2025, 6:36:13 UTC - in response to Message 62402.  

Your link is unusable. Those hosts errors are simply you ran out of disk space or exceeded time limits.
I think that both of your assumption are not the case: I have about 42GB free disk space - not enough to process an ATMML? What concerns exceeding a time limit: for ATMML this is 14-1/2 hours; I don't think so.
ID: 62404 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 62406 - Posted: 9 May 2025, 0:08:27 UTC - in response to Message 62401.  

Task failing with "energy is NaN" after more than 8 hours runtime is not nice:

https://gpugrid.net/gpugrid/result.php?resultid=38499642

:-(

this is a known failure mode of these types of simulations and is not anything new. nothing you can do about it really.

if you're getting a lot of the same kind of error at very high frequency, and you notice that others are reprocessing your task successfully, then you might look into the system configuration or system overclocks or stability that might be causing issues.
ID: 62406 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 62413 - Posted: 9 May 2025, 23:56:42 UTC - in response to Message 62404.  
Last modified: 10 May 2025, 0:00:37 UTC

Your link is unusable. Those hosts errors are simply you ran out of disk space or exceeded time limits.
I think that both of your assumption are not the case: I have about 42GB free disk space - not enough to process an ATMML? What concerns exceeding a time limit: for ATMML this is 14-1/2 hours; I don't think so.

I've simply stated exactly what your failed tasks reported. Just look at the output file for the failed tasks and the application prints why it failed.

For the tasks the reported out of disk space you have that in your control. Just allocate more space.

For the tasks that reported 'exceeded time limits' you have no control since the developer/admin has set the input file rsc_bound limit too low and the tasks are misconfigured. Nothing you can do about that unless you abort the tasks or wait for the admin to fix the task generation template with a bigger value.
ID: 62413 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 62425 - Posted: 16 May 2025, 16:57:53 UTC - in response to Message 62413.  

https://gpugrid.net/gpugrid/result.php?resultid=38508162

really annoying - task failed with "RuntimeError: Simulation failed 5 times!" after 12 hours :-(((
ID: 62425 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 62436 - Posted: 20 May 2025, 17:29:54 UTC

the next one, failing after 9 hours, with "Simulation failed: Particle coordinate is NaN"

https://gpugrid.net/gpugrid/result.php?resultid=38512126

By now I am thinking about abandoning GPUGRID - too many faulty tasks, high electricity bill (even more at this time where due to higher temperatures I need to run the aircondition).
ID: 62436 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
William Albert

Send message
Joined: 22 Sep 24
Posts: 9
Credit: 195,120,851
RAC: 0
Level
Ile
Scientific publications
wat
Message 62437 - Posted: 20 May 2025, 18:12:36 UTC - in response to Message 62436.  

the next one, failing after 9 hours, with "Simulation failed: Particle coordinate is NaN"

https://gpugrid.net/gpugrid/result.php?resultid=38512126

By now I am thinking about abandoning GPUGRID - too many faulty tasks, high electricity bill (even more at this time where due to higher temperatures I need to run the aircondition).


I have seen tasks fail with this error before, but the failure usually happens pretty quickly, and others crunching the WU also report similar failures.

If your WUs are reporting as invalid after hours of crunching, while your wingmen are able to successfully process it, I would lean toward a potential hardware stability issue (or some issue with the driver/OS/simultaneous usage that is causing these WUs to fail in this way).

That being said, if you're confident that your machines are running properly, and don't want to "waste" energy crunching these WUs, you can opt out of the ATMML project in your user preferences, without abandoning GPUGRID entirely.
ID: 62437 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pascal

Send message
Joined: 15 Jul 20
Posts: 95
Credit: 2,550,803,412
RAC: 248
Level
Phe
Scientific publications
wat
Message 62438 - Posted: 20 May 2025, 18:33:09 UTC - in response to Message 62437.  

personally, this is what I did because since Christmas,gpugrid works like a flea market. Between errors and work units that lasts too long, gpugrid has become too greedy with the GPUs. It was my priority project on gpu but I moved to something else that works properly.Einstein and asteroid are in priority now and gpugrid goes into 3rd position.
ID: 62438 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 62444 - Posted: 21 May 2025, 8:26:32 UTC - in response to Message 62437.  

I have seen tasks fail with this error before, but the failure usually happens pretty quickly, and others crunching the WU also report similar failures.

If your WUs are reporting as invalid after hours of crunching, while your wingmen are able to successfully process it, I would lean toward a potential hardware stability issue (or some issue with the driver/OS/simultaneous usage that is causing these WUs to fail in this way).


Interesting findings about such a task which failed after very long time on one of my hosts:

https://gpugrid.net/gpugrid/workunit.php?wuid=31494114

in 2 cases the task failed after a few minutes, on my host 622440 it failed after 12 (!) hours, and only in 1 case the task was successful.

It's a conondrum isn't it? These tasks are definitely kind of unpredictable :-(
ID: 62444 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

Message boards : Number crunching : ATMML

©2025 Universitat Pompeu Fabra