ATMML

Author	Message
Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 10,668 Level Scientific publications	Message 61939 - Posted: 15 Nov 2024, 19:39:23 UTC in the past few days, there have been a lot of ATMML tasks which failed after a few minutes. It's the sub-type PTP1B... Clicking on the working package reveals that this does not only happen with my hosts, but likewise with other vulunteers. ID: 61939 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 61941 - Posted: 15 Nov 2024, 19:49:08 UTC Yes, that batch was badly configured. Seems to have been cleared out and expired relatively soon. I'm on to a different MCL1 batch that is working correctly. ID: 61941 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 61998 - Posted: 4 Dec 2024, 19:44:57 UTC Looks like the latest batch of ATTML work finally has working checkpointing. This should make everyone happy with these long running tasks. Congratz to @Quico and @[BAT]Svennemans for the code improvement. ID: 61998 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 62211 - Posted: 17 Feb 2025, 16:52:56 UTC Last modified: 17 Feb 2025, 16:53:32 UTC I can't get my head round what this project is doing at the moment. I was sent an ATMML task at 04:30 this morning (a brand-new workunit, not a user resend). I've now completed running it, but I find that not only is the Scheduler down (as shown on the server status page), but the upload servers are also 'down for maintenance'. What's the point in preparing and sending out new work, if you're not interested in seeing if it worked? [typo] ID: 62211 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 62212 - Posted: 17 Feb 2025, 17:48:59 UTC The admin is preparing to move the servers. "Now that all running tasks have completed I am beginning to stop various server functions, it will take some days to move. We will update here as we make progress." ID: 62212 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 62213 - Posted: 17 Feb 2025, 17:54:14 UTC - in response to Message 62212. Has the admin communicated that to the researcher who created new work this morning? WU 31438508 ID: 62213 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 62214 - Posted: 17 Feb 2025, 18:40:53 UTC - in response to Message 62213. Don't know. All the researchers belong and comment regularly on the Discord channel. Assume that all are aware of the project move. Quico must not have stopped all of his processes, or thought he did and a task slipped out. ID: 62214 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 62215 - Posted: 17 Feb 2025, 19:02:59 UTC - in response to Message 62214. Well, the scheduler must have been running at 04:30 this morning, otherwise I couldn't have picked up the new task. And yesterday, there were three tasks running. Today, there are also three tasks running, and one user has reported a runtime in the last 24 hours - so the upload server must have been running too. Unless they manage comms in both directions in tandem, admin will never reach his aim of "all running tasks have completed". I don't get on with Discord - vastly bloated, IMO. Could you pass a message back to the team, please? I've changed my 15 minute script to one which retries uploads as well, so he only needs to switch comms back on for a few minutes. ID: 62215 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 62245 - Posted: 6 Mar 2025, 17:08:09 UTC - in response to Message 62215. I don't get on with Discord - vastly bloated, IMO. you can use discord just in the web browser. you don't need to download their application. not any more bloated than any other webpage. ID: 62245 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 154 Credit: 131,154,684 RAC: 27 Level Scientific publications	Message 62248 - Posted: 8 Mar 2025, 15:19:02 UTC Go here for a message from Steve: https://gpugrid.net/gpugrid/forum_thread.php?id=5409&postid=62242 ID: 62248 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 10,668 Level Scientific publications	Message 62401 - Posted: 7 May 2025, 19:10:08 UTC Task failing with "energy is NaN" after more than 8 hours runtime is not nice: https://gpugrid.net/gpugrid/result.php?resultid=38499642 :-( ID: 62401 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 62402 - Posted: 7 May 2025, 19:39:36 UTC Your link is unusable. Those hosts errors are simply you ran out of disk space or exceeded time limits. ID: 62402 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 10,668 Level Scientific publications	Message 62404 - Posted: 8 May 2025, 6:36:13 UTC - in response to Message 62402. Your link is unusable. Those hosts errors are simply you ran out of disk space or exceeded time limits. I think that both of your assumption are not the case: I have about 42GB free disk space - not enough to process an ATMML? What concerns exceeding a time limit: for ATMML this is 14-1/2 hours; I don't think so. ID: 62404 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 62406 - Posted: 9 May 2025, 0:08:27 UTC - in response to Message 62401. Task failing with "energy is NaN" after more than 8 hours runtime is not nice: https://gpugrid.net/gpugrid/result.php?resultid=38499642 :-( this is a known failure mode of these types of simulations and is not anything new. nothing you can do about it really. if you're getting a lot of the same kind of error at very high frequency, and you notice that others are reprocessing your task successfully, then you might look into the system configuration or system overclocks or stability that might be causing issues. ID: 62406 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 62413 - Posted: 9 May 2025, 23:56:42 UTC - in response to Message 62404. Last modified: 10 May 2025, 0:00:37 UTC Your link is unusable. Those hosts errors are simply you ran out of disk space or exceeded time limits. I think that both of your assumption are not the case: I have about 42GB free disk space - not enough to process an ATMML? What concerns exceeding a time limit: for ATMML this is 14-1/2 hours; I don't think so. I've simply stated exactly what your failed tasks reported. Just look at the output file for the failed tasks and the application prints why it failed. For the tasks the reported out of disk space you have that in your control. Just allocate more space. For the tasks that reported 'exceeded time limits' you have no control since the developer/admin has set the input file rsc_bound limit too low and the tasks are misconfigured. Nothing you can do about that unless you abort the tasks or wait for the admin to fix the task generation template with a bigger value. ID: 62413 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 10,668 Level Scientific publications	Message 62425 - Posted: 16 May 2025, 16:57:53 UTC - in response to Message 62413. https://gpugrid.net/gpugrid/result.php?resultid=38508162 really annoying - task failed with "RuntimeError: Simulation failed 5 times!" after 12 hours :-((( ID: 62425 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 10,668 Level Scientific publications	Message 62436 - Posted: 20 May 2025, 17:29:54 UTC the next one, failing after 9 hours, with "Simulation failed: Particle coordinate is NaN" https://gpugrid.net/gpugrid/result.php?resultid=38512126 By now I am thinking about abandoning GPUGRID - too many faulty tasks, high electricity bill (even more at this time where due to higher temperatures I need to run the aircondition). ID: 62436 · Rating: 0 · rate: / Reply Quote

William Albert Send message Joined: 22 Sep 24 Posts: 9 Credit: 241,620,851 RAC: 1,110 Level Scientific publications	Message 62437 - Posted: 20 May 2025, 18:12:36 UTC - in response to Message 62436. the next one, failing after 9 hours, with "Simulation failed: Particle coordinate is NaN" https://gpugrid.net/gpugrid/result.php?resultid=38512126 By now I am thinking about abandoning GPUGRID - too many faulty tasks, high electricity bill (even more at this time where due to higher temperatures I need to run the aircondition). I have seen tasks fail with this error before, but the failure usually happens pretty quickly, and others crunching the WU also report similar failures. If your WUs are reporting as invalid after hours of crunching, while your wingmen are able to successfully process it, I would lean toward a potential hardware stability issue (or some issue with the driver/OS/simultaneous usage that is causing these WUs to fail in this way). That being said, if you're confident that your machines are running properly, and don't want to "waste" energy crunching these WUs, you can opt out of the ATMML project in your user preferences, without abandoning GPUGRID entirely. ID: 62437 · Rating: 0 · rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 96 Credit: 2,748,053,412 RAC: 6,063 Level Scientific publications	Message 62438 - Posted: 20 May 2025, 18:33:09 UTC - in response to Message 62437. personally, this is what I did because since Christmas,gpugrid works like a flea market. Between errors and work units that lasts too long, gpugrid has become too greedy with the GPUs. It was my priority project on gpu but I moved to something else that works properly.Einstein and asteroid are in priority now and gpugrid goes into 3rd position. ID: 62438 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 10,668 Level Scientific publications	Message 62444 - Posted: 21 May 2025, 8:26:32 UTC - in response to Message 62437. I have seen tasks fail with this error before, but the failure usually happens pretty quickly, and others crunching the WU also report similar failures. If your WUs are reporting as invalid after hours of crunching, while your wingmen are able to successfully process it, I would lean toward a potential hardware stability issue (or some issue with the driver/OS/simultaneous usage that is causing these WUs to fail in this way). Interesting findings about such a task which failed after very long time on one of my hosts: https://gpugrid.net/gpugrid/workunit.php?wuid=31494114 in 2 cases the task failed after a few minutes, on my host 622440 it failed after 12 (!) hours, and only in 1 case the task was successful. It's a conondrum isn't it? These tasks are definitely kind of unpredictable :-( ID: 62444 · Rating: 0 · rate: / Reply Quote