Message boards :
Number crunching :
ATMML
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next
| Author | Message |
|---|---|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
in the past few days, there have been a lot of ATMML tasks which failed after a few minutes. It's the sub-type PTP1B... Clicking on the working package reveals that this does not only happen with my hosts, but likewise with other vulunteers. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Yes, that batch was badly configured. Seems to have been cleared out and expired relatively soon. I'm on to a different MCL1 batch that is working correctly. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Looks like the latest batch of ATTML work finally has working checkpointing. This should make everyone happy with these long running tasks. Congratz to @Quico and @[BAT]Svennemans for the code improvement. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I can't get my head round what this project is doing at the moment. I was sent an ATMML task at 04:30 this morning (a brand-new workunit, not a user resend). I've now completed running it, but I find that not only is the Scheduler down (as shown on the server status page), but the upload servers are also 'down for maintenance'. What's the point in preparing and sending out new work, if you're not interested in seeing if it worked? [typo] |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
The admin is preparing to move the servers. "Now that all running tasks have completed I am beginning to stop various server functions, it will take some days to move. We will update here as we make progress." |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Don't know. All the researchers belong and comment regularly on the Discord channel. Assume that all are aware of the project move. Quico must not have stopped all of his processes, or thought he did and a task slipped out. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well, the scheduler must have been running at 04:30 this morning, otherwise I couldn't have picked up the new task. And yesterday, there were three tasks running. Today, there are also three tasks running, and one user has reported a runtime in the last 24 hours - so the upload server must have been running too. Unless they manage comms in both directions in tandem, admin will never reach his aim of "all running tasks have completed". I don't get on with Discord - vastly bloated, IMO. Could you pass a message back to the team, please? I've changed my 15 minute script to one which retries uploads as well, so he only needs to switch comms back on for a few minutes. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
I don't get on with Discord - vastly bloated, IMO. you can use discord just in the web browser. you don't need to download their application. not any more bloated than any other webpage.
|
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
Go here for a message from Steve: https://gpugrid.net/gpugrid/forum_thread.php?id=5409&postid=62242 |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Task failing with "energy is NaN" after more than 8 hours runtime is not nice: https://gpugrid.net/gpugrid/result.php?resultid=38499642 :-( |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Your link is unusable. Those hosts errors are simply you ran out of disk space or exceeded time limits. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Your link is unusable. Those hosts errors are simply you ran out of disk space or exceeded time limits.I think that both of your assumption are not the case: I have about 42GB free disk space - not enough to process an ATMML? What concerns exceeding a time limit: for ATMML this is 14-1/2 hours; I don't think so. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
Task failing with "energy is NaN" after more than 8 hours runtime is not nice: this is a known failure mode of these types of simulations and is not anything new. nothing you can do about it really. if you're getting a lot of the same kind of error at very high frequency, and you notice that others are reprocessing your task successfully, then you might look into the system configuration or system overclocks or stability that might be causing issues.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Your link is unusable. Those hosts errors are simply you ran out of disk space or exceeded time limits.I think that both of your assumption are not the case: I have about 42GB free disk space - not enough to process an ATMML? What concerns exceeding a time limit: for ATMML this is 14-1/2 hours; I don't think so. I've simply stated exactly what your failed tasks reported. Just look at the output file for the failed tasks and the application prints why it failed. For the tasks the reported out of disk space you have that in your control. Just allocate more space. For the tasks that reported 'exceeded time limits' you have no control since the developer/admin has set the input file rsc_bound limit too low and the tasks are misconfigured. Nothing you can do about that unless you abort the tasks or wait for the admin to fix the task generation template with a bigger value. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
https://gpugrid.net/gpugrid/result.php?resultid=38508162 really annoying - task failed with "RuntimeError: Simulation failed 5 times!" after 12 hours :-((( |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
the next one, failing after 9 hours, with "Simulation failed: Particle coordinate is NaN" https://gpugrid.net/gpugrid/result.php?resultid=38512126 By now I am thinking about abandoning GPUGRID - too many faulty tasks, high electricity bill (even more at this time where due to higher temperatures I need to run the aircondition). |
|
Send message Joined: 22 Sep 24 Posts: 9 Credit: 195,120,851 RAC: 0 Level ![]() Scientific publications
|
the next one, failing after 9 hours, with "Simulation failed: Particle coordinate is NaN" I have seen tasks fail with this error before, but the failure usually happens pretty quickly, and others crunching the WU also report similar failures. If your WUs are reporting as invalid after hours of crunching, while your wingmen are able to successfully process it, I would lean toward a potential hardware stability issue (or some issue with the driver/OS/simultaneous usage that is causing these WUs to fail in this way). That being said, if you're confident that your machines are running properly, and don't want to "waste" energy crunching these WUs, you can opt out of the ATMML project in your user preferences, without abandoning GPUGRID entirely. |
|
Send message Joined: 15 Jul 20 Posts: 95 Credit: 2,550,803,412 RAC: 248 Level ![]() Scientific publications
|
personally, this is what I did because since Christmas,gpugrid works like a flea market. Between errors and work units that lasts too long, gpugrid has become too greedy with the GPUs. It was my priority project on gpu but I moved to something else that works properly.Einstein and asteroid are in priority now and gpugrid goes into 3rd position. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have seen tasks fail with this error before, but the failure usually happens pretty quickly, and others crunching the WU also report similar failures. Interesting findings about such a task which failed after very long time on one of my hosts: https://gpugrid.net/gpugrid/workunit.php?wuid=31494114 in 2 cases the task failed after a few minutes, on my host 622440 it failed after 12 (!) hours, and only in 1 case the task was successful. It's a conondrum isn't it? These tasks are definitely kind of unpredictable :-( |
©2025 Universitat Pompeu Fabra