Message boards :
Number crunching :
many faulty ATMMLs recently
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Within the past hours, I've been receiving several ATMML tasks which errored out after about 1 1/2 minutes, e.g.:' https://www.gpugrid.net/result.php?resultid=37138619 Viewing the working package reveals that these tasks failed on other hosts, too. What's happening (besides the upload problem due to "disk full")? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
If you look down towards the end of that task report, you can read: bzip2: Compressed file ends unexpectedly; which is pretty self-explanatory. It may be an effect of the lack of disk space: to make an archive, you need to have enough space for both the source file, and the compressed version at the same time. The depressing thing about this sequence of errors is that this is a project which has said many times that their research results are important and cutting-edge, and speed is of the essence. They use many tools to encourage us to send the results back to them as quickly as possible: the 24-hour return bonus, manipulating task sizes in a way which discourages large (idle) caches, and so on. Some of the applications process a sequence of tasks based on a single dataset: later tasks are generated from the results of the early runs. But not if we can't return them, so work supply grinds to a halt too. And as others have pointed out, this has happened before: this outage was entirely predictable and manageable. They need to keep an eye on administration, as well as research. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
And as others have pointed out, this has happened before: this outage was entirely predictable and manageable. They need to keep an eye on administration, as well as research. + 1 |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
If you look down towards the end of that task report, you can read: I investigated and my finding is that the problem is NOT caused by lack of disk space. As I had noticed already long time ago, these ATMMLs, in the first minutes after start, use up to ~14,4GB disk space, and right after the compressed file has been decompressed, it is deleted, and from then on the disk usage is slightly above 9GB, throughout the remaining task processing time. The host on which these tasks error out after a few minutes has about 45GB disk space left for GPUGRID, so there cannot possibly be a disk space problem. Furthermore, a look at the working package of such a task reveals that it has failed on other hosts as well, see here: https://www.gpugrid.net/workunit.php?wuid=30372600 Hence, my finding is that these tasks most probably are misconfigured :-( |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I was suggesting that the server might be struggling when creating the archive, not that your machine introduced the errors when unpacking it. But the effect's the same - it doesn't work, and we can't fix it 'in the field'. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
although the server should be okay since yesterday, I still get tasks which error out after a few minutes, like: https://www.gpugrid.net/result.php?resultid=37154558 excerpt from stderr: File "C:\ProgramData\BOINC\slots\0\Lib\site-packages\sync\worker.py", line 124, in run raise RuntimeError(f"Simulation failed {ntry} times!") RuntimeError: Simulation failed 5 times! any explanation for this behaviour? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
GIGO - Garbage In, Garbage Out. |
|
Send message Joined: 29 Aug 24 Posts: 71 Credit: 3,321,790,989 RAC: 1,408 Level ![]() Scientific publications
|
True but at least the garbage out pretty quickly. I'm thankful. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
True but at least the garbage out pretty quickly. I'm thankful. if such faulty tasks error out a few minutes after they start, it's okay with me. However, if a tasks runs for about 3 1/2 hours and then crashes: https://www.gpugrid.net/result.php?resultid=37190122 it's a real waste of ressources :-( |
|
Send message Joined: 29 Aug 24 Posts: 71 Credit: 3,321,790,989 RAC: 1,408 Level ![]() Scientific publications
|
After I sent the previous msg, I also got a few tried 5 times after a long time running as well. Sigh. |
©2025 Universitat Pompeu Fabra