Message boards :
Number crunching :
All ATM beta error out
Message board moderation
| Author | Message |
|---|---|
JStatesonSend message Joined: 31 Oct 08 Posts: 186 Credit: 3,578,903,157 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Pair RTX-2070 6/26/2023 10:58:22 PM CUDA: NVIDIA GPU 0: NVIDIA GeForce RTX 2070 SUPER (driver version 535.98, CUDA version 12.2, compute capability 7.5, 8192MB, 8192MB available, 9216 GFLOPS peak) 6 6/26/2023 10:58:22 PM CUDA: NVIDIA GPU 1: NVIDIA GeForce RTX 2070 SUPER (driver version 535.98, CUDA version 12.2, compute capability 7.5, 8192MB, 8192MB available, 9216 GFLOPS peak) Einstein & asteroids run fine, but 9 out of 10 GPUGRID tasks terminate after about 90 seconds. All (78) tasks show up as error at web site. Have no idea what the problem is http://www.gpugrid.net/results.php?hostid=605305&offset=60&show_names=0&state=0&appid= try my performance program, the BoincTasks History Reader. Find and read about it here: https://forum.efmer.com/index.php?topic=1355.0 |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I often see a similar problem. https://www.gpugrid.net/result.php?resultid=33631003 Also, there workunits usually reach 100% completion for hours before they finish. |
|
Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,340,042,080 RAC: 33 Level ![]() Scientific publications
|
Failing for me as well on both of my systems Linux Mint with Nvidia A4000 Windows 11 with RTX 4090 |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
working fine on all my systems (linux + various nvidia). like a 7% error rate, which is in line with previous batches.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Never have been able to decipher most of the errors on Windows hosts. But your error on your Linux-A4000 host was simply that you interrupted the task during its run. ATMbeta tasks can't be interrupted, paused or restarted once they have begun or the task will error out. If you let the tasks run without any stoppage, they will complete just fine and validate. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
the windows issue might be a permissions issue or something. all Ryan's windows tasks end with "cmd.exe exited" which might be due to boinc lacking permissions to call the local system's cmd.exe? try running BOINC as Administrator. just a guess. maybe that's not it. for his linux task. he might not have interrupted it intentionally. looks like the app (or the wrapper) hit a segfault for some reason after running for about 4 mins. then it restarted, and of course it will fail after a restart as they always do.
|
|
Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,340,042,080 RAC: 33 Level ![]() Scientific publications
|
the windows issue might be a permissions issue or something. all Ryan's windows tasks end with "cmd.exe exited" which might be due to boinc lacking permissions to call the local system's cmd.exe? Just had another one fail on the Linux box, the machine has not been touched since it was downloaded to me seeing it sat there with a computation error, it failed in 42s. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
the windows issue might be a permissions issue or something. all Ryan's windows tasks end with "cmd.exe exited" which might be due to boinc lacking permissions to call the local system's cmd.exe? the one that just failed today, actually ran for over an hour. See Here. It failed with the common energy is NaN error, which happens occasionally. you just had bad luck that it happened on one of the first tasks you've run on this batch. Nothing wrong with your system, this just happens sometimes (to everyone). the one that failed in "42 seconds" was from yesterday. it also actually ran for about 4 mins, but the timer reset when the task tried to restart (which is what ultimately caused the computation error). You can clearly see the real runtime from the timestamps in the logs. See Here. Started at 09:26:01 Segfault at about 9:30:xx Restarted at 9:30:26 Error at 9:30:44 all very understandable and explainable given the current idiosyncrasies of the ATMbeta app and what specifically happened on your system.
|
|
Send message Joined: 21 Jan 10 Posts: 46 Credit: 1,388,234,528 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'm seeing the same problem with ATMbeta. They error out in the first 5 minutes then go to 100% then take a long time to error out. I just upgraded to an RTX 4080 from my GTX 980ti as I was getting a lot of failures from Einstein tasks. That has stopped but the only GPU tasks erroring are from this project. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
I'm seeing the same problem with ATMbeta. They error out in the first 5 minutes then go to 100% then take a long time to error out. All of your tasks from the last few days completed successfully. You have no errors listed on your host details.
|
|
Send message Joined: 21 Jan 10 Posts: 46 Credit: 1,388,234,528 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yes becasue it's STILL running at 100% over an hour. It was supposed to run 28days then went to 100% in 5 minues and now is still running 1h22m at 100% complete and has not terminated. The task manager is still showing high utilization on the task. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
This behavior has been known and discussed for like 8 months now. The first segments of these tasks will show normal progress. Tasks with “0-5” or “0-10” in the filename. Subsequent segment tasks with “1-5” or “2-5” or “1-10” or “2-10” etc will all jump to 100% immediately due to a bug in how the application reports progress. If you leave it alone and don’t touch it, it will complete successfully. The tasks you have on your host now are indeed this type with both being listed as “2-10” units. This is normal/expected. And you can expect the same with the rest of the batch all the way to the 9-10s. Just leave them be and do their thing. It’s not an error unless BOINC says “Computation Error”, which it hasn’t.
|
|
Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,340,042,080 RAC: 33 Level ![]() Scientific publications
|
Looks like units are completing fine now. |
|
Send message Joined: 21 Jan 10 Posts: 46 Credit: 1,388,234,528 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
You spoke too soon: https://www.gpugrid.net/workunit.php?wuid=27605492 |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
You spoke too soon: not really. you hit the energy is NaN error. which is also well known to happen. it happens to everyone occasionally. about 7% of my results hit this error at some point (69 out of 1102 completed results), sometimes 5 mins in, sometimes 5 hours in. usually means that there is something wrong with the task setup (from the project), though sometimes it will complete successfully on another host on a resend. that's not the same issue that some folks are having where all tasks error out. this usually happens with windows hosts unfortunately and might be a permissions issue or some other fringe thing with windows that hasnt been hashed out yet by the project (or users). some windows users seem to have little issue and others seem to have only issues. but your host is in general working as expected since you've submitted successful tasks. hanging at 100% is NOT the root cause for the error you got. it's simply an idiosyncrasy of the higher level task segments. two totally different things at play there. your other tasks which also stuck at 100% completed fine.
|
|
Send message Joined: 21 Jan 10 Posts: 46 Credit: 1,388,234,528 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks for the background. I would however reconsider the term "well known" -- it's clear that it's not to the casual participant. Has anyone considered or tabulated the wasted power, time, and corresponding CO2 emissions on these broken WUs? Wasting resources on crypto is one thing but we shouldn't be wasting volunteer's resources. These add up. Another just hit: https://www.gpugrid.net/workunit.php?wuid=27606226 |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
If you are concerned, they you can stop your consternation by simply removing this project. |
|
Send message Joined: 21 Jan 10 Posts: 46 Credit: 1,388,234,528 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Done. Thanks! |
|
Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,340,042,080 RAC: 33 Level ![]() Scientific publications
|
Still failing for me, blocked the 4090 machine for now, anyone on 40xx series get them working? if so what magic are you using? |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
Still failing for me, blocked the 4090 machine for now, anyone on 40xx series get them working? if so what magic are you using? did you try running BOINC as administrator so that you have elevated privileges? or switch to Linux.
|
©2025 Universitat Pompeu Fabra