Message boards :
Number crunching :
Error While Computing
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 21 Mar 20 Posts: 6 Credit: 53,007,324 RAC: 0 Level ![]() Scientific publications
|
The vast majority of the units my computer completes have been reported as 'Error While Computing'. This has been going on for a few months. For a while a few weeks ago, the units seemed to be much smaller and only take a few hours to complete. These seemed to be validated much more often than the large units that take a couple days of crunching. Is there a larger reason for this or is it a problem with my machine? |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
The acemd4 and python tasks are still being debugged by the admin developers. So lots of errors still and nothing wrong with your host. The acemd3 tasks have been stable for over a year. So they should validate on everyone's hardware. Only investigate your hardware if the errors are with this type of task. |
|
Send message Joined: 21 Mar 20 Posts: 6 Credit: 53,007,324 RAC: 0 Level ![]() Scientific publications
|
I think the only tasks I've gotten have been ACEMD3. Some validate, many show an error while computing. What could cause this on my end?? |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
What could cause this on my end?? Do you overclock your GPU ? What's the temperature of the GPU ? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The acemd3 tasks have been stable for over a year. And one of mine has just crashed on a normally stable computer. Result 32884789: exit code 0, "Incorrect function", after 5 seconds. The acemd3 application normally has a usage lifetime of around a year before it needs a software licence renewal. Are we reaching that time again? Shouldn't be - it was last refreshed on 10 Nov 2021. |
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
Just to piggyback on this thread with something else,,,, The run time vs cpu time, over the course of working ACMED 3 (2 days plus run time, still far away from deadline) I am seeing a 2 hour time difference between the two. I have never seen that on my other projects. Everything is running ok, but is this normal? The two hours? |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
Just to piggyback on this thread with something else,,,, I would say no that's not normal. I'm going to guess that you're running the CPU on 100% utilization on some CPU project too? that's probably the reason. you're starving the GPU of CPU resources.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I think the only tasks I've gotten have been ACEMD3. Some validate, many show an error while computing. What could cause this on my end?? Looking at your error: 08:26:39 (15796): wrapper: running bin/acemd3.exe (--boinc --device 0) Detected memory leaks! You are having issues with either a hot gpu, hot cpu or flaky memory. These are the typical issues that cause memory errors. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
I think the only tasks I've gotten have been ACEMD3. Some validate, many show an error while computing. What could cause this on my end?? You quoted the wrong issue. Detected memory leaks is ubiquitous in the Windows ACEMD3 app. Even successful runs shows that error. It’s benign and not indicative of any problem. His real issue is here:
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Thanks for the correction. I wasn't aware that memory leaks are a common problem on Windows hosts. |
|
Send message Joined: 21 Mar 20 Posts: 6 Credit: 53,007,324 RAC: 0 Level ![]() Scientific publications
|
Ok, since Keith Myers quoted me, are you saying I have a different problem on my end or there is no problem on my end? |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
You had a problem with the task configuration. Server issue. Not your hardware issue after all. |
|
Send message Joined: 21 Mar 20 Posts: 6 Credit: 53,007,324 RAC: 0 Level ![]() Scientific publications
|
Thanks. Incidentally, I'm getting 'error while computing' issues on Rosetta@Home units, also. . . |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Then something wrong with your Python environment I guess. Rosetta is doing Python tasks also I believe. But still nothing wrong on your end. Up to the project to package all the Python bits necessary to crunch the task and send it to you properly. |
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! http://www.gpugrid.net/result.php?resultid=32884878 ACMED 3 task 195 (0xc3) EXIT_CHILD_FAILED |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! This is a well known issue. You can’t restart the task on a different GPU. Basically can’t interrupt a running task at all.
|
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! I have had the same error, I suspend and shut down the client and exit via the menu at the end of my computing day. The next morning I start up again and the task resumes on the same GPU. But a half day later for full day later then it crashes. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! you can see in your task log that it actually restarted on a different GPU. that's why it failed. 08:01:06 (9168): wrapper (7.9.26016): starting it started on device 1, then the final restart happened on device 0. I would recommend not restarting your computer until the GPUGRID task finishes. I've even seen this issue happen from restarting on the same GPU after something like a driver update. just don't interrupt the task at all.
|
©2025 Universitat Pompeu Fabra