Error While Computing

Author	Message
jkdma Send message Joined: 21 Mar 20 Posts: 6 Credit: 53,007,324 RAC: 0 Level Scientific publications	Message 58656 - Posted: 16 Apr 2022, 17:02:55 UTC The vast majority of the units my computer completes have been reported as 'Error While Computing'. This has been going on for a few months. For a while a few weeks ago, the units seemed to be much smaller and only take a few hours to complete. These seemed to be validated much more often than the large units that take a couple days of crunching. Is there a larger reason for this or is it a problem with my machine? ID: 58656 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 58657 - Posted: 17 Apr 2022, 1:31:09 UTC The acemd4 and python tasks are still being debugged by the admin developers. So lots of errors still and nothing wrong with your host. The acemd3 tasks have been stable for over a year. So they should validate on everyone's hardware. Only investigate your hardware if the errors are with this type of task. ID: 58657 · Rating: 0 · rate: / Reply Quote

jkdma Send message Joined: 21 Mar 20 Posts: 6 Credit: 53,007,324 RAC: 0 Level Scientific publications	Message 58658 - Posted: 17 Apr 2022, 4:52:42 UTC - in response to Message 58657. Last modified: 17 Apr 2022, 5:30:24 UTC I think the only tasks I've gotten have been ACEMD3. Some validate, many show an error while computing. What could cause this on my end?? ID: 58658 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 10,668 Level Scientific publications	Message 58659 - Posted: 17 Apr 2022, 5:40:09 UTC - in response to Message 58658. Last modified: 17 Apr 2022, 5:40:25 UTC What could cause this on my end?? Do you overclock your GPU ? What's the temperature of the GPU ? ID: 58659 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58660 - Posted: 17 Apr 2022, 7:50:16 UTC - in response to Message 58657. The acemd3 tasks have been stable for over a year. And one of mine has just crashed on a normally stable computer. Result 32884789: exit code 0, "Incorrect function", after 5 seconds. The acemd3 application normally has a usage lifetime of around a year before it needs a software licence renewal. Are we reaching that time again? Shouldn't be - it was last refreshed on 10 Nov 2021. ID: 58660 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 154 Credit: 131,154,684 RAC: 27 Level Scientific publications	Message 58661 - Posted: 17 Apr 2022, 7:56:05 UTC Just to piggyback on this thread with something else,,,, The run time vs cpu time, over the course of working ACMED 3 (2 days plus run time, still far away from deadline) I am seeing a 2 hour time difference between the two. I have never seen that on my other projects. Everything is running ok, but is this normal? The two hours? ID: 58661 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 58662 - Posted: 17 Apr 2022, 12:49:41 UTC - in response to Message 58661. Just to piggyback on this thread with something else,,,, The run time vs cpu time, over the course of working ACMED 3 (2 days plus run time, still far away from deadline) I am seeing a 2 hour time difference between the two. I have never seen that on my other projects. Everything is running ok, but is this normal? The two hours? I would say no that's not normal. I'm going to guess that you're running the CPU on 100% utilization on some CPU project too? that's probably the reason. you're starving the GPU of CPU resources. ID: 58662 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 58663 - Posted: 17 Apr 2022, 16:23:27 UTC - in response to Message 58658. I think the only tasks I've gotten have been ACEMD3. Some validate, many show an error while computing. What could cause this on my end?? Looking at your error: 08:26:39 (15796): wrapper: running bin/acemd3.exe (--boinc --device 0) Detected memory leaks! You are having issues with either a hot gpu, hot cpu or flaky memory. These are the typical issues that cause memory errors. ID: 58663 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 58664 - Posted: 17 Apr 2022, 16:32:21 UTC - in response to Message 58663. Last modified: 17 Apr 2022, 16:35:47 UTC I think the only tasks I've gotten have been ACEMD3. Some validate, many show an error while computing. What could cause this on my end?? Looking at your error: 08:26:39 (15796): wrapper: running bin/acemd3.exe (--boinc --device 0) Detected memory leaks! You are having issues with either a hot gpu, hot cpu or flaky memory. These are the typical issues that cause memory errors. You quoted the wrong issue. Detected memory leaks is ubiquitous in the Windows ACEMD3 app. Even successful runs shows that error. It’s benign and not indicative of any problem. His real issue is here: 01:56:42 (4340): wrapper: running bin/acemd3.exe (--boinc --device 0) ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\trajectory.cpp line 103: Cannot open XTC file ID: 58664 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 58665 - Posted: 17 Apr 2022, 18:39:57 UTC Thanks for the correction. I wasn't aware that memory leaks are a common problem on Windows hosts. ID: 58665 · Rating: 0 · rate: / Reply Quote

jkdma Send message Joined: 21 Mar 20 Posts: 6 Credit: 53,007,324 RAC: 0 Level Scientific publications	Message 58667 - Posted: 18 Apr 2022, 1:52:19 UTC - in response to Message 58664. Ok, since Keith Myers quoted me, are you saying I have a different problem on my end or there is no problem on my end? ID: 58667 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 58668 - Posted: 18 Apr 2022, 5:25:53 UTC - in response to Message 58667. You had a problem with the task configuration. Server issue. Not your hardware issue after all. ID: 58668 · Rating: 0 · rate: / Reply Quote

jkdma Send message Joined: 21 Mar 20 Posts: 6 Credit: 53,007,324 RAC: 0 Level Scientific publications	Message 58670 - Posted: 18 Apr 2022, 16:16:30 UTC - in response to Message 58668. Thanks. Incidentally, I'm getting 'error while computing' issues on Rosetta@Home units, also. . . ID: 58670 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 58671 - Posted: 18 Apr 2022, 17:10:14 UTC - in response to Message 58670. Then something wrong with your Python environment I guess. Rosetta is doing Python tasks also I believe. But still nothing wrong on your end. Up to the project to package all the Python bits necessary to crunch the task and send it to you properly. ID: 58671 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 154 Credit: 131,154,684 RAC: 27 Level Scientific publications	Message 58697 - Posted: 21 Apr 2022, 17:39:46 UTC Last modified: 21 Apr 2022, 17:40:13 UTC ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! http://www.gpugrid.net/result.php?resultid=32884878 ACMED 3 task 195 (0xc3) EXIT_CHILD_FAILED ID: 58697 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 58698 - Posted: 21 Apr 2022, 22:34:28 UTC - in response to Message 58697. ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! http://www.gpugrid.net/result.php?resultid=32884878 ACMED 3 task 195 (0xc3) EXIT_CHILD_FAILED This is a well known issue. You can’t restart the task on a different GPU. Basically can’t interrupt a running task at all. ID: 58698 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 154 Credit: 131,154,684 RAC: 27 Level Scientific publications	Message 58702 - Posted: 22 Apr 2022, 18:00:44 UTC - in response to Message 58698. ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! http://www.gpugrid.net/result.php?resultid=32884878 ACMED 3 task 195 (0xc3) EXIT_CHILD_FAILED This is a well known issue. You can’t restart the task on a different GPU. Basically can’t interrupt a running task at all. I have had the same error, I suspend and shut down the client and exit via the menu at the end of my computing day. The next morning I start up again and the task resumes on the same GPU. But a half day later for full day later then it crashes. ID: 58702 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 58703 - Posted: 22 Apr 2022, 18:03:57 UTC - in response to Message 58702. ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! http://www.gpugrid.net/result.php?resultid=32884878 ACMED 3 task 195 (0xc3) EXIT_CHILD_FAILED This is a well known issue. You can’t restart the task on a different GPU. Basically can’t interrupt a running task at all. I have had the same error, I suspend and shut down the client and exit via the menu at the end of my computing day. The next morning I start up again and the task resumes on the same GPU. But a half day later for full day later then it crashes. you can see in your task log that it actually restarted on a different GPU. that's why it failed. 08:01:06 (9168): wrapper (7.9.26016): starting 08:01:06 (9168): wrapper: running bin/acemd3.exe (--boinc --device 1) Detected memory leaks! Dumping objects -> ..\api\boinc_api.cpp(309) : {389760} normal block at 0x000001DBD2145AF0, 8 bytes long. Data: < > 00 00 CC D3 DB 01 00 00 ..\lib\diagnostics_win.cpp(417) : {388486} normal block at 0x000001DBD2175230, 1080 bytes long. Data: <8 $ > 38 1E 00 00 CD CD CD CD 24 01 00 00 00 00 00 00 ..\zip\boinc_zip.cpp(122) : {153} normal block at 0x000001DBD2150860, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Object dump complete. 08:37:06 (15692): wrapper (7.9.26016): starting 08:37:06 (15692): wrapper: running bin/acemd3.exe (--boinc --device 0) ERROR: C:\Users\admin\miniconda3\conda-bld\acemd3_1632736748005\work\src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! it started on device 1, then the final restart happened on device 0. I would recommend not restarting your computer until the GPUGRID task finishes. I've even seen this issue happen from restarting on the same GPU after something like a driver update. just don't interrupt the task at all. ID: 58703 · Rating: 0 · rate: / Reply Quote