ATMML

Author	Message
Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 10,668 Level Scientific publications	Message 61799 - Posted: 12 Sep 2024, 12:45:01 UTC Could it be that my Quadro P5000 is unable to crunch ATMMLs? Several days ago, I tried it twice, and each time the tasks errored out after a few minutes (I guess, but cannot tell for sure: at the moment the GPU was supposed to start working after the initial steps). BTW: the CPU is Intel Xeon E5 2667 v4 (two such CPUs are in the box). Any ideas ? ID: 61799 · Rating: 0 · rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 51 Credit: 0 RAC: 0 Level Scientific publications	Message 61800 - Posted: 12 Sep 2024, 13:10:01 UTC - in response to Message 61799. I think a P5000 should work, given that it is the same generation as a 1080 which I have confirmed to work. It may be that your drivers are too old. I have a 1080 with driver version 536. ID: 61800 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 10,668 Level Scientific publications	Message 61801 - Posted: 12 Sep 2024, 18:00:32 UTC - in response to Message 61800. I think a P5000 should work, given that it is the same generation as a 1080 which I have confirmed to work. It may be that your drivers are too old. I have a 1080 with driver version 536. thanks, Steve, for your quick reply. Some 5 hours ago, I started another task - and it is still running :-) So I keep my fingers crossed that it will finish successfully. No idea why the other two ones before failed. BTW: the driver is 537.99 ID: 61801 · Rating: 0 · rate: / Reply Quote

TofPete Send message Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 0 Level Scientific publications	Message 61816 - Posted: 18 Sep 2024, 13:47:47 UTC Hi, Why do I receive such an error messages in ATMML tasks recently? Stderr output <core_client_version>8.0.2</core_client_version> <![CDATA[ <message> (unknown error) (0) - exit code 195 (0xc3)</message> <stderr_txt> 09:59:48 (19024): wrapper (7.9.26016): starting 09:59:48 (19024): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2) aceforce_dft_v0.4.ckpt ID: 61816 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61817 - Posted: 18 Sep 2024, 19:18:18 UTC - in response to Message 61816. You have to read a long way further down to find the real answer to your question! In the one I picked, I see: Traceback (most recent call last): File "D:\ProgramData\BOINC\slots\2\Scripts\rbfe_explicit_sync.py", line 11, in <module> rx.scheduleJobs() File "D:\ProgramData\BOINC\slots\2\Lib\site-packages\sync\atm.py", line 126, in scheduleJobs self.worker.run(replica) File "D:\ProgramData\BOINC\slots\2\Lib\site-packages\sync\worker.py", line 124, in run raise RuntimeError(f"Simulation failed {ntry} times!") RuntimeError: Simulation failed 5 times! That looks like something is wrong in the way that job was set up by the project - that's not your fault, and there's nothing you can do about it except report it here - and move on to the next one. ID: 61817 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 10,668 Level Scientific publications	Message 61834 - Posted: 27 Sep 2024, 12:53:24 UTC Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them. ID: 61834 · Rating: 0 · rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 3,168 Level Scientific publications	Message 61835 - Posted: 27 Sep 2024, 12:57:15 UTC - in response to Message 61834. Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them. I agree. ID: 61835 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61836 - Posted: 27 Sep 2024, 12:59:50 UTC - in response to Message 61834. Last modified: 27 Sep 2024, 13:02:17 UTC Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them. I came here to report exactly the same thing. In the last hour, I've had 6 ATMML tasks which went wrong, and I only have 5 video cards! Two were a relatively quick 'Error while computing' - perhaps around the 5% mark. Four were 'Cancelled by server', after runs from 14 ksec to 50 ksec. I'm switching to Quantum Chemistry for the moment, until we get a handle on what the problem is. Edit - there goes another one: 'Error while computing' around the 5% mark. ID: 61836 · Rating: 0 · rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 109 Credit: 3,977,737,860 RAC: 6,051 Level Scientific publications	Message 61837 - Posted: 27 Sep 2024, 13:10:42 UTC - in response to Message 61836. Last modified: 27 Sep 2024, 13:11:34 UTC Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them. I came here to report exactly the same thing. Two were a relatively quick 'Error while computing' - perhaps around the 5% mark. Something is strange. The work queue was over 800 tasks yesterday, now it's 7. ID: 61837 · Rating: 0 · rate: / Reply Quote

Freewill Send message Joined: 18 Mar 10 Posts: 28 Credit: 43,098,337,419 RAC: 27,323 Level Scientific publications	Message 61838 - Posted: 27 Sep 2024, 13:15:55 UTC - in response to Message 61837. There was a message from Quico posted on the GPUGrid Discord server. They cancelled some parts of the project as they need to finish other parts quickly. They will resend those later. I also lost quite a bit of run time, but they didn't have a better way to do it. ID: 61838 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61839 - Posted: 27 Sep 2024, 13:26:49 UTC - in response to Message 61838. They cancelled some parts of the project as they need to finish other parts quickly. Let's hope the ones they need to finish quickly aren't the most recent ones that end with 'Error while computing'. My 'Error while computing' today are all replication _0, type CDK2 or - a recent one - NEW_CDK2. ID: 61839 · Rating: 0 · rate: / Reply Quote

Freewill Send message Joined: 18 Mar 10 Posts: 28 Credit: 43,098,337,419 RAC: 27,323 Level Scientific publications	Message 61841 - Posted: 27 Sep 2024, 14:59:39 UTC - in response to Message 61839. They cancelled some parts of the project as they need to finish other parts quickly. Let's hope the ones they need to finish quickly aren't the most recent ones that end with 'Error while computing'. My 'Error while computing' today are all replication _0, type CDK2 or - a recent one - NEW_CDK2. If you're using MPS with Nvidia cards, I have seen that killing or stopping tasks while loaded into the GPU memory can really screw things up and cause newly started tasks to fail as well. That was happening after their server cancels, so I am restarting my PCs. ID: 61841 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61842 - Posted: 27 Sep 2024, 18:15:21 UTC - in response to Message 61841. It doesn't really feel like that sort of problem, but I'll keep an eye on it and restart if it happens again. At the moment, the machines seem happy on QC tasks - and that seems to be the work that they want crunching, judging by the SSP. When I see the RTS queue filling up again, I'll try one or two to see what happens. ID: 61842 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 10,668 Level Scientific publications	Message 61843 - Posted: 27 Sep 2024, 19:27:21 UTC - in response to Message 61842. At the moment, the machines seem happy on QC tasks - and that seems to be the work that they want crunching but this strategy does not work the way it was probably supposed to - as QC tasks are not available for Windows crunchers (and those are still the majority). Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 % - as I noticed with the last few tasks that were not aborted by server and hence could finish - see here: https://www.gpugrid.net/result.php?resultid=36020678 ID: 61843 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 13,083,686,510 RAC: 31,373 Level Scientific publications	Message 61844 - Posted: 27 Sep 2024, 20:13:59 UTC - in response to Message 61834. Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. I have added up the processing time from my hosts for 11 ATMML tasks "Aborted by server" on past three days. About 135 hours. Really not nice. The tradition used to be that only not started tasks were aborted, and started ones were allowed to finish. Heavy reasons for breaking this tradition, I suppose. ID: 61844 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61845 - Posted: 27 Sep 2024, 21:17:54 UTC - in response to Message 61843. Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 % I've seen that well before the cancellations started - but only for tasks which ran significantly quicker that the ones we're used to. I think that's fair. ID: 61845 · Rating: 0 · rate: / Reply Quote

KeithBriggs Send message Joined: 29 Aug 24 Posts: 71 Credit: 3,419,290,989 RAC: 1,689 Level Scientific publications	Message 61846 - Posted: 27 Sep 2024, 21:18:28 UTC - in response to Message 61844. 135 hours is a drag! Sorry to hear. We want our GPUs productive and not just hamster wheels. ID: 61846 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 10,668 Level Scientific publications	Message 61849 - Posted: 28 Sep 2024, 5:58:40 UTC - in response to Message 61845. Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 % I've seen that well before the cancellations started - but only for tasks which ran significantly quicker that the ones we're used to. I think that's fair. I agree that for significantly shorter tasks the credit is lower, no question. But the task which I cited was one of the "long ones". ID: 61849 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 10,668 Level Scientific publications	Message 61850 - Posted: 28 Sep 2024, 5:59:23 UTC - in response to Message 61846. 135 hours is a drag! Sorry to hear. We want our GPUs productive and not just hamster wheels. + 1 ID: 61850 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 154 Credit: 131,154,684 RAC: 27 Level Scientific publications	Message 61860 - Posted: 1 Oct 2024, 20:59:28 UTC Last modified: 1 Oct 2024, 21:00:26 UTC I just processed my first one after getting my computer back online. It errored out. There is a lot of depreciation going on in the task. What is that all about? 12.9 hours run time. https://www.gpugrid.net/results.php?userid=107556&offset=0&show_names=0&state=5&appid= ID: 61860 · Rating: 0 · rate: / Reply Quote