Message boards :
Number crunching :
ATMML
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next
| Author | Message |
|---|---|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Could it be that my Quadro P5000 is unable to crunch ATMMLs? Several days ago, I tried it twice, and each time the tasks errored out after a few minutes (I guess, but cannot tell for sure: at the moment the GPU was supposed to start working after the initial steps). BTW: the CPU is Intel Xeon E5 2667 v4 (two such CPUs are in the box). Any ideas ? |
|
Send message Joined: 21 Dec 23 Posts: 51 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
I think a P5000 should work, given that it is the same generation as a 1080 which I have confirmed to work. It may be that your drivers are too old. I have a 1080 with driver version 536. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I think a P5000 should work, given that it is the same generation as a 1080 which I have confirmed to work. It may be that your drivers are too old. I have a 1080 with driver version 536. thanks, Steve, for your quick reply. Some 5 hours ago, I started another task - and it is still running :-) So I keep my fingers crossed that it will finish successfully. No idea why the other two ones before failed. BTW: the driver is 537.99 |
|
Send message Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 0 Level ![]() Scientific publications
|
Hi, Why do I receive such an error messages in ATMML tasks recently? Stderr output <core_client_version>8.0.2</core_client_version> <![CDATA[ <message> (unknown error) (0) - exit code 195 (0xc3)</message> <stderr_txt> 09:59:48 (19024): wrapper (7.9.26016): starting 09:59:48 (19024): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2) aceforce_dft_v0.4.ckpt |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
You have to read a long way further down to find the real answer to your question! In the one I picked, I see: Traceback (most recent call last):
File "D:\ProgramData\BOINC\slots\2\Scripts\rbfe_explicit_sync.py", line 11, in <module>
rx.scheduleJobs()
File "D:\ProgramData\BOINC\slots\2\Lib\site-packages\sync\atm.py", line 126, in scheduleJobs
self.worker.run(replica)
File "D:\ProgramData\BOINC\slots\2\Lib\site-packages\sync\worker.py", line 124, in run
raise RuntimeError(f"Simulation failed {ntry} times!")
RuntimeError: Simulation failed 5 times!That looks like something is wrong in the way that job was set up by the project - that's not your fault, and there's nothing you can do about it except report it here - and move on to the next one. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 57 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. I agree. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. I came here to report exactly the same thing. In the last hour, I've had 6 ATMML tasks which went wrong, and I only have 5 video cards! Two were a relatively quick 'Error while computing' - perhaps around the 5% mark. Four were 'Cancelled by server', after runs from 14 ksec to 50 ksec. I'm switching to Quantum Chemistry for the moment, until we get a handle on what the problem is. Edit - there goes another one: 'Error while computing' around the 5% mark. |
|
Send message Joined: 30 Apr 13 Posts: 106 Credit: 3,805,237,860 RAC: 53 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. Something is strange. The work queue was over 800 tasks yesterday, now it's 7. |
|
Send message Joined: 18 Mar 10 Posts: 28 Credit: 41,810,583,419 RAC: 10,891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
There was a message from Quico posted on the GPUGrid Discord server. They cancelled some parts of the project as they need to finish other parts quickly. They will resend those later. I also lost quite a bit of run time, but they didn't have a better way to do it. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
They cancelled some parts of the project as they need to finish other parts quickly. Let's hope the ones they need to finish quickly aren't the most recent ones that end with 'Error while computing'. My 'Error while computing' today are all replication _0, type CDK2 or - a recent one - NEW_CDK2. |
|
Send message Joined: 18 Mar 10 Posts: 28 Credit: 41,810,583,419 RAC: 10,891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
They cancelled some parts of the project as they need to finish other parts quickly. If you're using MPS with Nvidia cards, I have seen that killing or stopping tasks while loaded into the GPU memory can really screw things up and cause newly started tasks to fail as well. That was happening after their server cancels, so I am restarting my PCs. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It doesn't really feel like that sort of problem, but I'll keep an eye on it and restart if it happens again. At the moment, the machines seem happy on QC tasks - and that seems to be the work that they want crunching, judging by the SSP. When I see the RTS queue filling up again, I'll try one or two to see what happens. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
At the moment, the machines seem happy on QC tasks - and that seems to be the work that they want crunching but this strategy does not work the way it was probably supposed to - as QC tasks are not available for Windows crunchers (and those are still the majority). Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 % - as I noticed with the last few tasks that were not aborted by server and hence could finish - see here: https://www.gpugrid.net/result.php?resultid=36020678 |
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,187 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. I have added up the processing time from my hosts for 11 ATMML tasks "Aborted by server" on past three days. About 135 hours. Really not nice. The tradition used to be that only not started tasks were aborted, and started ones were allowed to finish. Heavy reasons for breaking this tradition, I suppose. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 % I've seen that well before the cancellations started - but only for tasks which ran significantly quicker that the ones we're used to. I think that's fair. |
|
Send message Joined: 29 Aug 24 Posts: 71 Credit: 3,321,790,989 RAC: 1,155 Level ![]() Scientific publications
|
135 hours is a drag! Sorry to hear. We want our GPUs productive and not just hamster wheels. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 % I agree that for significantly shorter tasks the credit is lower, no question. But the task which I cited was one of the "long ones". |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
135 hours is a drag! Sorry to hear. We want our GPUs productive and not just hamster wheels. + 1 |
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
I just processed my first one after getting my computer back online. It errored out. There is a lot of depreciation going on in the task. What is that all about? 12.9 hours run time. https://www.gpugrid.net/results.php?userid=107556&offset=0&show_names=0&state=5&appid= |
©2025 Universitat Pompeu Fabra