ATMML

Message boards : Number crunching : ATMML
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61799 - Posted: 12 Sep 2024, 12:45:01 UTC

Could it be that my Quadro P5000 is unable to crunch ATMMLs?
Several days ago, I tried it twice, and each time the tasks errored out after a few minutes (I guess, but cannot tell for sure: at the moment the GPU was supposed to start working after the initial steps).
BTW: the CPU is Intel Xeon E5 2667 v4 (two such CPUs are in the box).

Any ideas ?
ID: 61799 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 21 Dec 23
Posts: 51
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61800 - Posted: 12 Sep 2024, 13:10:01 UTC - in response to Message 61799.  

I think a P5000 should work, given that it is the same generation as a 1080 which I have confirmed to work. It may be that your drivers are too old. I have a 1080 with driver version 536.
ID: 61800 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61801 - Posted: 12 Sep 2024, 18:00:32 UTC - in response to Message 61800.  

I think a P5000 should work, given that it is the same generation as a 1080 which I have confirmed to work. It may be that your drivers are too old. I have a 1080 with driver version 536.

thanks, Steve, for your quick reply.
Some 5 hours ago, I started another task - and it is still running :-)
So I keep my fingers crossed that it will finish successfully.
No idea why the other two ones before failed.
BTW: the driver is 537.99
ID: 61801 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TofPete

Send message
Joined: 17 Mar 24
Posts: 15
Credit: 63,874,103
RAC: 0
Level
Thr
Scientific publications
wat
Message 61816 - Posted: 18 Sep 2024, 13:47:47 UTC

Hi,

Why do I receive such an error messages in ATMML tasks recently?

Stderr output
<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
(unknown error) (0) - exit code 195 (0xc3)</message>
<stderr_txt>
09:59:48 (19024): wrapper (7.9.26016): starting
09:59:48 (19024): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2)
aceforce_dft_v0.4.ckpt
ID: 61816 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61817 - Posted: 18 Sep 2024, 19:18:18 UTC - in response to Message 61816.  

You have to read a long way further down to find the real answer to your question!

In the one I picked, I see:
Traceback (most recent call last):
  File "D:\ProgramData\BOINC\slots\2\Scripts\rbfe_explicit_sync.py", line 11, in <module>
    rx.scheduleJobs()
  File "D:\ProgramData\BOINC\slots\2\Lib\site-packages\sync\atm.py", line 126, in scheduleJobs
    self.worker.run(replica)
  File "D:\ProgramData\BOINC\slots\2\Lib\site-packages\sync\worker.py", line 124, in run
    raise RuntimeError(f"Simulation failed {ntry} times!")
RuntimeError: Simulation failed 5 times!

That looks like something is wrong in the way that job was set up by the project - that's not your fault, and there's nothing you can do about it except report it here - and move on to the next one.
ID: 61817 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61834 - Posted: 27 Sep 2024, 12:53:24 UTC

Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice.
If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them.
ID: 61834 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 57
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61835 - Posted: 27 Sep 2024, 12:57:15 UTC - in response to Message 61834.  

Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice.
If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them.


I agree.


ID: 61835 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61836 - Posted: 27 Sep 2024, 12:59:50 UTC - in response to Message 61834.  
Last modified: 27 Sep 2024, 13:02:17 UTC

Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice.
If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them.

I came here to report exactly the same thing. In the last hour, I've had 6 ATMML tasks which went wrong, and I only have 5 video cards!

Two were a relatively quick 'Error while computing' - perhaps around the 5% mark.
Four were 'Cancelled by server', after runs from 14 ksec to 50 ksec.

I'm switching to Quantum Chemistry for the moment, until we get a handle on what the problem is.

Edit - there goes another one: 'Error while computing' around the 5% mark.
ID: 61836 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
WPrion

Send message
Joined: 30 Apr 13
Posts: 106
Credit: 3,805,237,860
RAC: 53
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61837 - Posted: 27 Sep 2024, 13:10:42 UTC - in response to Message 61836.  
Last modified: 27 Sep 2024, 13:11:34 UTC

Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice.
If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them.

I came here to report exactly the same thing.

Two were a relatively quick 'Error while computing' - perhaps around the 5% mark.


Something is strange. The work queue was over 800 tasks yesterday, now it's 7.
ID: 61837 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Freewill

Send message
Joined: 18 Mar 10
Posts: 28
Credit: 41,810,583,419
RAC: 10,891
Level
Trp
Scientific publications
watwatwatwatwat
Message 61838 - Posted: 27 Sep 2024, 13:15:55 UTC - in response to Message 61837.  

There was a message from Quico posted on the GPUGrid Discord server. They cancelled some parts of the project as they need to finish other parts quickly. They will resend those later.

I also lost quite a bit of run time, but they didn't have a better way to do it.
ID: 61838 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61839 - Posted: 27 Sep 2024, 13:26:49 UTC - in response to Message 61838.  

They cancelled some parts of the project as they need to finish other parts quickly.

Let's hope the ones they need to finish quickly aren't the most recent ones that end with 'Error while computing'.

My 'Error while computing' today are all replication _0, type CDK2 or - a recent one - NEW_CDK2.
ID: 61839 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Freewill

Send message
Joined: 18 Mar 10
Posts: 28
Credit: 41,810,583,419
RAC: 10,891
Level
Trp
Scientific publications
watwatwatwatwat
Message 61841 - Posted: 27 Sep 2024, 14:59:39 UTC - in response to Message 61839.  

They cancelled some parts of the project as they need to finish other parts quickly.

Let's hope the ones they need to finish quickly aren't the most recent ones that end with 'Error while computing'.

My 'Error while computing' today are all replication _0, type CDK2 or - a recent one - NEW_CDK2.


If you're using MPS with Nvidia cards, I have seen that killing or stopping tasks while loaded into the GPU memory can really screw things up and cause newly started tasks to fail as well. That was happening after their server cancels, so I am restarting my PCs.
ID: 61841 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61842 - Posted: 27 Sep 2024, 18:15:21 UTC - in response to Message 61841.  

It doesn't really feel like that sort of problem, but I'll keep an eye on it and restart if it happens again.

At the moment, the machines seem happy on QC tasks - and that seems to be the work that they want crunching, judging by the SSP. When I see the RTS queue filling up again, I'll try one or two to see what happens.
ID: 61842 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61843 - Posted: 27 Sep 2024, 19:27:21 UTC - in response to Message 61842.  

At the moment, the machines seem happy on QC tasks - and that seems to be the work that they want crunching

but this strategy does not work the way it was probably supposed to - as QC tasks are not available for Windows crunchers (and those are still the majority).
Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 % - as I noticed with the last few tasks that were not aborted by server and hence could finish - see here:
https://www.gpugrid.net/result.php?resultid=36020678
ID: 61843 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,187
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61844 - Posted: 27 Sep 2024, 20:13:59 UTC - in response to Message 61834.  

Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice.

I have added up the processing time from my hosts for 11 ATMML tasks "Aborted by server" on past three days.
About 135 hours. Really not nice.
The tradition used to be that only not started tasks were aborted, and started ones were allowed to finish.
Heavy reasons for breaking this tradition, I suppose.
ID: 61844 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61845 - Posted: 27 Sep 2024, 21:17:54 UTC - in response to Message 61843.  

Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 %

I've seen that well before the cancellations started - but only for tasks which ran significantly quicker that the ones we're used to. I think that's fair.
ID: 61845 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KeithBriggs

Send message
Joined: 29 Aug 24
Posts: 71
Credit: 3,321,790,989
RAC: 1,155
Level
Arg
Scientific publications
wat
Message 61846 - Posted: 27 Sep 2024, 21:18:28 UTC - in response to Message 61844.  

135 hours is a drag! Sorry to hear. We want our GPUs productive and not just hamster wheels.
ID: 61846 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61849 - Posted: 28 Sep 2024, 5:58:40 UTC - in response to Message 61845.  

Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 %

I've seen that well before the cancellations started - but only for tasks which ran significantly quicker that the ones we're used to. I think that's fair.

I agree that for significantly shorter tasks the credit is lower, no question.
But the task which I cited was one of the "long ones".
ID: 61849 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61850 - Posted: 28 Sep 2024, 5:59:23 UTC - in response to Message 61846.  

135 hours is a drag! Sorry to hear. We want our GPUs productive and not just hamster wheels.

+ 1
ID: 61850 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 61860 - Posted: 1 Oct 2024, 20:59:28 UTC
Last modified: 1 Oct 2024, 21:00:26 UTC

I just processed my first one after getting my computer back online.
It errored out.
There is a lot of depreciation going on in the task.
What is that all about?

12.9 hours run time.
https://www.gpugrid.net/results.php?userid=107556&offset=0&show_names=0&state=5&appid=
ID: 61860 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : Number crunching : ATMML

©2025 Universitat Pompeu Fabra