ACEMD updated app

Message boards : News : ACEMD updated app
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60927 - Posted: 3 Jan 2024, 22:30:58 UTC - in response to Message 60926.  

You need to look at a running task while it is still in its slot and capture the stderr.txt and progress files for later examination before the task errors out and clears the slot.

Your uploaded result files do not have any useful information about why the tasks are failing.

You should at least examine the acemd application for its license expiration as posted in my last post. Assuming the Windows application got the same license expiration, the tasks should run.

acemd --licence would at least eliminate that as the issue. Or prove it.
ID: 60927 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 869
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60928 - Posted: 4 Jan 2024, 6:08:43 UTC - in response to Message 60927.  

... Your uploaded result files do not have any useful information about why the tasks are failing.

yes, you are right, the task from the link I uploaded before does not show any stderr.txt - for what reason ever (I did not check this before, sorry for that). I have noticed that this is the case with all tasks from this PC, regardless of whether they succeed for fail; no idea why.

However, the stderr from the other PC where ACEMD 3 tasks also failed does work, here is an example:
http://www.gpugrid.net/result.php?resultid=33725327

You should at least examine the acemd application for its license expiration as posted in my last post. Assuming the Windows application got the same license expiration, the tasks should run.

acemd --licence would at least eliminate that as the issue. Or prove it.

As yesterday the ACEMD 3 started failing at about the same time on both of my PCs (with a third PC, unfortunately I cannot crunch ACEMD 3 because the app does not work with Ada Lovelace yet), my guess, of course, was that this is not due to any problems with my hardware or my software, but rather due to a problem with the app itself, probably with the license.
ID: 60928 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60929 - Posted: 4 Jan 2024, 7:34:07 UTC - in response to Message 60928.  

The stderr.txt on Windows hosts never shows any reason for failing or succeeding.
I've never been able to decipher why all Windows tasks have the debug dump in their outputs.

You get the same dump output whether it succeeds or fails. They only ever display the generic error 195 BOINC catchall error code which does not explain anything.

The Linux stderr.txt output actually does show explicit reasons for why a task fails.

Your Quadro P5000 is NOT Ada generation, it's Pascal generation and Pascal cards have always worked with acemd tasks.

I've been trying to find the code path for these acemd tasks and haven't been able to deduce anything beyond the CONDA environment they set up in the job file and pass on the parameter file to the app.

They don't have the same layout the ATMbeta tasks use so you can follow along with the processing and figure out where they fail in the setup or processing flow.

There were a slug of acemd tasks released today that had the same issue with no Windows file found and all failed on all the Linux hosts. But they were not from the initial bad release but newly generated tasks today.

This task as an example of what I said.

https://www.gpugrid.net/result.php?resultid=33725353

It was attempted by 7 hosts of both Windows and Linux so the task itself is badly configured.
ID: 60929 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60930 - Posted: 4 Jan 2024, 9:56:31 UTC - in response to Message 60929.  

Actually, Erich's https://www.gpugrid.net/result.php?resultid=33725327 does contain a useful error code:

app exit status: 0xc0000135

That's a generic Windows NT code:

0xC0000135

STATUS_DLL_NOT_FOUND

{Unable To Locate Component} This application has failed to start because %hs was not found. Reinstalling the application might fix this problem.

You have to be careful and search Microsoft itself for that one: the general internet chatterbox will usually say that a specific component is at fault (usually the .NET framework), which is unlikely to be relevant for research applications. You might be able to get a name for the missing component by trying to launch the application manually in a terminal window - it should populate that %hs parameter.
ID: 60930 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 998,578
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60935 - Posted: 8 Jan 2024, 5:44:10 UTC - in response to Message 60904.  

4 ACEMD tasks received at this Linux host on January 7-8th still continued failing after a few seconds.
One example:

Application: ACEMD 3: molecular dynamics simulations for GPUs 2.22 (cuda1121)
Name: 0_2-CRYPTICSCOUT_pocket_discovery_f279f6d5_5830_427a_b012_ee7935c48e7f-1-3-RND8942
State: Computation error
Received: Mon 08 Jan 2024 02:58:37 WET
Report deadline: Sat 13 Jan 2024 02:58:36 WET
Resources: 0.49 CPUs + 1 NVIDIA GPU
Estimated computation size: 5,000,000 GFLOPs
CPU time: 00:00:00
Elapsed time: 00:00:33
Executable: wrapper_26198_x86_64-pc-linux-gnu
ID: 60935 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60936 - Posted: 8 Jan 2024, 8:22:14 UTC - in response to Message 60935.  

They have "ModuleNotFoundError: No module named 'msvcrt'".

I think that stands for "MicroSoft Visual C RunTime [module]" - which is odd to see in a Linux package.
ID: 60936 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61504 - Posted: 15 May 2024, 10:39:23 UTC

Following on from the reported issue in the ATM thread ("exceeded elapsed time limit" error - message 61483):

I've finally caught one of these for inspection in daylight. It's on a Linux machine, so a slightly different version - v2.24, deployed 15 Apr 2024 - but it should be close enough.

Here are the vital statistics:
App speed:  <flops> 6271039115434
Task size:  <rsc_fpops_est> 1000000000000000000
Correction: <duration_correction_factor> 0.010000

for an estimated run time of 1594 seconds - or 26 minutes 34 seconds, shown in BOINC Manager.

The time limit for the task is set by <rsc_fpops_bound>, which is 10 times larger than the estimate. So, 4 hours, 25 minutes, 40 seconds on this GeForce GTX 1660 Ti. I'll let you know how it gets on - or you can look it up yourself this afternoon, at task 35250069.

Or not.
ACEMD failed:
    Error loading CUDA module: CUDA_ERROR_UNSUPPORTED_PTX_VERSION (222)

Back to the drawing board, while it gets on with Quantum chemistry as usual!
ID: 61504 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,722,595
RAC: 4,266,994
Level
Trp
Scientific publications
wat
Message 61505 - Posted: 15 May 2024, 12:54:14 UTC - in response to Message 61504.  

You may need to update your drivers.
ID: 61505 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61509 - Posted: 16 May 2024, 8:21:07 UTC - in response to Message 61505.  

You may need to update your drivers.

It's a possibility - but the card/driver combo is accepted to run the cuda1121 version of QC. It's only the Python beta which needs cuda1131.

We'll see what happens when my other Linux machine catches a task - that does have a newer card and driver.
ID: 61509 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61519 - Posted: 21 May 2024, 16:28:22 UTC

OK, that's looking more plausible. My other machine (driver 535.99) has completed tasks on the primary RTX 3060 GPU, and is now running one on the secondary GTX 1660 GPU - no problems so far.

So I've upgraded the failing machine from driver 470.99 up to a matching 535.99: back to the long slow fishing game!

Meanwhile, I'll check the estimates for the task on the slower secondary card - that might be a (different) problem.
ID: 61519 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61526 - Posted: 24 May 2024, 14:25:12 UTC

I see we've been given a big new block of ACEND tasks to chew on.

Here are my current estimates for host 132158, after 9 completed tasks:



nearly 12 days for Quantum Chemistry
5.5 hours for ACEMD 3

That's still pretty tight on maximum time, but I've already got two more tasks to run - and they're all running to completion for now. We'll take another look after 11 completed tasks, to see what effect that has.
ID: 61526 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61527 - Posted: 25 May 2024, 7:33:20 UTC

Yup, confirmed:



If you can get to 11 completed tasks, it's plain sailing from there on. The original 'time limit exceeded' problem was caused by the project's poor estimation of the work involved in completing the different work types - but it would be devilishly difficult for them to correct it at this late stage, without causing similar problems for other apps too. I suspect we'll have to live with it.
ID: 61527 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,722,595
RAC: 4,266,994
Level
Trp
Scientific publications
wat
Message 61528 - Posted: 25 May 2024, 15:18:24 UTC

I guess updating the drivers solved your previous problem.

the app may be labelled with an incorrect CUDA version requirement.
ID: 61528 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61529 - Posted: 25 May 2024, 15:28:45 UTC - in response to Message 61528.  

I guess updating the drivers solved your previous problem.

Yes, that machine is running fine now - 6 tasks completed, plus two running.

It's still in the danger zone for 'exceeded elapsed time limit', but looks like it should pull through.
ID: 61529 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mrchips

Send message
Joined: 9 May 21
Posts: 16
Credit: 1,435,881,404
RAC: 20
Level
Met
Scientific publications
wat
Message 61570 - Posted: 29 Jun 2024, 11:11:40 UTC

All of my WU have failed for the past 3 days

-112 (0xffffffffffffff90) ERR_XML_PARSE

ID: 61570 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 998,578
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61571 - Posted: 29 Jun 2024, 13:04:07 UTC - in response to Message 61570.  

It may be related with ACEMD 3 app update to v2.28 deployed on 26/06/2024.
Your previous v2.27 tasks were completing correctly.
Wait for no tasks in execution and try resetting GPUGRID project at BOINC Manager, to freshly download all related libraries.
If it doesn't help, something might be wrong at new version.
ID: 61571 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mrchips

Send message
Joined: 9 May 21
Posts: 16
Credit: 1,435,881,404
RAC: 20
Level
Met
Scientific publications
wat
Message 61572 - Posted: 29 Jun 2024, 13:16:03 UTC

I just started a new computer yesterday to run gpugrid
all new uploads were performed, I think the new version is corrupt.

Is there a way to get back to the previos version?

ID: 61572 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 998,578
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61573 - Posted: 29 Jun 2024, 13:58:54 UTC - in response to Message 61572.  

No way back.
Have to wait for debugging on Project's side...
ID: 61573 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tomaras

Send message
Joined: 4 Mar 20
Posts: 18
Credit: 3,119,821,062
RAC: 1,336,390
Level
Arg
Scientific publications
wat
Message 61574 - Posted: 30 Jun 2024, 22:59:01 UTC

Are there no work units? Is something amiss?
ID: 61574 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
roundup

Send message
Joined: 11 May 10
Posts: 68
Credit: 12,293,491,875
RAC: 2,191,793
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61575 - Posted: 1 Jul 2024, 4:59:43 UTC - in response to Message 61574.  

Are there no work units? Is something amiss?


Watch the Server Status. It tells you how many work units are available:
https://www.gpugrid.net/server_status.php
ID: 61575 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : News : ACEMD updated app

©2025 Universitat Pompeu Fabra