ACEMD updated app

Author	Message
Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 60927 - Posted: 3 Jan 2024, 22:30:58 UTC - in response to Message 60926. You need to look at a running task while it is still in its slot and capture the stderr.txt and progress files for later examination before the task errors out and clears the slot. Your uploaded result files do not have any useful information about why the tasks are failing. You should at least examine the acemd application for its license expiration as posted in my last post. Assuming the Windows application got the same license expiration, the tasks should run. acemd --licence would at least eliminate that as the issue. Or prove it. ID: 60927 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 75,187 Level Scientific publications	Message 60928 - Posted: 4 Jan 2024, 6:08:43 UTC - in response to Message 60927. ... Your uploaded result files do not have any useful information about why the tasks are failing. yes, you are right, the task from the link I uploaded before does not show any stderr.txt - for what reason ever (I did not check this before, sorry for that). I have noticed that this is the case with all tasks from this PC, regardless of whether they succeed for fail; no idea why. However, the stderr from the other PC where ACEMD 3 tasks also failed does work, here is an example: http://www.gpugrid.net/result.php?resultid=33725327 You should at least examine the acemd application for its license expiration as posted in my last post. Assuming the Windows application got the same license expiration, the tasks should run. acemd --licence would at least eliminate that as the issue. Or prove it. As yesterday the ACEMD 3 started failing at about the same time on both of my PCs (with a third PC, unfortunately I cannot crunch ACEMD 3 because the app does not work with Ada Lovelace yet), my guess, of course, was that this is not due to any problems with my hardware or my software, but rather due to a problem with the app itself, probably with the license. ID: 60928 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 60929 - Posted: 4 Jan 2024, 7:34:07 UTC - in response to Message 60928. The stderr.txt on Windows hosts never shows any reason for failing or succeeding. I've never been able to decipher why all Windows tasks have the debug dump in their outputs. You get the same dump output whether it succeeds or fails. They only ever display the generic error 195 BOINC catchall error code which does not explain anything. The Linux stderr.txt output actually does show explicit reasons for why a task fails. Your Quadro P5000 is NOT Ada generation, it's Pascal generation and Pascal cards have always worked with acemd tasks. I've been trying to find the code path for these acemd tasks and haven't been able to deduce anything beyond the CONDA environment they set up in the job file and pass on the parameter file to the app. They don't have the same layout the ATMbeta tasks use so you can follow along with the processing and figure out where they fail in the setup or processing flow. There were a slug of acemd tasks released today that had the same issue with no Windows file found and all failed on all the Linux hosts. But they were not from the initial bad release but newly generated tasks today. This task as an example of what I said. https://www.gpugrid.net/result.php?resultid=33725353 It was attempted by 7 hosts of both Windows and Linux so the task itself is badly configured. ID: 60929 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60930 - Posted: 4 Jan 2024, 9:56:31 UTC - in response to Message 60929. Actually, Erich's https://www.gpugrid.net/result.php?resultid=33725327 does contain a useful error code: app exit status: 0xc0000135 That's a generic Windows NT code: 0xC0000135 STATUS_DLL_NOT_FOUND {Unable To Locate Component} This application has failed to start because %hs was not found. Reinstalling the application might fix this problem. You have to be careful and search Microsoft itself for that one: the general internet chatterbox will usually say that a specific component is at fault (usually the .NET framework), which is unlikely to be relevant for research applications. You might be able to get a name for the missing component by trying to launch the application manually in a terminal window - it should populate that %hs parameter. ID: 60930 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 1,140,567 Level Scientific publications	Message 60935 - Posted: 8 Jan 2024, 5:44:10 UTC - in response to Message 60904. 4 ACEMD tasks received at this Linux host on January 7-8th still continued failing after a few seconds. One example: Application: ACEMD 3: molecular dynamics simulations for GPUs 2.22 (cuda1121) Name: 0_2-CRYPTICSCOUT_pocket_discovery_f279f6d5_5830_427a_b012_ee7935c48e7f-1-3-RND8942 State: Computation error Received: Mon 08 Jan 2024 02:58:37 WET Report deadline: Sat 13 Jan 2024 02:58:36 WET Resources: 0.49 CPUs + 1 NVIDIA GPU Estimated computation size: 5,000,000 GFLOPs CPU time: 00:00:00 Elapsed time: 00:00:33 Executable: wrapper_26198_x86_64-pc-linux-gnu ID: 60935 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60936 - Posted: 8 Jan 2024, 8:22:14 UTC - in response to Message 60935. They have "ModuleNotFoundError: No module named 'msvcrt'". I think that stands for "MicroSoft Visual C RunTime [module]" - which is odd to see in a Linux package. ID: 60936 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61504 - Posted: 15 May 2024, 10:39:23 UTC Following on from the reported issue in the ATM thread ("exceeded elapsed time limit" error - message 61483): I've finally caught one of these for inspection in daylight. It's on a Linux machine, so a slightly different version - v2.24, deployed 15 Apr 2024 - but it should be close enough. Here are the vital statistics: App speed: <flops> 6271039115434 Task size: <rsc_fpops_est> 1000000000000000000 Correction: <duration_correction_factor> 0.010000 for an estimated run time of 1594 seconds - or 26 minutes 34 seconds, shown in BOINC Manager. The time limit for the task is set by <rsc_fpops_bound>, which is 10 times larger than the estimate. So, 4 hours, 25 minutes, 40 seconds on this GeForce GTX 1660 Ti. I'll let you know how it gets on - or you can look it up yourself this afternoon, at task 35250069. Or not. ACEMD failed: Error loading CUDA module: CUDA_ERROR_UNSUPPORTED_PTX_VERSION (222) Back to the drawing board, while it gets on with Quantum chemistry as usual! ID: 61504 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 61505 - Posted: 15 May 2024, 12:54:14 UTC - in response to Message 61504. You may need to update your drivers. ID: 61505 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61509 - Posted: 16 May 2024, 8:21:07 UTC - in response to Message 61505. You may need to update your drivers. It's a possibility - but the card/driver combo is accepted to run the cuda1121 version of QC. It's only the Python beta which needs cuda1131. We'll see what happens when my other Linux machine catches a task - that does have a newer card and driver. ID: 61509 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61519 - Posted: 21 May 2024, 16:28:22 UTC OK, that's looking more plausible. My other machine (driver 535.99) has completed tasks on the primary RTX 3060 GPU, and is now running one on the secondary GTX 1660 GPU - no problems so far. So I've upgraded the failing machine from driver 470.99 up to a matching 535.99: back to the long slow fishing game! Meanwhile, I'll check the estimates for the task on the slower secondary card - that might be a (different) problem. ID: 61519 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61526 - Posted: 24 May 2024, 14:25:12 UTC I see we've been given a big new block of ACEND tasks to chew on. Here are my current estimates for host 132158, after 9 completed tasks: nearly 12 days for Quantum Chemistry 5.5 hours for ACEMD 3 That's still pretty tight on maximum time, but I've already got two more tasks to run - and they're all running to completion for now. We'll take another look after 11 completed tasks, to see what effect that has. ID: 61526 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61527 - Posted: 25 May 2024, 7:33:20 UTC Yup, confirmed: If you can get to 11 completed tasks, it's plain sailing from there on. The original 'time limit exceeded' problem was caused by the project's poor estimation of the work involved in completing the different work types - but it would be devilishly difficult for them to correct it at this late stage, without causing similar problems for other apps too. I suspect we'll have to live with it. ID: 61527 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 61528 - Posted: 25 May 2024, 15:18:24 UTC I guess updating the drivers solved your previous problem. the app may be labelled with an incorrect CUDA version requirement. ID: 61528 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61529 - Posted: 25 May 2024, 15:28:45 UTC - in response to Message 61528. I guess updating the drivers solved your previous problem. Yes, that machine is running fine now - 6 tasks completed, plus two running. It's still in the danger zone for 'exceeded elapsed time limit', but looks like it should pull through. ID: 61529 · Rating: 0 · rate: / Reply Quote

mrchips Send message Joined: 9 May 21 Posts: 16 Credit: 1,435,881,404 RAC: 0 Level Scientific publications	Message 61570 - Posted: 29 Jun 2024, 11:11:40 UTC All of my WU have failed for the past 3 days -112 (0xffffffffffffff90) ERR_XML_PARSE ID: 61570 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 1,140,567 Level Scientific publications	Message 61571 - Posted: 29 Jun 2024, 13:04:07 UTC - in response to Message 61570. It may be related with ACEMD 3 app update to v2.28 deployed on 26/06/2024. Your previous v2.27 tasks were completing correctly. Wait for no tasks in execution and try resetting GPUGRID project at BOINC Manager, to freshly download all related libraries. If it doesn't help, something might be wrong at new version. ID: 61571 · Rating: 0 · rate: / Reply Quote

mrchips Send message Joined: 9 May 21 Posts: 16 Credit: 1,435,881,404 RAC: 0 Level Scientific publications	Message 61572 - Posted: 29 Jun 2024, 13:16:03 UTC I just started a new computer yesterday to run gpugrid all new uploads were performed, I think the new version is corrupt. Is there a way to get back to the previos version? ID: 61572 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 1,140,567 Level Scientific publications	Message 61573 - Posted: 29 Jun 2024, 13:58:54 UTC - in response to Message 61572. No way back. Have to wait for debugging on Project's side... ID: 61573 · Rating: 0 · rate: / Reply Quote

tomaras Send message Joined: 4 Mar 20 Posts: 18 Credit: 3,128,071,062 RAC: 20,837 Level Scientific publications	Message 61574 - Posted: 30 Jun 2024, 22:59:01 UTC Are there no work units? Is something amiss? ID: 61574 · Rating: 0 · rate: / Reply Quote

roundup Send message Joined: 11 May 10 Posts: 68 Credit: 12,531,253,875 RAC: 67,608 Level Scientific publications	Message 61575 - Posted: 1 Jul 2024, 4:59:43 UTC - in response to Message 61574. Are there no work units? Is something amiss? Watch the Server Status. It tells you how many work units are available: https://www.gpugrid.net/server_status.php ID: 61575 · Rating: 0 · rate: / Reply Quote