ACEMD3 High error rates

Message boards : Number crunching : ACEMD3 High error rates
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
homer__simpsons

Send message
Joined: 17 Nov 15
Posts: 14
Credit: 136,767,025
RAC: 0
Level
Cys
Scientific publications
wat
Message 61952 - Posted: 24 Nov 2024, 13:58:39 UTC

Looking at my host, over 15 task, 6 failed (40%):



I believe this is not expected. I started to see this with the new batch (from 2024-11-17?). I previously paused tasks for ACEMD3, but re-started to run them again.

Luckily they fail early in the compute so there is not too much wasted resources, but it should probably be investingated.

Host: https://www.gpugrid.net/show_host_detail.php?hostid=611890

ID: 61952 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul Forsdick

Send message
Joined: 21 Feb 09
Posts: 1
Credit: 42,661,435
RAC: 19
Level
Val
Scientific publications
wat
Message 61953 - Posted: 24 Nov 2024, 17:58:52 UTC - in response to Message 61952.  

I have the same problem
ID: 61953 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61956 - Posted: 25 Nov 2024, 8:04:55 UTC - in response to Message 61953.  

I have the same problem

You do not have the same problem referenced in this thread since you've haven't run any acemd3 tasks.

All your errors are the ATMML tasks.
ID: 61956 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
homer__simpsons

Send message
Joined: 17 Nov 15
Posts: 14
Credit: 136,767,025
RAC: 0
Level
Cys
Scientific publications
wat
Message 61975 - Posted: 29 Nov 2024, 10:41:01 UTC

I upgraded my Game Ready drivers to v566.14 just in case.

Now I am at 15 errors for 37 ACEMD3 tasks, so still 40%.

The jobs fails rather early so "this is fine" but there still is a waste of resources.

If I can provide anything to help debug this, please let me know
ID: 61975 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TofPete

Send message
Joined: 17 Mar 24
Posts: 15
Credit: 63,874,103
RAC: 0
Level
Thr
Scientific publications
wat
Message 62012 - Posted: 10 Dec 2024, 8:44:09 UTC - in response to Message 61952.  

I have the same problem.

Most of the acemd3 tasks failed due to memory leak or unknown error:



It's a bit annoying that 32 tasks were failing from my recent 54 tasks.
It's 59 % of failing rate.

Anyone can help me to solve this?

ID: 62012 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile den777

Send message
Joined: 29 Apr 13
Posts: 1
Credit: 71,060,506
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwat
Message 62013 - Posted: 10 Dec 2024, 13:01:42 UTC

Same problem here.
The worst thing is that Windows shows popup about memory access violation and until I manually click OK, the task won't finish and will just keep being idle.
ID: 62013 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Michael H.W. Weber

Send message
Joined: 9 Feb 16
Posts: 78
Credit: 656,229,684
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 62076 - Posted: 24 Dec 2024, 14:59:15 UTC
Last modified: 24 Dec 2024, 14:59:55 UTC

Same problem - two error messages, the first being the major one:

(unknown error) (0) - exit code 195 (0xc3)
(unknown error) (87) - exit code 195 (0xc3)

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 62076 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 62077 - Posted: 24 Dec 2024, 18:39:12 UTC - in response to Message 62076.  

Same problem - two error messages, the first being the major one:

(unknown error) (0) - exit code 195 (0xc3)
(unknown error) (87) - exit code 195 (0xc3)

Michael.


unhide your hosts so that the whole error can be seen. any "195" code is not helpful. that's just the generic error from the BOINC app or wrapper. the actual reason for failure could be more embedded in the stderr output and could very well be related to your hardware and software configuration (such as incorrect drivers)
ID: 62077 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TofPete

Send message
Joined: 17 Mar 24
Posts: 15
Credit: 63,874,103
RAC: 0
Level
Thr
Scientific publications
wat
Message 62160 - Posted: 22 Jan 2025, 8:49:43 UTC - in response to Message 62077.  

I have the same problem, the error rate is about 50% (16 errors / 33 total tasks) which is annoying!
I use this computer for other computing projects as well but there are errors at the ACEMD 3 of GPUGrid tasks only.

I can see unknown error and memory leak in the logs:
https://www.gpugrid.net/result.php?resultid=37934054

All of the operation system and graphic card driver updates are installed on my machine, so I don't know what else I can do to solve these memory leak errors.

How can I "unhide" my host to see more details about this problem?


unhide your hosts so that the whole error can be seen. any "195" code is not helpful. that's just the generic error from the BOINC app or wrapper. the actual reason for failure could be more embedded in the stderr output and could very well be related to your hardware and software configuration (such as incorrect drivers)

ID: 62160 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62161 - Posted: 22 Jan 2025, 10:45:53 UTC - in response to Message 62160.  

How can I "unhide" my host to see more details about this problem?

Log in to your home page on this website (https://www.gpugrid.net/myaccount.php).

Under 'Preferences', choose GPUGRID preferences (https://www.gpugrid.net/prefs.php?subset=project).

[don't worry about the error messages - it still works]

Edit the top group - 'Primary (default) preferences'.

Check 'Should GPUGRID show your computers on its web site?' and update. That's all.
ID: 62161 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
William Albert

Send message
Joined: 22 Sep 24
Posts: 9
Credit: 195,120,851
RAC: 0
Level
Ile
Scientific publications
wat
Message 62162 - Posted: 22 Jan 2025, 17:13:05 UTC

My host has been crunching these units non-stop without issue.

I initially thought that perhaps this was a Windows-specific problem, since that was the common factor with the people in this thread that complained and had their hosts visible. However, the work units in question were re-assigned to other Windows hosts, and those other processed their respective tasks to completion without issue.

The other possibility that came to mind was the task failing due to running out of VRAM. It's a possibility (especially if the host is being used for other graphical things or is simultaneously running other GPU work units), but in past instances I've seen of this, the task failed with a specific out-of-memory message.

Given that these tasks aren't universally failing, there's also the possibility that those who are reporting issues simply have failing or unreliable hardware. A quick (if perhaps disruptive) way of testing this is to power- or clock-limit the GPU and see if the errors stop.
ID: 62162 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kcharuso

Send message
Joined: 7 Oct 13
Posts: 5
Credit: 1,077,934,108
RAC: 207
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 62163 - Posted: 23 Jan 2025, 1:15:59 UTC - in response to Message 62162.  
Last modified: 23 Jan 2025, 1:24:54 UTC

my fail rate is ranging from 1-2 tasks to 7-9 tasks a day for ACEMD 3. usually tasks failed within a few minutes from the beginning so not much resources were used. better to have none tho. i also noticed that gpu time and run time used were similar like a hundred seconds different or so. is this normal?
ID: 62163 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TofPete

Send message
Joined: 17 Mar 24
Posts: 15
Credit: 63,874,103
RAC: 0
Level
Thr
Scientific publications
wat
Message 62165 - Posted: 23 Jan 2025, 17:15:38 UTC - in response to Message 62161.  

I checked the settings mentioned and it's already checked.

How can I "unhide" my host to see more details about this problem?

Log in to your home page on this website (https://www.gpugrid.net/myaccount.php).

Under 'Preferences', choose GPUGRID preferences (https://www.gpugrid.net/prefs.php?subset=project).

[don't worry about the error messages - it still works]

Edit the top group - 'Primary (default) preferences'.

Check 'Should GPUGRID show your computers on its web site?' and update. That's all.

ID: 62165 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TofPete

Send message
Joined: 17 Mar 24
Posts: 15
Credit: 63,874,103
RAC: 0
Level
Thr
Scientific publications
wat
Message 62166 - Posted: 23 Jan 2025, 17:28:46 UTC - in response to Message 62162.  

I understand this but my problem is that I don't know what settings need to be changed to solve these fails.

Sometimes I only lose several minutes, but there are tasks which needed about 9800 seconds to fail.

I think that 32 GB RAM, 3 GHz CPU clock and an Nvidia GTX 1050 Ti with 4 GB VRAM is enough for this kind of tasks. I use my cpu and video card with their's normal settings, I don't use overclocking, etc. And the strange thing is that the problem occurs only with ACEMD3 tasks.

And the logs says that there were memory leaks.
Why?
What settings should I change to prevent the leaking?

My host has been crunching these units non-stop without issue.

I initially thought that perhaps this was a Windows-specific problem, since that was the common factor with the people in this thread that complained and had their hosts visible. However, the work units in question were re-assigned to other Windows hosts, and those other processed their respective tasks to completion without issue.

The other possibility that came to mind was the task failing due to running out of VRAM. It's a possibility (especially if the host is being used for other graphical things or is simultaneously running other GPU work units), but in past instances I've seen of this, the task failed with a specific out-of-memory message.

Given that these tasks aren't universally failing, there's also the possibility that those who are reporting issues simply have failing or unreliable hardware. A quick (if perhaps disruptive) way of testing this is to power- or clock-limit the GPU and see if the errors stop.

ID: 62166 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pascal

Send message
Joined: 15 Jul 20
Posts: 95
Credit: 2,550,803,412
RAC: 248
Level
Phe
Scientific publications
wat
Message 62167 - Posted: 23 Jan 2025, 18:05:01 UTC - in response to Message 62166.  

bonjour

vous devriez essayer en augmentant la memoire virtuelle a 50 gb.

increasing the pagefile size in Windows to around 50-60GB.

https://forums.cnetfrance.fr/tutoriels-windows-10/575813-windows-10-augmenter-la-memoire-de-pagination-ou-memoire-virtuelle

https://answers.microsoft.com/fr-fr/windows/forum/all/restauration-pagefilesys/628b8a32-f8cd-4481-95a1-2ebd1ef08ce1

Cela a marcher pour moi avant mon passage a linux.
It worked for me before my passage to linux
ID: 62167 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62168 - Posted: 23 Jan 2025, 18:07:56 UTC - in response to Message 62165.  

And I can see your computer (host 619264) and tasks just fine - not sure why others were having problems.

Your computer is completing ATM tasks OK, but failing ACEMD3 tasks. The logs show the underlying errors:

16:28:06 (6248): bin/acemd.exe exited; CPU time 0.000000
16:28:06 (6248): app exit status: 0xc0000005

09:46:12 (20156): bin/acemd.exe exited; CPU time 0.015625
09:46:12 (20156): app exit status: 0xc0000005

03:13:45 (10704): bin/acemd.exe exited; CPU time 0.000000
03:13:45 (10704): app exit status: 0xc0000005

17:47:38 (11904): bin/acemd.exe exited; CPU time 0.000000
17:47:38 (11904): app exit status: 0xc0000005

14:00:30 (25420): bin/acemd.exe exited; CPU time 0.000000
14:00:30 (25420): app exit status: 0xc0000005

The exit status (normally written 0xC0000005) is a Windows code defined as "STATUS_ACCESS_VIOLATION", which in full would be reported as 'The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.' - BOINC hasn't passed on those extra parameters.

Many online 'answers' to online searches will suggest that this could be caused by faulty computer RAM, but that's not the only answer - it can also be caused by bad programming. In your case, every example still visible occurs as the application starts or restarts. I'd recommend that you try to avoid pausing ACEMD3 tasks mid-run - try to let them run continuously to completion. See if that reduces the error rate to an acceptable level.

ID: 62168 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 62169 - Posted: 23 Jan 2025, 18:52:19 UTC

I'd just increase the Windows pagefile size first to 50-60GB and reboot to see if that fixes the issue.

If that fails I would start investigating your memory for errors.
ID: 62169 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pascal

Send message
Joined: 15 Jul 20
Posts: 95
Credit: 2,550,803,412
RAC: 248
Level
Phe
Scientific publications
wat
Message 62170 - Posted: 23 Jan 2025, 19:14:00 UTC - in response to Message 62169.  

pour tester la mémoire il faut utiliser memtest et non le logiciel intégré a windows.Memtest est plus fiable.

to test memory you must use memtest and not the built-in software with windows. Memtest is more reliable.

https://www.memtest86.com/
ID: 62170 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pascal

Send message
Joined: 15 Jul 20
Posts: 95
Credit: 2,550,803,412
RAC: 248
Level
Phe
Scientific publications
wat
Message 62171 - Posted: 23 Jan 2025, 19:19:32 UTC - in response to Message 62170.  

je vous conseille aussi de désactiver l'intégrité de la mémoire .

I also advise you to disable memory integrity.

Apres cela,si le probleme continue,cela dépasse mes connaissances.

After that, if the problem continues,it’s beyond my knowledge.


https://www.malekal.com/desactiver-isolation-noyau-windows-11-10/
ID: 62171 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pascal

Send message
Joined: 15 Jul 20
Posts: 95
Credit: 2,550,803,412
RAC: 248
Level
Phe
Scientific publications
wat
Message 62172 - Posted: 23 Jan 2025, 19:23:35 UTC - in response to Message 62171.  
Last modified: 23 Jan 2025, 19:52:11 UTC

apres personne n'est a l'abri d'unites de travail qui semble avoir un bug enfin je suppose que c'est cela et non mon pc.

after no one is safe from work units that seems to have a bug finally I guess it’s this and not my pc.

https://www.gpugrid.net/workunit.php?wuid=31255135

n'oubliez pas de vérifier la température en fonctionnement de votre carte graphique au cas ou le ventilateur serait fatigué.
Remember to check the operating temperature of your graphics card in case the fan is tired.

Thermal and Power Specs:
97 97 97 Maximum GPU Temperature (in C)
https://www.nvidia.com/en-us/geforce/10-series/
ID: 62172 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : ACEMD3 High error rates

©2025 Universitat Pompeu Fabra