Message boards :
Number crunching :
All ATM beta error out
Message board moderation
Previous · 1 · 2
| Author | Message |
|---|---|
|
Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,340,042,080 RAC: 33 Level ![]() Scientific publications
|
Yea tried running as admin, same issue, switching this machine to Linux is not an option. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
do you have AV on the system? is it possible that it's blocking some activity from the app? like preventing the download of the extra packages it needs. it seems to be failing sometime between downloading the extra packages and running the app. try disabling your AV, or whitelisting the BOINC data directory.
|
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
This is a dead tasks stdeerr output (Currently 2 hours and shows 100%) and since I am going to bed for 8 hrs I am aborting this. Its only my 3rd and I have a 4th like it. 23:28:42 (16788): wrapper (7.9.26016): starting 23:28:42 (16788): wrapper: running python.exe (bin/conda-unpack) 23:28:44 (16788): python.exe exited; CPU time 0.000000 23:28:44 (16788): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2) run.bat run.sh tnks2_m5b_m5c_0.xml tnks2_m5b_m5c_asyncre.cntl tnks2_m5b_m5c.inpcrd tnks2_m5b_m5c.prmtop 23:28:45 (16788): Library/usr/bin/tar.exe exited; CPU time 0.031250 23:28:45 (16788): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat) Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git 'D:\data\slots\13\tmp\pip-req-build-y0gn8rc9' Running command git rev-parse -q --verify 'sha^d7931b9a6217232d481731f7589d64b100a514ac' Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git d7931b9a6217232d481731f7589d64b100a514ac Running command git checkout -q d7931b9a6217232d481731f7589d64b100a514ac Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead. This is completed one: ( At quick glance in the opening commands I can see no difference) <core_client_version>7.24.1</core_client_version> <![CDATA[ <stderr_txt> 19:06:49 (13988): wrapper (7.9.26016): starting 19:06:49 (13988): wrapper: running python.exe (bin/conda-unpack) 19:06:52 (13988): python.exe exited; CPU time 0.000000 19:06:52 (13988): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2) run.bat run.sh tnks2_m1b_m5h_0.xml tnks2_m1b_m5h_asyncre.cntl tnks2_m1b_m5h.inpcrd tnks2_m1b_m5h.prmtop 19:06:53 (13988): Library/usr/bin/tar.exe exited; CPU time 0.015625 19:06:53 (13988): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat) Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git 'D:\data\slots\15\tmp\pip-req-build-o6pc0y1a' Running command git rev-parse -q --verify 'sha^d7931b9a6217232d481731f7589d64b100a514ac' Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git d7931b9a6217232d481731f7589d64b100a514ac Running command git checkout -q d7931b9a6217232d481731f7589d64b100a514ac Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead. tar: run.log: file changed as we read it 03:07:01 (13988): C:/Windows/system32/cmd.exe exited; CPU time 27177.796875 03:07:01 (13988): called boinc_finish(0) 0 bytes in 0 Free Blocks. 310 bytes in 4 Normal Blocks. 1144 bytes in 1 CRT Blocks. 0 bytes in 0 Ignore Blocks. 0 bytes in 0 Client Blocks. Largest number used: 0 bytes. Total allocations: 427320182 bytes. Dumping objects -> {3078551} normal block at 0x00000261DB821170, 48 bytes long. Data: <PATH=D:\data\slo> 50 41 54 48 3D 44 3A 5C 64 61 74 61 5C 73 6C 6F {3078530} normal block at 0x00000261D9D674A0, 139 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 ..\api\boinc_api.cpp(309) : {3078527} normal block at 0x00000261D9D35490, 8 bytes long. Data: < ƒÛa > 00 00 83 DB 61 02 00 00 {3077868} normal block at 0x00000261D9D67640, 139 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 {3077233} normal block at 0x00000261D9D34C20, 8 bytes long. Data: < {”Ûa > 80 7B 94 DB 61 02 00 00 ..\zip\boinc_zip.cpp(122) : {278} normal block at 0x00000261D9D3F160, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 {263} normal block at 0x00000261D9D408E0, 16 bytes long. Data: <¨ ÔÙa > A8 19 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {262} normal block at 0x00000261D9D410B0, 16 bytes long. Data: < ÔÙa > 80 19 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {261} normal block at 0x00000261D9D41010, 16 bytes long. Data: <X ÔÙa > 58 19 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {260} normal block at 0x00000261D9D40E30, 16 bytes long. Data: <0 ÔÙa > 30 19 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {259} normal block at 0x00000261D9D40CF0, 16 bytes long. Data: < ÔÙa > 08 19 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {258} normal block at 0x00000261D9D40700, 16 bytes long. Data: <à ÔÙa > E0 18 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {257} normal block at 0x00000261D9D393D0, 32 bytes long. Data: <CUDA_DEVICE=0 PU> 43 55 44 41 5F 44 45 56 49 43 45 3D 30 00 50 55 {256} normal block at 0x00000261D9D407A0, 16 bytes long. Data: < ¢ÓÙa > 10 A2 D3 D9 61 02 00 00 00 00 00 00 00 00 00 00 {255} normal block at 0x00000261D9D3A210, 40 bytes long. Data: <  ÔÙa ГÓÙa > A0 07 D4 D9 61 02 00 00 D0 93 D3 D9 61 02 00 00 {254} normal block at 0x00000261D9D411F0, 16 bytes long. Data: <À ÔÙa > C0 18 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {253} normal block at 0x00000261D9D406B0, 16 bytes long. Data: < ÔÙa > 98 18 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {252} normal block at 0x00000261D9D38AD0, 32 bytes long. Data: <C:/Windows/syste> 43 3A 2F 57 69 6E 64 6F 77 73 2F 73 79 73 74 65 {251} normal block at 0x00000261D9D41060, 16 bytes long. Data: <p ÔÙa > 70 18 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {250} normal block at 0x00000261D9D391F0, 32 bytes long. Data: <xjvf input.tar.b> 78 6A 76 66 20 69 6E 70 75 74 2E 74 61 72 2E 62 {249} normal block at 0x00000261D9D40DE0, 16 bytes long. Data: <¸ ÔÙa > B8 17 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {248} normal block at 0x00000261D9D40F70, 16 bytes long. Data: < ÔÙa > 90 17 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {247} normal block at 0x00000261D9D40F20, 16 bytes long. Data: <h ÔÙa > 68 17 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {246} normal block at 0x00000261D9D40CA0, 16 bytes long. Data: <@ ÔÙa > 40 17 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {245} normal block at 0x00000261D9D40C50, 16 bytes long. Data: < ÔÙa > 18 17 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {244} normal block at 0x00000261D9D40C00, 16 bytes long. Data: <ð ÔÙa > F0 16 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {242} normal block at 0x00000261D9D40BB0, 16 bytes long. Data: < ¥ÓÙa > 20 A5 D3 D9 61 02 00 00 00 00 00 00 00 00 00 00 {241} normal block at 0x00000261D9D3A520, 40 bytes long. Data: <° ÔÙa p ‚Ûa > B0 0B D4 D9 61 02 00 00 70 11 82 DB 61 02 00 00 {240} normal block at 0x00000261D9D40B60, 16 bytes long. Data: <Ð ÔÙa > D0 16 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {239} normal block at 0x00000261D9D40FC0, 16 bytes long. Data: <¨ ÔÙa > A8 16 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {238} normal block at 0x00000261D9D39130, 32 bytes long. Data: <Library/usr/bin/> 4C 69 62 72 61 72 79 2F 75 73 72 2F 62 69 6E 2F {237} normal block at 0x00000261D9D411A0, 16 bytes long. Data: < ÔÙa > 80 16 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {236} normal block at 0x00000261D9D392B0, 32 bytes long. Data: <bin/conda-unpack> 62 69 6E 2F 63 6F 6E 64 61 2D 75 6E 70 61 63 6B {235} normal block at 0x00000261D9D40ED0, 16 bytes long. Data: <È ÔÙa > C8 15 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {234} normal block at 0x00000261D9D40E80, 16 bytes long. Data: <  ÔÙa > A0 15 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {233} normal block at 0x00000261D9D40B10, 16 bytes long. Data: <x ÔÙa > 78 15 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {232} normal block at 0x00000261D9D40A20, 16 bytes long. Data: <P ÔÙa > 50 15 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {231} normal block at 0x00000261D9D40610, 16 bytes long. Data: <( ÔÙa > 28 15 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {230} normal block at 0x00000261D9D40570, 16 bytes long. Data: < ÔÙa > 00 15 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {229} normal block at 0x00000261D9D41150, 16 bytes long. Data: <à ÔÙa > E0 14 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {228} normal block at 0x00000261D9D40520, 16 bytes long. Data: <¸ ÔÙa > B8 14 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {227} normal block at 0x00000261D9D40980, 16 bytes long. Data: < ÔÙa > 90 14 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {226} normal block at 0x00000261D9D41490, 1488 bytes long. Data: < ÔÙa python.e> 80 09 D4 D9 61 02 00 00 70 79 74 68 6F 6E 2E 65 {90} normal block at 0x00000261D9D39070, 32 bytes long. Data: <windows_x86_64__> 77 69 6E 64 6F 77 73 5F 78 38 36 5F 36 34 5F 5F {89} normal block at 0x00000261D9D34900, 16 bytes long. Data: <À§ÓÙa > C0 A7 D3 D9 61 02 00 00 00 00 00 00 00 00 00 00 {88} normal block at 0x00000261D9D3A7C0, 40 bytes long. Data: < IÓÙa p ÓÙa > 00 49 D3 D9 61 02 00 00 70 90 D3 D9 61 02 00 00 {67} normal block at 0x00000261D9D34AE0, 16 bytes long. Data: < ê,ã÷ > 80 EA 2C E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {66} normal block at 0x00000261D9D34A90, 16 bytes long. Data: <@é,ã÷ > 40 E9 2C E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {65} normal block at 0x00000261D9D35170, 16 bytes long. Data: <øW)ã÷ > F8 57 29 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {64} normal block at 0x00000261D9D348B0, 16 bytes long. Data: <ØW)ã÷ > D8 57 29 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {63} normal block at 0x00000261D9D35350, 16 bytes long. Data: <P )ã÷ > 50 04 29 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {62} normal block at 0x00000261D9D34EA0, 16 bytes long. Data: <0 )ã÷ > 30 04 29 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {61} normal block at 0x00000261D9D352B0, 16 bytes long. Data: <à )ã÷ > E0 02 29 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {60} normal block at 0x00000261D9D34A40, 16 bytes long. Data: < )ã÷ > 10 04 29 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {59} normal block at 0x00000261D9D353F0, 16 bytes long. Data: <p )ã÷ > 70 04 29 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {58} normal block at 0x00000261D9D35260, 16 bytes long. Data: < À'ã÷ > 18 C0 27 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 Object dump complete. </stderr_txt> ]]> |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
its not dead. it's just not finished. please read the lengthy ATM post in the News forum. it's known about all task segments except the first exhibiing this behavior of staying pegged to 100%. it's fine, just let it finish
|
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
its not dead. it's just not finished. Ok, thanks |
|
Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,340,042,080 RAC: 33 Level ![]() Scientific publications
|
Thought I would try again, still failing, did anyone ever come up with a fix to get these running on 40 series cards? failed unit https://www.gpugrid.net/result.php?resultid=33751449 |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
use Linux seems to be the best way to run them.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Thought I would try again, still failing, did anyone ever come up with a fix to get these running on 40 series cards? You got one of the bad work units that are still floating around. Read the News ATMbeta thread. The researcher updated the Windows packaging to fix your referenced task type error. But only will apply for any new work units generated. Try and get one of the new ones. |
|
Send message Joined: 20 May 11 Posts: 16 Credit: 86,798,974 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
I have a 5950X setup with a perfectly running 2080 Super and 3 tasks I got today (so far) are all failing... 8-) GPUGRID 1.09 ATMbeta: Free energy calculations of protein-ligand binding (cuda1121) JNK1_m29_m27_3-QUICO_ATM_500K_dih14fit-2-5-RND8000_0 00:00:32 (00:00:00) 1/25/2024 1:16:36 PM 1/25/2024 1:19:36 PM 0.989C + 1NV 0.0 Reported: Computation error (195,) Win11-Main |
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
I have a 5950X setup with a perfectly running 2080 Super and 3 tasks I got today (so far) are all failing... What Keith already said just above: the fix is known but will be deployed at the earliest on the next batch of tasks. Current batch cannot be retroactively fixed... See post here: https://www.gpugrid.net/forum_thread.php?id=5379&nowrap=true#61076 I'm waiting on that just as you are. All it needs is a little patience. |
|
Send message Joined: 20 May 11 Posts: 16 Credit: 86,798,974 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Ohhh! So those were stray OLD tasks.. Gotchya! Thanks! 8-) |
|
Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,340,042,080 RAC: 33 Level ![]() Scientific publications
|
Cheers, ill leave it enabled then and see what happens |
|
Send message Joined: 21 Jul 12 Posts: 9 Credit: 433,306,729 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
I'm noticing an interesting aside wrt ATM WUs. All mine seem to be erroring out atm, but drilling into the WUs, it isn't just my computer. As an example https://www.gpugrid.net/workunit.php?wuid=27653498 Now the odd thing is that all the other computers seem to have an Intel processor (when I inspect each computer showing the failed result), and going down the list there seems to be a fail. But then 1 result is returned successfully, which has an AMD processor. Now I know the bulk of the processing is on the GPU, but there is some CPU processing as I understand. Perhaps there's something in the code which is playing nasty with Intel's implementation of something? Thought I'd pass this observation along, it being beta and all... |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
The one that succeeded was because it was on Linux, not because it was on AMD.
|
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,447 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
On this last batch of ATMbeta, I've noticed an increase of tasks failing with ValueError: Energy is NaN Currently about 1/3 ratio of failures. Since January 22th, 64 valid tasks, and 20 failed with that error at my Linux hosts. 14 of the failed tasks exceeded 1 hour processing time before erroring, and in these, 9 of them exceeding 3 hours processing time. Anyway, I still consider it as a reasonable ratio for these well-rewarded beta tasks, and I keep my hope of being contributing to Science. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 69 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
On this last batch of ATMbeta, I've noticed an increase of tasks failing with ValueError: Energy is NaN I noticed that too, about 1/3 was about right for me at one point, which happened on the Intel chip computer. Energy is NaN. I usually get about 90% success rate. Ignore the AMD chip computer, that's another story. But I seem to be ending on a high note for 4 in row ending successfully as far, let hope it continues. |
©2025 Universitat Pompeu Fabra