ATM

Message boards : News : ATM
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 35 · Next

AuthorMessage
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 869
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60333 - Posted: 16 Apr 2023, 18:26:41 UTC - in response to Message 60320.  

I, too, had such an error after the task had run for 7.885 seconds:

File "Z:\BOINC\slots\3\lib\site-packages\openmm\app\statedatareporter.py", line 365, in _checkForErrors
raise ValueError('Energy is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan')
ValueError: Energy is NaN


https://www.gpugrid.net/result.php?resultid=33436488

no overclocking.


this time, the task errored out after 16.400 seconds :-(

https://www.gpugrid.net/result.php?resultid=33442242
ID: 60333 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 2
Level
Trp
Scientific publications
watwatwat
Message 60335 - Posted: 17 Apr 2023, 8:10:17 UTC

It feels like there's at least four categories of ATMbeta WUs running simultaneously.
None have checkpointing.
Top Priority should be to make checkpointing work.
Shotgun approach squanters a lot of compute time.
ID: 60335 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60336 - Posted: 17 Apr 2023, 11:15:57 UTC

My Nation like many others has gone into a default situation. The most expensive item is the supply of electricity and they are frequently switching off the grid without informing us.
David H, says the WUs are checkpointing. If they are checkpointing, then why are they not recovering? Well recovering or not, I cannot do a thing about the electric grid. So, best of luck to the WUs and as it is Ramadan, I have nothing left in the upper chamber to argue. Over and out.
ID: 60336 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 2
Level
Trp
Scientific publications
watwatwat
Message 60352 - Posted: 26 Apr 2023, 16:16:15 UTC

Still no checkpointing.
Suspent then unsuspend = crash.
Many WUs failing due to subprocess.
ID: 60352 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60353 - Posted: 28 Apr 2023, 5:30:09 UTC

If there is a storm and electricity go, WU crashes. I know that Boincer's do not do a re-start for months on end but I have to do a re-start. WU crashes. If the GPU updates or the System updates, the WU crashes. If the cat plays with the keyboard, the WU crashes.
I do not want catty remarks but will keep crashing them from now on. Who cares.
ID: 60353 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[CSF] Aleksey Belkov

Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,292,358,731
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60354 - Posted: 30 Apr 2023, 12:40:47 UTC - in response to Message 60353.  
Last modified: 30 Apr 2023, 12:42:32 UTC

Who cares.

No, it's not about who cares.

This is about which of the project employees has the knowledge and resources to implement the necessary functionality, and which of them has the time for this.
And as you should understand, they don't make decisions there on their own, it's not a hobby.
The necessary specialists can now be involved in other, more priority projects for the institute, and neither we nor the employees themselves can influence this.
Deal with it.

Nothing will change from the number of tearful posts about the problem, no matter how much someone would like it.
Unless, of course, the goal is once again just to let off steam somewhere because of indignation.
ID: 60354 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60357 - Posted: 3 May 2023, 8:34:49 UTC
Last modified: 3 May 2023, 9:32:57 UTC

Task TYK2_m44_m55_5_FIX-QUICO_ATM_Sage_xTB-0-5-RND2847_0 (today):

FileNotFoundError: [Errno 2] No such file or directory: 'TYK2_m44_m55_0.xml'


Later - CDK2_miu_m26_4-QUICO_ATM_Sage_xTB-0-5-RND8419_0 running OK.
ID: 60357 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60358 - Posted: 5 May 2023, 7:14:35 UTC
Last modified: 5 May 2023, 7:46:54 UTC

And a similar batch configuration error with today's BACE run, like

BACE_m24_m7e_5-QUICO_ATM_Sage_xTB-0-5-RND7993_0

08:05:32 (386384): wrapper: running bin/bash (run.sh)
bin/bash: run.sh: No such file or directory

(five so far)

Edit - now wasted 20 of the things, and switched to Python to avoid quota errors. I should have dropped in to give you a hand when passing through Barcelona at the weekend!
ID: 60358 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60359 - Posted: 5 May 2023, 8:04:09 UTC

I cannot resource share ATMBeta with other projects because it is stopped to run other projects. Ends up with an error.
ID: 60359 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[CSF] Aleksey Belkov

Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,292,358,731
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60360 - Posted: 5 May 2023, 9:01:39 UTC - in response to Message 60358.  

And a similar batch configuration error with today's BACE run, like

Same for Win apps:
https://www.gpugrid.net/result.php?resultid=33475629
https://www.gpugrid.net/results.php?userid=101590
Sad : /
ID: 60360 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,722,595
RAC: 4,266,994
Level
Trp
Scientific publications
wat
Message 60361 - Posted: 5 May 2023, 11:44:07 UTC - in response to Message 60359.  

I cannot resource share ATMBeta with other projects because it is stopped to run other projects. Ends up with an error.


set all other GPU projects to resource share of 0, then they wont run at all when you have ATM work.
ID: 60361 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 869
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60362 - Posted: 5 May 2023, 16:10:12 UTC

many of the recent ATMs errored out after not even a minute, stderr says:

wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
Der Befehl "run.bat" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.

in English: the command "run.but" is either misspelled our could not be found.

What's up?
ID: 60362 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60363 - Posted: 5 May 2023, 16:16:35 UTC

Same equivalent type of error in Linux for a great many tasks.

bin/bash: run.sh: No such file or directory

BACE_m7g_m7c_3-QUICO_ATM_Sage_xTB-0-5-RND8127_3
ID: 60363 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60364 - Posted: 5 May 2023, 17:19:23 UTC

Got a collection of twenty-one errored tasks. Suspended work fetch on that computer. The other is busy with Abous WU.
ID: 60364 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60365 - Posted: 5 May 2023, 19:07:51 UTC

Now these are doing it as well: MCL1_m28_m47_1_FIX-QUICO_ATM_Sage_xTB-0-5-RND0954_0

18:09:56 (394275): wrapper: running bin/bash (run.sh)
bin/bash: run.sh: No such file or directory

The experimenters and/or staff have got to get a grip on this - you are wasting everybody's time and electricity.

BOINC is very unforgiving: you have to get it 100% exact, all at the same time, every time. It's worth you taking a pause after each new batch is prepared, and then going back and proof-reading the configuration. Five minutes spent checking would probably have meant getting some real research results over the weekend: now, nothing will probably work until Monday (and I'm not holding my breath then, either).
ID: 60365 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bluestang

Send message
Joined: 13 Apr 15
Posts: 11
Credit: 3,003,712,606
RAC: 2,164,606
Level
Arg
Scientific publications
wat
Message 60366 - Posted: 5 May 2023, 21:27:48 UTC - in response to Message 60365.  
Last modified: 5 May 2023, 21:28:11 UTC

Now these are doing it as well: MCL1_m28_m47_1_FIX-QUICO_ATM_Sage_xTB-0-5-RND0954_0

18:09:56 (394275): wrapper: running bin/bash (run.sh)
bin/bash: run.sh: No such file or directory

The experimenters and/or staff have got to get a grip on this - you are wasting everybody's time and electricity.

BOINC is very unforgiving: you have to get it 100% exact, all at the same time, every time. It's worth you taking a pause after each new batch is prepared, and then going back and proof-reading the configuration. Five minutes spent checking would probably have meant getting some real research results over the weekend: now, nothing will probably work until Monday (and I'm not holding my breath then, either).



Exactly!

When you have more Tasks that Error (277) than Valid (240) ... that is pretty damn sad!
ID: 60366 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 869
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60369 - Posted: 6 May 2023, 7:05:05 UTC - in response to Message 60365.  

The experimenters and/or staff have got to get a grip on this - you are wasting everybody's time and electricity.

BOINC is very unforgiving: you have to get it 100% exact, all at the same time, every time. It's worth you taking a pause after each new batch is prepared, and then going back and proof-reading the configuration. Five minutes spent checking would probably have meant getting some real research results over the weekend: now, nothing will probably work until Monday (and I'm not holding my breath then, either).

+ 1
ID: 60369 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60370 - Posted: 6 May 2023, 11:20:02 UTC - in response to Message 60364.  
Last modified: 6 May 2023, 11:23:43 UTC

Got a collection of twenty-one errored tasks. Suspended work fetch on that computer. The other is busy with Abous WU.

___________

Abous, WU finished and I got one ATMBeta. It lasted all of one minute and three seconds. Suspended work fetch on this computer also.
Validated two ATMBeta, error twenty-two.
ID: 60370 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60371 - Posted: 6 May 2023, 12:21:56 UTC

Maybe someone can answer a question I have. After running ATMBeta, Einstein starts but it reports, GPU is missing. How does this happen?
ID: 60371 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,722,595
RAC: 4,266,994
Level
Trp
Scientific publications
wat
Message 60372 - Posted: 6 May 2023, 12:58:47 UTC - in response to Message 60371.  
Last modified: 6 May 2023, 12:59:14 UTC

atmbeta likely has nothing to do with it.

but ATMbeta uses CUDA, Einstein uses OpenCL. does BOINC still report OpenCL support in the startup log? you might need to reinstall your drivers.
ID: 60372 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 35 · Next

Message boards : News : ATM

©2025 Universitat Pompeu Fabra