ATM

Message boards : News : ATM
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 . . . 35 · Next

AuthorMessage
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 42
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60728 - Posted: 5 Sep 2023, 2:35:40 UTC

The uploads are definitely slow, upwards of 25 minutes for an 86 Meg files. I just notice it this weekend. Until last week, the uploads were taking less than a minute for this particular file. Though downloads still take less than a minute for me.

I think the server needs a reboot, and the management of this project needs a good kick in the ass.......................

That's just my opinion.



ID: 60728 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60734 - Posted: 6 Sep 2023, 16:58:05 UTC

Any idea why there is no longer any Stderr information on failed tasks?

Example: https://www.gpugrid.net/result.php?resultid=33616730

The task failed after 2344 secs, and I have no idea why? Although I guess that it may again have been the old and well known "energy is nan" problem.

P.S. The servers, download as well as upload, are getting worse and worse :-(
ID: 60734 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 3,915
Level
Trp
Scientific publications
wat
Message 60735 - Posted: 6 Sep 2023, 17:34:36 UTC - in response to Message 60734.  

likely something specific to that host, nothing to do with the project. none of my errors exhibit that.

stderr information is stored and uploaded from the client state file directly. for it to be empty probably means something got corrupted with that file.
ID: 60735 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 543
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60736 - Posted: 6 Sep 2023, 20:31:02 UTC

Also finally noticing the slow downloads/uploads for tasks for the project that many have been complaining about for a week.

Started to get really bad yesterday and today. Only 20-50kb/s and lots of stalled activity now. Taking 25-30 minutes for uploads for example across all hosts now.

I agree the servers need a good reboot. Should do that now that the available tasks have dwindled down to a dozen.
ID: 60736 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 42
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60739 - Posted: 11 Sep 2023, 8:32:51 UTC

Getting the "Energy is NaN" error doesn't necessarily mean that the unit is a bad one. It means that unit more sensitive to failure.

Here is an example where my computer and two others had that error, but someone else finished it successfully:

https://www.gpugrid.net/workunit.php?wuid=27572371

Here is an example where my computer finished it successfully, while other didn't:

https://www.gpugrid.net/workunit.php?wuid=27573320

My computer is 610674.

BTW: The uploads are running at normal speeds, when there is very little work. The project obviously needs more bandwidth.



ID: 60739 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60740 - Posted: 11 Sep 2023, 9:25:02 UTC - in response to Message 60739.  

BTW: The uploads are running at normal speeds, when there is very little work. The project obviously needs more bandwidth.

Download problems are still persisting, a recent example see here:

https://www.gpugrid.net/result.php?resultid=33624287
ID: 60740 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60741 - Posted: 11 Sep 2023, 12:28:05 UTC - in response to Message 60739.  

Getting the "Energy is NaN" error doesn't necessarily mean that the unit is a bad one. It means that unit more sensitive to failure.

this may be the case; but after such long time, the developper should have been able to iron out this unusual extremely high sensivity.

2 or 3 weeks ago I stopped crunching ATMs on one of my hosts, after about every other task had failed, for unknown reason (no overclocking, no other tasks from other projects running).

Now I am experiencing the curious situation on the other two hosts, that since - as opposed to until short time ago - no tasks from other projects are running (because there are none available), the dropout rate of ATMs even increased.
No idea how come.
And if this happens after several hours (which it does), it is more than annoying. If the situation continues like this, I will stop crunching ATMs also on these other two hosts, coming back only once the developper has improved the performance of the ATMs.

Too bad that GPUGRID has stopped all other sub-projects like Python or ACEMD (both 3 and 4).

I have been with GPUGRID for almost 9 years, but during this period of time, the situation has never been as bad aa it has been in the recent past, strange server problems included :-(
No idea what's going on there ???
ID: 60741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60742 - Posted: 13 Sep 2023, 16:00:45 UTC

I added up the runtimes of tasks that failed with "energy is NAN" alone on September 11:
45.228 seconds = 12,56 hours.
That's not nice :-(
ID: 60742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Speedy

Send message
Joined: 19 Aug 07
Posts: 46
Credit: 45,339,082
RAC: 28
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 60743 - Posted: 13 Sep 2023, 21:01:00 UTC - in response to Message 60742.  

It may not be nice but it's part of beta testing
ID: 60743 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60744 - Posted: 14 Sep 2023, 10:35:39 UTC - in response to Message 60743.  

It may not be nice but it's part of beta testing

well, as already discussed here earlier: ATM has run as beta for half a year now.
So, at some point the ongoing problems could be solved, right?
ID: 60744 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Speedy

Send message
Joined: 19 Aug 07
Posts: 46
Credit: 45,339,082
RAC: 28
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 60745 - Posted: 14 Sep 2023, 21:39:02 UTC - in response to Message 60744.  

It may not be nice but it's part of beta testing

well, as already discussed here earlier: ATM has run as beta for half a year now.
So, at some point the ongoing problems could be solved, right?

Yes you could be right in what you are saying. Nevertheless the application is still in "Beta" so errors are still likely to show up every now and then or lots.
ID: 60745 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AnandBhat

Send message
Joined: 3 Mar 22
Posts: 1
Credit: 18,104,606
RAC: 0
Level
Pro
Scientific publications
wat
Message 60746 - Posted: 15 Sep 2023, 0:19:15 UTC

A couple of my ATM tasks (my only GPUgrid tasks for a loooong time) failed with this error: openmm.OpenMMException: Unknown property 'version' in node 'IntegratorParameters'

https://www.gpugrid.net/result.php?resultid=33624768
https://www.gpugrid.net/result.php?resultid=33624760

Looking at a few of my wingmen who've had similar failures, appears to be a bad batch.
ID: 60746 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60747 - Posted: 15 Sep 2023, 16:27:42 UTC - in response to Message 60746.  

A couple of my ATM tasks (my only GPUgrid tasks for a loooong time) failed with this error: openmm.OpenMMException: Unknown property 'version' in node 'IntegratorParameters'

https://www.gpugrid.net/result.php?resultid=33624768
https://www.gpugrid.net/result.php?resultid=33624760

Looking at a few of my wingmen who've had similar failures, appears to be a bad batch.

one thing you can be happy about: theses tasks seem to fail after a few minutes, and not after many hours as it often is the case with the "Engergy is NAN" error.
So not much waste of resources.
ID: 60747 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60748 - Posted: 18 Sep 2023, 9:28:45 UTC

Two errors so far with today's new batch. Both of the form

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_mXX_mXX_0.xml'
ID: 60748 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 543
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60750 - Posted: 19 Sep 2023, 0:17:15 UTC
Last modified: 19 Sep 2023, 0:17:35 UTC

Just a FYI here for an interesting observation of two tasks run on a host where the first task finished up and reported correctly.

Then the next task started up in the same slot 110 and errored out.

The difference was the second task failed to find the original xml file.

Compare the ends of the output file to see the difference.

Task https://www.gpugrid.net/result.php?resultid=33626002

+ echo 'Save output'
+ tar cjvf output.tar.bz2 run.log r0/MCL1_m27_m05.out r1/MCL1_m27_m05.out r10/MCL1_m27_m05.out r11/MCL1_m27_m05.out r12/MCL1_m27_m05.out r13/MCL1_m27_m05.out r14/MCL1_m27_m05.out r15/MCL1_m27_m05.out r16/MCL1_m27_m05.out r17/MCL1_m27_m05.out r18/MCL1_m27_m05.out r19/MCL1_m27_m05.out r2/MCL1_m27_m05.out r20/MCL1_m27_m05.out r21/MCL1_m27_m05.out r3/MCL1_m27_m05.out r4/MCL1_m27_m05.out r5/MCL1_m27_m05.out r6/MCL1_m27_m05.out r7/MCL1_m27_m05.out r8/MCL1_m27_m05.out r9/MCL1_m27_m05.out r0/MCL1_m27_m05.dcd r1/MCL1_m27_m05.dcd r10/MCL1_m27_m05.dcd r11/MCL1_m27_m05.dcd r12/MCL1_m27_m05.dcd r13/MCL1_m27_m05.dcd r14/MCL1_m27_m05.dcd r15/MCL1_m27_m05.dcd r16/MCL1_m27_m05.dcd r17/MCL1_m27_m05.dcd r18/MCL1_m27_m05.dcd r19/MCL1_m27_m05.dcd r2/MCL1_m27_m05.dcd r20/MCL1_m27_m05.dcd r21/MCL1_m27_m05.dcd r3/MCL1_m27_m05.dcd r4/MCL1_m27_m05.dcd r5/MCL1_m27_m05.dcd r6/MCL1_m27_m05.dcd r7/MCL1_m27_m05.dcd r8/MCL1_m27_m05.dcd r9/MCL1_m27_m05.dcd
tar: run.log: file changed as we read it
+ true
+ echo 'Save restart'
+ tar cjvf restart.tar.bz2 r0/MCL1_m27_m05_ckpt.xml r1/MCL1_m27_m05_ckpt.xml r10/MCL1_m27_m05_ckpt.xml r11/MCL1_m27_m05_ckpt.xml r12/MCL1_m27_m05_ckpt.xml r13/MCL1_m27_m05_ckpt.xml r14/MCL1_m27_m05_ckpt.xml r15/MCL1_m27_m05_ckpt.xml r16/MCL1_m27_m05_ckpt.xml r17/MCL1_m27_m05_ckpt.xml r18/MCL1_m27_m05_ckpt.xml r19/MCL1_m27_m05_ckpt.xml r2/MCL1_m27_m05_ckpt.xml r20/MCL1_m27_m05_ckpt.xml r21/MCL1_m27_m05_ckpt.xml r3/MCL1_m27_m05_ckpt.xml r4/MCL1_m27_m05_ckpt.xml r5/MCL1_m27_m05_ckpt.xml r6/MCL1_m27_m05_ckpt.xml r7/MCL1_m27_m05_ckpt.xml r8/MCL1_m27_m05_ckpt.xml r9/MCL1_m27_m05_ckpt.xml
16:23:56 (1259959): bin/bash exited; CPU time 6653.704111
16:23:56 (1259959): called boinc_finish(0)


And the task https://www.gpugrid.net/result.php?resultid=33626029
that failed later in the same slot.

+ echo 'Save output'
+ tar cjvf output.tar.bz2 run.log r0/MCL1_m09_m41.out r1/MCL1_m09_m41.out r10/MCL1_m09_m41.out r11/MCL1_m09_m41.out r12/MCL1_m09_m41.out r13/MCL1_m09_m41.out r14/MCL1_m09_m41.out r15/MCL1_m09_m41.out r16/MCL1_m09_m41.out r17/MCL1_m09_m41.out r18/MCL1_m09_m41.out r19/MCL1_m09_m41.out r2/MCL1_m09_m41.out r20/MCL1_m09_m41.out r21/MCL1_m09_m41.out r3/MCL1_m09_m41.out r4/MCL1_m09_m41.out r5/MCL1_m09_m41.out r6/MCL1_m09_m41.out r7/MCL1_m09_m41.out r8/MCL1_m09_m41.out r9/MCL1_m09_m41.out r0/MCL1_m09_m41.dcd r1/MCL1_m09_m41.dcd r10/MCL1_m09_m41.dcd r11/MCL1_m09_m41.dcd r12/MCL1_m09_m41.dcd r13/MCL1_m09_m41.dcd r14/MCL1_m09_m41.dcd r15/MCL1_m09_m41.dcd r16/MCL1_m09_m41.dcd r17/MCL1_m09_m41.dcd r18/MCL1_m09_m41.dcd r19/MCL1_m09_m41.dcd r2/MCL1_m09_m41.dcd r20/MCL1_m09_m41.dcd r21/MCL1_m09_m41.dcd r3/MCL1_m09_m41.dcd r4/MCL1_m09_m41.dcd r5/MCL1_m09_m41.dcd r6/MCL1_m09_m41.dcd r7/MCL1_m09_m41.dcd r8/MCL1_m09_m41.dcd r9/MCL1_m09_m41.dcd
tar: run.log: file changed as we read it
+ true
+ echo 'Save restart'
+ tar cjvf restart.tar.bz2 'r*/*.xml'
tar: r*/*.xml: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors
16:44:27 (15335): bin/bash exited; CPU time 58.042528
16:44:27 (15335): app exit status: 0x2
16:44:27 (15335): called boinc_finish(195)


I want to bring attention to the part where it could not parse down the xml string correctly.

+ tar cjvf restart.tar.bz2 'r*/*.xml'
tar: r*/*.xml: Cannot stat: No such file or directory
ID: 60750 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GoodOlClint

Send message
Joined: 28 Jul 22
Posts: 1
Credit: 499,601,186
RAC: 0
Level
Gln
Scientific publications
wat
Message 60751 - Posted: 19 Sep 2023, 2:39:34 UTC

Got the "Energy is NaN" error today. Interestingly the task jumped to 100% after just a few minutes, however it continued running for over two hours afterwards. When it completed it uploaded a 45 mb result file.

https://www.gpugrid.net/result.php?resultid=33626669
ID: 60751 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60752 - Posted: 19 Sep 2023, 5:44:42 UTC - in response to Message 60751.  

Got the "Energy is NaN" error today.

Here the same a few hours ago, after so many times before :-(
It's really unbelievealbe that after this problem has happened on a regular basis for half a year now, the developper is still not willing/able to iron out this nasty error.

ID: 60752 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60753 - Posted: 19 Sep 2023, 6:11:02 UTC - in response to Message 60728.  

On September 5, Bedrich Hajek wrote:

The uploads are definitely slow, upwards of 25 minutes for an 86 Meg files. I just notice it this weekend. Until last week, the uploads were taking less than a minute for this particular file. Though downloads still take less than a minute for me.

I think the server needs a reboot, and the management of this project needs a good kick in the ass.......................

That's just my opinion.

Still nothing has changed. Despite of only very few new tasks available once in a while, downloads and uploads take forever (25-30 kb/s).
So it's clear that traffic congestion is definitely not the reason.
I am curious when the project management will finally move ahead and straighten out all the problems.
Seemingly, it really needs a good kick in the ass to wake them up ...
ID: 60753 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 543
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60754 - Posted: 19 Sep 2023, 7:11:15 UTC

Complaining here is doing no good other than making the other forum participants tired of the constant diatribes.

The ATMbeta researcher has nothing to do with the application development. He just uses the tools that the project gives to him.

You need to vent your frustration at acellera.com

They are the ones that develop the apps. Look here.

https://www.gpugrid.net/about.php

Complain to the principal investigators Gianni and Toni.

Their contact information is in their website URL's posted on the About page.
ID: 60754 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[CSF] Aleksey Belkov

Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,292,358,731
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60764 - Posted: 21 Sep 2023, 22:32:00 UTC - in response to Message 60754.  

Complaining here is doing no good other than making the other forum participants tired of the constant diatribes.

Golden words

However, it’s still no use, because as practice shows, no matter how you argue, such people will still pour out their dissatisfaction over and over again...
ID: 60764 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 . . . 35 · Next

Message boards : News : ATM

©2025 Universitat Pompeu Fabra