ATM

Message boards : News : ATM
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 . . . 35 · Next

AuthorMessage
Speedy

Send message
Joined: 19 Aug 07
Posts: 46
Credit: 45,339,082
RAC: 28
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 60698 - Posted: 29 Aug 2023, 21:34:21 UTC - in response to Message 60697.  

I totally agree.
ID: 60698 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60699 - Posted: 29 Aug 2023, 21:52:37 UTC - in response to Message 60697.  

Quico has said multiple times that he doesn't know how to fix it (the runtime/% and checkpointing). complaining more wont get it fixed. at this point, it's your own choice to run this or not. if you don't like it, don't do it.

Quico is a research scientist - and at least he communicates with us (thank you). I wouldn't expect him to be an expert in project administration.

That's why my comment was explicitly directed at the (silent) administrators.
ID: 60699 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60700 - Posted: 30 Aug 2023, 5:16:27 UTC - in response to Message 60699.  

That's why my comment was explicitly directed at the (silent) administrators.

yes, they are very silent; and obviously they don't care whether or not we volunteers are confronted with annoyingly faulty tasks :-(
ID: 60700 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bluestang

Send message
Joined: 13 Apr 15
Posts: 11
Credit: 3,003,712,606
RAC: 1,912
Level
Arg
Scientific publications
wat
Message 60701 - Posted: 30 Aug 2023, 15:48:08 UTC - in response to Message 60699.  

Quico has said multiple times that he doesn't know how to fix it (the runtime/% and checkpointing). complaining more wont get it fixed. at this point, it's your own choice to run this or not. if you don't like it, don't do it.

Quico is a research scientist - and at least he communicates with us (thank you). I wouldn't expect him to be an expert in project administration.

That's why my comment was explicitly directed at the (silent) administrators.


Yes exactly. My comments are about the Admins/Devs...not Quico.

And as Richard has said, at least he communicates with us and does what he can. It's a shame the others can't, or won't.
ID: 60701 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Stoneageman
Avatar

Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,374,498
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60704 - Posted: 31 Aug 2023, 12:04:54 UTC

As I understand it, GPUgrid is now just one of several projects under the computational science lab and the developers are mostly involved with Acellera
ID: 60704 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60705 - Posted: 31 Aug 2023, 18:15:03 UTC

the next problem I have been faced with for several days: the download of a task takes forever. Speed is about 10 kB/ps :-(

This is nothing new, though. I remember that this kind of server problem comes up on a pretty regular basis :-(
ID: 60705 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60706 - Posted: 31 Aug 2023, 20:10:49 UTC - in response to Message 60705.  

the next problem I have been faced with for several days: the download of a task takes forever. Speed is about 10 kB/ps :-(

This is nothing new, though. I remember that this kind of server problem comes up on a pretty regular basis :-(


right now, the download of a task has been taking 1:40 hrs so far and the progress is about 55 %.
That's ridiculous :-(

What's going on at GPUGRID? Are the servers breaking down?
ID: 60706 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 543
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60707 - Posted: 31 Aug 2023, 22:30:17 UTC

Lots of tasks going out to hosts and lots of results returning.

Network speed has decreased under the increased congestion.

We've seen this before when we had tons of acemd3 work.
ID: 60707 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60708 - Posted: 1 Sep 2023, 5:46:33 UTC - in response to Message 60707.  

Lots of tasks going out to hosts and lots of results returning.
Network speed has decreased under the increased congestion-

currently only 171 users are receiving and sending tasks with several hours between receiving and sending.
So we are definitely not talking about outragiously high network traffic.
Something seems to be wrong with their servers.

ID: 60708 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 543
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60711 - Posted: 1 Sep 2023, 19:16:39 UTC

The download times you mentioned are very long and not at all what I am experiencing.

Don't know if your network connection speed is very slow or whether your ISP is having issues routing your traffic from the project to you.

My download speeds are mainly in the range of 50-100 Mb/s according to the Transfers page in the Manager when I download new work.
ID: 60711 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
roundup

Send message
Joined: 11 May 10
Posts: 68
Credit: 12,293,503,875
RAC: 3,253
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60712 - Posted: 2 Sep 2023, 7:41:43 UTC - in response to Message 60711.  

I had previously reported that ATMbeta fails after about 40 seconds on my RTX4080 under Windows 11, while I see other users getting valid results on different RTX40x0s.
Yesterday I installed Linux on the same machine and ATMbeta delivered valid results. The known bugs can again be observed, of course: Energy is NaN (some WU), progress bar jumps to 100% (exept the 0-5 units), no checkpoints.

On Windows, ATMbeta seems to have a particular problem.
ID: 60712 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60714 - Posted: 3 Sep 2023, 13:00:17 UTC - in response to Message 60711.  

The download times you mentioned are very long and not at all what I am experiencing.

Don't know if your network connection speed is very slow or whether your ISP is having issues routing your traffic from the project to you.

My download speeds are mainly in the range of 50-100 Mb/s according to the Transfers page in the Manager when I download new work.

here, some downloads get done rather quickly, some others take forever and sometimes they error out after long time. STDERR then says the following:

<message>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>cmet_m16_m20_3-QUICO_ATM_Mck_GAFF2_v4-3-cmet_m16_m20_3-QUICO_ATM_Mck_GAFF2_v4-2-5-RND1222_1</file_name>
<error_code>-119 (md5 checksum failed for file)</error_code>
</file_xfer_error>
</message>

The download speed of my ISP is 300 Mbit/s which normally works well as long as the download server at the other end has no problems.
ID: 60714 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60715 - Posted: 3 Sep 2023, 16:11:31 UTC

since yesterday, I face a new problem:
a tasks fails after some time, but there is no stderr so that one could see what the problem was.
Example:
https://www.gpugrid.net/result.php?resultid=33612295
the task failed after 2.731 seconds.
ID: 60715 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60716 - Posted: 3 Sep 2023, 16:17:43 UTC - in response to Message 60714.  

The download times you mentioned are very long and not at all what I am experiencing.

Don't know if your network connection speed is very slow or whether your ISP is having issues routing your traffic from the project to you.

My download speeds are mainly in the range of 50-100 Mb/s according to the Transfers page in the Manager when I download new work.

here, some downloads get done rather quickly, some others take forever and sometimes they error out after long time. STDERR then says the following:

<message>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>cmet_m16_m20_3-QUICO_ATM_Mck_GAFF2_v4-3-cmet_m16_m20_3-QUICO_ATM_Mck_GAFF2_v4-2-5-RND1222_1</file_name>
<error_code>-119 (md5 checksum failed for file)</error_code>
</file_xfer_error>
</message>

The download speed of my ISP is 300 Mbit/s which normally works well as long as the download server at the other end has no problems.


here an example of the download problem which I keep facing:

https://www.gpugrid.net/result.php?resultid=33613115

Erstellt 3 Sep 2023 | 11:20:22 UTC
Gesendet 3 Sep 2023 | 12:29:54 UTC
Empfangen 3 Sep 2023 | 12:43:06 UTC

since the download still did not get finished after almost 70 minutes, it broke off :-(

I think GPUGRID needs to work on their servers quickly.
ID: 60716 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 543
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60717 - Posted: 3 Sep 2023, 19:33:56 UTC

What do you have for transfers in your cc_config.xml file?

Just the basic 2 connections?

Any rate limiting?

I think the default should be 8 connections per project and 32 per host.

Especially if there are other BOINC projects running besides GPUGrid.
ID: 60717 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60719 - Posted: 3 Sep 2023, 20:26:11 UTC - in response to Message 60717.  

What do you have for transfers in your cc_config.xml file?

Just the basic 2 connections?

Any rate limiting?

I think the default should be 8 connections per project and 32 per host.

Especially if there are other BOINC projects running besides GPUGrid.

it's 8 connections per project

and the downloads get even worse now. Several times now downloads have stopped after proceeding extremely slowly, with "download failed" in the BOINC manager :-(
ID: 60719 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60720 - Posted: 3 Sep 2023, 20:44:51 UTC
Last modified: 3 Sep 2023, 20:45:57 UTC

the BOINC event log keeps saying: project servers may be temporarily down.

And the pending upload jumps to "repeat in ... hours" immediately.

So there is definitely something wrong with the servers over there.

P.S.: even sending this posting out took almost 1 minute
ID: 60720 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 543
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60721 - Posted: 3 Sep 2023, 22:11:23 UTC - in response to Message 60720.  

Still believe the issue is local to you. In all the while you have reported issues with the downloads, I have not experienced any issues or backoffs.

Project is working normally for me though the speeds have degraded from what I experienced a month ago or so.

Still no issues keeping all the hosts crunching the ATMbeta tasks.
ID: 60721 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Stoneageman
Avatar

Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,374,498
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60726 - Posted: 4 Sep 2023, 9:24:19 UTC
Last modified: 4 Sep 2023, 9:29:31 UTC

I regularly get backoffs on transfers, so it's not just you.
Their server has issues for sure. I have to use a different IP address than that which my hosts use, just to access this site. Some ideas,

Reboot your router at least every 24hrs. If you are not on a fixed IP, this is likely to get you a new IP address. This helps me greatly with maintaining good transfer speeds.

Use a VPN and set location as Spain.

Use a script on each host to keep tickling their server, such as....

:top
"C:\Program Files\BOINC\boinccmd" --host 127.0.0.1:31416 --passwd "yourpasswordhere" --network_available
TIMEOUT /T 300
goto top


Create a text file with this script. Edit to suit your install. Save as a batch file then double click to run it.
ID: 60726 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 3,915
Level
Trp
Scientific publications
wat
Message 60727 - Posted: 4 Sep 2023, 16:06:36 UTC
Last modified: 4 Sep 2023, 16:07:19 UTC

i started processing ATM again on my known stable host (Linux Ubuntu LTS, EPYC + 4x A4000).

out of 160 tasks that have processed, 9 had an error ( excluded 9 tasks that had download errors and never wasted any processing time, download errors are just a "cost of doing business" with GPUGRID, IMO)

that's roughly a 5% error rate, and reasonable IMO. yeah some failed after a decent processing time, but I'm not gonna get upset about it since the vast majority of tasks that touch my system complete successfully.

if anyone is having a significantly higher error rate, you might need to look into the stability of the system itself, or switch to linux, or re-examine how you are operating (dont stop the tasks for any reason if you can help it, don't reboot, dont run other projects, etc) or any combination of the three.

when setup properly and accounting for project specific idiosyncrasies, these tasks mostly run fine.
ID: 60727 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 . . . 35 · Next

Message boards : News : ATM

©2025 Universitat Pompeu Fabra