ATM

Message boards : News : ATM
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 35 · Next

AuthorMessage
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60615 - Posted: 20 Jul 2023, 17:00:39 UTC - in response to Message 60610.  
Last modified: 20 Jul 2023, 17:09:15 UTC

Sorry for missing out for a while. We were testing ATM in a setup not available for GPUGRID. But we're back to crunching :)

I've seen that more or less everything is running fine. Albeit for some crashes that can happen everything seems to come back to me fine.

Is there anything in specific I should look into it? I already know about the progress reporting issue (if it persists) but there's not much more I can do on my end. If they plan to update the GPUGRID app at some point I'll insist.


several long standing and well discussed issues are still unresolved with these tasks. in reducing priority:

1. task checkpointing still does not work properly. it may be writing to the checkpoint file, but it does not ever resume from the checkpoint. any pausing or suspending work units for any reason will cause it to error out when it attempts to resume. this is an issue for anyone who runs multiple projects (BOINC will occasionally pause in-progress units to crunch other projects) or needs to shutdown their computer for updates or whatever.

2. runtime progress reporting ONLY works for the first batch "0-5" labelled tasks. anything "1-5" though "4-5" do not work properly, they jump immediately to 100% and stay there until it is complete. this makes it hard to know how long they will run

3. estimated flops setting on these tasks is probably way too high leading to crazy high runtime estimates. this could likely cause indirect issues with the BOINC client either not fetching work properly or not managing other projects properly.

4. many batches are being sent out malformed occasionally. leading to errors. seems most are due to incorrect formatting or naming. stuff like this:
"+ tar cjvf restart.tar.bz2 'r*/*.xml'
tar: r*/*.xml: Cannot stat: No such file or directory"

these are things I've seen constant complaints about every time these tasks come back.

I would highly recommend that you guys attach a computer to the project like a normal user so that you can experience them first hand and properly troubleshoot them.
ID: 60615 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bluestang

Send message
Joined: 13 Apr 15
Posts: 11
Credit: 3,003,712,606
RAC: 2,331
Level
Arg
Scientific publications
wat
Message 60616 - Posted: 20 Jul 2023, 18:39:57 UTC

Yes, the constant WUs throwing errors! Luckily most at the beginning, but some run for quite some time before erroring out. More Errors than Valid is a huge waste of resources and time for everyone.

Ian&Steve says it well!
ID: 60616 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60617 - Posted: 21 Jul 2023, 7:48:12 UTC - in response to Message 60610.  

Sorry for missing out for a while. We were testing ATM in a setup not available for GPUGRID. But we're back to crunching :)

I've seen that more or less everything is running fine. Albeit for some crashes that can happen everything seems to come back to me fine.

Is there anything in specific I should look into it? I already know about the progress reporting issue (if it persists) but there's not much more I can do on my end. If they plan to update the GPUGRID app at some point I'll insist.

________

Could you please make these tasks be able to suspend? Monsoons in my part of the World and every time it rains there is a power outage. Even though in Preferences I have set it to keep WU in Memory while on batteries, every time the power goes WU ends up with an error. Now 100% of the WUs at my end in error are due to this reason.
ID: 60617 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60618 - Posted: 21 Jul 2023, 8:39:49 UTC - in response to Message 60616.  

Yes, the constant WUs throwing errors! Luckily most at the beginning, but some run for quite some time before erroring out. More Errors than Valid is a huge waste of resources and time for everyone.

Ian&Steve says it well!

mentioning the "waste of resources": "ValueError: Energy is NaN." has happened again quite a lot in the recent past. Mostly after between 1-1/2 and 2 hours runtime.
Given that electricity cost has trippled here since last year, such waste has become quite expensive :-(
ID: 60618 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bluestang

Send message
Joined: 13 Apr 15
Posts: 11
Credit: 3,003,712,606
RAC: 2,331
Level
Arg
Scientific publications
wat
Message 60619 - Posted: 26 Jul 2023, 22:57:14 UTC

Another new batch...same old Errors.

Does Krembill have anything to do with this project lololol.
ID: 60619 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60620 - Posted: 27 Jul 2023, 17:20:42 UTC

Valid 17, error 26. I know it makes no difference; plenty of computers are standing by and it will get done.
ID: 60620 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
roundup

Send message
Joined: 11 May 10
Posts: 68
Credit: 12,293,501,875
RAC: 3,114
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60621 - Posted: 29 Jul 2023, 10:21:38 UTC - in response to Message 60612.  

AFAIK it should run anywhere, maybe the issue is more driver related? We recently tested on 40 series GPUs locally and it run fine, since I saw some comments in the thread.


My driver for the RTX 4080 under Win11 is 536.23
All units error out after about 40 seconds.
I do not see this on a 2070S nor on a 3070 Laptop.
ID: 60621 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60622 - Posted: 29 Jul 2023, 16:26:53 UTC - in response to Message 60621.  

AFAIK it should run anywhere, maybe the issue is more driver related? We recently tested on 40 series GPUs locally and it run fine, since I saw some comments in the thread.


My driver for the RTX 4080 under Win11 is 536.23
All units error out after about 40 seconds.
I do not see this on a 2070S nor on a 3070 Laptop.

It would be helpful if you unhid your computers so we could examine the output files to get a clue on why the tasks are failing on your 40 series card,
ID: 60622 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
roundup

Send message
Joined: 11 May 10
Posts: 68
Credit: 12,293,501,875
RAC: 3,114
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60623 - Posted: 29 Jul 2023, 19:35:33 UTC - in response to Message 60622.  
Last modified: 29 Jul 2023, 19:45:12 UTC

It would be helpful if you unhid your computers so we could examine the output files to get a clue on why the tasks are failing on your 40 series card,


Done.
I just ran 2 fresh WUs that errored out as usual.
Thank you very much.
ID: 60623 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60624 - Posted: 29 Jul 2023, 19:38:57 UTC - in response to Message 60623.  
Last modified: 29 Jul 2023, 19:41:14 UTC


It would be helpful if you unhid your computers so we could examine the output files to get a clue on why the tasks are failing on your 40 series card,


Done. Thank you very much.

Wasn't helpful. You don't have any result output at all. The tasks never even get to start the setup process. They just exit immediately.

Quico needs to reexamine his statement that the 40 series cards are working OK on the ATMbeta tasks.

[Edit]
I would reset the project to start with in the hope that the task and app packages gets downloaded again. Maybe the necessary Python environment never got set up correctly initially.
ID: 60624 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
roundup

Send message
Joined: 11 May 10
Posts: 68
Credit: 12,293,501,875
RAC: 3,114
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60625 - Posted: 29 Jul 2023, 19:52:58 UTC - in response to Message 60624.  


I would reset the project to start with in the hope that the task and app packages gets downloaded again. Maybe the necessary Python environment never got set up correctly initially.

Project reset and tried two new WU. Same result - error after a few seconds.
ID: 60625 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60626 - Posted: 29 Jul 2023, 21:30:00 UTC - in response to Message 60624.  
Last modified: 29 Jul 2023, 21:32:16 UTC

Quico needs to reexamine his statement that the 40 series cards are working OK on the ATMbeta tasks.


they do. look at the leaderboard. many 40-series hosts returning valid work from both linux and Windows.
ID: 60626 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60627 - Posted: 29 Jul 2023, 21:40:38 UTC - in response to Message 60626.  

OK, so 40 series works fine for both Windows and Linux.

So what would you recommend for this volunteer to do for troubleshooting when tasks don't report any useful information?

The most logical step of project reset was not fruitful.
ID: 60627 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60628 - Posted: 29 Jul 2023, 22:51:39 UTC

Does the 4080 run other projects gpu tasks without errors?
ID: 60628 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60629 - Posted: 30 Jul 2023, 3:12:46 UTC - in response to Message 60627.  

Maybe a problem with BOINC itself. Might try a different BOINC version.
ID: 60629 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
roundup

Send message
Joined: 11 May 10
Posts: 68
Credit: 12,293,501,875
RAC: 3,114
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60630 - Posted: 30 Jul 2023, 5:45:30 UTC - in response to Message 60628.  

Does the 4080 run other projects gpu tasks without errors?

Yes, it does without errors. PrimeGrid, SRBase, Einstein and WCG OPNG.
BOINC was updated to 7.22.2 within this Beat phase - same result.
ID: 60630 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60632 - Posted: 3 Aug 2023, 5:25:52 UTC - in response to Message 60619.  

Another new batch...same old Errors.

Does Krembill have anything to do with this project lololol.

forget Krembil - it's down most of the time. Too bad what happened to WCG :-(
ID: 60632 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60633 - Posted: 3 Aug 2023, 11:21:33 UTC

It does not make any difference to Quico. The task will be completed on one or another computer and his science is done. It is our very expensive energy that is wasted but as Quico himself said, the science gets done who cares about wasted energy?
ID: 60633 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60634 - Posted: 4 Aug 2023, 17:50:37 UTC - in response to Message 60633.  

You also have the option to crunch something else if your time is wasted here.
ID: 60634 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60636 - Posted: 6 Aug 2023, 1:02:51 UTC - in response to Message 60634.  

You also have the option to crunch something else if your time is wasted here.


I wish you would put a dirty sock where required. In Asia, the transmission of power is through overhead lines. They run red hot and expanded in our heat. Many people used to die due to electrocution. They switch off the grid. If the WU's cannot handle a suspension then there is no need for cati useless useless remarks. You also have the option of not running off with your writing skills.
ID: 60636 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 35 · Next

Message boards : News : ATM

©2025 Universitat Pompeu Fabra