ATMML

Message boards : Number crunching : ATMML
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 8 · Next

AuthorMessage
Richard

Send message
Joined: 13 Jan 24
Posts: 2
Credit: 39,763,706
RAC: 0
Level
Val
Scientific publications
wat
Message 61692 - Posted: 22 Aug 2024, 20:11:48 UTC - in response to Message 61689.  

Looks like it finally started and ran for a few minutes, then uploaded...

Richard
ID: 61692 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Opolis

Send message
Joined: 19 Feb 12
Posts: 3
Credit: 1,508,126,091
RAC: 11
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 61693 - Posted: 22 Aug 2024, 22:21:42 UTC - in response to Message 61687.  

These tasks are running fine for me so far. The only thing I noticed was that the points awarded seem off. The second task I completed took an hour longer than the first but received 900k fewer points. So far they have been taking 5-6 hours on a 3080ti, driver version 535.183.01.


The points are accurate. You get a 50% bonus, if you finish the task successfully and return the results within 24 hours from downloading it. There is a 25% bonus if you do it within 48 hours. No bonus if you return it after 48 hours. This is an incentive for quick return of results.




Ah you are correct. I had the one task stuck in "downloading" for a while and I didn't run it until the next day.
ID: 61693 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
WPrion

Send message
Joined: 30 Apr 13
Posts: 106
Credit: 3,805,237,860
RAC: 65
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61694 - Posted: 23 Aug 2024, 1:13:04 UTC

Are there no checkpoints on ATMML tasks?

I was about 30% complete when I had to suspend the task and shut down the computer. When I restarted both the % done and elapsed time were zero.
ID: 61694 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 69
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61695 - Posted: 23 Aug 2024, 2:00:00 UTC - in response to Message 61694.  

Are there no checkpoints on ATMML tasks?

I was about 30% complete when I had to suspend the task and shut down the computer. When I restarted both the % done and elapsed time were zero.



No, there are not. Same goes for quantum chemistry and ATM. They haven't figured out how to do it, yet.

ID: 61695 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61696 - Posted: 23 Aug 2024, 7:14:23 UTC

I hope this doesn't backfire. This morning I see 800 tasks in progress, but zero ready to send.

My last two downloads have been replica _3 tasks, each WU having failed on three Windows machines first.

I do hope new Windows users pay attention to the 'tricks of the trade' we've learned over the years:

    * small cache, especially with slower GPUs.
    * run continuously, don't allow interruptions (especially auto-updates)
    * don't swap to a different GPU type mid-run

ID: 61696 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61697 - Posted: 23 Aug 2024, 10:32:49 UTC - in response to Message 61696.  

I do hope new Windows users pay attention to the 'tricks of the trade' we've learned over the years:

Thank you for your ever-sharing expertise


My last two downloads have been replica _3 tasks, each WU having failed on three Windows machines first.

Despite this, there is a noticeable increase in the number of users returning ATMML results.
Likely for the effect of Windows users now added to previous Linux ones.
Before new Windows ATMML app was released, users/24h was consistently about 80 - 100.
Currently it is more than 230, as can be seen at Server status page.
ID: 61697 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
WPrion

Send message
Joined: 30 Apr 13
Posts: 106
Credit: 3,805,237,860
RAC: 65
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61698 - Posted: 23 Aug 2024, 10:55:09 UTC - in response to Message 61595.  
Last modified: 23 Aug 2024, 10:58:27 UTC

ReL the Apps Page: https://www.gpugrid.net/apps.php

I wish, for consistency, it would state:

ATMML: Free energy with neural networks for GPU

Also, when selecting projects in project preferences, it would be nice if it stated:

ATMML on GPU
ID: 61698 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61699 - Posted: 23 Aug 2024, 11:21:33 UTC - in response to Message 61698.  

ReL the Apps Page: https://www.gpugrid.net/apps.php

I wish, for consistency, it would state:

ATMML: Free energy with neural networks for GPU

Also, when selecting projects in project preferences, it would be nice if it stated:

ATMML on GPU


this is GPUgrid. all tasks are for GPU
ID: 61699 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61700 - Posted: 23 Aug 2024, 12:41:38 UTC - in response to Message 61697.  

Despite this, there is a noticeable increase in the number of users returning ATMML results.

Indeed. But the question is: are those completed, end-of-run, scientifically useful results - or are they early crashes, resulting only in the creation and issue of another replica, to take its place in the 'in progress' count?

We can't tell from the outside. But runtimes starting at 0.04 hours don't look too good.
ID: 61700 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 21 Dec 23
Posts: 51
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61701 - Posted: 23 Aug 2024, 12:57:06 UTC - in response to Message 61700.  
Last modified: 23 Aug 2024, 12:59:22 UTC

Hi, the windows host are working successfully. There are more errors than on linux as expected, but plenty are working well.

Unfortunately some WUs with the very short run time but validated status bug are still in circulation. (each WU runs in a chain of 5 steps, when a step finishes it launches a new job with the same settings.) New WUs do not have this bug.
This is the bug I am talking about: https://www.gpugrid.net/forum_thread.php?id=5468&nowrap=true#61682
ID: 61701 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
WPrion

Send message
Joined: 30 Apr 13
Posts: 106
Credit: 3,805,237,860
RAC: 65
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61702 - Posted: 23 Aug 2024, 16:55:08 UTC - in response to Message 61696.  


* small cache, especially with slower GPUs.


Which cache? Where is it set?? What should it be set at???
ID: 61702 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
WPrion

Send message
Joined: 30 Apr 13
Posts: 106
Credit: 3,805,237,860
RAC: 65
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61703 - Posted: 23 Aug 2024, 17:03:59 UTC

I just started ATMML yesterday. Out of seven starts only one completed. The rest errored-out after 1-1.5 hours. Windows11/RTX4090.

I'd like to get some actual work done...
ID: 61703 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 61704 - Posted: 23 Aug 2024, 17:19:31 UTC - in response to Message 61702.  


* small cache, especially with slower GPUs.


Which cache? Where is it set?? What should it be set at???


He's talking about the work cache on the host. you can (kind of) control that in the BOINC Manager Options->"Computing Preferences" menu. set it to something less than 1 day probably.

you'll be limited to 4 tasks from the project (per GPU) anyway.

ID: 61704 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Farscape
Avatar

Send message
Joined: 1 Feb 09
Posts: 6
Credit: 1,937,116,460
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61705 - Posted: 23 Aug 2024, 17:30:31 UTC

The Windows tasks ARE NOT working as advertised....

On two 3090ti computers and one 3090 11 work units have error out between 2-4 hours of run time.

Previous successful task run times went between 17000-18500 seconds.

Errored tasks are 5000-8500 seconds.

I am killing the ap in preferences until itself out....
ID: 61705 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
WPrion

Send message
Joined: 30 Apr 13
Posts: 106
Credit: 3,805,237,860
RAC: 65
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61706 - Posted: 23 Aug 2024, 18:22:30 UTC - in response to Message 61705.  

Thanks. There are too many cache's out there. Let's call this the work queue.
ID: 61706 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]

Send message
Joined: 16 Jul 07
Posts: 209
Credit: 5,496,860,456
RAC: 12,111
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61707 - Posted: 23 Aug 2024, 20:14:08 UTC - in response to Message 61705.  

The Windows tasks ARE NOT working as advertised....

On two 3090ti computers and one 3090 11 work units have error out between 2-4 hours of run time.

Previous successful task run times went between 17000-18500 seconds.

Errored tasks are 5000-8500 seconds.

I am killing the ap in preferences until itself out....


All 8 of 8 tasks I have completed and returned also categorized as error. This is on win10 with 4080 and 4090 GPUs. Here is a sample:
http://www.gpugrid.net/result.php?resultid=35743812
Reno, NV
Team: SETI.USA
ID: 61707 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61708 - Posted: 23 Aug 2024, 21:09:54 UTC - in response to Message 61707.  

Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED

That's going to be a difficult one to overcome unless the project addresses its job estimation. You need to 'complete' (which includes a successful finish plus validation) 11 tasks before the estimates are normalised - and if every task fails, you'll never get there.
ID: 61708 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 21 Dec 23
Posts: 51
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61709 - Posted: 24 Aug 2024, 8:17:34 UTC - in response to Message 61708.  

Hello. I apologise about the time limit exceed errors. I did not expect this. The jobs run for the same time as the linux ones that have all been working so I dont really understand what is happening.

Unfortunately the way boinc deals with "runtime" is completely inadequate for gpu projects. In a WU we have to estimate the flop use, which is a difficult thing to do for a gpu app. The boinc client then somehow estimates the flops performance of your computer in a way I don't understand. I cannot simply put a runtime limit of x hours as would be typical.


Does anyone know where the denominator comes from in this line?:
  <message>
exceeded elapsed time limit 5454.20 (10000000000.00G/1712015.37G)</message>
<stderr_txt>


The numerator I believe is the fpops_bound that is set in the WU template which is controlled by us.
ID: 61709 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61710 - Posted: 24 Aug 2024, 9:16:26 UTC - in response to Message 61709.  
Last modified: 24 Aug 2024, 9:18:42 UTC

Does anyone know where the denominator comes from in this line?:

<message>
exceeded elapsed time limit 5454.20 (10000000000.00G/1712015.37G)</message>
<stderr_txt>

The numerator I believe is the fpops_bound that is set in the WU template which is controlled by us.

Yes. It's the current estimated speed for the task, which should be 'learned' by BOINC for the individual computer running this particular task type ('app_version').

It's a complex three-stage process, and unfortunately it doesn't go down to the granularity of individual GPU types - all GPUs are considered equal.

1) When a new app version is created, the server will set a first, initial, value for GPU speeds for that version. I'm afraid I don't know how that initial value is estimated, but I'll try to find out.

2) Once the app version is up and running, the server monitors the runtime of the successful tasks returned. That's done at both the project level, and the individual host level. The first critical point is probably when the project has received 100 results: the calculated average speed from those 100 is used to set the expected speed for all tasks issued from that point forward. [aside - 'obviously' the first results received will be from the fastest machines, so that value is skewed]

3) Later, as each individual host reports tasks, once 11 successful tasks have been returned, future tasks assigned to that host are assigned the running average value for that host.

The current speed estimate ('fpops_est') can be seen in the application_details page for each host. zombie67 hasn't completed an ATMML task yet, so no 'Average processing rate' for his machine is shown yet for ATMML (at the bottom), but you can see it for other task types.

Phew. That's probably more than enough for now, so I'll leave you to digest it.
ID: 61710 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
wujj123456

Send message
Joined: 9 Jun 10
Posts: 19
Credit: 2,233,932,323
RAC: 0
Level
Phe
Scientific publications
watwatwatwat
Message 61711 - Posted: 24 Aug 2024, 20:16:04 UTC - in response to Message 61710.  

I'm curious why do we even bother to intentionally error out a task based on runtime at all? Usually a wrong estimate of runtime just messes with local client scheduling a bit, but tasks finish fine eventually. It's not like GPUGrid had accurate runtime estimation before, but previous tasks didn't fail.

Does this batch/app has bug that could cause it to stuck computing forever, which is why we need an additional protection to abort tasks after certain runtime?
ID: 61711 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 8 · Next

Message boards : Number crunching : ATMML

©2025 Universitat Pompeu Fabra