ATMML

Author	Message
Richard Send message Joined: 13 Jan 24 Posts: 2 Credit: 39,763,706 RAC: 0 Level Scientific publications	Message 61692 - Posted: 22 Aug 2024, 20:11:48 UTC - in response to Message 61689. Looks like it finally started and ran for a few minutes, then uploaded... Richard ID: 61692 · Rating: 0 · rate: / Reply Quote

Opolis Send message Joined: 19 Feb 12 Posts: 3 Credit: 1,523,876,091 RAC: 382 Level Scientific publications	Message 61693 - Posted: 22 Aug 2024, 22:21:42 UTC - in response to Message 61687. These tasks are running fine for me so far. The only thing I noticed was that the points awarded seem off. The second task I completed took an hour longer than the first but received 900k fewer points. So far they have been taking 5-6 hours on a 3080ti, driver version 535.183.01. The points are accurate. You get a 50% bonus, if you finish the task successfully and return the results within 24 hours from downloading it. There is a 25% bonus if you do it within 48 hours. No bonus if you return it after 48 hours. This is an incentive for quick return of results. Ah you are correct. I had the one task stuck in "downloading" for a while and I didn't run it until the next day. ID: 61693 · Rating: 0 · rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 109 Credit: 3,977,737,860 RAC: 6,051 Level Scientific publications	Message 61694 - Posted: 23 Aug 2024, 1:13:04 UTC Are there no checkpoints on ATMML tasks? I was about 30% complete when I had to suspend the task and shut down the computer. When I restarted both the % done and elapsed time were zero. ID: 61694 · Rating: 0 · rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 3,168 Level Scientific publications	Message 61695 - Posted: 23 Aug 2024, 2:00:00 UTC - in response to Message 61694. Are there no checkpoints on ATMML tasks? I was about 30% complete when I had to suspend the task and shut down the computer. When I restarted both the % done and elapsed time were zero. No, there are not. Same goes for quantum chemistry and ATM. They haven't figured out how to do it, yet. ID: 61695 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61696 - Posted: 23 Aug 2024, 7:14:23 UTC I hope this doesn't backfire. This morning I see 800 tasks in progress, but zero ready to send. My last two downloads have been replica _3 tasks, each WU having failed on three Windows machines first. I do hope new Windows users pay attention to the 'tricks of the trade' we've learned over the years: * small cache, especially with slower GPUs. * run continuously, don't allow interruptions (especially auto-updates) * don't swap to a different GPU type mid-run ID: 61696 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 13,083,686,510 RAC: 31,373 Level Scientific publications	Message 61697 - Posted: 23 Aug 2024, 10:32:49 UTC - in response to Message 61696. I do hope new Windows users pay attention to the 'tricks of the trade' we've learned over the years: Thank you for your ever-sharing expertise My last two downloads have been replica _3 tasks, each WU having failed on three Windows machines first. Despite this, there is a noticeable increase in the number of users returning ATMML results. Likely for the effect of Windows users now added to previous Linux ones. Before new Windows ATMML app was released, users/24h was consistently about 80 - 100. Currently it is more than 230, as can be seen at Server status page. ID: 61697 · Rating: 0 · rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 109 Credit: 3,977,737,860 RAC: 6,051 Level Scientific publications	Message 61698 - Posted: 23 Aug 2024, 10:55:09 UTC - in response to Message 61595. Last modified: 23 Aug 2024, 10:58:27 UTC ReL the Apps Page: https://www.gpugrid.net/apps.php I wish, for consistency, it would state: ATMML: Free energy with neural networks for GPU Also, when selecting projects in project preferences, it would be nice if it stated: ATMML on GPU ID: 61698 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 61699 - Posted: 23 Aug 2024, 11:21:33 UTC - in response to Message 61698. ReL the Apps Page: https://www.gpugrid.net/apps.php I wish, for consistency, it would state: ATMML: Free energy with neural networks for GPU Also, when selecting projects in project preferences, it would be nice if it stated: ATMML on GPU this is GPUgrid. all tasks are for GPU ID: 61699 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61700 - Posted: 23 Aug 2024, 12:41:38 UTC - in response to Message 61697. Despite this, there is a noticeable increase in the number of users returning ATMML results. Indeed. But the question is: are those completed, end-of-run, scientifically useful results - or are they early crashes, resulting only in the creation and issue of another replica, to take its place in the 'in progress' count? We can't tell from the outside. But runtimes starting at 0.04 hours don't look too good. ID: 61700 · Rating: 0 · rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 51 Credit: 0 RAC: 0 Level Scientific publications	Message 61701 - Posted: 23 Aug 2024, 12:57:06 UTC - in response to Message 61700. Last modified: 23 Aug 2024, 12:59:22 UTC Hi, the windows host are working successfully. There are more errors than on linux as expected, but plenty are working well. Unfortunately some WUs with the very short run time but validated status bug are still in circulation. (each WU runs in a chain of 5 steps, when a step finishes it launches a new job with the same settings.) New WUs do not have this bug. This is the bug I am talking about: https://www.gpugrid.net/forum_thread.php?id=5468&nowrap=true#61682 ID: 61701 · Rating: 0 · rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 109 Credit: 3,977,737,860 RAC: 6,051 Level Scientific publications	Message 61702 - Posted: 23 Aug 2024, 16:55:08 UTC - in response to Message 61696. * small cache, especially with slower GPUs. Which cache? Where is it set?? What should it be set at??? ID: 61702 · Rating: 0 · rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 109 Credit: 3,977,737,860 RAC: 6,051 Level Scientific publications	Message 61703 - Posted: 23 Aug 2024, 17:03:59 UTC I just started ATMML yesterday. Out of seven starts only one completed. The rest errored-out after 1-1.5 hours. Windows11/RTX4090. I'd like to get some actual work done... ID: 61703 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 61704 - Posted: 23 Aug 2024, 17:19:31 UTC - in response to Message 61702. * small cache, especially with slower GPUs. Which cache? Where is it set?? What should it be set at??? He's talking about the work cache on the host. you can (kind of) control that in the BOINC Manager Options->"Computing Preferences" menu. set it to something less than 1 day probably. you'll be limited to 4 tasks from the project (per GPU) anyway. ID: 61704 · Rating: 0 · rate: / Reply Quote

Farscape Send message Joined: 1 Feb 09 Posts: 6 Credit: 1,937,116,460 RAC: 0 Level Scientific publications	Message 61705 - Posted: 23 Aug 2024, 17:30:31 UTC The Windows tasks ARE NOT working as advertised.... On two 3090ti computers and one 3090 11 work units have error out between 2-4 hours of run time. Previous successful task run times went between 17000-18500 seconds. Errored tasks are 5000-8500 seconds. I am killing the ap in preferences until itself out.... ID: 61705 · Rating: 0 · rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 109 Credit: 3,977,737,860 RAC: 6,051 Level Scientific publications	Message 61706 - Posted: 23 Aug 2024, 18:22:30 UTC - in response to Message 61705. Thanks. There are too many cache's out there. Let's call this the work queue. ID: 61706 · Rating: 0 · rate: / Reply Quote

zombie67 [MM] Send message Joined: 16 Jul 07 Posts: 209 Credit: 6,054,860,456 RAC: 15,024 Level Scientific publications	Message 61707 - Posted: 23 Aug 2024, 20:14:08 UTC - in response to Message 61705. The Windows tasks ARE NOT working as advertised.... On two 3090ti computers and one 3090 11 work units have error out between 2-4 hours of run time. Previous successful task run times went between 17000-18500 seconds. Errored tasks are 5000-8500 seconds. I am killing the ap in preferences until itself out.... All 8 of 8 tasks I have completed and returned also categorized as error. This is on win10 with 4080 and 4090 GPUs. Here is a sample: http://www.gpugrid.net/result.php?resultid=35743812 Reno, NV Team: SETI.USA ID: 61707 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61708 - Posted: 23 Aug 2024, 21:09:54 UTC - in response to Message 61707. Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED That's going to be a difficult one to overcome unless the project addresses its job estimation. You need to 'complete' (which includes a successful finish plus validation) 11 tasks before the estimates are normalised - and if every task fails, you'll never get there. ID: 61708 · Rating: 0 · rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 51 Credit: 0 RAC: 0 Level Scientific publications	Message 61709 - Posted: 24 Aug 2024, 8:17:34 UTC - in response to Message 61708. Hello. I apologise about the time limit exceed errors. I did not expect this. The jobs run for the same time as the linux ones that have all been working so I dont really understand what is happening. Unfortunately the way boinc deals with "runtime" is completely inadequate for gpu projects. In a WU we have to estimate the flop use, which is a difficult thing to do for a gpu app. The boinc client then somehow estimates the flops performance of your computer in a way I don't understand. I cannot simply put a runtime limit of x hours as would be typical. Does anyone know where the denominator comes from in this line?: <message> exceeded elapsed time limit 5454.20 (10000000000.00G/1712015.37G)</message> <stderr_txt> The numerator I believe is the fpops_bound that is set in the WU template which is controlled by us. ID: 61709 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61710 - Posted: 24 Aug 2024, 9:16:26 UTC - in response to Message 61709. Last modified: 24 Aug 2024, 9:18:42 UTC Does anyone know where the denominator comes from in this line?: <message> exceeded elapsed time limit 5454.20 (10000000000.00G/1712015.37G)</message> <stderr_txt> The numerator I believe is the fpops_bound that is set in the WU template which is controlled by us. Yes. It's the current estimated speed for the task, which should be 'learned' by BOINC for the individual computer running this particular task type ('app_version'). It's a complex three-stage process, and unfortunately it doesn't go down to the granularity of individual GPU types - all GPUs are considered equal. 1) When a new app version is created, the server will set a first, initial, value for GPU speeds for that version. I'm afraid I don't know how that initial value is estimated, but I'll try to find out. 2) Once the app version is up and running, the server monitors the runtime of the successful tasks returned. That's done at both the project level, and the individual host level. The first critical point is probably when the project has received 100 results: the calculated average speed from those 100 is used to set the expected speed for all tasks issued from that point forward. [aside - 'obviously' the first results received will be from the fastest machines, so that value is skewed] 3) Later, as each individual host reports tasks, once 11 successful tasks have been returned, future tasks assigned to that host are assigned the running average value for that host. The current speed estimate ('fpops_est') can be seen in the application_details page for each host. zombie67 hasn't completed an ATMML task yet, so no 'Average processing rate' for his machine is shown yet for ATMML (at the bottom), but you can see it for other task types. Phew. That's probably more than enough for now, so I'll leave you to digest it. ID: 61710 · Rating: 0 · rate: / Reply Quote

wujj123456 Send message Joined: 9 Jun 10 Posts: 19 Credit: 2,233,932,323 RAC: 0 Level Scientific publications	Message 61711 - Posted: 24 Aug 2024, 20:16:04 UTC - in response to Message 61710. I'm curious why do we even bother to intentionally error out a task based on runtime at all? Usually a wrong estimate of runtime just messes with local client scheduling a bit, but tasks finish fine eventually. It's not like GPUGrid had accurate runtime estimation before, but previous tasks didn't fail. Does this batch/app has bug that could cause it to stuck computing forever, which is why we need an additional protection to abort tasks after certain runtime? ID: 61711 · Rating: 0 · rate: / Reply Quote