Message boards :
Number crunching :
ATMML
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 8 · Next
| Author | Message |
|---|---|
|
Send message Joined: 13 Jan 24 Posts: 2 Credit: 39,763,706 RAC: 0 Level ![]() Scientific publications
|
Looks like it finally started and ran for a few minutes, then uploaded... Richard |
OpolisSend message Joined: 19 Feb 12 Posts: 3 Credit: 1,508,126,091 RAC: 11 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
These tasks are running fine for me so far. The only thing I noticed was that the points awarded seem off. The second task I completed took an hour longer than the first but received 900k fewer points. So far they have been taking 5-6 hours on a 3080ti, driver version 535.183.01. Ah you are correct. I had the one task stuck in "downloading" for a while and I didn't run it until the next day. |
|
Send message Joined: 30 Apr 13 Posts: 106 Credit: 3,805,237,860 RAC: 65 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Are there no checkpoints on ATMML tasks? I was about 30% complete when I had to suspend the task and shut down the computer. When I restarted both the % done and elapsed time were zero. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 69 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Are there no checkpoints on ATMML tasks? No, there are not. Same goes for quantum chemistry and ATM. They haven't figured out how to do it, yet. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I hope this doesn't backfire. This morning I see 800 tasks in progress, but zero ready to send. My last two downloads have been replica _3 tasks, each WU having failed on three Windows machines first. I do hope new Windows users pay attention to the 'tricks of the trade' we've learned over the years: * small cache, especially with slower GPUs. * run continuously, don't allow interruptions (especially auto-updates) * don't swap to a different GPU type mid-run |
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,447 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I do hope new Windows users pay attention to the 'tricks of the trade' we've learned over the years: Thank you for your ever-sharing expertise My last two downloads have been replica _3 tasks, each WU having failed on three Windows machines first. Despite this, there is a noticeable increase in the number of users returning ATMML results. Likely for the effect of Windows users now added to previous Linux ones. Before new Windows ATMML app was released, users/24h was consistently about 80 - 100. Currently it is more than 230, as can be seen at Server status page. |
|
Send message Joined: 30 Apr 13 Posts: 106 Credit: 3,805,237,860 RAC: 65 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
ReL the Apps Page: https://www.gpugrid.net/apps.php I wish, for consistency, it would state: ATMML: Free energy with neural networks for GPU Also, when selecting projects in project preferences, it would be nice if it stated: ATMML on GPU |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
ReL the Apps Page: https://www.gpugrid.net/apps.php this is GPUgrid. all tasks are for GPU
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Despite this, there is a noticeable increase in the number of users returning ATMML results. Indeed. But the question is: are those completed, end-of-run, scientifically useful results - or are they early crashes, resulting only in the creation and issue of another replica, to take its place in the 'in progress' count? We can't tell from the outside. But runtimes starting at 0.04 hours don't look too good. |
|
Send message Joined: 21 Dec 23 Posts: 51 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Hi, the windows host are working successfully. There are more errors than on linux as expected, but plenty are working well. Unfortunately some WUs with the very short run time but validated status bug are still in circulation. (each WU runs in a chain of 5 steps, when a step finishes it launches a new job with the same settings.) New WUs do not have this bug. This is the bug I am talking about: https://www.gpugrid.net/forum_thread.php?id=5468&nowrap=true#61682 |
|
Send message Joined: 30 Apr 13 Posts: 106 Credit: 3,805,237,860 RAC: 65 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Which cache? Where is it set?? What should it be set at??? |
|
Send message Joined: 30 Apr 13 Posts: 106 Credit: 3,805,237,860 RAC: 65 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I just started ATMML yesterday. Out of seven starts only one completed. The rest errored-out after 1-1.5 hours. Windows11/RTX4090. I'd like to get some actual work done... |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
He's talking about the work cache on the host. you can (kind of) control that in the BOINC Manager Options->"Computing Preferences" menu. set it to something less than 1 day probably. you'll be limited to 4 tasks from the project (per GPU) anyway.
|
FarscapeSend message Joined: 1 Feb 09 Posts: 6 Credit: 1,937,116,460 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The Windows tasks ARE NOT working as advertised.... On two 3090ti computers and one 3090 11 work units have error out between 2-4 hours of run time. Previous successful task run times went between 17000-18500 seconds. Errored tasks are 5000-8500 seconds. I am killing the ap in preferences until itself out.... |
|
Send message Joined: 30 Apr 13 Posts: 106 Credit: 3,805,237,860 RAC: 65 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks. There are too many cache's out there. Let's call this the work queue. |
|
Send message Joined: 16 Jul 07 Posts: 209 Credit: 5,496,860,456 RAC: 12,111 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The Windows tasks ARE NOT working as advertised.... All 8 of 8 tasks I have completed and returned also categorized as error. This is on win10 with 4080 and 4090 GPUs. Here is a sample: http://www.gpugrid.net/result.php?resultid=35743812 Reno, NV Team: SETI.USA |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED That's going to be a difficult one to overcome unless the project addresses its job estimation. You need to 'complete' (which includes a successful finish plus validation) 11 tasks before the estimates are normalised - and if every task fails, you'll never get there. |
|
Send message Joined: 21 Dec 23 Posts: 51 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Hello. I apologise about the time limit exceed errors. I did not expect this. The jobs run for the same time as the linux ones that have all been working so I dont really understand what is happening. Unfortunately the way boinc deals with "runtime" is completely inadequate for gpu projects. In a WU we have to estimate the flop use, which is a difficult thing to do for a gpu app. The boinc client then somehow estimates the flops performance of your computer in a way I don't understand. I cannot simply put a runtime limit of x hours as would be typical. Does anyone know where the denominator comes from in this line?: <message> exceeded elapsed time limit 5454.20 (10000000000.00G/1712015.37G)</message> <stderr_txt> The numerator I believe is the fpops_bound that is set in the WU template which is controlled by us. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Does anyone know where the denominator comes from in this line?: Yes. It's the current estimated speed for the task, which should be 'learned' by BOINC for the individual computer running this particular task type ('app_version'). It's a complex three-stage process, and unfortunately it doesn't go down to the granularity of individual GPU types - all GPUs are considered equal. 1) When a new app version is created, the server will set a first, initial, value for GPU speeds for that version. I'm afraid I don't know how that initial value is estimated, but I'll try to find out. 2) Once the app version is up and running, the server monitors the runtime of the successful tasks returned. That's done at both the project level, and the individual host level. The first critical point is probably when the project has received 100 results: the calculated average speed from those 100 is used to set the expected speed for all tasks issued from that point forward. [aside - 'obviously' the first results received will be from the fastest machines, so that value is skewed] 3) Later, as each individual host reports tasks, once 11 successful tasks have been returned, future tasks assigned to that host are assigned the running average value for that host. The current speed estimate ('fpops_est') can be seen in the application_details page for each host. zombie67 hasn't completed an ATMML task yet, so no 'Average processing rate' for his machine is shown yet for ATMML (at the bottom), but you can see it for other task types. Phew. That's probably more than enough for now, so I'll leave you to digest it. |
|
Send message Joined: 9 Jun 10 Posts: 19 Credit: 2,233,932,323 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
I'm curious why do we even bother to intentionally error out a task based on runtime at all? Usually a wrong estimate of runtime just messes with local client scheduling a bit, but tasks finish fine eventually. It's not like GPUGrid had accurate runtime estimation before, but previous tasks didn't fail. Does this batch/app has bug that could cause it to stuck computing forever, which is why we need an additional protection to abort tasks after certain runtime? |
©2025 Universitat Pompeu Fabra