ATMML

Author	Message
Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 11,680 Level Scientific publications	Message 61762 - Posted: 4 Sep 2024, 18:08:37 UTC Generally, if I know I must reboot the system shortly in the future I will just wait till the current tasks are finished or reboot shortly after a new task starts so I won't begrudge the little time lost it has already spent crunching and which it will have have restart again after the reboot. It is generally safe to stop a task soon after it starts because with the exception of the acemd tasks, all the rest of the task types need several minutes to unpack the python environment in the slots and actually hasn't started calculating anything yet You can get away with interrupting the startup process with a reboot I have found and you won't throw away the task or error it out. ID: 61762 · Rating: 0 · rate: / Reply Quote

Life v lies: Dont be a DNA-den... Send message Joined: 14 Feb 20 Posts: 16 Credit: 27,395,983 RAC: 0 Level Scientific publications	Message 61763 - Posted: 4 Sep 2024, 21:35:14 UTC - in response to Message 61687. Last modified: 4 Sep 2024, 22:07:43 UTC The time bonus system has been in place w/ GPUGrid for years. (And yes, the GG tasks download several GIG of data... [WHY? well, another issue] and the download time does count against the deadline) BUT the points awarded are nonetheless - shall one say - unfathomable. Case in point: ATMML has a very, very high failure rate [yet another issue, AND an important one], and when completed usually award 300,000 points, at least to my NVIDIA which is better in some ways than this guy's... HOWEVER, host 621740 has had seven successful ATMML tasks (see below) in the last six days with EACH being awarded 2,700,000 points .... SO, what gives?? WHY a 9-fold difference??? WUid other task result 29271283 1 error, 1 abort 29270516 3 errors 29265238 1 error, 1 abort 29204796 1 time out, 5 errors (1 of these after 50,905 sec = 14+ Hrs 29268456 n/a 29267692 3 errors 29267146 2 errors 621740's specs: GenuineIntel Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz [Family 6 Model 26 Stepping 5] Number of processors 8 Coprocessors NVIDIA NVIDIA GeForce RTX 3060 (12287MB) driver: 560.81 Operating System Microsoft Windows 10 Professional Memory 12279.11 MB Cache 256 KB My NVIDIA has Memory 16316.07 MB Cache 512 KB PLUS Swap space 45668.07 MB Total disk space 464.49 GB Lasslo P, PhD, Prof Engr. ID: 61763 · Rating: 0 · rate: / Reply Quote

Life v lies: Dont be a DNA-den... Send message Joined: 14 Feb 20 Posts: 16 Credit: 27,395,983 RAC: 0 Level Scientific publications	Message 61764 - Posted: 4 Sep 2024, 22:00:06 UTC - in response to Message 61762. Good point, but winDoze update does not make it easy to avoid IT's "decision" about updating and the time to restart my system. (Don't you hate it when big tech is so much more brilliant and all-knowing than you?) I turn OFF updates for the 5 weeks max allowed, then as the month ends, I pick the time when I will download and install OS updates. Even then, I set the "active hours" to the times LEAST likely my PC is in use, usually including late PM to early AM Lasslo P, PhD, Prof Engr ID: 61764 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2,714 Level Scientific publications	Message 61765 - Posted: 4 Sep 2024, 22:08:10 UTC - in response to Message 61763. Last modified: 4 Sep 2024, 22:11:37 UTC you're comparing multiple levels of apples vs oranges. a GTX1660Ti is in no ways better than a RTX3060, it's older gen, less than half the CUDA cores, no tensor cores (which ATMML will use), slower clock speed, slower memory speed, truly basically every performance metric favors the 3060. your task you completed for 300,000cr was ACEMD3, not ATMML, and you also need to consider that the ATMML tasks run much longer than ACEMD3 and use more resources, so the higher credit reward is appropriate. your 1x ACEMD3 task if you were to crunch 24/7 would come up with a production of a little over 1,000,000 points per day. your competitor also completed one ACEMD3 task recently, and scaling that to 24/7 production comes out to around 1,500,000 points per day. it takes them about 4x longer to run ATMML. and based on their recent production, including the failure, is about 3,600,000 ppd. the project admins have resolved a lot of the problems with ATMML, if you want to have better success with this project, and GPUGRID in general, you can consider switching to Linux. otherwise, maybe investigate what's going wrong with your system to cause the failures. looks like a permissions issue to me since your failed tasks have a bunch of Access is denied errors in the WU log. possibly an over zealous AV software. that could be the reason for your download errors also, or just spotty internet ID: 61765 · Rating: 0 · rate: / Reply Quote

Life under progressive, coerci... Send message Joined: 7 Feb 12 Posts: 5 Credit: 334,557,283 RAC: 0 Level Scientific publications	Message 61766 - Posted: 4 Sep 2024, 23:19:20 UTC - in response to Message 61765. GTX1660Ti is in no ways better "in no way" is an absolute statement, and is false: my NVIDIA has 33% more memory and double the cache. But, admittedly it is not in general as powerful ID: 61766 · Rating: 0 · rate: / Reply Quote

Life under progressive, coerci... Send message Joined: 7 Feb 12 Posts: 5 Credit: 334,557,283 RAC: 0 Level Scientific publications	Message 61767 - Posted: 4 Sep 2024, 23:34:23 UTC - in response to Message 61765. Last modified: 4 Sep 2024, 23:47:58 UTC maybe investigate what's going wrong with your system to cause the failures. how bizarre ... Batting 1000, the GPUGrid tasks which fail on my system have ALSO failed on several, perhaps even 6 or 7, other systems (except when I take a newly issued task, and when I check those they also fail after bombing on my system.) So, if the problem is on my end and not in any way on GPUGrid's end, then there must be dozens and dozens (and dozens) of other systems which apparently need to "investigate what's going wrong" with them... that could be the reason for your download errors also I have no "Download errors" except when I abort the download of a task which has already had repeated compute errors. GPUGrid needs 8 failures before figuring out that there are 'too many error (may have bug)' If I can, I'd rather give them this insight before I waste 5-10 minutes of time on my GPU, such as it is. Anyway, thanks for your feedback Oh, and by the way, I run 12-13 other projects, including at least three others where I run GPU tasks. This very high error rate of tasks is NOT an issue whatsoever with any of them. LLP, PhD, Prof Engr ID: 61767 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2,714 Level Scientific publications	Message 61768 - Posted: 4 Sep 2024, 23:34:57 UTC - in response to Message 61766. Last modified: 4 Sep 2024, 23:41:05 UTC GTX1660Ti is in no ways better "in no way" is an absolute statement, and is false: my NVIDIA has 33% more memory and double the cache. But, admittedly it is not in general as powerful read up. my "absolute" statement is correct. your 1660Ti has half the memory of a 3060. your 1660Ti also has half the cache of the 3060. GTX 1660Ti has 6GB RTX 3060 has 12GB (you can see this in the host details you referenced) not sure where you're getting your information, but it's wrong or missing context (like comparing to a laptop GPU or something) ID: 61768 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2,714 Level Scientific publications	Message 61769 - Posted: 4 Sep 2024, 23:38:05 UTC - in response to Message 61767. Last modified: 4 Sep 2024, 23:41:39 UTC maybe investigate what's going wrong with your system to cause the failures. how bizarre ... Batting 1000, the GPUGrid tasks which fail on my system have ALSO failed on several, perhaps even 6 or 7, other systems (except when I take a newly issued task, and when I check those they also fail after bombing on my system.) So, if the problem is on my end and not in any way on GPUGrid's end, then there must be dozens and dozens (and dozens) of other systems which apparently need to "investigate what's going wrong" with them... that could be the reason for your download errors also I have no "Download errors" except when I abort the download of a task which has already had repeated compute errors. Anyway, thanks for your feedback there are more people running Windows. higher probability for resends to land on another problematic windows host. it's more common for Windows users to be running AV software. it's common for windows users to have issues with BOINC projects and AV software. not hard to imagine that these factors mean that a large number of people would have problems when they're all coming to play. check your AV settings, whitelist the BOINC data directories and try again. ID: 61769 · Rating: 0 · rate: / Reply Quote

Life under progressive, coerci... Send message Joined: 7 Feb 12 Posts: 5 Credit: 334,557,283 RAC: 0 Level Scientific publications	Message 61770 - Posted: 5 Sep 2024, 0:07:54 UTC - in response to Message 61768. Last modified: 5 Sep 2024, 0:08:32 UTC your 1660Ti has half the memory of a 3060. your 1660Ti also has half the cache of the 3060. GTX 1660Ti has 6GB RTX 3060 has 12GB (you can see this in the host details you referenced) not sure where you're getting your information, but it's wrong or missing context (like comparing to a laptop GPU or something) My information is from GPUGrid's host information, https://gpugrid.net/show_host_detail.php?hostid=613323 which states 16GB, but this may be unreliable as TechPowerUp GPU-Z does give the NVIDIA site's number of 6GB My numbers for the cache also come from gpugrid.net/show_host_detail.php as indeed all the memory figurers in my original post so I guess my mistake was trusting gpugrid.net/show_host_detail info. And no, this is not a laptop, but a 12-core desktop. ID: 61770 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2,714 Level Scientific publications	Message 61771 - Posted: 5 Sep 2024, 0:18:58 UTC - in response to Message 61770. Last modified: 5 Sep 2024, 0:24:01 UTC your 1660Ti has half the memory of a 3060. your 1660Ti also has half the cache of the 3060. GTX 1660Ti has 6GB RTX 3060 has 12GB (you can see this in the host details you referenced) not sure where you're getting your information, but it's wrong or missing context (like comparing to a laptop GPU or something) My information is from GPUGrid's host information, https://gpugrid.net/show_host_detail.php?hostid=613323 which states 16GB, but this may be unreliable as TechPowerUp GPU-Z does give the NVIDIA site's number of 6GB My numbers for the cache also come from gpugrid.net/show_host_detail.php as indeed all the memory figurers in my original post so I guess my mistake was trusting gpugrid.net/show_host_detail info. And no, this is not a laptop, but a 12-core desktop. TPU is not out of date, and probably one of the most reliable databases for GPU (and other) specifications. there lies the issue. you're looking at system memory, not the GPU memory. system memory has little to do with GPUGRID tasks that run on the GPUs and not the CPU. at all BOINC projects, the GPU VRAM is listed in parenthesis next to the GPU model name on the Coprocessors line. and further context, there was a long standing bug with BOINC versions older than about 7.18 that capped Nvidia memory reported (not actual) to only 4GB. so old versions were wrong in what they reported for a long time. so still, the 3060 beats the 1660Ti in every metric. you just happened to have populated more system memory on the motherboard, but that has nothing to do with comparing the GPUs themselves. ID: 61771 · Rating: 0 · rate: / Reply Quote

Life under progressive, coerci... Send message Joined: 7 Feb 12 Posts: 5 Credit: 334,557,283 RAC: 0 Level Scientific publications	Message 61772 - Posted: 5 Sep 2024, 0:39:58 UTC - in response to Message 61769. windows users to have issues with BOINC projects Again, I run 12-13 other projects, including at least three others where I run GPU tasks. I have a zero error rate on other projects. But I do appreciate your suggestion, as I like the science behind GPUGrid and would very much like to RUN tasks rather than have them error out. I have searched PC settings and Control Panel settings as well as file options for "AV" and do not get any relevant hits. Could you please elaborate on what you mean by AV settings and whitelisting the BOINC directories? Thanks. LLP, PhD, Prof Engr. ID: 61772 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2,714 Level Scientific publications	Message 61773 - Posted: 5 Sep 2024, 0:43:07 UTC - in response to Message 61772. AV = Anti Virus software. ID: 61773 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61774 - Posted: 5 Sep 2024, 12:21:27 UTC Switching back to BOINC software and (specifically) ATMML tasks. I've posted extensively in this thread about the problems of task duration estimation at this project. I've got some new data, which I can't explain. Last week, I added a new Linux host (host 625407). It's a pretty plain vanilla Intel i5 with a single RTX 3060 - should be fairly average for this project. It's completed 17 tasks so far, with number 18 in progress - around 3 per day. I attached it to a venue with only ATMML tasks allowed. Given my interest in BOINC's server-side task duration estimation for GPUs, I've been logging the stats. Here's what I've got so far: Task number rsc_fpops_est rsc_fpops_bound flops DCF Runtime estimate (secs) 1 1E+18 1E+21 20,146,625,396,909 1.0000 49636 13.79 hours 2 3 4 1E+18 1E+21 20,218,746,342,900 0.8351 41301 11.47 hours 5 6 7 1E+18 1E+21 19,777,581,461,665 0.9931 50214 13.95 hours 8 9 10 1E+18 1E+21 19,446,193,249,403 0.8926 45900 12.75 hours 11 1E+18 1E+21 19,506,082,146,580 0.8247 42279 11.74 hours 12 1E+18 1E+21 19,522,515,301,144 0.7661 39242 10.90 hours 13 14 1E+18 1E+20 99,825,140,137 0.7585 7598256 87.94 days 15 16 1E+18 1E+21 99,825,140,137 0.7360 7373243 85.34 days 17 1E+18 1E+21 99,825,140,137 0.7287 7300045 84.49 days 18 1E+18 1E+21 99,825,140,137 0.7215 7227478 83.65 days My understanding of the BOINC server code is that, for a mature app_version (Linux ATMML has been around for 2 months), the initial estimates should be based on the average speed of the tasks so far across the project as a whole. So it seems reasonable that the initial estimates were for 10-12 hours - that's about what I expected for this GPU. Then, after the first 11 tasks have been reported successful, it should switch to the average for this specific host. So why does it appear that this particular host is reporting a speed something like 1/200th of the project average? So now, it's frantically attempting to compensate by driving my DCF through the floor, as with my two older machines. The absolute values are interesting too. The initial (project-wide) flops estimates are hovering around 20,000 GFlops - does that sound right, for those who know the hardware in detail? And they are fluctuating a bit, as might be expected for an average with variable task durations for completions. After the transition, my card dropped to below 100 GFlops - and has remained rock-steady. That's not in the script. The APR for the card (which should match the flops figure for allocated tasks) is 35599.725995644 GFlops - which doesn't match any of the figures above. Where does this take us? And what, if anything, can we do about it? I'll try to get my head round the published BOINC server code on GitHub, but this area is notoriously complex. And the likelihood is that the current code differs to a greater or lesser extent from the code in use at this project. I invite others of similarly inquisitive mind to join in with suggestions. ID: 61774 · Rating: 0 · rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 27 Credit: 4,153,801,869 RAC: 0 Level Scientific publications	Message 61775 - Posted: 5 Sep 2024, 14:38:15 UTC I didn't run ATMML but I'm currently running Qchem on Tesla P100 with short run times (averaging somewhere around 12 mins or so per task). I see this similar behavior/pattern when starting a new instance. If I were to guess, your DCF will eventually go down from the last value of 0.7215 to 0.01 after running 100+ tasks and your final estimated run time could be about 1.16 days which is still higher than your average expected run time for your card. However if you run cpu benchmark, then the DCF number will go up from 0.01 to something higher and will take another 100+ tasks for the DCF to go down to 0.01 again but this time the estimated run time will go below 1.16 days. I didn't take any note but just making observation only, so I could be wrong. My wild guess is that when running gpu only task there is associated cpu % required to run that gpu task and running benchmark will take care of the cpu portion needed for the gpu task. My one cent. ID: 61775 · Rating: 0 · rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 51 Credit: 0 RAC: 0 Level Scientific publications	Message 61776 - Posted: 5 Sep 2024, 14:38:59 UTC - in response to Message 61774. This is very interesting, thank you for the numbers. I still don't understand where the flops number for a machine comes from. does it use the data of your hardware? or is it purely based on maths done from the rsc_fpops_est number we have set and the time taken for WUs? I am also unsure how I would set this rsc_fpops_est number to be more accurate. given one of these WUs takes maybe an hour on a 4090: A 4090 is 80 TFLOPS. x1 hour = ~ 3x10^17 float point operations. Which is not actually that far off the estimated value of 1x10^18. of course the WUs will not be using all the Tflops of the 4090. And there is no sane way for me to calculate the number of floating point operations the program uses. ID: 61776 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61777 - Posted: 5 Sep 2024, 17:07:17 UTC - in response to Message 61776. Well, I said that this is going to be difficult ... My knowledge and understanding comes from already being an active volunteer at SETI@home on 18 Dec 2008, when BOINC was first announced as being able to manage and use CUDA on GPUs for scientific computing - GPUGrid is also mentioned as becoming CUDA-enabled on the same day. We spent the following months and years knocking the rough edges off the initial launch code. I think the names and features I was referring to in my post were introduced in a sort-of relaunch around 2011. That's still a long way back in the memory bank, and that makes it difficult to find precise references in code or documentation. My understanding is that the system was designed to be as easy as possible for researchers to implement. I believe the only key information required is the rsc_fpops_est - the estimated size of the task. From your comments on the 4090, and my logging of the early tasks, I think we can accept the current figure as being 'near enough right', and that's all it needs to be. I think that the flops value is - since 2011-ish - reverse-engineered from your estimate of fpops_est and the measured runtime of the task on the volunteer's hardware. Pre 2011, BOINC took more notice of the 'peak flops' calculated by our computers from the speed and internal geometry of the GPU in use. BOINC guessed a 'fiddle factor' - I think something like 5% - as the ratio between the maximum usable speed on real-life jobs, and the calculated peak speed. But that was abandoned, except possibly for use as an initial seeding value. Once the real-life data is available from the initial tasks run on a new computer, the server should maintain a running average value for flops for each computer attached to the project. That should wobble with small changes in actual task duration, which is why I was surprised to see it remained identical to the last significant digit in my run so far. All the data necessary to calculate flops is returned as each task is reported complete. It's stored in the result table on the server, and should be transferred/averaged to the host table. I should be able to point you to the current code for managing that transfer, and the variable names and db field names used - though I may not be able to post them until after the weekend. Perhaps we could compare notes once I've found them? ID: 61777 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 61778 - Posted: 8 Sep 2024, 20:09:24 UTC 08/09/2024 21:08:27 \| GPUGRID \| [error] Error reported by file upload server: Server is out of disk space ID: 61778 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 25,299 Level Scientific publications	Message 61783 - Posted: 9 Sep 2024, 2:39:37 UTC - in response to Message 61778. 08/09/2024 21:08:27 \| GPUGRID \| [error] Error reported by file upload server: Server is out of disk space this has happened in irregular intervals over all the years - last time about 2 weeks ago. Hard to believe how difficult is must be to take measures against it. ID: 61783 · Rating: 0 · rate: / Reply Quote

Maxxina Send message Joined: 6 Mar 15 Posts: 2 Credit: 169,990,150 RAC: 0 Level Scientific publications	Message 61790 - Posted: 11 Sep 2024, 20:27:28 UTC Well. How does one managed to complete unit in 5 min ? Im sitting on quite more then ok PC with decent 4090 card. and me units are close to 5 hours .) ID: 61790 · Rating: 0 · rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 27 Credit: 4,153,801,869 RAC: 0 Level Scientific publications	Message 61791 - Posted: 11 Sep 2024, 21:36:50 UTC - in response to Message 61790. Well. How does one managed to complete unit in 5 min ? Im sitting on quite more then ok PC with decent 4090 card. and me units are close to 5 hours .) If you are referring to this post https://www.gpugrid.net/forum_thread.php?id=5468&nowrap=true#61786, Steve posted in the gpugrid discord that there are still tasks that will be generated by the older batch before the code was updated. I don't know how long before these older tasks will be flushed out from the system but it has now been more than 21 days. ID: 61791 · Rating: 0 · rate: / Reply Quote