Message boards :
Number crunching :
ATMML
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next
| Author | Message |
|---|---|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Generally, if I know I must reboot the system shortly in the future I will just wait till the current tasks are finished or reboot shortly after a new task starts so I won't begrudge the little time lost it has already spent crunching and which it will have have restart again after the reboot. It is generally safe to stop a task soon after it starts because with the exception of the acemd tasks, all the rest of the task types need several minutes to unpack the python environment in the slots and actually hasn't started calculating anything yet You can get away with interrupting the startup process with a reboot I have found and you won't throw away the task or error it out. |
|
Send message Joined: 14 Feb 20 Posts: 16 Credit: 27,395,983 RAC: 0 Level ![]() Scientific publications
|
The time bonus system has been in place w/ GPUGrid for years. (And yes, the GG tasks download several GIG of data... [WHY? well, another issue] and the download time does count against the deadline) BUT the points awarded are nonetheless - shall one say - unfathomable. Case in point: ATMML has a very, very high failure rate [yet another issue, AND an important one], and when completed usually award 300,000 points, at least to my NVIDIA which is better in some ways than this guy's... HOWEVER, host 621740 has had seven successful ATMML tasks (see below) in the last six days with EACH being awarded 2,700,000 points .... SO, what gives?? WHY a 9-fold difference??? WUid other task result 29271283 1 error, 1 abort 29270516 3 errors 29265238 1 error, 1 abort 29204796 1 time out, 5 errors (1 of these after 50,905 sec = 14+ Hrs 29268456 n/a 29267692 3 errors 29267146 2 errors 621740's specs: GenuineIntel Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz [Family 6 Model 26 Stepping 5] Number of processors 8 Coprocessors NVIDIA NVIDIA GeForce RTX 3060 (12287MB) driver: 560.81 Operating System Microsoft Windows 10 Professional Memory 12279.11 MB Cache 256 KB My NVIDIA has Memory 16316.07 MB Cache 512 KB PLUS Swap space 45668.07 MB Total disk space 464.49 GB Lasslo P, PhD, Prof Engr. |
|
Send message Joined: 14 Feb 20 Posts: 16 Credit: 27,395,983 RAC: 0 Level ![]() Scientific publications
|
Good point, but winDoze update does not make it easy to avoid IT's "decision" about updating and the time to restart my system. (Don't you hate it when big tech is so much more brilliant and all-knowing than you?) I turn OFF updates for the 5 weeks max allowed, then as the month ends, I pick the time when I will download and install OS updates. Even then, I set the "active hours" to the times LEAST likely my PC is in use, usually including late PM to early AM Lasslo P, PhD, Prof Engr |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
you're comparing multiple levels of apples vs oranges. a GTX1660Ti is in no ways better than a RTX3060, it's older gen, less than half the CUDA cores, no tensor cores (which ATMML will use), slower clock speed, slower memory speed, truly basically every performance metric favors the 3060. your task you completed for 300,000cr was ACEMD3, not ATMML, and you also need to consider that the ATMML tasks run much longer than ACEMD3 and use more resources, so the higher credit reward is appropriate. your 1x ACEMD3 task if you were to crunch 24/7 would come up with a production of a little over 1,000,000 points per day. your competitor also completed one ACEMD3 task recently, and scaling that to 24/7 production comes out to around 1,500,000 points per day. it takes them about 4x longer to run ATMML. and based on their recent production, including the failure, is about 3,600,000 ppd. the project admins have resolved a lot of the problems with ATMML, if you want to have better success with this project, and GPUGRID in general, you can consider switching to Linux. otherwise, maybe investigate what's going wrong with your system to cause the failures. looks like a permissions issue to me since your failed tasks have a bunch of Access is denied errors in the WU log. possibly an over zealous AV software. that could be the reason for your download errors also, or just spotty internet
|
Life under progressive, coerci...Send message Joined: 7 Feb 12 Posts: 5 Credit: 334,557,283 RAC: 0 Level ![]() Scientific publications
|
GTX1660Ti is in no ways better "in no way" is an absolute statement, and is false: my NVIDIA has 33% more memory and double the cache. But, admittedly it is not in general as powerful |
Life under progressive, coerci...Send message Joined: 7 Feb 12 Posts: 5 Credit: 334,557,283 RAC: 0 Level ![]() Scientific publications
|
maybe investigate what's going wrong with your system to cause the failures. how bizarre ... Batting 1000, the GPUGrid tasks which fail on my system have ALSO failed on several, perhaps even 6 or 7, other systems (except when I take a newly issued task, and when I check those they also fail after bombing on my system.) So, if the problem is on my end and not in any way on GPUGrid's end, then there must be dozens and dozens (and dozens) of other systems which apparently need to "investigate what's going wrong" with them... that could be the reason for your download errors also I have no "Download errors" except when I abort the download of a task which has already had repeated compute errors. GPUGrid needs 8 failures before figuring out that there are 'too many error (may have bug)' If I can, I'd rather give them this insight before I waste 5-10 minutes of time on my GPU, such as it is. Anyway, thanks for your feedback Oh, and by the way, I run 12-13 other projects, including at least three others where I run GPU tasks. This very high error rate of tasks is NOT an issue whatsoever with any of them. LLP, PhD, Prof Engr |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
GTX1660Ti is in no ways better read up. my "absolute" statement is correct. your 1660Ti has half the memory of a 3060. your 1660Ti also has half the cache of the 3060. GTX 1660Ti has 6GB RTX 3060 has 12GB (you can see this in the host details you referenced) not sure where you're getting your information, but it's wrong or missing context (like comparing to a laptop GPU or something)
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
maybe investigate what's going wrong with your system to cause the failures. there are more people running Windows. higher probability for resends to land on another problematic windows host. it's more common for Windows users to be running AV software. it's common for windows users to have issues with BOINC projects and AV software. not hard to imagine that these factors mean that a large number of people would have problems when they're all coming to play. check your AV settings, whitelist the BOINC data directories and try again.
|
Life under progressive, coerci...Send message Joined: 7 Feb 12 Posts: 5 Credit: 334,557,283 RAC: 0 Level ![]() Scientific publications
|
your 1660Ti has half the memory of a 3060. My information is from GPUGrid's host information, https://gpugrid.net/show_host_detail.php?hostid=613323 which states 16GB, but this may be unreliable as TechPowerUp GPU-Z does give the NVIDIA site's number of 6GB My numbers for the cache also come from gpugrid.net/show_host_detail.php as indeed all the memory figurers in my original post so I guess my mistake was trusting gpugrid.net/show_host_detail info. And no, this is not a laptop, but a 12-core desktop. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
your 1660Ti has half the memory of a 3060. TPU is not out of date, and probably one of the most reliable databases for GPU (and other) specifications. there lies the issue. you're looking at system memory, not the GPU memory. system memory has little to do with GPUGRID tasks that run on the GPUs and not the CPU. at all BOINC projects, the GPU VRAM is listed in parenthesis next to the GPU model name on the Coprocessors line. and further context, there was a long standing bug with BOINC versions older than about 7.18 that capped Nvidia memory reported (not actual) to only 4GB. so old versions were wrong in what they reported for a long time. so still, the 3060 beats the 1660Ti in every metric. you just happened to have populated more system memory on the motherboard, but that has nothing to do with comparing the GPUs themselves.
|
Life under progressive, coerci...Send message Joined: 7 Feb 12 Posts: 5 Credit: 334,557,283 RAC: 0 Level ![]() Scientific publications
|
windows users to have issues with BOINC projectsAgain, I run 12-13 other projects, including at least three others where I run GPU tasks. I have a zero error rate on other projects. But I do appreciate your suggestion, as I like the science behind GPUGrid and would very much like to RUN tasks rather than have them error out. I have searched PC settings and Control Panel settings as well as file options for "AV" and do not get any relevant hits. Could you please elaborate on what you mean by AV settings and whitelisting the BOINC directories? Thanks. LLP, PhD, Prof Engr. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
AV = Anti Virus software.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Switching back to BOINC software and (specifically) ATMML tasks. I've posted extensively in this thread about the problems of task duration estimation at this project. I've got some new data, which I can't explain. Last week, I added a new Linux host (host 625407). It's a pretty plain vanilla Intel i5 with a single RTX 3060 - should be fairly average for this project. It's completed 17 tasks so far, with number 18 in progress - around 3 per day. I attached it to a venue with only ATMML tasks allowed. Given my interest in BOINC's server-side task duration estimation for GPUs, I've been logging the stats. Here's what I've got so far: Task number rsc_fpops_est rsc_fpops_bound flops DCF Runtime estimate (secs) 1 1E+18 1E+21 20,146,625,396,909 1.0000 49636 13.79 hours 2 3 4 1E+18 1E+21 20,218,746,342,900 0.8351 41301 11.47 hours 5 6 7 1E+18 1E+21 19,777,581,461,665 0.9931 50214 13.95 hours 8 9 10 1E+18 1E+21 19,446,193,249,403 0.8926 45900 12.75 hours 11 1E+18 1E+21 19,506,082,146,580 0.8247 42279 11.74 hours 12 1E+18 1E+21 19,522,515,301,144 0.7661 39242 10.90 hours 13 14 1E+18 1E+20 99,825,140,137 0.7585 7598256 87.94 days 15 16 1E+18 1E+21 99,825,140,137 0.7360 7373243 85.34 days 17 1E+18 1E+21 99,825,140,137 0.7287 7300045 84.49 days 18 1E+18 1E+21 99,825,140,137 0.7215 7227478 83.65 days My understanding of the BOINC server code is that, for a mature app_version (Linux ATMML has been around for 2 months), the initial estimates should be based on the average speed of the tasks so far across the project as a whole. So it seems reasonable that the initial estimates were for 10-12 hours - that's about what I expected for this GPU. Then, after the first 11 tasks have been reported successful, it should switch to the average for this specific host. So why does it appear that this particular host is reporting a speed something like 1/200th of the project average? So now, it's frantically attempting to compensate by driving my DCF through the floor, as with my two older machines. The absolute values are interesting too. The initial (project-wide) flops estimates are hovering around 20,000 GFlops - does that sound right, for those who know the hardware in detail? And they are fluctuating a bit, as might be expected for an average with variable task durations for completions. After the transition, my card dropped to below 100 GFlops - and has remained rock-steady. That's not in the script. The APR for the card (which should match the flops figure for allocated tasks) is 35599.725995644 GFlops - which doesn't match any of the figures above. Where does this take us? And what, if anything, can we do about it? I'll try to get my head round the published BOINC server code on GitHub, but this area is notoriously complex. And the likelihood is that the current code differs to a greater or lesser extent from the code in use at this project. I invite others of similarly inquisitive mind to join in with suggestions. |
|
Send message Joined: 8 Oct 16 Posts: 27 Credit: 4,153,801,869 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
I didn't run ATMML but I'm currently running Qchem on Tesla P100 with short run times (averaging somewhere around 12 mins or so per task). I see this similar behavior/pattern when starting a new instance. If I were to guess, your DCF will eventually go down from the last value of 0.7215 to 0.01 after running 100+ tasks and your final estimated run time could be about 1.16 days which is still higher than your average expected run time for your card. However if you run cpu benchmark, then the DCF number will go up from 0.01 to something higher and will take another 100+ tasks for the DCF to go down to 0.01 again but this time the estimated run time will go below 1.16 days. I didn't take any note but just making observation only, so I could be wrong. My wild guess is that when running gpu only task there is associated cpu % required to run that gpu task and running benchmark will take care of the cpu portion needed for the gpu task. My one cent. |
|
Send message Joined: 21 Dec 23 Posts: 51 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
This is very interesting, thank you for the numbers. I still don't understand where the flops number for a machine comes from. does it use the data of your hardware? or is it purely based on maths done from the rsc_fpops_est number we have set and the time taken for WUs? I am also unsure how I would set this rsc_fpops_est number to be more accurate. given one of these WUs takes maybe an hour on a 4090: A 4090 is 80 TFLOPS. x1 hour = ~ 3x10^17 float point operations. Which is not actually that far off the estimated value of 1x10^18. of course the WUs will not be using all the Tflops of the 4090. And there is no sane way for me to calculate the number of floating point operations the program uses. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well, I said that this is going to be difficult ... My knowledge and understanding comes from already being an active volunteer at SETI@home on 18 Dec 2008, when BOINC was first announced as being able to manage and use CUDA on GPUs for scientific computing - GPUGrid is also mentioned as becoming CUDA-enabled on the same day. We spent the following months and years knocking the rough edges off the initial launch code. I think the names and features I was referring to in my post were introduced in a sort-of relaunch around 2011. That's still a long way back in the memory bank, and that makes it difficult to find precise references in code or documentation. My understanding is that the system was designed to be as easy as possible for researchers to implement. I believe the only key information required is the rsc_fpops_est - the estimated size of the task. From your comments on the 4090, and my logging of the early tasks, I think we can accept the current figure as being 'near enough right', and that's all it needs to be. I think that the flops value is - since 2011-ish - reverse-engineered from your estimate of fpops_est and the measured runtime of the task on the volunteer's hardware. Pre 2011, BOINC took more notice of the 'peak flops' calculated by our computers from the speed and internal geometry of the GPU in use. BOINC guessed a 'fiddle factor' - I think something like 5% - as the ratio between the maximum usable speed on real-life jobs, and the calculated peak speed. But that was abandoned, except possibly for use as an initial seeding value. Once the real-life data is available from the initial tasks run on a new computer, the server should maintain a running average value for flops for each computer attached to the project. That should wobble with small changes in actual task duration, which is why I was surprised to see it remained identical to the last significant digit in my run so far. All the data necessary to calculate flops is returned as each task is reported complete. It's stored in the result table on the server, and should be transferred/averaged to the host table. I should be able to point you to the current code for managing that transfer, and the variable names and db field names used - though I may not be able to post them until after the weekend. Perhaps we could compare notes once I've found them? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
08/09/2024 21:08:27 | GPUGRID | [error] Error reported by file upload server: Server is out of disk space |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
08/09/2024 21:08:27 | GPUGRID | [error] Error reported by file upload server: Server is out of disk space this has happened in irregular intervals over all the years - last time about 2 weeks ago. Hard to believe how difficult is must be to take measures against it. |
|
Send message Joined: 6 Mar 15 Posts: 2 Credit: 169,990,150 RAC: 343 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Well. How does one managed to complete unit in 5 min ? Im sitting on quite more then ok PC with decent 4090 card. and me units are close to 5 hours .) |
|
Send message Joined: 8 Oct 16 Posts: 27 Credit: 4,153,801,869 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Well. How does one managed to complete unit in 5 min ? If you are referring to this post https://www.gpugrid.net/forum_thread.php?id=5468&nowrap=true#61786, Steve posted in the gpugrid discord that there are still tasks that will be generated by the older batch before the code was updated. I don't know how long before these older tasks will be flushed out from the system but it has now been more than 21 days. |
©2025 Universitat Pompeu Fabra