Message boards : Number crunching : ATMML
Author | Message |
---|---|
I just finished crunching a task for this new application successfully. | |
ID: 61582 | Rating: 0 | rate: / Reply Quote | |
By the name of the app, somehow uses machine learning. | |
ID: 61583 | Rating: 0 | rate: / Reply Quote | |
I just finished crunching a task for this new application successfully. how did you manage to download such a task? The list in which you can choose from the various subprojects does NOT include ATMML | |
ID: 61584 | Rating: 0 | rate: / Reply Quote | |
This is an app in testing mode, it does not appear as one to select yet. You will only get the WUs if you have selected to run the test applications. It is a different version of the existing ATM app that includes machine learning based forcefields for the molecular dynamics. | |
ID: 61585 | Rating: 0 | rate: / Reply Quote | |
Thanks for the progress update and explanation of just what kind of ML is being used for the ATM tasks, Steve. | |
ID: 61586 | Rating: 0 | rate: / Reply Quote | |
Is it Windows, Linux or both? | |
ID: 61594 | Rating: 0 | rate: / Reply Quote | |
You can verify OS compatibility for different applications at GPUGRID apps page. | |
ID: 61595 | Rating: 0 | rate: / Reply Quote | |
I noticed that this batch of ATMML units takes almost 3 times longer than the previous batches to complete. One of them, I suspended and when I restarted it, it would not start, I kept "running" it for over an hour, and no progress, so I had no option, but to abort it. | |
ID: 61602 | Rating: 0 | rate: / Reply Quote | |
effectivement elles sont tres longue a calculer.Je vais les arreter aussi. | |
ID: 61604 | Rating: 0 | rate: / Reply Quote | |
j ai annulé les 4 taches ATMML que j'avais car trop longues a calculer. | |
ID: 61607 | Rating: 0 | rate: / Reply Quote | |
Didn't have any issues with the new ATMML tasks I received. Rescued one at "the last chance saloon" as the _7 wingman. | |
ID: 61609 | Rating: 0 | rate: / Reply Quote | |
I don't recall a larger executable from a BOINC project. 4.67 GB! That is larger than some LHC VDI files. | |
ID: 61611 | Rating: 0 | rate: / Reply Quote | |
I had 57 units so far without a single error. Great! | |
ID: 61612 | Rating: 0 | rate: / Reply Quote | |
Hello everyone! My four hosts running 3060, 3060ti and 3070ti were not able to complete a single unit so far. They all fail at the very beginning with the following STDERR output: "Error loading cuda module". I am running Linux Mint and Ubuntu with Nvidida driver 470. The newer drivers produce errors in other projects so I decided to stick to that driver version. I noticed that a lot of my wingmen successfully crunch the units with driver 530 or 535. is that a driver issue? All other projects run just fine on version 470. | |
ID: 61615 | Rating: 0 | rate: / Reply Quote | |
with that error, yes i would assume the old driver version is the issue. | |
ID: 61616 | Rating: 0 | rate: / Reply Quote | |
chez moi les pilotes d'origine du system fonctionne tres bien.ce sont les pilotes 535 fourni a l'install de linux mint.. | |
ID: 61617 | Rating: 0 | rate: / Reply Quote | |
chez moi les pilotes d'origine du system fonctionne tres bien.ce sont les pilotes 535 fourni a l'install de linux mint.. I tried to install the 535 driver but after that my GPU is no longer recognised by Amicable, Einstein and Asteroids. GPUgrid lets me start new wus but they fail after 43 seconds saying that no Nvidia GPU was found. Do I have to install additional libraries or something like that? I also noticed that there is an open driver package from Nvidia and a regualar meta package and a server version of that driver. Which one are you guys using? | |
ID: 61619 | Rating: 0 | rate: / Reply Quote | |
chez moi les pilotes d'origine du system fonctionne tres bien.ce sont les pilotes 535 fourni a l'install de linux mint.. if you're running opencl applications then yes you need additional opencl package. sudo apt install ocl-icd-libopencl1 535 drivers work fine for einstein, most of my hosts are on that driver and I contribute to einstein primarily. ____________ | |
ID: 61620 | Rating: 0 | rate: / Reply Quote | |
je n'utilise rien de supplemntaire comme package. | |
ID: 61621 | Rating: 0 | rate: / Reply Quote | |
pour commencer,je vous conseille de tester vos barrettes de ram avec memtest free et pas un autre programme.il fonctionne tres bien et est fiable. | |
ID: 61622 | Rating: 0 | rate: / Reply Quote | |
quand vous installer de driver 535 sous linux il installe aussi tout le nécessaire pour calculer. | |
ID: 61623 | Rating: 0 | rate: / Reply Quote | |
je vous conseille aussi de faire un test de votre dur pour voir s'il ny a pas de cluster defectueux. | |
ID: 61624 | Rating: 0 | rate: / Reply Quote | |
pour dépanner un pc je commence toujours par ces 2 choses. | |
ID: 61625 | Rating: 0 | rate: / Reply Quote | |
it may be true for Mint that opencl components are installed with the normal driver package. but that is not the case for Ubuntu, and you do need to install the opencl components separately with the command in my previous post. | |
ID: 61626 | Rating: 0 | rate: / Reply Quote | |
ok, I managed to get one of my UBUNTU hosts running with the 535 driver and the additional OpenCL libraries installed like you said Ian&Steve. For 5h it has been crunching an ATMML unit so far, it's looking good! I am surpised to see that you seemingly need OpenCL to run CUDA code because that is the only difference to my previous attempts. | |
ID: 61631 | Rating: 0 | rate: / Reply Quote | |
you don't need opencl driver components to run true cuda code. but a lot of other projects are not running apps compiled in cuda, but rather OpenCL. | |
ID: 61632 | Rating: 0 | rate: / Reply Quote | |
...running 3060, 3060ti and 3070ti were not able to complete a single unit so far. It's hit or miss for 1080 Ti, 2070 Ti, and 3060 Ti to successfully complete ATMML WUs. They have no problem with 1.05 QC so I dedicate them to QC. My 2080 Ti, 3080, and 3080 Ti GPUs have no problem running ATMML so they're dedicated to ATMML. All running Linux Mint with Nvidia 550.54.14. ____________ | |
ID: 61633 | Rating: 0 | rate: / Reply Quote | |
Hello. | |
ID: 61634 | Rating: 0 | rate: / Reply Quote | |
Until yesterday I received the ATMML wus and my hosts crunched them successfully with driver version 535. Today I am not getting any new ones. The servers says there were no available but on the server main page there are always around 300 available for download. And that number fluctuates so others get them. Does anybody else have that problem? | |
ID: 61646 | Rating: 0 | rate: / Reply Quote | |
Does anybody else have that problem? No. Since the new batch arrived I have constant supply on two machines with drivers 535 and 550. The new work units seem to take much longer to calculate (with higher credits): 11700s on 4080 Super 12000s on 4080 14700s on 4070Ti | |
ID: 61647 | Rating: 0 | rate: / Reply Quote | |
Huh... Do you know how much vram they need? My GPUs are limited to 8 GB. | |
ID: 61648 | Rating: 0 | rate: / Reply Quote | |
Huh... Do you know how much vram they need? My GPUs are limited to 8 GB. 8GB is plenty. They only use 3-4GB at most at some times. | |
ID: 61649 | Rating: 0 | rate: / Reply Quote | |
Is it Windows, Linux or both? Unfortunately it is only for Linux according to the application page. | |
ID: 61650 | Rating: 0 | rate: / Reply Quote | |
still waiting and wanting for some Windows tasks | |
ID: 61651 | Rating: 0 | rate: / Reply Quote | |
I tried to install the 535 driver but after that my GPU is no longer recognised by Amicable, Einstein and Asteroids. GPUgrid lets me start new wus but they fail after 43 seconds saying that no Nvidia GPU was found. It look like OpenCL not installed with distro 535 drivers. You need download from Nvidia website official source with *.run extension and install it manually, not from distro driver manager. Something like this method, but newer *.run version of course: https://askubuntu.com/questions/66328/how-do-i-install-the-latest-nvidia-drivers-from-the-run-file | |
ID: 61670 | Rating: 0 | rate: / Reply Quote | |
These tasks are running fine for me so far. The only thing I noticed was that the points awarded seem off. The second task I completed took an hour longer than the first but received 900k fewer points. So far they have been taking 5-6 hours on a 3080ti, driver version 535.183.01. | |
ID: 61685 | Rating: 0 | rate: / Reply Quote | |
It is lower credit because it took longer from receive to return result | |
ID: 61686 | Rating: 0 | rate: / Reply Quote | |
These tasks are running fine for me so far. The only thing I noticed was that the points awarded seem off. The second task I completed took an hour longer than the first but received 900k fewer points. So far they have been taking 5-6 hours on a 3080ti, driver version 535.183.01. The points are accurate. You get a 50% bonus, if you finish the task successfully and return the results within 24 hours from downloading it. There is a 25% bonus if you do it within 48 hours. No bonus if you return it after 48 hours. This is an incentive for quick return of results. | |
ID: 61687 | Rating: 0 | rate: / Reply Quote | |
This task has been "downloading" for almost 4 hrs, has 0% completion and estimated run time of 400+ days. | |
ID: 61689 | Rating: 0 | rate: / Reply Quote | |
Patience . . . . grasshopper. This is a new app for Windows hosts so you are competing for download bandwidth with the cohort of other Windows hosts. Which are many. | |
ID: 61690 | Rating: 0 | rate: / Reply Quote | |
Looks like it finally started and ran for a few minutes, then uploaded... | |
ID: 61692 | Rating: 0 | rate: / Reply Quote | |
These tasks are running fine for me so far. The only thing I noticed was that the points awarded seem off. The second task I completed took an hour longer than the first but received 900k fewer points. So far they have been taking 5-6 hours on a 3080ti, driver version 535.183.01. Ah you are correct. I had the one task stuck in "downloading" for a while and I didn't run it until the next day. | |
ID: 61693 | Rating: 0 | rate: / Reply Quote | |
Are there no checkpoints on ATMML tasks? | |
ID: 61694 | Rating: 0 | rate: / Reply Quote | |
Are there no checkpoints on ATMML tasks? No, there are not. Same goes for quantum chemistry and ATM. They haven't figured out how to do it, yet. | |
ID: 61695 | Rating: 0 | rate: / Reply Quote | |
I hope this doesn't backfire. This morning I see 800 tasks in progress, but zero ready to send.
| |
ID: 61696 | Rating: 0 | rate: / Reply Quote | |
I do hope new Windows users pay attention to the 'tricks of the trade' we've learned over the years: Thank you for your ever-sharing expertise My last two downloads have been replica _3 tasks, each WU having failed on three Windows machines first. Despite this, there is a noticeable increase in the number of users returning ATMML results. Likely for the effect of Windows users now added to previous Linux ones. Before new Windows ATMML app was released, users/24h was consistently about 80 - 100. Currently it is more than 230, as can be seen at Server status page. | |
ID: 61697 | Rating: 0 | rate: / Reply Quote | |
ReL the Apps Page: https://www.gpugrid.net/apps.php | |
ID: 61698 | Rating: 0 | rate: / Reply Quote | |
ReL the Apps Page: https://www.gpugrid.net/apps.php this is GPUgrid. all tasks are for GPU ____________ | |
ID: 61699 | Rating: 0 | rate: / Reply Quote | |
Despite this, there is a noticeable increase in the number of users returning ATMML results. Indeed. But the question is: are those completed, end-of-run, scientifically useful results - or are they early crashes, resulting only in the creation and issue of another replica, to take its place in the 'in progress' count? We can't tell from the outside. But runtimes starting at 0.04 hours don't look too good. | |
ID: 61700 | Rating: 0 | rate: / Reply Quote | |
Hi, the windows host are working successfully. There are more errors than on linux as expected, but plenty are working well. | |
ID: 61701 | Rating: 0 | rate: / Reply Quote | |
Which cache? Where is it set?? What should it be set at??? | |
ID: 61702 | Rating: 0 | rate: / Reply Quote | |
I just started ATMML yesterday. Out of seven starts only one completed. The rest errored-out after 1-1.5 hours. Windows11/RTX4090. | |
ID: 61703 | Rating: 0 | rate: / Reply Quote | |
He's talking about the work cache on the host. you can (kind of) control that in the BOINC Manager Options->"Computing Preferences" menu. set it to something less than 1 day probably. you'll be limited to 4 tasks from the project (per GPU) anyway. ____________ | |
ID: 61704 | Rating: 0 | rate: / Reply Quote | |
The Windows tasks ARE NOT working as advertised.... | |
ID: 61705 | Rating: 0 | rate: / Reply Quote | |
Thanks. There are too many cache's out there. Let's call this the work queue. | |
ID: 61706 | Rating: 0 | rate: / Reply Quote | |
The Windows tasks ARE NOT working as advertised.... All 8 of 8 tasks I have completed and returned also categorized as error. This is on win10 with 4080 and 4090 GPUs. Here is a sample: http://www.gpugrid.net/result.php?resultid=35743812 ____________ Reno, NV Team: SETI.USA | |
ID: 61707 | Rating: 0 | rate: / Reply Quote | |
Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED That's going to be a difficult one to overcome unless the project addresses its job estimation. You need to 'complete' (which includes a successful finish plus validation) 11 tasks before the estimates are normalised - and if every task fails, you'll never get there. | |
ID: 61708 | Rating: 0 | rate: / Reply Quote | |
Hello. I apologise about the time limit exceed errors. I did not expect this. The jobs run for the same time as the linux ones that have all been working so I dont really understand what is happening. <message> exceeded elapsed time limit 5454.20 (10000000000.00G/1712015.37G)</message> <stderr_txt> The numerator I believe is the fpops_bound that is set in the WU template which is controlled by us. | |
ID: 61709 | Rating: 0 | rate: / Reply Quote | |
Does anyone know where the denominator comes from in this line?: Yes. It's the current estimated speed for the task, which should be 'learned' by BOINC for the individual computer running this particular task type ('app_version'). It's a complex three-stage process, and unfortunately it doesn't go down to the granularity of individual GPU types - all GPUs are considered equal. 1) When a new app version is created, the server will set a first, initial, value for GPU speeds for that version. I'm afraid I don't know how that initial value is estimated, but I'll try to find out. 2) Once the app version is up and running, the server monitors the runtime of the successful tasks returned. That's done at both the project level, and the individual host level. The first critical point is probably when the project has received 100 results: the calculated average speed from those 100 is used to set the expected speed for all tasks issued from that point forward. [aside - 'obviously' the first results received will be from the fastest machines, so that value is skewed] 3) Later, as each individual host reports tasks, once 11 successful tasks have been returned, future tasks assigned to that host are assigned the running average value for that host. The current speed estimate ('fpops_est') can be seen in the application_details page for each host. zombie67 hasn't completed an ATMML task yet, so no 'Average processing rate' for his machine is shown yet for ATMML (at the bottom), but you can see it for other task types. Phew. That's probably more than enough for now, so I'll leave you to digest it. | |
ID: 61710 | Rating: 0 | rate: / Reply Quote | |
I'm curious why do we even bother to intentionally error out a task based on runtime at all? Usually a wrong estimate of runtime just messes with local client scheduling a bit, but tasks finish fine eventually. It's not like GPUGrid had accurate runtime estimation before, but previous tasks didn't fail. | |
ID: 61711 | Rating: 0 | rate: / Reply Quote | |
Probably the decision is because this project depends on fast turnaround and turnover for tasks. | |
ID: 61712 | Rating: 0 | rate: / Reply Quote | |
I've seen people misuse this "fail fast" philosophy very often. "Fail fast" makes sense only when it's going to be a failure anyway. Turning a successful result into a failure proactively is the opposite of making progress. | |
ID: 61713 | Rating: 0 | rate: / Reply Quote | |
I'm pretty sure the "exceeded elapsed time limit" is not because the project scientists just decided on a whim to utilize it. | |
ID: 61714 | Rating: 0 | rate: / Reply Quote | |
I agree - the runtime errors are an issue mainly of the BOINC software, but they are appearing because the GPUGrid teams - admin and research - have over the years failed to fully come to terms with the changes introduced by BOINC around 2010. We are running a very old copy of the BOINC server code here, which include the beginnings of the 2010 changes, but which makes it very difficult for us to dig our way out of the hole we're in. {size} / {size per second} The sizes cancel out, and duration is the inverse of speed. In the case of my new task, I have: <rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est> (size) <flops>698258637765.176392</flops> (speed) My calculator makes that Duration 1,432,134 seconds, or about 16.5 days. But our BOINC clients have a trick up their sleeves for coping with that - it's called the DCF, or duration correction factor. For this machine, it's settled to 0.045052. Popping that into the calculator, that comes down to: Runtime estimate 64,520 seconds, or 17.92 hours. BOINC Manager displays 17:55:20, and that's about right for these tasks (although they do vary). CONCLUSION The task sizes set by the project for this app are unrealistically high, and the runtime estimates only approach sanity through the heavy application of DCF - which should normally hover around 1. DCF is self-adjusting, but very slowly for these extreme limits. And you have to do the work first, which may not be possible. Volunteers with knowledge and skill can adjust their own DCFs, but I wouldn't advise it for novices. @ Steve That's even more indigestible than the essay I wrote you yesterday. Please don't jump into changing things until they've both sunk in fully: meddling (and in particular, meddling with one task type at a project with multiple applications) can cause even more problems that it cures. Mull it over, discuss it with your administrators and fellow researchers, and above all - ask as many questions as you need. | |
ID: 61716 | Rating: 0 | rate: / Reply Quote | |
"transient upload error: server out of disk space" | |
ID: 61721 | Rating: 0 | rate: / Reply Quote | |
So this runtime exceeded failure is actually related to the absurd rsc_fpops_est setting. This number is obviously not accurate. Could the project fix the number to be more reasonable, instead of relying on client's trial and error for adjustment that wastes lots of compute power? | |
ID: 61724 | Rating: 0 | rate: / Reply Quote | |
So this runtime exceeded failure is actually related to the absurd rsc_fpops_est setting. This number is obviously not accurate. Could the project fix the number to be more reasonable, instead of relying on client's trial and error for adjustment that wastes lots of compute power? Not so fast, please. The rsc_fpops_est figure is obviously wrong, but that's the result of many years of twiddling knobs without really understanding what they do. Two flies in that pot of ointment: If they reduce rsc_fpops_est by itself, the time limit will reduce, and more tasks will fail. There's a second value - rsc_fpops_bound - which actually triggers the failure. In my worked example, that was set to 1,000x the estimate, or several years. That was one of the knobs they twiddled some years ago: the default is 10x. So something else is seriously wrong as well. Soon after the Windows app was launched, I saw tasks with very high replication numbers, which had failed on multiple machines - up to 7, the limit here. But very few of them were 'time limit exceeded'. The tasks I'm running now have low replication numbers, so we may be over the worst of it. I repeat my plea to Steve - please take your time to think, discuss, understand what's going on. Don't change anything until you've worked out what it'll do. | |
ID: 61730 | Rating: 0 | rate: / Reply Quote | |
Thank you for the explanation. | |
ID: 61732 | Rating: 0 | rate: / Reply Quote | |
Yes, I found host 591089. That succeeded on its first task, but then failed five in a row on the time limit. | |
ID: 61733 | Rating: 0 | rate: / Reply Quote | |
I found a way to get the windows WUs finish without errors. | |
ID: 61735 | Rating: 0 | rate: / Reply Quote | |
WUs starting with "MCL1" are all erroring out in windows or linux. | |
ID: 61739 | Rating: 0 | rate: / Reply Quote | |
Quite a few of my tasks starting with PTP1B are failing in both OSs | |
ID: 61740 | Rating: 0 | rate: / Reply Quote | |
Quite a few of my tasks starting with PTP1B are failing in both OSs Those worked fine for me under linux. took shorter time to finish them. | |
ID: 61741 | Rating: 0 | rate: / Reply Quote | |
Same issue here: | |
ID: 61742 | Rating: 0 | rate: / Reply Quote | |
That is a bad batch. All units error out after several weeks of trouble-free calculation under Linux. Example: | |
ID: 61743 | Rating: 0 | rate: / Reply Quote | |
Yes, I had 90 failed ATMML tasks overnight. The earliest was issued just after 18:00 UTC yesterday, but was created at 27 Aug 2024 | 13:28:15 UTC. | |
ID: 61747 | Rating: 0 | rate: / Reply Quote | |
I experienced the same problems over several days when I was suspending GPU processing because of very hot temps in Texas; the result was loss of many hours of processing until I discovered the LTIWS(leave tasks in memory while suspended) was apparently not working. I am suspending ATMML tasks until cooler weather arrives in the fall.BET | |
ID: 61757 | Rating: 0 | rate: / Reply Quote | |
I can't remember all of the task types that allow suspending or exiting Boinc without erroring out. | |
ID: 61758 | Rating: 0 | rate: / Reply Quote | |
LTI[M]WS only applies to CPU tasks. GPUs don't have that much spare memory. | |
ID: 61760 | Rating: 0 | rate: / Reply Quote | |
I appreciate the informative responses of BOTH Keith and Richard immediately below!! | |
ID: 61761 | Rating: 0 | rate: / Reply Quote | |
Generally, if I know I must reboot the system shortly in the future I will just wait till the current tasks are finished or reboot shortly after a new task starts so I won't begrudge the little time lost it has already spent crunching and which it will have have restart again after the reboot. | |
ID: 61762 | Rating: 0 | rate: / Reply Quote | |
The time bonus system has been in place w/ GPUGrid for years. (And yes, the GG tasks download several GIG of data... [WHY? well, another issue] and the download time does count against the deadline) | |
ID: 61763 | Rating: 0 | rate: / Reply Quote | |
Good point, but winDoze update does not make it easy to avoid IT's "decision" about updating and the time to restart my system. (Don't you hate it when big tech is so much more brilliant and all-knowing than you?) | |
ID: 61764 | Rating: 0 | rate: / Reply Quote | |
you're comparing multiple levels of apples vs oranges. | |
ID: 61765 | Rating: 0 | rate: / Reply Quote | |
GTX1660Ti is in no ways better "in no way" is an absolute statement, and is false: my NVIDIA has 33% more memory and double the cache. But, admittedly it is not in general as powerful | |
ID: 61766 | Rating: 0 | rate: / Reply Quote | |
maybe investigate what's going wrong with your system to cause the failures. how bizarre ... Batting 1000, the GPUGrid tasks which fail on my system have ALSO failed on several, perhaps even 6 or 7, other systems (except when I take a newly issued task, and when I check those they also fail after bombing on my system.) So, if the problem is on my end and not in any way on GPUGrid's end, then there must be dozens and dozens (and dozens) of other systems which apparently need to "investigate what's going wrong" with them... that could be the reason for your download errors also I have no "Download errors" except when I abort the download of a task which has already had repeated compute errors. GPUGrid needs 8 failures before figuring out that there are 'too many error (may have bug)' If I can, I'd rather give them this insight before I waste 5-10 minutes of time on my GPU, such as it is. Anyway, thanks for your feedback Oh, and by the way, I run 12-13 other projects, including at least three others where I run GPU tasks. This very high error rate of tasks is NOT an issue whatsoever with any of them. LLP, PhD, Prof Engr | |
ID: 61767 | Rating: 0 | rate: / Reply Quote | |
GTX1660Ti is in no ways better read up. my "absolute" statement is correct. your 1660Ti has half the memory of a 3060. your 1660Ti also has half the cache of the 3060. GTX 1660Ti has 6GB RTX 3060 has 12GB (you can see this in the host details you referenced) not sure where you're getting your information, but it's wrong or missing context (like comparing to a laptop GPU or something) ____________ | |
ID: 61768 | Rating: 0 | rate: / Reply Quote | |
maybe investigate what's going wrong with your system to cause the failures. there are more people running Windows. higher probability for resends to land on another problematic windows host. it's more common for Windows users to be running AV software. it's common for windows users to have issues with BOINC projects and AV software. not hard to imagine that these factors mean that a large number of people would have problems when they're all coming to play. check your AV settings, whitelist the BOINC data directories and try again. ____________ | |
ID: 61769 | Rating: 0 | rate: / Reply Quote | |
your 1660Ti has half the memory of a 3060. My information is from GPUGrid's host information, https://gpugrid.net/show_host_detail.php?hostid=613323 which states 16GB, but this may be unreliable as TechPowerUp GPU-Z does give the NVIDIA site's number of 6GB My numbers for the cache also come from gpugrid.net/show_host_detail.php as indeed all the memory figurers in my original post so I guess my mistake was trusting gpugrid.net/show_host_detail info. And no, this is not a laptop, but a 12-core desktop. | |
ID: 61770 | Rating: 0 | rate: / Reply Quote | |
your 1660Ti has half the memory of a 3060. TPU is not out of date, and probably one of the most reliable databases for GPU (and other) specifications. there lies the issue. you're looking at system memory, not the GPU memory. system memory has little to do with GPUGRID tasks that run on the GPUs and not the CPU. at all BOINC projects, the GPU VRAM is listed in parenthesis next to the GPU model name on the Coprocessors line. and further context, there was a long standing bug with BOINC versions older than about 7.18 that capped Nvidia memory reported (not actual) to only 4GB. so old versions were wrong in what they reported for a long time. so still, the 3060 beats the 1660Ti in every metric. you just happened to have populated more system memory on the motherboard, but that has nothing to do with comparing the GPUs themselves. ____________ | |
ID: 61771 | Rating: 0 | rate: / Reply Quote | |
windows users to have issues with BOINC projectsAgain, I run 12-13 other projects, including at least three others where I run GPU tasks. I have a zero error rate on other projects. But I do appreciate your suggestion, as I like the science behind GPUGrid and would very much like to RUN tasks rather than have them error out. I have searched PC settings and Control Panel settings as well as file options for "AV" and do not get any relevant hits. Could you please elaborate on what you mean by AV settings and whitelisting the BOINC directories? Thanks. LLP, PhD, Prof Engr. | |
ID: 61772 | Rating: 0 | rate: / Reply Quote | |
AV = Anti Virus software. | |
ID: 61773 | Rating: 0 | rate: / Reply Quote | |
Switching back to BOINC software and (specifically) ATMML tasks. I've posted extensively in this thread about the problems of task duration estimation at this project. I've got some new data, which I can't explain. Task number rsc_fpops_est rsc_fpops_bound flops DCF Runtime estimate (secs) 1 1E+18 1E+21 20,146,625,396,909 1.0000 49636 13.79 hours 2 3 4 1E+18 1E+21 20,218,746,342,900 0.8351 41301 11.47 hours 5 6 7 1E+18 1E+21 19,777,581,461,665 0.9931 50214 13.95 hours 8 9 10 1E+18 1E+21 19,446,193,249,403 0.8926 45900 12.75 hours 11 1E+18 1E+21 19,506,082,146,580 0.8247 42279 11.74 hours 12 1E+18 1E+21 19,522,515,301,144 0.7661 39242 10.90 hours 13 14 1E+18 1E+20 99,825,140,137 0.7585 7598256 87.94 days 15 16 1E+18 1E+21 99,825,140,137 0.7360 7373243 85.34 days 17 1E+18 1E+21 99,825,140,137 0.7287 7300045 84.49 days 18 1E+18 1E+21 99,825,140,137 0.7215 7227478 83.65 days My understanding of the BOINC server code is that, for a mature app_version (Linux ATMML has been around for 2 months), the initial estimates should be based on the average speed of the tasks so far across the project as a whole. So it seems reasonable that the initial estimates were for 10-12 hours - that's about what I expected for this GPU. Then, after the first 11 tasks have been reported successful, it should switch to the average for this specific host. So why does it appear that this particular host is reporting a speed something like 1/200th of the project average? So now, it's frantically attempting to compensate by driving my DCF through the floor, as with my two older machines. The absolute values are interesting too. The initial (project-wide) flops estimates are hovering around 20,000 GFlops - does that sound right, for those who know the hardware in detail? And they are fluctuating a bit, as might be expected for an average with variable task durations for completions. After the transition, my card dropped to below 100 GFlops - and has remained rock-steady. That's not in the script. The APR for the card (which should match the flops figure for allocated tasks) is 35599.725995644 GFlops - which doesn't match any of the figures above. Where does this take us? And what, if anything, can we do about it? I'll try to get my head round the published BOINC server code on GitHub, but this area is notoriously complex. And the likelihood is that the current code differs to a greater or lesser extent from the code in use at this project. I invite others of similarly inquisitive mind to join in with suggestions. | |
ID: 61774 | Rating: 0 | rate: / Reply Quote | |
I didn't run ATMML but I'm currently running Qchem on Tesla P100 with short run times (averaging somewhere around 12 mins or so per task). I see this similar behavior/pattern when starting a new instance. If I were to guess, your DCF will eventually go down from the last value of 0.7215 to 0.01 after running 100+ tasks and your final estimated run time could be about 1.16 days which is still higher than your average expected run time for your card. However if you run cpu benchmark, then the DCF number will go up from 0.01 to something higher and will take another 100+ tasks for the DCF to go down to 0.01 again but this time the estimated run time will go below 1.16 days. | |
ID: 61775 | Rating: 0 | rate: / Reply Quote | |
This is very interesting, thank you for the numbers. | |
ID: 61776 | Rating: 0 | rate: / Reply Quote | |
Well, I said that this is going to be difficult ... | |
ID: 61777 | Rating: 0 | rate: / Reply Quote | |
08/09/2024 21:08:27 | GPUGRID | [error] Error reported by file upload server: Server is out of disk space | |
ID: 61778 | Rating: 0 | rate: / Reply Quote | |
08/09/2024 21:08:27 | GPUGRID | [error] Error reported by file upload server: Server is out of disk space this has happened in irregular intervals over all the years - last time about 2 weeks ago. Hard to believe how difficult is must be to take measures against it. | |
ID: 61783 | Rating: 0 | rate: / Reply Quote | |
Well. How does one managed to complete unit in 5 min ? | |
ID: 61790 | Rating: 0 | rate: / Reply Quote | |
Well. How does one managed to complete unit in 5 min ? If you are referring to this post https://www.gpugrid.net/forum_thread.php?id=5468&nowrap=true#61786, Steve posted in the gpugrid discord that there are still tasks that will be generated by the older batch before the code was updated. I don't know how long before these older tasks will be flushed out from the system but it has now been more than 21 days. | |
ID: 61791 | Rating: 0 | rate: / Reply Quote | |
Could it be that my Quadro P5000 is unable to crunch ATMMLs? | |
ID: 61799 | Rating: 0 | rate: / Reply Quote | |
I think a P5000 should work, given that it is the same generation as a 1080 which I have confirmed to work. It may be that your drivers are too old. I have a 1080 with driver version 536. | |
ID: 61800 | Rating: 0 | rate: / Reply Quote | |
I think a P5000 should work, given that it is the same generation as a 1080 which I have confirmed to work. It may be that your drivers are too old. I have a 1080 with driver version 536. thanks, Steve, for your quick reply. Some 5 hours ago, I started another task - and it is still running :-) So I keep my fingers crossed that it will finish successfully. No idea why the other two ones before failed. BTW: the driver is 537.99 | |
ID: 61801 | Rating: 0 | rate: / Reply Quote | |
Hi, Stderr output <core_client_version>8.0.2</core_client_version> <![CDATA[ <message> (unknown error) (0) - exit code 195 (0xc3)</message> <stderr_txt> 09:59:48 (19024): wrapper (7.9.26016): starting 09:59:48 (19024): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2) aceforce_dft_v0.4.ckpt | |
ID: 61816 | Rating: 0 | rate: / Reply Quote | |
You have to read a long way further down to find the real answer to your question! Traceback (most recent call last): File "D:\ProgramData\BOINC\slots\2\Scripts\rbfe_explicit_sync.py", line 11, in <module> rx.scheduleJobs() File "D:\ProgramData\BOINC\slots\2\Lib\site-packages\sync\atm.py", line 126, in scheduleJobs self.worker.run(replica) File "D:\ProgramData\BOINC\slots\2\Lib\site-packages\sync\worker.py", line 124, in run raise RuntimeError(f"Simulation failed {ntry} times!") RuntimeError: Simulation failed 5 times! That looks like something is wrong in the way that job was set up by the project - that's not your fault, and there's nothing you can do about it except report it here - and move on to the next one. | |
ID: 61817 | Rating: 0 | rate: / Reply Quote | |
Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. | |
ID: 61834 | Rating: 0 | rate: / Reply Quote | |
Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. I agree. | |
ID: 61835 | Rating: 0 | rate: / Reply Quote | |
Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. I came here to report exactly the same thing. In the last hour, I've had 6 ATMML tasks which went wrong, and I only have 5 video cards! Two were a relatively quick 'Error while computing' - perhaps around the 5% mark. Four were 'Cancelled by server', after runs from 14 ksec to 50 ksec. I'm switching to Quantum Chemistry for the moment, until we get a handle on what the problem is. Edit - there goes another one: 'Error while computing' around the 5% mark. | |
ID: 61836 | Rating: 0 | rate: / Reply Quote | |
Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. Something is strange. The work queue was over 800 tasks yesterday, now it's 7. | |
ID: 61837 | Rating: 0 | rate: / Reply Quote | |
There was a message from Quico posted on the GPUGrid Discord server. They cancelled some parts of the project as they need to finish other parts quickly. They will resend those later. | |
ID: 61838 | Rating: 0 | rate: / Reply Quote | |
They cancelled some parts of the project as they need to finish other parts quickly. Let's hope the ones they need to finish quickly aren't the most recent ones that end with 'Error while computing'. My 'Error while computing' today are all replication _0, type CDK2 or - a recent one - NEW_CDK2. | |
ID: 61839 | Rating: 0 | rate: / Reply Quote | |
They cancelled some parts of the project as they need to finish other parts quickly. If you're using MPS with Nvidia cards, I have seen that killing or stopping tasks while loaded into the GPU memory can really screw things up and cause newly started tasks to fail as well. That was happening after their server cancels, so I am restarting my PCs. | |
ID: 61841 | Rating: 0 | rate: / Reply Quote | |
It doesn't really feel like that sort of problem, but I'll keep an eye on it and restart if it happens again. | |
ID: 61842 | Rating: 0 | rate: / Reply Quote | |
At the moment, the machines seem happy on QC tasks - and that seems to be the work that they want crunching but this strategy does not work the way it was probably supposed to - as QC tasks are not available for Windows crunchers (and those are still the majority). Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 % - as I noticed with the last few tasks that were not aborted by server and hence could finish - see here: https://www.gpugrid.net/result.php?resultid=36020678 | |
ID: 61843 | Rating: 0 | rate: / Reply Quote | |
Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. I have added up the processing time from my hosts for 11 ATMML tasks "Aborted by server" on past three days. About 135 hours. Really not nice. The tradition used to be that only not started tasks were aborted, and started ones were allowed to finish. Heavy reasons for breaking this tradition, I suppose. | |
ID: 61844 | Rating: 0 | rate: / Reply Quote | |
Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 % I've seen that well before the cancellations started - but only for tasks which ran significantly quicker that the ones we're used to. I think that's fair. | |
ID: 61845 | Rating: 0 | rate: / Reply Quote | |
135 hours is a drag! Sorry to hear. We want our GPUs productive and not just hamster wheels. | |
ID: 61846 | Rating: 0 | rate: / Reply Quote | |
Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 % I agree that for significantly shorter tasks the credit is lower, no question. But the task which I cited was one of the "long ones". | |
ID: 61849 | Rating: 0 | rate: / Reply Quote | |
135 hours is a drag! Sorry to hear. We want our GPUs productive and not just hamster wheels. + 1 | |
ID: 61850 | Rating: 0 | rate: / Reply Quote | |
I just processed my first one after getting my computer back online. | |
ID: 61860 | Rating: 0 | rate: / Reply Quote | |
in the past few days, there have been a lot of ATMML tasks which failed after a few minutes. It's the sub-type PTP1B... | |
ID: 61939 | Rating: 0 | rate: / Reply Quote | |
Yes, that batch was badly configured. Seems to have been cleared out and expired relatively soon. I'm on to a different MCL1 batch that is working correctly. | |
ID: 61941 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : ATMML