ATMML

Message boards : Number crunching : ATMML

Author	Message
Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,079,048,466 RAC: 15,671,380 Level Scientific publications	Message 61582 - Posted: 5 Jul 2024 \| 14:15:37 UTC
	I just finished crunching a task for this new application successfully. https://www.gpugrid.net/result.php?resultid=35379717 What exactly are we crunching here?
	ID: 61582 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,649,638,294 RAC: 13,283,785 Level Scientific publications	Message 61583 - Posted: 5 Jul 2024 \| 15:08:01 UTC
	By the name of the app, somehow uses machine learning.
	ID: 61583 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level Scientific publications	Message 61584 - Posted: 6 Jul 2024 \| 16:38:31 UTC - in response to Message 61582.
	I just finished crunching a task for this new application successfully. how did you manage to download such a task? The list in which you can choose from the various subprojects does NOT include ATMML
	ID: 61584 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 46 Credit: 0 RAC: 0 Level Scientific publications	Message 61585 - Posted: 6 Jul 2024 \| 18:36:43 UTC
	This is an app in testing mode, it does not appear as one to select yet. You will only get the WUs if you have selected to run the test applications. It is a different version of the existing ATM app that includes machine learning based forcefields for the molecular dynamics.
	ID: 61585 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,649,638,294 RAC: 13,283,785 Level Scientific publications	Message 61586 - Posted: 6 Jul 2024 \| 19:59:54 UTC - in response to Message 61585.
	Thanks for the progress update and explanation of just what kind of ML is being used for the ATM tasks, Steve. I see also you released a new beta ATM app yesterday to go along with the ATMML app. Already did one of those today.
	ID: 61586 \| Rating: 0 \| rate: / Reply Quote

Drago Send message Joined: 3 May 20 Posts: 18 Credit: 831,594,060 RAC: 3,662,077 Level Scientific publications	Message 61594 - Posted: 15 Jul 2024 \| 9:34:47 UTC - in response to Message 61586.
	Is it Windows, Linux or both?
	ID: 61594 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 9,762,712,024 RAC: 21,245,153 Level Scientific publications	Message 61595 - Posted: 15 Jul 2024 \| 10:32:05 UTC
	You can verify OS compatibility for different applications at GPUGRID apps page.
	ID: 61595 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,079,048,466 RAC: 15,671,380 Level Scientific publications	Message 61602 - Posted: 17 Jul 2024 \| 1:10:16 UTC
	I noticed that this batch of ATMML units takes almost 3 times longer than the previous batches to complete. One of them, I suspended and when I restarted it, it would not start, I kept "running" it for over an hour, and no progress, so I had no option, but to abort it.
	ID: 61602 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 77 Credit: 1,558,772,434 RAC: 11,364,608 Level Scientific publications	Message 61604 - Posted: 17 Jul 2024 \| 7:45:34 UTC - in response to Message 61602.
	effectivement elles sont tres longue a calculer.Je vais les arreter aussi. 9h20 sur ma rtx 4060 et 14h20 sur rtx a2000. They are very long to calculate. I will stop them too. 9h20 on my rtx 4060 and 14h20 on rtx a2000. ____________
	ID: 61604 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 77 Credit: 1,558,772,434 RAC: 11,364,608 Level Scientific publications	Message 61607 - Posted: 17 Jul 2024 \| 20:37:40 UTC
	j ai annulé les 4 taches ATMML que j'avais car trop longues a calculer. entre 16 et 24 heures.MESSIEURS LES PROGRAMMEURS,j'espere que vous allez vous pencher sur ce probleme? I cancelled the 4 ATMML stains I had because too long to calculate. between 16 and 24 hours.PROGRAMMERS, I hope you will look into this problem? ____________
	ID: 61607 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,649,638,294 RAC: 13,283,785 Level Scientific publications	Message 61609 - Posted: 18 Jul 2024 \| 16:30:26 UTC
	Didn't have any issues with the new ATMML tasks I received. Rescued one at "the last chance saloon" as the _7 wingman. Don't seem to have a "unreasonable" crunch time for the hardware used. About 7 hours or so. I've had acemd that went for 12-14 hours before.
	ID: 61609 \| Rating: 0 \| rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 337 Credit: 7,617,663,350 RAC: 11,036,158 Level Scientific publications	Message 61611 - Posted: 20 Jul 2024 \| 1:46:11 UTC
	I don't recall a larger executable from a BOINC project. 4.67 GB! That is larger than some LHC VDI files.
	ID: 61611 \| Rating: 0 \| rate: / Reply Quote

roundup Send message Joined: 11 May 10 Posts: 63 Credit: 9,096,655,193 RAC: 54,351,602 Level Scientific publications	Message 61612 - Posted: 20 Jul 2024 \| 4:15:20 UTC
	I had 57 units so far without a single error. Great! Fastest unit took 4,197 seconds (1,17 hours) on a 4080 Super, longest one took a bit over 30,000 seconds (8,33 hours) on a 4060ti. More than reasonable.
	ID: 61612 \| Rating: 0 \| rate: / Reply Quote

Drago Send message Joined: 3 May 20 Posts: 18 Credit: 831,594,060 RAC: 3,662,077 Level Scientific publications	Message 61615 - Posted: 22 Jul 2024 \| 13:05:22 UTC
	Hello everyone! My four hosts running 3060, 3060ti and 3070ti were not able to complete a single unit so far. They all fail at the very beginning with the following STDERR output: "Error loading cuda module". I am running Linux Mint and Ubuntu with Nvidida driver 470. The newer drivers produce errors in other projects so I decided to stick to that driver version. I noticed that a lot of my wingmen successfully crunch the units with driver 530 or 535. is that a driver issue? All other projects run just fine on version 470. Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead. Traceback (most recent call last): File "/var/lib/boinc-client/slots/24/bin/rbfe_explicit_sync.py", line 10, in <module> rx.setupJob() File "/var/lib/boinc-client/slots/24/lib/python3.11/site-packages/sync/atm.py", line 85, in setupJob self.worker = OMMWorkerATM(ommsystem, self.config, self.logger) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/boinc-client/slots/24/lib/python3.11/site-packages/sync/worker.py", line 34, in __init__ self.simulation = Simulation(self.topology, self.ommsystem.system, self.integrator, platform, properties) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/boinc-client/slots/24/lib/python3.11/site-packages/openmm/app/simulation.py", line 106, in __init__ self.context = mm.Context(self.system, self.integrator, platform, platformProperties) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/boinc-client/slots/24/lib/python3.11/site-packages/openmm/openmm.py", line 12171, in __init__ _openmm.Context_swiginit(self, _openmm.new_Context(*args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^ openmm.OpenMMException: Error loading CUDA module: CUDA_ERROR_UNSUPPORTED_PTX_VERSION (222)
	ID: 61615 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 61616 - Posted: 22 Jul 2024 \| 15:07:38 UTC - in response to Message 61615.
	with that error, yes i would assume the old driver version is the issue. CUDA historically has not been forward compatible. as in, a CUDA10 binary could not run on a system with only CUDA 8 drivers. but the opposite was true in most cases, that backward compatibility is fine and you can run even very old CUDA code with the latest drivers. only starting with CUDA 11.1 was forward compatibility introduced, and only within the same major version. So a system with only CUDA 11.1 drivers could still run up to CUDA 11.8 binaries. Same goes for CUDA12, where all CUDA 12 drivers will be compatible with all CUDA 12+ binaries. I have a feeling that some parts of this new ATMML app, and probably in particular OpenMM (based on what's throwing the error) actually requires CUDA 12+ drivers. and the app is misidentified at the project as being CUDA 11 compatible. you could test this by installing the newer drivers and see if they then run. what other project has issue with the newer drivers? ____________
	ID: 61616 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 77 Credit: 1,558,772,434 RAC: 11,364,608 Level Scientific publications	Message 61617 - Posted: 22 Jul 2024 \| 16:10:31 UTC - in response to Message 61616.
	chez moi les pilotes d'origine du system fonctionne tres bien.ce sont les pilotes 535 fourni a l'install de linux mint.. https://www.gpugrid.net/results.php?userid=563937 at me the original drivers of the system works three good.this are the 535 drivers provided to install linux mint.. ____________
	ID: 61617 \| Rating: 0 \| rate: / Reply Quote

Drago Send message Joined: 3 May 20 Posts: 18 Credit: 831,594,060 RAC: 3,662,077 Level Scientific publications	Message 61619 - Posted: 23 Jul 2024 \| 12:25:52 UTC - in response to Message 61617.
	chez moi les pilotes d'origine du system fonctionne tres bien.ce sont les pilotes 535 fourni a l'install de linux mint.. https://www.gpugrid.net/results.php?userid=563937 at me the original drivers of the system works three good.this are the 535 drivers provided to install linux mint.. I tried to install the 535 driver but after that my GPU is no longer recognised by Amicable, Einstein and Asteroids. GPUgrid lets me start new wus but they fail after 43 seconds saying that no Nvidia GPU was found. Do I have to install additional libraries or something like that? I also noticed that there is an open driver package from Nvidia and a regualar meta package and a server version of that driver. Which one are you guys using?
	ID: 61619 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 61620 - Posted: 23 Jul 2024 \| 13:13:11 UTC - in response to Message 61619.
	chez moi les pilotes d'origine du system fonctionne tres bien.ce sont les pilotes 535 fourni a l'install de linux mint.. https://www.gpugrid.net/results.php?userid=563937 at me the original drivers of the system works three good.this are the 535 drivers provided to install linux mint.. I tried to install the 535 driver but after that my GPU is no longer recognised by Amicable, Einstein and Asteroids. GPUgrid lets me start new wus but they fail after 43 seconds saying that no Nvidia GPU was found. Do I have to install additional libraries or something like that? I also noticed that there is an open driver package from Nvidia and a regualar meta package and a server version of that driver. Which one are you guys using? if you're running opencl applications then yes you need additional opencl package. sudo apt install ocl-icd-libopencl1 535 drivers work fine for einstein, most of my hosts are on that driver and I contribute to einstein primarily. ____________
	ID: 61620 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 77 Credit: 1,558,772,434 RAC: 11,364,608 Level Scientific publications	Message 61621 - Posted: 23 Jul 2024 \| 15:40:38 UTC - in response to Message 61620.
	je n'utilise rien de supplemntaire comme package. J'ai installé linux mint normalement et fais les mises a jours systeme et pilotes. J'ai installé les pilotes 535 en passant par le gestionnaire de pilotes at tout fonctionne tres bien. boinc reconnait ma rtx 4060 et ma rtx a2000 et ma gtx 1650 dans le meme pc. je calcule pour gpugrid et amicable numbers sans problemes. soit vous avez une installation systeme défaillante soit un probleme hardware. I don’t use anything extra as a package. I installed linux mint normally and make the system and driver updates. I installed the 535 drivers through the driver manager and everything works fine. boinc recognizes my rtx 4060 and my rtx a2000 and my gtx 1650 in the same pc. I calculate for gpugrid and friendly numbers without problems. either you have a system installation failure or a hardware problem. ____________
	ID: 61621 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 77 Credit: 1,558,772,434 RAC: 11,364,608 Level Scientific publications	Message 61622 - Posted: 23 Jul 2024 \| 15:55:08 UTC
	pour commencer,je vous conseille de tester vos barrettes de ram avec memtest free et pas un autre programme.il fonctionne tres bien et est fiable. To start with, I advise you to test your ram strips with memtest free and not another program.it works very well and is reliable. https://www.memtest86.com/ ____________
	ID: 61622 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 77 Credit: 1,558,772,434 RAC: 11,364,608 Level Scientific publications	Message 61623 - Posted: 23 Jul 2024 \| 15:57:56 UTC
	quand vous installer de driver 535 sous linux il installe aussi tout le nécessaire pour calculer. c'est a dire opencl nvidia et cuda donc il ne faut rien installer d'autres . Juste le driver 535 et c'est tout when you install driver 535 under linux it also installs everything necessary to calculate. that is opencl nvidia and cuda so nothing else to install . Just driver 535 and that’s it ____________
	ID: 61623 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 77 Credit: 1,558,772,434 RAC: 11,364,608 Level Scientific publications	Message 61624 - Posted: 23 Jul 2024 \| 16:46:34 UTC
	je vous conseille aussi de faire un test de votre dur pour voir s'il ny a pas de cluster defectueux. il faut faire un test de surface avec hdtune 255 free ou un autre programme genre crystaldiskinfo. https://www.hdtune.com/download.html I also advise you to do a test of your hard to see if there is no bad cluster. you have to do a surface test with hdtune 255 free or another program like crystaldiskinfo. ____________
	ID: 61624 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 77 Credit: 1,558,772,434 RAC: 11,364,608 Level Scientific publications	Message 61625 - Posted: 23 Jul 2024 \| 16:48:16 UTC
	pour dépanner un pc je commence toujours par ces 2 choses. to troubleshoot a pc I always start with these 2 things ____________
	ID: 61625 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 61626 - Posted: 23 Jul 2024 \| 17:45:43 UTC - in response to Message 61623.
	it may be true for Mint that opencl components are installed with the normal driver package. but that is not the case for Ubuntu, and you do need to install the opencl components separately with the command in my previous post. ____________
	ID: 61626 \| Rating: 0 \| rate: / Reply Quote

Drago Send message Joined: 3 May 20 Posts: 18 Credit: 831,594,060 RAC: 3,662,077 Level Scientific publications	Message 61631 - Posted: 24 Jul 2024 \| 13:58:43 UTC Last modified: 24 Jul 2024 \| 13:59:35 UTC
	ok, I managed to get one of my UBUNTU hosts running with the 535 driver and the additional OpenCL libraries installed like you said Ian&Steve. For 5h it has been crunching an ATMML unit so far, it's looking good! I am surpised to see that you seemingly need OpenCL to run CUDA code because that is the only difference to my previous attempts. So far no luck with my Mint Laptop. If I change the driver to any other version than 470 the GPU is no longer detected. That may also have to do with the AMD driver from the on-board AMD GPU. Maybe they interfere with each other. Will try my other hosts this weekend. Thanks for the help.
	ID: 61631 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 61632 - Posted: 24 Jul 2024 \| 14:13:39 UTC - in response to Message 61631.
	you don't need opencl driver components to run true cuda code. but a lot of other projects are not running apps compiled in cuda, but rather OpenCL. ____________
	ID: 61632 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 401 Credit: 16,755,010,632 RAC: 220,113 Level Scientific publications	Message 61633 - Posted: 25 Jul 2024 \| 6:20:50 UTC - in response to Message 61615.
	...running 3060, 3060ti and 3070ti were not able to complete a single unit so far. It's hit or miss for 1080 Ti, 2070 Ti, and 3060 Ti to successfully complete ATMML WUs. They have no problem with 1.05 QC so I dedicate them to QC. My 2080 Ti, 3080, and 3080 Ti GPUs have no problem running ATMML so they're dedicated to ATMML. All running Linux Mint with Nvidia 550.54.14. ____________
	ID: 61633 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 46 Credit: 0 RAC: 0 Level Scientific publications	Message 61634 - Posted: 25 Jul 2024 \| 8:14:34 UTC Last modified: 25 Jul 2024 \| 8:24:13 UTC
	Hello. This app is actually built with cudatoolkit version 11.8 For reasons indicated here: https://docs.nvidia.com/deploy/cuda-compatibility/#application-considerations-for-minor-version-compatibility The minimum driver version is not 450.80.02 as stated in table 1 of that link. But most likely 520 that was released with 11.8: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5 This app used OpenMM which uses PTX code. Hence if you are using a too old driver you will see the error: CUDA_ERROR_UNSUPPORTED_PTX_VERSION (222) For reference my test machine is a GTX 1080 with driver version 545
	ID: 61634 \| Rating: 0 \| rate: / Reply Quote

Drago Send message Joined: 3 May 20 Posts: 18 Credit: 831,594,060 RAC: 3,662,077 Level Scientific publications	Message 61646 - Posted: 31 Jul 2024 \| 11:34:25 UTC
	Until yesterday I received the ATMML wus and my hosts crunched them successfully with driver version 535. Today I am not getting any new ones. The servers says there were no available but on the server main page there are always around 300 available for download. And that number fluctuates so others get them. Does anybody else have that problem?
	ID: 61646 \| Rating: 0 \| rate: / Reply Quote

roundup Send message Joined: 11 May 10 Posts: 63 Credit: 9,096,655,193 RAC: 54,351,602 Level Scientific publications	Message 61647 - Posted: 31 Jul 2024 \| 12:54:39 UTC - in response to Message 61646.
	Does anybody else have that problem? No. Since the new batch arrived I have constant supply on two machines with drivers 535 and 550. The new work units seem to take much longer to calculate (with higher credits): 11700s on 4080 Super 12000s on 4080 14700s on 4070Ti
	ID: 61647 \| Rating: 0 \| rate: / Reply Quote

Drago Send message Joined: 3 May 20 Posts: 18 Credit: 831,594,060 RAC: 3,662,077 Level Scientific publications	Message 61648 - Posted: 31 Jul 2024 \| 15:28:07 UTC
	Huh... Do you know how much vram they need? My GPUs are limited to 8 GB.
	ID: 61648 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,649,638,294 RAC: 13,283,785 Level Scientific publications	Message 61649 - Posted: 31 Jul 2024 \| 15:45:03 UTC - in response to Message 61648.
	Huh... Do you know how much vram they need? My GPUs are limited to 8 GB. 8GB is plenty. They only use 3-4GB at most at some times.
	ID: 61649 \| Rating: 0 \| rate: / Reply Quote

Speedy Send message Joined: 19 Aug 07 Posts: 43 Credit: 31,091,082 RAC: 4,197 Level Scientific publications	Message 61650 - Posted: 1 Aug 2024 \| 3:23:45 UTC - in response to Message 61594.
	Is it Windows, Linux or both? Unfortunately it is only for Linux according to the application page.
	ID: 61650 \| Rating: 0 \| rate: / Reply Quote

mrchips Send message Joined: 9 May 21 Posts: 16 Credit: 1,380,930,500 RAC: 2,815,839 Level Scientific publications	Message 61651 - Posted: 2 Aug 2024 \| 11:25:00 UTC
	still waiting and wanting for some Windows tasks ____________
	ID: 61651 \| Rating: 0 \| rate: / Reply Quote

Dmit Send message Joined: 12 Sep 10 Posts: 8 Credit: 157,470,524 RAC: 139,900 Level Scientific publications	Message 61670 - Posted: 11 Aug 2024 \| 23:25:47 UTC - in response to Message 61619. Last modified: 11 Aug 2024 \| 23:31:13 UTC
	I tried to install the 535 driver but after that my GPU is no longer recognised by Amicable, Einstein and Asteroids. GPUgrid lets me start new wus but they fail after 43 seconds saying that no Nvidia GPU was found. It look like OpenCL not installed with distro 535 drivers. You need download from Nvidia website official source with .run extension and install it manually, not from distro driver manager. Something like this method, but newer .run version of course: https://askubuntu.com/questions/66328/how-do-i-install-the-latest-nvidia-drivers-from-the-run-file
	ID: 61670 \| Rating: 0 \| rate: / Reply Quote

Opolis Send message Joined: 19 Feb 12 Posts: 3 Credit: 1,193,303,370 RAC: 9,015,094 Level Scientific publications	Message 61685 - Posted: 22 Aug 2024 \| 16:59:03 UTC
	These tasks are running fine for me so far. The only thing I noticed was that the points awarded seem off. The second task I completed took an hour longer than the first but received 900k fewer points. So far they have been taking 5-6 hours on a 3080ti, driver version 535.183.01.
	ID: 61685 \| Rating: 0 \| rate: / Reply Quote

Greger Send message Joined: 6 Jan 15 Posts: 76 Credit: 24,192,102,249 RAC: 13,992,829 Level Scientific publications	Message 61686 - Posted: 22 Aug 2024 \| 18:07:41 UTC - in response to Message 61685.
	It is lower credit because it took longer from receive to return result 19 Aug 2024 \| 19:41:09 UTC 21 Aug 2024 \| 20:35:33 UTC This is time and it is more then 24 hours and therefor it got reduced in points.
	ID: 61686 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,079,048,466 RAC: 15,671,380 Level Scientific publications	Message 61687 - Posted: 22 Aug 2024 \| 18:13:19 UTC - in response to Message 61685.
	These tasks are running fine for me so far. The only thing I noticed was that the points awarded seem off. The second task I completed took an hour longer than the first but received 900k fewer points. So far they have been taking 5-6 hours on a 3080ti, driver version 535.183.01. The points are accurate. You get a 50% bonus, if you finish the task successfully and return the results within 24 hours from downloading it. There is a 25% bonus if you do it within 48 hours. No bonus if you return it after 48 hours. This is an incentive for quick return of results.
	ID: 61687 \| Rating: 0 \| rate: / Reply Quote

Richard Send message Joined: 13 Jan 24 Posts: 2 Credit: 27,850,000 RAC: 293,591 Level Scientific publications	Message 61689 - Posted: 22 Aug 2024 \| 18:24:03 UTC
	This task has been "downloading" for almost 4 hrs, has 0% completion and estimated run time of 400+ days. Looks like it needs killing? Running on Win 11, high-end machine. Richard ============================== Application ATMML: Free energy with neural networks 1.01 (cuda1121) Name TYK2_A02_A09_r0_2-QUICO_ATM_AF_04_Benchmark-12-20-RND2746 State Downloading Received 2024-08-22 7:40:17 AM Report deadline 2024-08-27 7:40:19 AM Resources 0.996 CPUs + 1 NVIDIA GPU Estimated computation size 1,000,000,000 GFLOPs Executable wrapper_6.1_windows_x86_64.exe
	ID: 61689 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,649,638,294 RAC: 13,283,785 Level Scientific publications	Message 61690 - Posted: 22 Aug 2024 \| 18:29:45 UTC - in response to Message 61689.
	Patience . . . . grasshopper. This is a new app for Windows hosts so you are competing for download bandwidth with the cohort of other Windows hosts. Which are many. The task runtime estimation is not accurate until your host has returned 11 valid tasks of that type to develop a correct and accurate APR rate. Only then will the estimated runtimes be correct.
	ID: 61690 \| Rating: 0 \| rate: / Reply Quote

Richard Send message Joined: 13 Jan 24 Posts: 2 Credit: 27,850,000 RAC: 293,591 Level Scientific publications	Message 61692 - Posted: 22 Aug 2024 \| 20:11:48 UTC - in response to Message 61689.
	Looks like it finally started and ran for a few minutes, then uploaded... Richard
	ID: 61692 \| Rating: 0 \| rate: / Reply Quote

Opolis Send message Joined: 19 Feb 12 Posts: 3 Credit: 1,193,303,370 RAC: 9,015,094 Level Scientific publications	Message 61693 - Posted: 22 Aug 2024 \| 22:21:42 UTC - in response to Message 61687.
	These tasks are running fine for me so far. The only thing I noticed was that the points awarded seem off. The second task I completed took an hour longer than the first but received 900k fewer points. So far they have been taking 5-6 hours on a 3080ti, driver version 535.183.01. The points are accurate. You get a 50% bonus, if you finish the task successfully and return the results within 24 hours from downloading it. There is a 25% bonus if you do it within 48 hours. No bonus if you return it after 48 hours. This is an incentive for quick return of results. Ah you are correct. I had the one task stuck in "downloading" for a while and I didn't run it until the next day.
	ID: 61693 \| Rating: 0 \| rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 96 Credit: 2,638,034,111 RAC: 20,338,166 Level Scientific publications	Message 61694 - Posted: 23 Aug 2024 \| 1:13:04 UTC
	Are there no checkpoints on ATMML tasks? I was about 30% complete when I had to suspend the task and shut down the computer. When I restarted both the % done and elapsed time were zero.
	ID: 61694 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,079,048,466 RAC: 15,671,380 Level Scientific publications	Message 61695 - Posted: 23 Aug 2024 \| 2:00:00 UTC - in response to Message 61694.
	Are there no checkpoints on ATMML tasks? I was about 30% complete when I had to suspend the task and shut down the computer. When I restarted both the % done and elapsed time were zero. No, there are not. Same goes for quantum chemistry and ATM. They haven't figured out how to do it, yet.
	ID: 61695 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61696 - Posted: 23 Aug 2024 \| 7:14:23 UTC
	I hope this doesn't backfire. This morning I see 800 tasks in progress, but zero ready to send. My last two downloads have been replica _3 tasks, each WU having failed on three Windows machines first. I do hope new Windows users pay attention to the 'tricks of the trade' we've learned over the years: small cache, especially with slower GPUs. run continuously, don't allow interruptions (especially auto-updates) don't swap to a different GPU type mid-run
	ID: 61696 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 9,762,712,024 RAC: 21,245,153 Level Scientific publications	Message 61697 - Posted: 23 Aug 2024 \| 10:32:49 UTC - in response to Message 61696.
	I do hope new Windows users pay attention to the 'tricks of the trade' we've learned over the years: Thank you for your ever-sharing expertise My last two downloads have been replica _3 tasks, each WU having failed on three Windows machines first. Despite this, there is a noticeable increase in the number of users returning ATMML results. Likely for the effect of Windows users now added to previous Linux ones. Before new Windows ATMML app was released, users/24h was consistently about 80 - 100. Currently it is more than 230, as can be seen at Server status page.
	ID: 61697 \| Rating: 0 \| rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 96 Credit: 2,638,034,111 RAC: 20,338,166 Level Scientific publications	Message 61698 - Posted: 23 Aug 2024 \| 10:55:09 UTC - in response to Message 61595. Last modified: 23 Aug 2024 \| 10:58:27 UTC
	ReL the Apps Page: https://www.gpugrid.net/apps.php I wish, for consistency, it would state: ATMML: Free energy with neural networks for GPU Also, when selecting projects in project preferences, it would be nice if it stated: ATMML on GPU
	ID: 61698 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 61699 - Posted: 23 Aug 2024 \| 11:21:33 UTC - in response to Message 61698.
	ReL the Apps Page: https://www.gpugrid.net/apps.php I wish, for consistency, it would state: ATMML: Free energy with neural networks for GPU Also, when selecting projects in project preferences, it would be nice if it stated: ATMML on GPU this is GPUgrid. all tasks are for GPU ____________
	ID: 61699 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61700 - Posted: 23 Aug 2024 \| 12:41:38 UTC - in response to Message 61697.
	Despite this, there is a noticeable increase in the number of users returning ATMML results. Indeed. But the question is: are those completed, end-of-run, scientifically useful results - or are they early crashes, resulting only in the creation and issue of another replica, to take its place in the 'in progress' count? We can't tell from the outside. But runtimes starting at 0.04 hours don't look too good.
	ID: 61700 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 46 Credit: 0 RAC: 0 Level Scientific publications	Message 61701 - Posted: 23 Aug 2024 \| 12:57:06 UTC - in response to Message 61700. Last modified: 23 Aug 2024 \| 12:59:22 UTC
	Hi, the windows host are working successfully. There are more errors than on linux as expected, but plenty are working well. Unfortunately some WUs with the very short run time but validated status bug are still in circulation. (each WU runs in a chain of 5 steps, when a step finishes it launches a new job with the same settings.) New WUs do not have this bug. This is the bug I am talking about: https://www.gpugrid.net/forum_thread.php?id=5468&nowrap=true#61682
	ID: 61701 \| Rating: 0 \| rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 96 Credit: 2,638,034,111 RAC: 20,338,166 Level Scientific publications	Message 61702 - Posted: 23 Aug 2024 \| 16:55:08 UTC - in response to Message 61696.
	* small cache, especially with slower GPUs. Which cache? Where is it set?? What should it be set at???
	ID: 61702 \| Rating: 0 \| rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 96 Credit: 2,638,034,111 RAC: 20,338,166 Level Scientific publications	Message 61703 - Posted: 23 Aug 2024 \| 17:03:59 UTC
	I just started ATMML yesterday. Out of seven starts only one completed. The rest errored-out after 1-1.5 hours. Windows11/RTX4090. I'd like to get some actual work done...
	ID: 61703 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 61704 - Posted: 23 Aug 2024 \| 17:19:31 UTC - in response to Message 61702.
	* small cache, especially with slower GPUs. Which cache? Where is it set?? What should it be set at??? He's talking about the work cache on the host. you can (kind of) control that in the BOINC Manager Options->"Computing Preferences" menu. set it to something less than 1 day probably. you'll be limited to 4 tasks from the project (per GPU) anyway. ____________
	ID: 61704 \| Rating: 0 \| rate: / Reply Quote

Farscape Send message Joined: 1 Feb 09 Posts: 5 Credit: 1,609,692,118 RAC: 9,579,370 Level Scientific publications	Message 61705 - Posted: 23 Aug 2024 \| 17:30:31 UTC
	The Windows tasks ARE NOT working as advertised.... On two 3090ti computers and one 3090 11 work units have error out between 2-4 hours of run time. Previous successful task run times went between 17000-18500 seconds. Errored tasks are 5000-8500 seconds. I am killing the ap in preferences until itself out.... ____________
	ID: 61705 \| Rating: 0 \| rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 96 Credit: 2,638,034,111 RAC: 20,338,166 Level Scientific publications	Message 61706 - Posted: 23 Aug 2024 \| 18:22:30 UTC - in response to Message 61705.
	Thanks. There are too many cache's out there. Let's call this the work queue.
	ID: 61706 \| Rating: 0 \| rate: / Reply Quote

zombie67 [MM] Send message Joined: 16 Jul 07 Posts: 209 Credit: 4,095,161,456 RAC: 22,338,324 Level Scientific publications	Message 61707 - Posted: 23 Aug 2024 \| 20:14:08 UTC - in response to Message 61705.
	The Windows tasks ARE NOT working as advertised.... On two 3090ti computers and one 3090 11 work units have error out between 2-4 hours of run time. Previous successful task run times went between 17000-18500 seconds. Errored tasks are 5000-8500 seconds. I am killing the ap in preferences until itself out.... All 8 of 8 tasks I have completed and returned also categorized as error. This is on win10 with 4080 and 4090 GPUs. Here is a sample: http://www.gpugrid.net/result.php?resultid=35743812 ____________ Reno, NV Team: SETI.USA
	ID: 61707 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61708 - Posted: 23 Aug 2024 \| 21:09:54 UTC - in response to Message 61707.
	Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED That's going to be a difficult one to overcome unless the project addresses its job estimation. You need to 'complete' (which includes a successful finish plus validation) 11 tasks before the estimates are normalised - and if every task fails, you'll never get there.
	ID: 61708 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 46 Credit: 0 RAC: 0 Level Scientific publications	Message 61709 - Posted: 24 Aug 2024 \| 8:17:34 UTC - in response to Message 61708.
	Hello. I apologise about the time limit exceed errors. I did not expect this. The jobs run for the same time as the linux ones that have all been working so I dont really understand what is happening. Unfortunately the way boinc deals with "runtime" is completely inadequate for gpu projects. In a WU we have to estimate the flop use, which is a difficult thing to do for a gpu app. The boinc client then somehow estimates the flops performance of your computer in a way I don't understand. I cannot simply put a runtime limit of x hours as would be typical. Does anyone know where the denominator comes from in this line?: <message> exceeded elapsed time limit 5454.20 (10000000000.00G/1712015.37G)</message> <stderr_txt> The numerator I believe is the fpops_bound that is set in the WU template which is controlled by us.
	ID: 61709 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61710 - Posted: 24 Aug 2024 \| 9:16:26 UTC - in response to Message 61709. Last modified: 24 Aug 2024 \| 9:18:42 UTC
	Does anyone know where the denominator comes from in this line?: <message> exceeded elapsed time limit 5454.20 (10000000000.00G/1712015.37G)</message> <stderr_txt> The numerator I believe is the fpops_bound that is set in the WU template which is controlled by us. Yes. It's the current estimated speed for the task, which should be 'learned' by BOINC for the individual computer running this particular task type ('app_version'). It's a complex three-stage process, and unfortunately it doesn't go down to the granularity of individual GPU types - all GPUs are considered equal. 1) When a new app version is created, the server will set a first, initial, value for GPU speeds for that version. I'm afraid I don't know how that initial value is estimated, but I'll try to find out. 2) Once the app version is up and running, the server monitors the runtime of the successful tasks returned. That's done at both the project level, and the individual host level. The first critical point is probably when the project has received 100 results: the calculated average speed from those 100 is used to set the expected speed for all tasks issued from that point forward. [aside - 'obviously' the first results received will be from the fastest machines, so that value is skewed] 3) Later, as each individual host reports tasks, once 11 successful tasks have been returned, future tasks assigned to that host are assigned the running average value for that host. The current speed estimate ('fpops_est') can be seen in the application_details page for each host. zombie67 hasn't completed an ATMML task yet, so no 'Average processing rate' for his machine is shown yet for ATMML (at the bottom), but you can see it for other task types. Phew. That's probably more than enough for now, so I'll leave you to digest it.
	ID: 61710 \| Rating: 0 \| rate: / Reply Quote

wujj123456 Send message Joined: 9 Jun 10 Posts: 19 Credit: 2,233,932,323 RAC: 1,049 Level Scientific publications	Message 61711 - Posted: 24 Aug 2024 \| 20:16:04 UTC - in response to Message 61710.
	I'm curious why do we even bother to intentionally error out a task based on runtime at all? Usually a wrong estimate of runtime just messes with local client scheduling a bit, but tasks finish fine eventually. It's not like GPUGrid had accurate runtime estimation before, but previous tasks didn't fail. Does this batch/app has bug that could cause it to stuck computing forever, which is why we need an additional protection to abort tasks after certain runtime?
	ID: 61711 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,649,638,294 RAC: 13,283,785 Level Scientific publications	Message 61712 - Posted: 24 Aug 2024 \| 21:30:21 UTC - in response to Message 61711.
	Probably the decision is because this project depends on fast turnaround and turnover for tasks. Science can't proceed till the earlier result is returned, validated and then iterated into the next task. Better to fail fast and send out the next wingman task until the task gets retired at 8 fails. Deadline is always 5 days to get through 8 tries and why they reward 50% credit bonus for returning results within 24 hours.
	ID: 61712 \| Rating: 0 \| rate: / Reply Quote

wujj123456 Send message Joined: 9 Jun 10 Posts: 19 Credit: 2,233,932,323 RAC: 1,049 Level Scientific publications	Message 61713 - Posted: 24 Aug 2024 \| 22:25:16 UTC - in response to Message 61712. Last modified: 24 Aug 2024 \| 22:34:54 UTC
	I've seen people misuse this "fail fast" philosophy very often. "Fail fast" makes sense only when it's going to be a failure anyway. Turning a successful result into a failure proactively is the opposite of making progress. Look at the errors on this host. It took ~20K seconds for my host to finish, but all the prematurely killed results ended up wasting way more compute time and on average suffered another half a day delay before getting a successful return. That's not speeding up but slowing down. That's why I'm curious what this limit is trying to protect against. Only if we know there is a chance that a task can stuck computing indefinitely, would such a limit make sense. That would generally indicate some bug needs to be fixed. Even then, given how long turnaround would be after killing an otherwise successful task, the project should have set a floor of the limit to a few hours. In addition, "science" does not equal to this project alone. Wasting hours of compute that could have been used by other projects isn't advancing "science". It's advancing this project at the cost of other science. That would be a bit disrespectful to fellow scientists if it's done intentionally. However, I'd rather assume good intention here that this is just a misguided optimization. I know software isn't easy, so project owner should take a look at the resulting data and try to make more efficient use of available compute by reducing waste, which would also speed up the progress for this project.
	ID: 61713 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,649,638,294 RAC: 13,283,785 Level Scientific publications	Message 61714 - Posted: 25 Aug 2024 \| 7:40:25 UTC - in response to Message 61713.
	I'm pretty sure the "exceeded elapsed time limit" is not because the project scientists just decided on a whim to utilize it. It's part of the Boinc code and nothing they have control over. It's present for all projects that use the Boinc code unmodified. Only the Boinc developers have the knowledge of how that function is implemented. The project scientist already stated he was surprised by the errors when the exact same task template was used for the Linux tasks and they have not had any issues with elapsed time limit errors. Something specific to Windows. And they do not develop for Windows firstly being all the tools and software they use is primarily first Linux based and where their expertise is greatest. Some of the toolchains they use have never had Windows versions which is why it has taken so long for some Windows versions of the native Linux apps.
	ID: 61714 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61716 - Posted: 25 Aug 2024 \| 11:49:19 UTC - in response to Message 61714.
	I agree - the runtime errors are an issue mainly of the BOINC software, but they are appearing because the GPUGrid teams - admin and research - have over the years failed to fully come to terms with the changes introduced by BOINC around 2010. We are running a very old copy of the BOINC server code here, which include the beginnings of the 2010 changes, but which makes it very difficult for us to dig our way out of the hole we're in. But I don't agree that only the BOINC developers understand the code. It's all open-source, and other projects mange to control it reasonably well. The finer points are indeed cloaked in obscure language, but the resulting data is visible on all our machines. Let's play with a current worked excample. I've just downloaded a new ATMML task on host 508381. That's a Linux machine, with 2 GPUs. They are in fact identical, so for once the 'Coprocessors' line is true. It has completed 52 ATMML tasks so far, so it has had plenty of time to reach a steady state. [BOINC loves steady states - it's the edge cases, like deploying a new app_version, which cause the problems] My key objective is to see how the runtime estimate was derived, and to see what was done well, and what was done badly. BOINC works out the runtime from the size and the speed of the task. In dimensional terms, that's {size} / {size per second} The sizes cancel out, and duration is the inverse of speed. In the case of my new task, I have: <rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est> (size) <flops>698258637765.176392</flops> (speed) My calculator makes that Duration 1,432,134 seconds, or about 16.5 days. But our BOINC clients have a trick up their sleeves for coping with that - it's called the DCF, or duration correction factor. For this machine, it's settled to 0.045052. Popping that into the calculator, that comes down to: Runtime estimate 64,520 seconds, or 17.92 hours. BOINC Manager displays 17:55:20, and that's about right for these tasks (although they do vary). CONCLUSION The task sizes set by the project for this app are unrealistically high, and the runtime estimates only approach sanity through the heavy application of DCF - which should normally hover around 1. DCF is self-adjusting, but very slowly for these extreme limits. And you have to do the work first, which may not be possible. Volunteers with knowledge and skill can adjust their own DCFs, but I wouldn't advise it for novices. @ Steve That's even more indigestible than the essay I wrote you yesterday. Please don't jump into changing things until they've both sunk in fully: meddling (and in particular, meddling with one task type at a project with multiple applications) can cause even more problems that it cures. Mull it over, discuss it with your administrators and fellow researchers, and above all - ask as many questions as you need.
	ID: 61716 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level Scientific publications	Message 61721 - Posted: 25 Aug 2024 \| 19:56:36 UTC
	"transient upload error: server out of disk space" the old problem which has been occurring over the years :-( Unbelievable that this is still happening.
	ID: 61721 \| Rating: 0 \| rate: / Reply Quote

wujj123456 Send message Joined: 9 Jun 10 Posts: 19 Credit: 2,233,932,323 RAC: 1,049 Level Scientific publications	Message 61724 - Posted: 26 Aug 2024 \| 0:56:52 UTC - in response to Message 61716.
	So this runtime exceeded failure is actually related to the absurd rsc_fpops_est setting. This number is obviously not accurate. Could the project fix the number to be more reasonable, instead of relying on client's trial and error for adjustment that wastes lots of compute power?
	ID: 61724 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61730 - Posted: 26 Aug 2024 \| 7:52:36 UTC - in response to Message 61724.
	So this runtime exceeded failure is actually related to the absurd rsc_fpops_est setting. This number is obviously not accurate. Could the project fix the number to be more reasonable, instead of relying on client's trial and error for adjustment that wastes lots of compute power? Not so fast, please. The rsc_fpops_est figure is obviously wrong, but that's the result of many years of twiddling knobs without really understanding what they do. Two flies in that pot of ointment: If they reduce rsc_fpops_est by itself, the time limit will reduce, and more tasks will fail. There's a second value - rsc_fpops_bound - which actually triggers the failure. In my worked example, that was set to 1,000x the estimate, or several years. That was one of the knobs they twiddled some years ago: the default is 10x. So something else is seriously wrong as well. Soon after the Windows app was launched, I saw tasks with very high replication numbers, which had failed on multiple machines - up to 7, the limit here. But very few of them were 'time limit exceeded'. The tasks I'm running now have low replication numbers, so we may be over the worst of it. I repeat my plea to Steve - please take your time to think, discuss, understand what's going on. Don't change anything until you've worked out what it'll do.
	ID: 61730 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 46 Credit: 0 RAC: 0 Level Scientific publications	Message 61732 - Posted: 26 Aug 2024 \| 8:32:33 UTC - in response to Message 61730.
	Thank you for the explanation. The time limit exceeded error therefore happened because: - we had a bug in some circulating WUs where certain errors would not trigger a proper error code. The result would then be validated with short runtimes. - these fast runtime results then skewed the correction factors for the newly released windows app version. To fix the problem I 10x'ed the rsc_fpops_bound value while leaving the rsc_fpops_est unchanged. This appears to have worked and hosts that previously had the time limit exceeded errors now do not.
	ID: 61732 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61733 - Posted: 26 Aug 2024 \| 9:05:27 UTC - in response to Message 61732.
	Yes, I found host 591089. That succeeded on its first task, but then failed five in a row on the time limit. It's had the current one for two days now, so hopefully it'll work. One to watch.
	ID: 61733 \| Rating: 0 \| rate: / Reply Quote

EA6LE Send message Joined: 28 Dec 20 Posts: 7 Credit: 20,035,707,011 RAC: 127,982,323 Level Scientific publications	Message 61735 - Posted: 26 Aug 2024 \| 13:16:48 UTC - in response to Message 61733.
	I found a way to get the windows WUs finish without errors. after you get a WU, go to Projects tab and select no new tasks. be sure you don't have other projects running at the same time. once is finished and uploaded you can allow for another WU and repeat.
	ID: 61735 \| Rating: 0 \| rate: / Reply Quote

EA6LE Send message Joined: 28 Dec 20 Posts: 7 Credit: 20,035,707,011 RAC: 127,982,323 Level Scientific publications	Message 61739 - Posted: 27 Aug 2024 \| 22:05:37 UTC - in response to Message 61735.
	WUs starting with "MCL1" are all erroring out in windows or linux.
	ID: 61739 \| Rating: 0 \| rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 337 Credit: 7,617,663,350 RAC: 11,036,158 Level Scientific publications	Message 61740 - Posted: 27 Aug 2024 \| 22:21:52 UTC
	Quite a few of my tasks starting with PTP1B are failing in both OSs
	ID: 61740 \| Rating: 0 \| rate: / Reply Quote

EA6LE Send message Joined: 28 Dec 20 Posts: 7 Credit: 20,035,707,011 RAC: 127,982,323 Level Scientific publications	Message 61741 - Posted: 27 Aug 2024 \| 22:31:58 UTC - in response to Message 61740. Last modified: 27 Aug 2024 \| 22:32:11 UTC
	Quite a few of my tasks starting with PTP1B are failing in both OSs Those worked fine for me under linux. took shorter time to finish them.
	ID: 61741 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,079,048,466 RAC: 15,671,380 Level Scientific publications	Message 61742 - Posted: 28 Aug 2024 \| 0:37:26 UTC
	Same issue here: https://www.gpugrid.net/result.php?resultid=35798577 Tue 27 Aug 2024 08:34:58 PM EDT \| GPUGRID \| Computation for task MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1 finished Tue 27 Aug 2024 08:34:58 PM EDT \| GPUGRID \| Output file MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1_0 for task MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1 absent Tue 27 Aug 2024 08:34:59 PM EDT \| GPUGRID \| Started upload of MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1_1
	ID: 61742 \| Rating: 0 \| rate: / Reply Quote

roundup Send message Joined: 11 May 10 Posts: 63 Credit: 9,096,655,193 RAC: 54,351,602 Level Scientific publications	Message 61743 - Posted: 28 Aug 2024 \| 3:47:48 UTC - in response to Message 61742.
	That is a bad batch. All units error out after several weeks of trouble-free calculation under Linux. Example: https://www.gpugrid.net/result.php?resultid=35800268 Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead. [W output_modules.py:45] Warning: CUDA graph capture will lock the batch to the current number of samples (2). Changing this will result in a crash (function ) [W output_modules.py:45] Warning: CUDA graph capture will lock the batch to the current number of samples (2). Changing this will result in a crash (function ) Traceback (most recent call last): File "/var/lib/boinc-client/slots/28/bin/rbfe_explicit_sync.py", line 11, in <module> rx.scheduleJobs() File "/var/lib/boinc-client/slots/28/lib/python3.11/site-packages/sync/atm.py", line 142, in scheduleJobs if isample % int(self.config['CHECKPOINT_FREQUENCY']) == 0 or isample == num_samples: ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/boinc-client/slots/28/lib/python3.11/site-packages/configobj/__init__.py", line 554, in __getitem__ val = dict.__getitem__(self, key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ KeyError: 'CHECKPOINT_FREQUENCY'
	ID: 61743 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61747 - Posted: 28 Aug 2024 \| 9:31:22 UTC
	Yes, I had 90 failed ATMML tasks overnight. The earliest was issued just after 18:00 UTC yesterday, but was created at 27 Aug 2024 \| 13:28:15 UTC. I've switched to helping with the quantum chemistry backlog for the time being.
	ID: 61747 \| Rating: 0 \| rate: / Reply Quote

Billy Ewell 1931 Send message Joined: 22 Oct 10 Posts: 40 Credit: 1,542,168,656 RAC: 3,801,702 Level Scientific publications	Message 61757 - Posted: 3 Sep 2024 \| 18:30:20 UTC - in response to Message 61694. Last modified: 3 Sep 2024 \| 18:37:38 UTC
	I experienced the same problems over several days when I was suspending GPU processing because of very hot temps in Texas; the result was loss of many hours of processing until I discovered the LTIWS(leave tasks in memory while suspended) was apparently not working. I am suspending ATMML tasks until cooler weather arrives in the fall.BET
	ID: 61757 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,649,638,294 RAC: 13,283,785 Level Scientific publications	Message 61758 - Posted: 3 Sep 2024 \| 19:46:19 UTC - in response to Message 61757.
	I can't remember all of the task types that allow suspending or exiting Boinc without erroring out. The acemd tasks properly checkpoint, but you also can't allow a restarted WU to start again on a different gpu or it will also error out. Best practice for GPUGrid generally has always been to let all tasks run to completion before exiting Boinc. No guarantees that any task will resume without loss of prior work done or just error out.
	ID: 61758 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61760 - Posted: 3 Sep 2024 \| 21:07:48 UTC - in response to Message 61757.
	LTI[M]WS only applies to CPU tasks. GPUs don't have that much spare memory.
	ID: 61760 \| Rating: 0 \| rate: / Reply Quote

Billy Ewell 1931 Send message Joined: 22 Oct 10 Posts: 40 Credit: 1,542,168,656 RAC: 3,801,702 Level Scientific publications	Message 61761 - Posted: 4 Sep 2024 \| 17:12:09 UTC - in response to Message 61760.
	I appreciate the informative responses of BOTH Keith and Richard immediately below!!
	ID: 61761 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,649,638,294 RAC: 13,283,785 Level Scientific publications	Message 61762 - Posted: 4 Sep 2024 \| 18:08:37 UTC
	Generally, if I know I must reboot the system shortly in the future I will just wait till the current tasks are finished or reboot shortly after a new task starts so I won't begrudge the little time lost it has already spent crunching and which it will have have restart again after the reboot. It is generally safe to stop a task soon after it starts because with the exception of the acemd tasks, all the rest of the task types need several minutes to unpack the python environment in the slots and actually hasn't started calculating anything yet You can get away with interrupting the startup process with a reboot I have found and you won't throw away the task or error it out.
	ID: 61762 \| Rating: 0 \| rate: / Reply Quote

Life v lies: Dont be a DN... Send message Joined: 14 Feb 20 Posts: 16 Credit: 27,395,983 RAC: 1,420 Level Scientific publications	Message 61763 - Posted: 4 Sep 2024 \| 21:35:14 UTC - in response to Message 61687. Last modified: 4 Sep 2024 \| 22:07:43 UTC
	The time bonus system has been in place w/ GPUGrid for years. (And yes, the GG tasks download several GIG of data... [WHY? well, another issue] and the download time does count against the deadline) BUT the points awarded are nonetheless - shall one say - unfathomable. Case in point: ATMML has a very, very high failure rate [yet another issue, AND an important one], and when completed usually award 300,000 points, at least to my NVIDIA which is better in some ways than this guy's... HOWEVER, host 621740 has had seven successful ATMML tasks (see below) in the last six days with EACH being awarded 2,700,000 points .... SO, what gives?? WHY a 9-fold difference??? WUid other task result 29271283 1 error, 1 abort 29270516 3 errors 29265238 1 error, 1 abort 29204796 1 time out, 5 errors (1 of these after 50,905 sec = 14+ Hrs 29268456 n/a 29267692 3 errors 29267146 2 errors 621740's specs: GenuineIntel Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz [Family 6 Model 26 Stepping 5] Number of processors 8 Coprocessors NVIDIA NVIDIA GeForce RTX 3060 (12287MB) driver: 560.81 Operating System Microsoft Windows 10 Professional Memory 12279.11 MB Cache 256 KB My NVIDIA has Memory 16316.07 MB Cache 512 KB PLUS Swap space 45668.07 MB Total disk space 464.49 GB Lasslo P, PhD, Prof Engr.
	ID: 61763 \| Rating: 0 \| rate: / Reply Quote

Life v lies: Dont be a DN... Send message Joined: 14 Feb 20 Posts: 16 Credit: 27,395,983 RAC: 1,420 Level Scientific publications	Message 61764 - Posted: 4 Sep 2024 \| 22:00:06 UTC - in response to Message 61762.
	Good point, but winDoze update does not make it easy to avoid IT's "decision" about updating and the time to restart my system. (Don't you hate it when big tech is so much more brilliant and all-knowing than you?) I turn OFF updates for the 5 weeks max allowed, then as the month ends, I pick the time when I will download and install OS updates. Even then, I set the "active hours" to the times LEAST likely my PC is in use, usually including late PM to early AM Lasslo P, PhD, Prof Engr
	ID: 61764 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 61765 - Posted: 4 Sep 2024 \| 22:08:10 UTC - in response to Message 61763. Last modified: 4 Sep 2024 \| 22:11:37 UTC
	you're comparing multiple levels of apples vs oranges. a GTX1660Ti is in no ways better than a RTX3060, it's older gen, less than half the CUDA cores, no tensor cores (which ATMML will use), slower clock speed, slower memory speed, truly basically every performance metric favors the 3060. your task you completed for 300,000cr was ACEMD3, not ATMML, and you also need to consider that the ATMML tasks run much longer than ACEMD3 and use more resources, so the higher credit reward is appropriate. your 1x ACEMD3 task if you were to crunch 24/7 would come up with a production of a little over 1,000,000 points per day. your competitor also completed one ACEMD3 task recently, and scaling that to 24/7 production comes out to around 1,500,000 points per day. it takes them about 4x longer to run ATMML. and based on their recent production, including the failure, is about 3,600,000 ppd. the project admins have resolved a lot of the problems with ATMML, if you want to have better success with this project, and GPUGRID in general, you can consider switching to Linux. otherwise, maybe investigate what's going wrong with your system to cause the failures. looks like a permissions issue to me since your failed tasks have a bunch of Access is denied errors in the WU log. possibly an over zealous AV software. that could be the reason for your download errors also, or just spotty internet ____________
	ID: 61765 \| Rating: 0 \| rate: / Reply Quote

Life v lies. Dont be a DN... Send message Joined: 7 Feb 12 Posts: 5 Credit: 333,019,294 RAC: 511,201 Level Scientific publications	Message 61766 - Posted: 4 Sep 2024 \| 23:19:20 UTC - in response to Message 61765.
	GTX1660Ti is in no ways better "in no way" is an absolute statement, and is false: my NVIDIA has 33% more memory and double the cache. But, admittedly it is not in general as powerful
	ID: 61766 \| Rating: 0 \| rate: / Reply Quote

Life v lies. Dont be a DN... Send message Joined: 7 Feb 12 Posts: 5 Credit: 333,019,294 RAC: 511,201 Level Scientific publications	Message 61767 - Posted: 4 Sep 2024 \| 23:34:23 UTC - in response to Message 61765. Last modified: 4 Sep 2024 \| 23:47:58 UTC
	maybe investigate what's going wrong with your system to cause the failures. how bizarre ... Batting 1000, the GPUGrid tasks which fail on my system have ALSO failed on several, perhaps even 6 or 7, other systems (except when I take a newly issued task, and when I check those they also fail after bombing on my system.) So, if the problem is on my end and not in any way on GPUGrid's end, then there must be dozens and dozens (and dozens) of other systems which apparently need to "investigate what's going wrong" with them... that could be the reason for your download errors also I have no "Download errors" except when I abort the download of a task which has already had repeated compute errors. GPUGrid needs 8 failures before figuring out that there are 'too many error (may have bug)' If I can, I'd rather give them this insight before I waste 5-10 minutes of time on my GPU, such as it is. Anyway, thanks for your feedback Oh, and by the way, I run 12-13 other projects, including at least three others where I run GPU tasks. This very high error rate of tasks is NOT an issue whatsoever with any of them. LLP, PhD, Prof Engr
	ID: 61767 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 61768 - Posted: 4 Sep 2024 \| 23:34:57 UTC - in response to Message 61766. Last modified: 4 Sep 2024 \| 23:41:05 UTC
	GTX1660Ti is in no ways better "in no way" is an absolute statement, and is false: my NVIDIA has 33% more memory and double the cache. But, admittedly it is not in general as powerful read up. my "absolute" statement is correct. your 1660Ti has half the memory of a 3060. your 1660Ti also has half the cache of the 3060. GTX 1660Ti has 6GB RTX 3060 has 12GB (you can see this in the host details you referenced) not sure where you're getting your information, but it's wrong or missing context (like comparing to a laptop GPU or something) ____________
	ID: 61768 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 61769 - Posted: 4 Sep 2024 \| 23:38:05 UTC - in response to Message 61767. Last modified: 4 Sep 2024 \| 23:41:39 UTC
	maybe investigate what's going wrong with your system to cause the failures. how bizarre ... Batting 1000, the GPUGrid tasks which fail on my system have ALSO failed on several, perhaps even 6 or 7, other systems (except when I take a newly issued task, and when I check those they also fail after bombing on my system.) So, if the problem is on my end and not in any way on GPUGrid's end, then there must be dozens and dozens (and dozens) of other systems which apparently need to "investigate what's going wrong" with them... that could be the reason for your download errors also I have no "Download errors" except when I abort the download of a task which has already had repeated compute errors. Anyway, thanks for your feedback there are more people running Windows. higher probability for resends to land on another problematic windows host. it's more common for Windows users to be running AV software. it's common for windows users to have issues with BOINC projects and AV software. not hard to imagine that these factors mean that a large number of people would have problems when they're all coming to play. check your AV settings, whitelist the BOINC data directories and try again. ____________
	ID: 61769 \| Rating: 0 \| rate: / Reply Quote

Life v lies. Dont be a DN... Send message Joined: 7 Feb 12 Posts: 5 Credit: 333,019,294 RAC: 511,201 Level Scientific publications	Message 61770 - Posted: 5 Sep 2024 \| 0:07:54 UTC - in response to Message 61768. Last modified: 5 Sep 2024 \| 0:08:32 UTC
	your 1660Ti has half the memory of a 3060. your 1660Ti also has half the cache of the 3060. GTX 1660Ti has 6GB RTX 3060 has 12GB (you can see this in the host details you referenced) not sure where you're getting your information, but it's wrong or missing context (like comparing to a laptop GPU or something) My information is from GPUGrid's host information, https://gpugrid.net/show_host_detail.php?hostid=613323 which states 16GB, but this may be unreliable as TechPowerUp GPU-Z does give the NVIDIA site's number of 6GB My numbers for the cache also come from gpugrid.net/show_host_detail.php as indeed all the memory figurers in my original post so I guess my mistake was trusting gpugrid.net/show_host_detail info. And no, this is not a laptop, but a 12-core desktop.
	ID: 61770 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 61771 - Posted: 5 Sep 2024 \| 0:18:58 UTC - in response to Message 61770. Last modified: 5 Sep 2024 \| 0:24:01 UTC
	your 1660Ti has half the memory of a 3060. your 1660Ti also has half the cache of the 3060. GTX 1660Ti has 6GB RTX 3060 has 12GB (you can see this in the host details you referenced) not sure where you're getting your information, but it's wrong or missing context (like comparing to a laptop GPU or something) My information is from GPUGrid's host information, https://gpugrid.net/show_host_detail.php?hostid=613323 which states 16GB, but this may be unreliable as TechPowerUp GPU-Z does give the NVIDIA site's number of 6GB My numbers for the cache also come from gpugrid.net/show_host_detail.php as indeed all the memory figurers in my original post so I guess my mistake was trusting gpugrid.net/show_host_detail info. And no, this is not a laptop, but a 12-core desktop. TPU is not out of date, and probably one of the most reliable databases for GPU (and other) specifications. there lies the issue. you're looking at system memory, not the GPU memory. system memory has little to do with GPUGRID tasks that run on the GPUs and not the CPU. at all BOINC projects, the GPU VRAM is listed in parenthesis next to the GPU model name on the Coprocessors line. and further context, there was a long standing bug with BOINC versions older than about 7.18 that capped Nvidia memory reported (not actual) to only 4GB. so old versions were wrong in what they reported for a long time. so still, the 3060 beats the 1660Ti in every metric. you just happened to have populated more system memory on the motherboard, but that has nothing to do with comparing the GPUs themselves. ____________
	ID: 61771 \| Rating: 0 \| rate: / Reply Quote

Life v lies. Dont be a DN... Send message Joined: 7 Feb 12 Posts: 5 Credit: 333,019,294 RAC: 511,201 Level Scientific publications	Message 61772 - Posted: 5 Sep 2024 \| 0:39:58 UTC - in response to Message 61769.
	windows users to have issues with BOINC projects Again, I run 12-13 other projects, including at least three others where I run GPU tasks. I have a zero error rate on other projects. But I do appreciate your suggestion, as I like the science behind GPUGrid and would very much like to RUN tasks rather than have them error out. I have searched PC settings and Control Panel settings as well as file options for "AV" and do not get any relevant hits. Could you please elaborate on what you mean by AV settings and whitelisting the BOINC directories? Thanks. LLP, PhD, Prof Engr.
	ID: 61772 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 61773 - Posted: 5 Sep 2024 \| 0:43:07 UTC - in response to Message 61772.
	AV = Anti Virus software. ____________
	ID: 61773 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61774 - Posted: 5 Sep 2024 \| 12:21:27 UTC
	Switching back to BOINC software and (specifically) ATMML tasks. I've posted extensively in this thread about the problems of task duration estimation at this project. I've got some new data, which I can't explain. Last week, I added a new Linux host (host 625407). It's a pretty plain vanilla Intel i5 with a single RTX 3060 - should be fairly average for this project. It's completed 17 tasks so far, with number 18 in progress - around 3 per day. I attached it to a venue with only ATMML tasks allowed. Given my interest in BOINC's server-side task duration estimation for GPUs, I've been logging the stats. Here's what I've got so far: Task number rsc_fpops_est rsc_fpops_bound flops DCF Runtime estimate (secs) 1 1E+18 1E+21 20,146,625,396,909 1.0000 49636 13.79 hours 2 3 4 1E+18 1E+21 20,218,746,342,900 0.8351 41301 11.47 hours 5 6 7 1E+18 1E+21 19,777,581,461,665 0.9931 50214 13.95 hours 8 9 10 1E+18 1E+21 19,446,193,249,403 0.8926 45900 12.75 hours 11 1E+18 1E+21 19,506,082,146,580 0.8247 42279 11.74 hours 12 1E+18 1E+21 19,522,515,301,144 0.7661 39242 10.90 hours 13 14 1E+18 1E+20 99,825,140,137 0.7585 7598256 87.94 days 15 16 1E+18 1E+21 99,825,140,137 0.7360 7373243 85.34 days 17 1E+18 1E+21 99,825,140,137 0.7287 7300045 84.49 days 18 1E+18 1E+21 99,825,140,137 0.7215 7227478 83.65 days My understanding of the BOINC server code is that, for a mature app_version (Linux ATMML has been around for 2 months), the initial estimates should be based on the average speed of the tasks so far across the project as a whole. So it seems reasonable that the initial estimates were for 10-12 hours - that's about what I expected for this GPU. Then, after the first 11 tasks have been reported successful, it should switch to the average for this specific host. So why does it appear that this particular host is reporting a speed something like 1/200th of the project average? So now, it's frantically attempting to compensate by driving my DCF through the floor, as with my two older machines. The absolute values are interesting too. The initial (project-wide) flops estimates are hovering around 20,000 GFlops - does that sound right, for those who know the hardware in detail? And they are fluctuating a bit, as might be expected for an average with variable task durations for completions. After the transition, my card dropped to below 100 GFlops - and has remained rock-steady. That's not in the script. The APR for the card (which should match the flops figure for allocated tasks) is 35599.725995644 GFlops - which doesn't match any of the figures above. Where does this take us? And what, if anything, can we do about it? I'll try to get my head round the published BOINC server code on GitHub, but this area is notoriously complex. And the likelihood is that the current code differs to a greater or lesser extent from the code in use at this project. I invite others of similarly inquisitive mind to join in with suggestions.
	ID: 61774 \| Rating: 0 \| rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 25 Credit: 4,153,801,869 RAC: 11,857,495 Level Scientific publications	Message 61775 - Posted: 5 Sep 2024 \| 14:38:15 UTC
	I didn't run ATMML but I'm currently running Qchem on Tesla P100 with short run times (averaging somewhere around 12 mins or so per task). I see this similar behavior/pattern when starting a new instance. If I were to guess, your DCF will eventually go down from the last value of 0.7215 to 0.01 after running 100+ tasks and your final estimated run time could be about 1.16 days which is still higher than your average expected run time for your card. However if you run cpu benchmark, then the DCF number will go up from 0.01 to something higher and will take another 100+ tasks for the DCF to go down to 0.01 again but this time the estimated run time will go below 1.16 days. I didn't take any note but just making observation only, so I could be wrong. My wild guess is that when running gpu only task there is associated cpu % required to run that gpu task and running benchmark will take care of the cpu portion needed for the gpu task. My one cent.
	ID: 61775 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 46 Credit: 0 RAC: 0 Level Scientific publications	Message 61776 - Posted: 5 Sep 2024 \| 14:38:59 UTC - in response to Message 61774.
	This is very interesting, thank you for the numbers. I still don't understand where the flops number for a machine comes from. does it use the data of your hardware? or is it purely based on maths done from the rsc_fpops_est number we have set and the time taken for WUs? I am also unsure how I would set this rsc_fpops_est number to be more accurate. given one of these WUs takes maybe an hour on a 4090: A 4090 is 80 TFLOPS. x1 hour = ~ 3x10^17 float point operations. Which is not actually that far off the estimated value of 1x10^18. of course the WUs will not be using all the Tflops of the 4090. And there is no sane way for me to calculate the number of floating point operations the program uses.
	ID: 61776 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61777 - Posted: 5 Sep 2024 \| 17:07:17 UTC - in response to Message 61776.
	Well, I said that this is going to be difficult ... My knowledge and understanding comes from already being an active volunteer at SETI@home on 18 Dec 2008, when BOINC was first announced as being able to manage and use CUDA on GPUs for scientific computing - GPUGrid is also mentioned as becoming CUDA-enabled on the same day. We spent the following months and years knocking the rough edges off the initial launch code. I think the names and features I was referring to in my post were introduced in a sort-of relaunch around 2011. That's still a long way back in the memory bank, and that makes it difficult to find precise references in code or documentation. My understanding is that the system was designed to be as easy as possible for researchers to implement. I believe the only key information required is the rsc_fpops_est - the estimated size of the task. From your comments on the 4090, and my logging of the early tasks, I think we can accept the current figure as being 'near enough right', and that's all it needs to be. I think that the flops value is - since 2011-ish - reverse-engineered from your estimate of fpops_est and the measured runtime of the task on the volunteer's hardware. Pre 2011, BOINC took more notice of the 'peak flops' calculated by our computers from the speed and internal geometry of the GPU in use. BOINC guessed a 'fiddle factor' - I think something like 5% - as the ratio between the maximum usable speed on real-life jobs, and the calculated peak speed. But that was abandoned, except possibly for use as an initial seeding value. Once the real-life data is available from the initial tasks run on a new computer, the server should maintain a running average value for flops for each computer attached to the project. That should wobble with small changes in actual task duration, which is why I was surprised to see it remained identical to the last significant digit in my run so far. All the data necessary to calculate flops is returned as each task is reported complete. It's stored in the result table on the server, and should be transferred/averaged to the host table. I should be able to point you to the current code for managing that transfer, and the variable names and db field names used - though I may not be able to post them until after the weekend. Perhaps we could compare notes once I've found them?
	ID: 61777 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61778 - Posted: 8 Sep 2024 \| 20:09:24 UTC
	08/09/2024 21:08:27 \| GPUGRID \| [error] Error reported by file upload server: Server is out of disk space
	ID: 61778 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level Scientific publications	Message 61783 - Posted: 9 Sep 2024 \| 2:39:37 UTC - in response to Message 61778.
	08/09/2024 21:08:27 \| GPUGRID \| [error] Error reported by file upload server: Server is out of disk space this has happened in irregular intervals over all the years - last time about 2 weeks ago. Hard to believe how difficult is must be to take measures against it.
	ID: 61783 \| Rating: 0 \| rate: / Reply Quote

Maxxina Send message Joined: 6 Mar 15 Posts: 2 Credit: 165,500,150 RAC: 227,831 Level Scientific publications	Message 61790 - Posted: 11 Sep 2024 \| 20:27:28 UTC
	Well. How does one managed to complete unit in 5 min ? Im sitting on quite more then ok PC with decent 4090 card. and me units are close to 5 hours .)
	ID: 61790 \| Rating: 0 \| rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 25 Credit: 4,153,801,869 RAC: 11,857,495 Level Scientific publications	Message 61791 - Posted: 11 Sep 2024 \| 21:36:50 UTC - in response to Message 61790.
	Well. How does one managed to complete unit in 5 min ? Im sitting on quite more then ok PC with decent 4090 card. and me units are close to 5 hours .) If you are referring to this post https://www.gpugrid.net/forum_thread.php?id=5468&nowrap=true#61786, Steve posted in the gpugrid discord that there are still tasks that will be generated by the older batch before the code was updated. I don't know how long before these older tasks will be flushed out from the system but it has now been more than 21 days.
	ID: 61791 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level Scientific publications	Message 61799 - Posted: 12 Sep 2024 \| 12:45:01 UTC
	Could it be that my Quadro P5000 is unable to crunch ATMMLs? Several days ago, I tried it twice, and each time the tasks errored out after a few minutes (I guess, but cannot tell for sure: at the moment the GPU was supposed to start working after the initial steps). BTW: the CPU is Intel Xeon E5 2667 v4 (two such CPUs are in the box). Any ideas ?
	ID: 61799 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 46 Credit: 0 RAC: 0 Level Scientific publications	Message 61800 - Posted: 12 Sep 2024 \| 13:10:01 UTC - in response to Message 61799.
	I think a P5000 should work, given that it is the same generation as a 1080 which I have confirmed to work. It may be that your drivers are too old. I have a 1080 with driver version 536.
	ID: 61800 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level Scientific publications	Message 61801 - Posted: 12 Sep 2024 \| 18:00:32 UTC - in response to Message 61800.
	I think a P5000 should work, given that it is the same generation as a 1080 which I have confirmed to work. It may be that your drivers are too old. I have a 1080 with driver version 536. thanks, Steve, for your quick reply. Some 5 hours ago, I started another task - and it is still running :-) So I keep my fingers crossed that it will finish successfully. No idea why the other two ones before failed. BTW: the driver is 537.99
	ID: 61801 \| Rating: 0 \| rate: / Reply Quote

TofPete Send message Joined: 17 Mar 24 Posts: 7 Credit: 52,127,500 RAC: 208,643 Level Scientific publications	Message 61816 - Posted: 18 Sep 2024 \| 13:47:47 UTC
	Hi, Why do I receive such an error messages in ATMML tasks recently? Stderr output <core_client_version>8.0.2</core_client_version> <![CDATA[ <message> (unknown error) (0) - exit code 195 (0xc3)</message> <stderr_txt> 09:59:48 (19024): wrapper (7.9.26016): starting 09:59:48 (19024): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2) aceforce_dft_v0.4.ckpt
	ID: 61816 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61817 - Posted: 18 Sep 2024 \| 19:18:18 UTC - in response to Message 61816.
	You have to read a long way further down to find the real answer to your question! In the one I picked, I see: Traceback (most recent call last): File "D:\ProgramData\BOINC\slots\2\Scripts\rbfe_explicit_sync.py", line 11, in <module> rx.scheduleJobs() File "D:\ProgramData\BOINC\slots\2\Lib\site-packages\sync\atm.py", line 126, in scheduleJobs self.worker.run(replica) File "D:\ProgramData\BOINC\slots\2\Lib\site-packages\sync\worker.py", line 124, in run raise RuntimeError(f"Simulation failed {ntry} times!") RuntimeError: Simulation failed 5 times! That looks like something is wrong in the way that job was set up by the project - that's not your fault, and there's nothing you can do about it except report it here - and move on to the next one.
	ID: 61817 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level Scientific publications	Message 61834 - Posted: 27 Sep 2024 \| 12:53:24 UTC
	Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them.
	ID: 61834 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,079,048,466 RAC: 15,671,380 Level Scientific publications	Message 61835 - Posted: 27 Sep 2024 \| 12:57:15 UTC - in response to Message 61834.
	Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them. I agree.
	ID: 61835 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61836 - Posted: 27 Sep 2024 \| 12:59:50 UTC - in response to Message 61834. Last modified: 27 Sep 2024 \| 13:02:17 UTC
	Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them. I came here to report exactly the same thing. In the last hour, I've had 6 ATMML tasks which went wrong, and I only have 5 video cards! Two were a relatively quick 'Error while computing' - perhaps around the 5% mark. Four were 'Cancelled by server', after runs from 14 ksec to 50 ksec. I'm switching to Quantum Chemistry for the moment, until we get a handle on what the problem is. Edit - there goes another one: 'Error while computing' around the 5% mark.
	ID: 61836 \| Rating: 0 \| rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 96 Credit: 2,638,034,111 RAC: 20,338,166 Level Scientific publications	Message 61837 - Posted: 27 Sep 2024 \| 13:10:42 UTC - in response to Message 61836. Last modified: 27 Sep 2024 \| 13:11:34 UTC
	Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them. I came here to report exactly the same thing. Two were a relatively quick 'Error while computing' - perhaps around the 5% mark. Something is strange. The work queue was over 800 tasks yesterday, now it's 7.
	ID: 61837 \| Rating: 0 \| rate: / Reply Quote

Freewill Send message Joined: 18 Mar 10 Posts: 20 Credit: 30,473,432,894 RAC: 155,013,392 Level Scientific publications	Message 61838 - Posted: 27 Sep 2024 \| 13:15:55 UTC - in response to Message 61837.
	There was a message from Quico posted on the GPUGrid Discord server. They cancelled some parts of the project as they need to finish other parts quickly. They will resend those later. I also lost quite a bit of run time, but they didn't have a better way to do it.
	ID: 61838 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61839 - Posted: 27 Sep 2024 \| 13:26:49 UTC - in response to Message 61838.
	They cancelled some parts of the project as they need to finish other parts quickly. Let's hope the ones they need to finish quickly aren't the most recent ones that end with 'Error while computing'. My 'Error while computing' today are all replication _0, type CDK2 or - a recent one - NEW_CDK2.
	ID: 61839 \| Rating: 0 \| rate: / Reply Quote

Freewill Send message Joined: 18 Mar 10 Posts: 20 Credit: 30,473,432,894 RAC: 155,013,392 Level Scientific publications	Message 61841 - Posted: 27 Sep 2024 \| 14:59:39 UTC - in response to Message 61839.
	They cancelled some parts of the project as they need to finish other parts quickly. Let's hope the ones they need to finish quickly aren't the most recent ones that end with 'Error while computing'. My 'Error while computing' today are all replication _0, type CDK2 or - a recent one - NEW_CDK2. If you're using MPS with Nvidia cards, I have seen that killing or stopping tasks while loaded into the GPU memory can really screw things up and cause newly started tasks to fail as well. That was happening after their server cancels, so I am restarting my PCs.
	ID: 61841 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61842 - Posted: 27 Sep 2024 \| 18:15:21 UTC - in response to Message 61841.
	It doesn't really feel like that sort of problem, but I'll keep an eye on it and restart if it happens again. At the moment, the machines seem happy on QC tasks - and that seems to be the work that they want crunching, judging by the SSP. When I see the RTS queue filling up again, I'll try one or two to see what happens.
	ID: 61842 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level Scientific publications	Message 61843 - Posted: 27 Sep 2024 \| 19:27:21 UTC - in response to Message 61842.
	At the moment, the machines seem happy on QC tasks - and that seems to be the work that they want crunching but this strategy does not work the way it was probably supposed to - as QC tasks are not available for Windows crunchers (and those are still the majority). Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 % - as I noticed with the last few tasks that were not aborted by server and hence could finish - see here: https://www.gpugrid.net/result.php?resultid=36020678
	ID: 61843 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 9,762,712,024 RAC: 21,245,153 Level Scientific publications	Message 61844 - Posted: 27 Sep 2024 \| 20:13:59 UTC - in response to Message 61834.
	Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice. I have added up the processing time from my hosts for 11 ATMML tasks "Aborted by server" on past three days. About 135 hours. Really not nice. The tradition used to be that only not started tasks were aborted, and started ones were allowed to finish. Heavy reasons for breaking this tradition, I suppose.
	ID: 61844 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,820,144,351 RAC: 19,471,837 Level Scientific publications	Message 61845 - Posted: 27 Sep 2024 \| 21:17:54 UTC - in response to Message 61843.
	Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 % I've seen that well before the cancellations started - but only for tasks which ran significantly quicker that the ones we're used to. I think that's fair.
	ID: 61845 \| Rating: 0 \| rate: / Reply Quote

KeithBriggs Send message Joined: 29 Aug 24 Posts: 10 Credit: 786,150,000 RAC: 9,564,086 Level Scientific publications	Message 61846 - Posted: 27 Sep 2024 \| 21:18:28 UTC - in response to Message 61844.
	135 hours is a drag! Sorry to hear. We want our GPUs productive and not just hamster wheels.
	ID: 61846 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level Scientific publications	Message 61849 - Posted: 28 Sep 2024 \| 5:58:40 UTC - in response to Message 61845.
	Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 % I've seen that well before the cancellations started - but only for tasks which ran significantly quicker that the ones we're used to. I think that's fair. I agree that for significantly shorter tasks the credit is lower, no question. But the task which I cited was one of the "long ones".
	ID: 61849 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level Scientific publications	Message 61850 - Posted: 28 Sep 2024 \| 5:59:23 UTC - in response to Message 61846.
	135 hours is a drag! Sorry to hear. We want our GPUs productive and not just hamster wheels. + 1
	ID: 61850 \| Rating: 0 \| rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 135 Credit: 121,356,939 RAC: 28,544 Level Scientific publications	Message 61860 - Posted: 1 Oct 2024 \| 20:59:28 UTC Last modified: 1 Oct 2024 \| 21:00:26 UTC
	I just processed my first one after getting my computer back online. It errored out. There is a lot of depreciation going on in the task. What is that all about? 12.9 hours run time. https://www.gpugrid.net/results.php?userid=107556&offset=0&show_names=0&state=5&appid=
	ID: 61860 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level Scientific publications	Message 61939 - Posted: 15 Nov 2024 \| 19:39:23 UTC
	in the past few days, there have been a lot of ATMML tasks which failed after a few minutes. It's the sub-type PTP1B... Clicking on the working package reveals that this does not only happen with my hosts, but likewise with other vulunteers.
	ID: 61939 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,649,638,294 RAC: 13,283,785 Level Scientific publications	Message 61941 - Posted: 15 Nov 2024 \| 19:49:08 UTC
	Yes, that batch was badly configured. Seems to have been cleared out and expired relatively soon. I'm on to a different MCL1 batch that is working correctly.
	ID: 61941 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : ATMML

	About	Science	Volunteers	Performance	Forum	Join us	Donate