New D3RBanditTest workunits

Author	Message
Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 56591 - Posted: 17 Feb 2021, 0:33:18 UTC - in response to Message 56587. Are there prolonged CPU-heavy periods where GPU util drops nearly to zero? Otherwise, I'll probably have an issue with my card. Just ~2 hrs into a new WU and it stalled. BOINC manager reports steadily increasing processor time since last checkpoints but, GPU util has been at 0% for nearly 30 min. Is that normal? Weird, I just suspended/unsuspended, it jumped back to the latest checkpoint and immediately the GPU util spiked back to normal levels. I'll see if the same issue comes up again between now and the next checkpoint. No that's not normal. I haven't seen that behavior myself. Are you using an app_config.xml to say run 2 WUs on the same GPU or max out the CPUs? This is mine: <app_config> <app> <name>acemd3</name> <gpu_versions> <cpu_usage>1.00</cpu_usage> <gpu_usage>1.00</gpu_usage> </gpu_versions> </app> <project_max_concurrent>4</project_max_concurrent> </app_config> Might be something else you're running. ID: 56591 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 56593 - Posted: 17 Feb 2021, 1:11:59 UTC - in response to Message 56583. I'm aware of the situation. but if you sell a 2080ti, and re-buy a 2070. you're still left with more money, no? even at higher prices, everything just shifts up because of the market. You're restricting the 2080ti so much that it performs similarly to a 2070S, so why have the 2080ti? that was my only point. I'm selling off my second string (1070s & 1080s) but keeping my 2080s. Waiting for RTC 3080s with the new design rules but that'll probably be April. I agree about 240V, and I normally run my systems in a remote location on 240V, but due to renovations, I have one system temporarily at my house. but if you're on 240V, why restrict it so much? This is funny since you gave me the script in one of the GG threads and I've been grateful since it cut my electric bill by a third. My 100 Amp load center is maxxed out. No wiggle room left. I use the voltage telemetry (via IPMI) to identify when a PSU might be failing. Sounds good but Dr Google threw so much stuff at me. Voltage telemetry can be so many different things. The IPMI article seemed to indicate that AMT might be better for me: https://en.wikipedia.org/wiki/Intel_Active_Management_Technology. Most of my PSUs have been running for about 7 years now and starting to die of old age. Hard failures are nice since they're easy to diagnose. It's the flaky ones with intermittent problems. I swap parts between a good computer and a bad actor and sometimes I get lucky and I can convince myself that a PSU is over the hill. I have one now that after a couple of hours randomly stops communicating while remaining powered up. I need to put a head on it and play swapsies. Does the voltage telemetry you observe give you an unambiguous indication of the demise of a PSU??? ID: 56593 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 56594 - Posted: 17 Feb 2021, 2:29:47 UTC IPMI stands for Intelligent Platform Management Interface. https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface Always found on server motherboards with a BMC. Baseboard Management Controller. https://searchnetworking.techtarget.com/definition/baseboard-management-controller You can look at the voltages coming out of the power supply on the system under load and spot issues with a power supply flaking out or on the edge of stability. Set warnings about voltage levels and such. Very handy. All done remotely. ID: 56594 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 56595 - Posted: 17 Feb 2021, 4:55:54 UTC - in response to Message 56593. As Keith said, IPMI is the remote management interface built into the ASRock Rack and Supermicro motherboards that I use. They have voltage monitoring built in for the most part. It just measures the voltages that it sees at the 24-pin and reports it out over the IPMI interface. I access the telemetry via the dedicated webGUI that is provided at the configurable IP address on the LAN. Tell tale signs of failure are usually sagging voltages. And not always where you expect. I have a PSU that I think is on the way out. It’s a 1200W PSU, only loaded maybe 600W now, but it’s10 years old, and was previously run pretty hard at the limit, in hot temps, for a long time. I’ve noticed random system restarts, and lots of warnings in the IPMI about low 3.3V below the 3.04v threshold. I’ll need to replace this soon. ID: 56595 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 56596 - Posted: 17 Feb 2021, 5:00:07 UTC - in response to Message 56593. “RTC 3080s with new design rules” ? Huh? It’s unfortunate that even existing 30-series cards still don’t work for GPUGRID. The app is holding them back. Right now the only options for 30-series are unoptimized OpenCL projects, FAH, or PrimeGrid (if you think finding big prime numbers is useful?) ID: 56596 · Rating: 0 · rate: / Reply Quote

Jeffwy Send message Joined: 15 Jan 12 Posts: 5 Credit: 59,000,574 RAC: 0 Level Scientific publications	Message 56599 - Posted: 17 Feb 2021, 9:49:39 UTC Seems I did not get one yet, and I have a Geforce RTX 2060 ID: 56599 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 56600 - Posted: 17 Feb 2021, 12:14:12 UTC - in response to Message 56599. Last modified: 17 Feb 2021, 12:14:32 UTC Seems I did not get one yet, and I have a Geforce RTX 2060 According to the status page of your host, you have two, and you had 7 before. ID: 56600 · Rating: 0 · rate: / Reply Quote

Jeffwy Send message Joined: 15 Jan 12 Posts: 5 Credit: 59,000,574 RAC: 0 Level Scientific publications	Message 56602 - Posted: 17 Feb 2021, 12:39:02 UTC - in response to Message 56600. Ok, maybe I was looking for a different name than what is there, so what are the names of the two new ones then? ID: 56602 · Rating: 0 · rate: / Reply Quote

tullio Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level Scientific publications	Message 56603 - Posted: 17 Feb 2021, 13:16:19 UTC Last modified: 17 Feb 2021, 14:11:02 UTC GTX 1060 completed in 157,027.38 s GTX 1650 completed in 175,814.54 s Both on a Windows 10 computer, the first with a Ryzen 5 1400 CPU.the second with an Intel i5 9400F. Tullio Sorry, I had exchanged the computers The GTX 1060 has 3 GB of Video RAM and was excluded from running Einstein@home Gravitational wave tasks which needed more than 3 GB. The GTX 1650 has 4 GB of Video RAM. ID: 56603 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 56605 - Posted: 17 Feb 2021, 15:14:24 UTC - in response to Message 56602. Seems I did not get one yet, and I have a Geforce RTX 2060 According to the status page of your host, you have two, and you had 7 before. Ok, maybe I was looking for a different name than what is there, so what are the names of the two new ones then? Click on the link in my reply, and you'll see. ID: 56605 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 56606 - Posted: 17 Feb 2021, 17:49:12 UTC - in response to Message 56596. “RTC 3080s with new design rules” ? Huh? They're switching some or all from Samsung 8 nm design rules to TSMC 7 nm design rules. Also an RTX 3080 Ti with more memory is expected. https://hexus.net/tech/news/graphics/146170-nvidia-shift-rtx-30-gpus-tsmc-7nm-2021-says-report/ ID: 56606 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 56607 - Posted: 17 Feb 2021, 17:56:08 UTC - in response to Message 56606. Last modified: 17 Feb 2021, 17:58:43 UTC “RTC 3080s with new design rules” ? Huh? They're switching some or all from Samsung 8 nm design rules to TSMC 7 nm design rules. Also an RTX 3080 Ti with more memory is expected. https://hexus.net/tech/news/graphics/146170-nvidia-shift-rtx-30-gpus-tsmc-7nm-2021-says-report/ oh, it was a typo on the RTX/RTC. They've had those rumors since pretty much launch (note the date on that article lol). Personally I wouldn't hold my breath. If they release anything, expect more paper launches with a few thousand cards available day one, then basically nothing for months again. but its all moot for GPUGRID anyway until they get around to making the app compatible with Ampere. there are 5 different ampere models that i've seen attemtped to be used here (A100, 3090, 3080, 3070, 3060ti), and the 3060 is set for launch late February. more models are meaningless for us if we can't use them :( ID: 56607 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 56608 - Posted: 17 Feb 2021, 22:35:54 UTC - in response to Message 56607. its all moot for GPUGRID anyway until they get around to making the app compatible with Ampere. I bet they don't have an Ampere GPU to do development and testing on. ID: 56608 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 56609 - Posted: 18 Feb 2021, 2:28:06 UTC - in response to Message 56608. The thing is. They don’t even need to. They can just download the new CUDA toolkit, edit a few arguments in the config file or make file and re-compile the app as-is. It’s not a whole lot of work. And it’ll work. They basically just need to unlock the new architecture. The app now doesn’t even try to run, it fails at the architecture check. ID: 56609 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level Scientific publications	Message 56611 - Posted: 18 Feb 2021, 5:22:10 UTC - in response to Message 56539. I have a dual GPU system, GTX 980 and GTX 1080 TI, and all of these work units have failed. Drivers are current as of Decemeber. I had to roll back January update as it didn't play nice with Milkyway@Home while you folks were on Holiday. Suddenly I can't complete a work unit without error. Running for how many hours a day? The GTX 1080 Ti should be adequate if you run it 24 hours a day, but I'm not sure if the GTX 980 will be. ID: 56611 · Rating: 0 · rate: / Reply Quote

Jeffwy Send message Joined: 15 Jan 12 Posts: 5 Credit: 59,000,574 RAC: 0 Level Scientific publications	Message 56612 - Posted: 18 Feb 2021, 10:42:44 UTC - in response to Message 56605. I'm talking about the new test units that were being talked about by ADMIN, not WUs from ACEMD, and if they have similar names than I wouldn't know. But they are most certainly not taking 18 hours to crunch, they are taking the same amount of time to crunch so I think they are the same WUs that I have been getting for six months, not the new ones listed at the top of this thread. ID: 56612 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 56613 - Posted: 18 Feb 2021, 12:53:58 UTC - in response to Message 56612. Last modified: 18 Feb 2021, 12:56:45 UTC I'm talking about the new test units that were being talked about by ADMIN, These workunits are the same as the "test" batch. not WUs from ACEMD, ACEMD is tha app that process all of the GPUGrid workunits, regardless of their size. and if they have similar names than I wouldn't know. They are named like: e22s16_e2s343p0f9-ADRIA_D3RBandit_batch2-0-1-RND4443_1 But they are most certainly not taking 18 hours to crunch, They take about 72,800~74,000 seconds on your host with an RTX 2060. That is 20h 13m ~ 20h 33m. they are taking the same amount of time to crunch so I think they are the same WUs that I have been getting for six months, not the new ones listed at the top of this thread. They are definitely not the same, as those took less than 2 hours on a similar GPU. ID: 56613 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 56614 - Posted: 18 Feb 2021, 13:31:47 UTC - in response to Message 56539. I have a dual GPU system, GTX 980 and GTX 1080 TI, and all of these work units have failed. Drivers are current as of Decemeber. I had to roll back January update as it didn't play nice with Milkyway@Home while you folks were on Holiday. Suddenly I can't complete a work unit without error. they have not all failed. you actually have a few that were submitted fine. see your tasks for that system here: http://www.gpugrid.net/results.php?hostid=514949 of your errors: 00:31:16 (11412): wrapper: running acemd3.exe (--boinc input --device 0) ERROR: src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! 00:28:21 (11224): wrapper: running acemd3.exe (--boinc input --device 0) ERROR: src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! 00:28:21 (3488): wrapper: running acemd3.exe (--boinc input --device 1) ERROR: src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! 23:35:51 (3068): wrapper: running acemd3.exe (--boinc input --device 0) ERROR: src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! 23:39:12 (1400): wrapper: running acemd3.exe (--boinc input --device 1) ERROR: src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! 23:39:12 (4088): wrapper: running acemd3.exe (--boinc input --device 0) ERROR: src\mdsim\context.cpp line 318: Cannot use a restart file on a different device! so it's clear what's causing your issue. you're either routinely starting and stopping BOINC computation or you have some task switching with other projects going on due to the long run of these tasks, which sometimes results in the process restarting on a different card. it's fairly well known that this will result in an error for GPUGRID tasks. you should increase the time threshold for task switching to longer than the run of these tasks (24 hrs?), and avoid interrupting the computation. Do not turn off the computer or let it go to sleep. I would probably even avoid the use of the "suspend GPU while computer is in use" option in BOINC. anything to avoid interrupting these very long tasks. ID: 56614 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 56615 - Posted: 18 Feb 2021, 16:38:45 UTC - in response to Message 56614. you should increase the time threshold for task switching to longer than the run of these tasks (24 hrs?), and avoid interrupting the computation. Try setting Resource=Zero on the other program you wish to time-slice with. Then it should only send you its WU if you have no GG WUs left. ID: 56615 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 56616 - Posted: 18 Feb 2021, 16:46:39 UTC - in response to Message 56615. true, I do this. but some people like to concurrently crunch multiple projects giving some love to them all, not just prime/backup. ID: 56616 · Rating: 0 · rate: / Reply Quote