New D3RBanditTest workunits

Author	Message
Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56554 - Posted: 16 Feb 2021, 13:21:02 UTC - in response to Message 56540. I'm not seeing many resends, mostly _0 and _1 original tasks. No issues getting work or returning valid results. There’s no quorum required here. So _1 is a resend. ID: 56554 · Rating: 0 · rate: / Reply Quote

YamFan Send message Joined: 11 Feb 16 Posts: 2 Credit: 60,774,031 RAC: 0 Level Scientific publications	Message 56557 - Posted: 16 Feb 2021, 14:24:00 UTC WU: New version of ACEMD 2.11 (cuda101) Name e16s23_e1s182p0f136-ADRIA_D3RBandit_batch0-0-1-RND8763 Currently 4days;6hours. Approx. 6hours left (just in time for the deadline, I hope). GPU 750Ti, running continuously, although I have been using the computer for other things too, at times. So, yes, current work units are long. Much longer than "normal". This post is just for reference. ✌ ID: 56557 · Rating: 0 · rate: / Reply Quote

YamFan Send message Joined: 11 Feb 16 Posts: 2 Credit: 60,774,031 RAC: 0 Level Scientific publications	Message 56558 - Posted: 16 Feb 2021, 14:26:02 UTC - in response to Message 56553. I agree. Overclocking GPU is much more likely to cause WU failure. ✌ ID: 56558 · Rating: 0 · rate: / Reply Quote

tullio Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level Scientific publications	Message 56560 - Posted: 16 Feb 2021, 15:15:04 UTC Last modified: 16 Feb 2021, 15:15:35 UTC Task 27021821 was canceled by server. Why? It was waiting to run. Tullio ID: 56560 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56561 - Posted: 16 Feb 2021, 15:18:30 UTC - in response to Message 56560. Task 27021821 was canceled by server. Why? It was waiting to run. Tullio http://www.gpugrid.net/workunit.php?wuid=27021821 because it was no longer needed. the original host returned the task after the deadline, but before you had processed it. since a quorum is not required, they only need one result. allowing you to process something which already has a result is just a waste of time. ID: 56561 · Rating: 0 · rate: / Reply Quote

Bill Send message Joined: 28 Oct 20 Posts: 2 Credit: 153,096,086 RAC: 0 Level Scientific publications	Message 56562 - Posted: 16 Feb 2021, 15:31:46 UTC - in response to Message 56535. Last modified: 16 Feb 2021, 15:32:12 UTC I have a 1660 super that takes about 34-38 hours on these new units, still seeing temps in the 60s, only uses 15% of gpu in task manager tho Interesting. My 1660 Ti is completing these tasks in about 25 hours. I would not have thought that there would be that much of a lead in crunching time. You have a 2600 as well, I'm running a 2200G, so my theory of a slower CPU doesn't seem to hold here. EDIT: Oh hey, this is my first post here. Hi everyone! ID: 56562 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56564 - Posted: 16 Feb 2021, 16:03:37 UTC - in response to Message 56562. the 1660ti is still faster than a 1660S. you're both on Windows so your times should be a little more comparable to my 1660S Linux times (27hrs). his first task actually completed in about 29hrs. which seems right. the second task took 40hrs, but I can only speculate the reason. maybe he was doing other things on the system to slow down processing. ID: 56564 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 56566 - Posted: 16 Feb 2021, 16:11:20 UTC Is it overclocking if you're underpowering??? This script runs 2080 Ti WUs in about 15 hours at 36% less power. First run this command: sudo nvidia-xconfig --enable-all-gpus --cool-bits=28 --allow-empty-initial-configuration Then execute this script: #!/bin/bash /usr/bin/nvidia-smi -pm 1 /usr/bin/nvidia-smi -acp UNRESTRICTED /usr/bin/nvidia-smi -i 0 -pl 160 /usr/bin/nvidia-settings -a "[gpu:0]/GPUPowerMizerMode=1" #0=Adaptive, 1=Prefer Maximum Performance , 2=Auto /usr/bin/nvidia-settings -a "[gpu:0]/GPUMemoryTransferRateOffset[3]=400" -a "[gpu:0]/GPUGraphicsClockOffset[3]=100" ID: 56566 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56568 - Posted: 16 Feb 2021, 16:26:54 UTC - in response to Message 56566. 160W seems too low for a 2080ti IMO. you're really restricting performance at that point. for reference, my RTX 2070's run in about 17hr @ 150W (not far away from the performance of your 2080ti, but 2070 is much less expensive) my RTX 2080ti's run in about 10hr @ 225W, 50% faster for 40% more power. ID: 56568 · Rating: 0 · rate: / Reply Quote

Pop Piasa Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level Scientific publications	Message 56569 - Posted: 16 Feb 2021, 16:37:57 UTC - in response to Message 56545. One example: WU #27023500 has been reported by a GTX 750 Ti at one of my hosts in 446,845.54 seconds, more than 4 hours after deadline. This task has been rewarded with 348,750.00 credits, and hasn't been resent to any other host, with "Didn't need" legend. May be the project managers are hiddenly attending this way the request of many Gpugrid users in this regard (?) This is an interesting opportunity to study how the server handles tardy but completed WUs. My GTX 750ti was also allowed to run past the deadline and received equal credit to yours. https://www.gpugrid.net/workunit.php?wuid=27023657 Could it be that the server can detect a WU being actively crunched and stop cancellation? Or is this a normal delay function we see? It had already created another task but had not sent it yet, and marked it as "didn't need" when my host reported. Run time was 408,214.47 sec. CPU time was 406,250.30 sec GPU clock was 1366MHz Mem clock was 2833MHz With my homespun fan intake mod it ran at 55C with 22C ambient room temp. Just can't kill it. ID: 56569 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 56570 - Posted: 16 Feb 2021, 16:39:16 UTC - in response to Message 56568. 160W seems too low for a 2080ti My circuit breakers are perfectly optimized. Not to mention heat management. I'm going to have to sit this race out. ID: 56570 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56571 - Posted: 16 Feb 2021, 16:51:35 UTC - in response to Message 56569. One example: WU #27023500 has been reported by a GTX 750 Ti at one of my hosts in 446,845.54 seconds, more than 4 hours after deadline. This task has been rewarded with 348,750.00 credits, and hasn't been resent to any other host, with "Didn't need" legend. May be the project managers are hiddenly attending this way the request of many Gpugrid users in this regard (?) This is an interesting opportunity to study how the server handles tardy but completed WUs. My GTX 750ti was also allowed to run past the deadline and received equal credit to yours. https://www.gpugrid.net/workunit.php?wuid=27023657 Could it be that the server can detect a WU being actively crunched and stop cancellation? Or is this a normal delay function we see? It had already created another task but had not sent it yet, and marked it as "didn't need" when my host reported. Run time was 408,214.47 sec. CPU time was 406,250.30 sec GPU clock was 1366MHz Mem clock was 2833MHz With my homespun fan intake mod it ran at 55C with 22C ambient room temp. Just can't kill it. I think it's probably just first come first serve. so whoever returns it first "wins" and the second other WU gets the axe. another user reported that his task was cancelled off of his host (but he hadn't started crunching yet). Unsure what would happen to a task that was axed while crunching had already started. ID: 56571 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56572 - Posted: 16 Feb 2021, 16:54:16 UTC - in response to Message 56570. Last modified: 16 Feb 2021, 17:01:15 UTC 160W seems too low for a 2080ti My circuit breakers are perfectly optimized. Not to mention heat management. I'm going to have to sit this race out. I understand. I actually have my 6x 2080ti host split between 2 breakers to avoid overloading a single 20A one (there are other systems on one of the circuits). I'm just saying that if you're going to restrict it that far, you might be better off with a cheaper card to begin with and save some money. ID: 56572 · Rating: 0 · rate: / Reply Quote

Pop Piasa Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level Scientific publications	Message 56575 - Posted: 16 Feb 2021, 17:26:46 UTC - in response to Message 56534. Last modified: 16 Feb 2021, 17:38:52 UTC Thanks Zoltan, I gave up for now and switched that GPU to FAH for now. It seems to be doing more FLOPS/hr when running FAHcore CUDA vs ACEMD, but that may be just the difference in scoring procedures. I have received another WU for the GTX 750ti. The send glitch has been remediated. I'm the 5th user to receive this WU iteration. https://www.gpugrid.net/workunit.php?wuid=27024256 It is a batch_1 (the 2nd batch) and the 0_1 refers to it being number 0 of 1, hence it's the only iteration that will exist of this WU. [edit] It just dawned on me that these WUs might be an experiment in having the same host perform all the generations of the model simulation consecutively instead of on different hosts. Or am I way off? ID: 56575 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level Scientific publications	Message 56576 - Posted: 16 Feb 2021, 17:31:28 UTC - in response to Message 56554. I'm not seeing many resends, mostly _0 and _1 original tasks. No issues getting work or returning valid results. There’s no quorum required here. So _1 is a resend. Uhh, Duh . . forgot where I'm at. ID: 56576 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,187 Level Scientific publications	Message 56579 - Posted: 16 Feb 2021, 17:57:03 UTC - in response to Message 56571. another user reported that his task was cancelled off of his host (but he hadn't started crunching yet). Unsure what would happen to a task that was axed while crunching had already started. When a resent task is started to process and then reported by the overdue host, this resent task will be let run to the end, and three scenes may produce: - 1) Deadline for the resent task is reached at this second host. Then, even if it is completed afterwards, it won't receive any credits, because there is already a previous valid result for it. It will be labeled by server as "Completed, too late to validate". - 2) What I call a "credit paradox" will happen when this resent task is finished in time for full bonus or mid bonus at the second host. It will receive anyway the standard credit amount without any bonus, to match the same credit amount that has already received the overdue task. - 3) When the resent task is finished past 48 hours but before its deadline, it also will receive the standard (no bonus) credit amount. ID: 56579 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 56581 - Posted: 16 Feb 2021, 18:37:28 UTC - in response to Message 56572. better off with a cheaper card to begin with and save some money. I guess you haven't been shopping for a GPU lately. Prices are really high. Perfect time to sell cards not buy them. Best thing I did was switch computers over to 240 V 20 A circuits. The most frequent problem I used to have was the A-phase getting out of balance with the B-phase and tripping one leg of the main breaker. With 240 V both phases are exactly balanced and PSUs run 6-8% more efficiently. Now if I could just find a diagnostic tool to tell me when a PSU is on its last leg. E.g, this Inline PSU Tester for $590 is pricey but looks like it's the most comprehensive I've found: https://www.passmark.com/products/inline-psu-tester/index.php If anyone knows of other brands please share. ID: 56581 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56583 - Posted: 16 Feb 2021, 19:14:55 UTC - in response to Message 56581. Last modified: 16 Feb 2021, 19:16:11 UTC better off with a cheaper card to begin with and save some money. I guess you haven't been shopping for a GPU lately. Prices are really high. Perfect time to sell cards not buy them. Best thing I did was switch computers over to 240 V 20 A circuits. The most frequent problem I used to have was the A-phase getting out of balance with the B-phase and tripping one leg of the main breaker. With 240 V both phases are exactly balanced and PSUs run 6-8% more efficiently. Now if I could just find a diagnostic tool to tell me when a PSU is on its last leg. E.g, this Inline PSU Tester for $590 is pricey but looks like it's the most comprehensive I've found: https://www.passmark.com/products/inline-psu-tester/index.php If anyone knows of other brands please share. I'm aware of the situation. but if you sell a 2080ti, and re-buy a 2070. you're still left with more money, no? even at higher prices, everything just shifts up because of the market. You're restricting the 2080ti so much that it performs similarly to a 2070S, so why have the 2080ti? that was my only point. I agree about 240V, and I normally run my systems in a remote location on 240V, but due to renovations, I have one system temporarily at my house. but if you're on 240V, why restrict it so much? I use the voltage telemetry (via IPMI) to identify when a PSU might be failing. ID: 56583 · Rating: 0 · rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 0 Level Scientific publications	Message 56587 - Posted: 16 Feb 2021, 23:07:56 UTC Last modified: 16 Feb 2021, 23:10:51 UTC Are there prolonged CPU-heavy periods where GPU util drops nearly to zero? Otherwise, I'll probably have an issue with my card. Just ~2 hrs into a new WU and it stalled. BOINC manager reports steadily increasing processor time since last checkpoints but, GPU util has been at 0% for nearly 30 min. Is that normal? Weird, I just suspended/unsuspended, it jumped back to the latest checkpoint and immeadiately the GPU util spiked back to normal levels. I'll see if the same issue comes up again between now and the next checkpoint. ID: 56587 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 56589 - Posted: 17 Feb 2021, 0:17:31 UTC - in response to Message 56587. Are there prolonged CPU-heavy periods where GPU util drops nearly to zero? Otherwise, I'll probably have an issue with my card. Just ~2 hrs into a new WU and it stalled. BOINC manager reports steadily increasing processor time since last checkpoints but, GPU util has been at 0% for nearly 30 min. Is that normal? Weird, I just suspended/unsuspended, it jumped back to the latest checkpoint and immeadiately the GPU util spiked back to normal levels. I'll see if the same issue comes up again between now and the next checkpoint. not normal from my experience. mine stay pegged at 98% GPU utilization for the entire run. ID: 56589 · Rating: 0 · rate: / Reply Quote