What is happening and what will happen at GPUGRID, update for 2021

Author	Message
bozz4science Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 0 Level Scientific publications	Message 58040 - Posted: 8 Dec 2021, 9:02:30 UTC can't someone pls just block this guy/bot. guy's posting all over random stuff and cluttering message boards across various projects. thx ID: 58040 · Rating: 0 · rate: / Reply Quote

marsinph Send message Joined: 11 Feb 18 Posts: 41 Credit: 579,891,424 RAC: 0 Level Scientific publications	Message 58043 - Posted: 10 Dec 2021, 8:54:57 UTC Admin, please, block this "administrator" (userid 573008) he (or it) publish non sense ,and not only here !!! ID: 58043 · Rating: 0 · rate: / Reply Quote

Greger Send message Joined: 6 Jan 15 Posts: 76 Credit: 25,499,534,331 RAC: 0 Level Scientific publications	Message 58059 - Posted: 11 Dec 2021, 0:36:50 UTC - in response to Message 58043. marsinph that is a hassle to do for admin but you can block him in your end. ID: 58059 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 75,187 Level Scientific publications	Message 58064 - Posted: 11 Dec 2021, 7:32:09 UTC - in response to Message 58059. marsinph that is a hassle to do for admin why so? Should be easy. ID: 58064 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 58066 - Posted: 11 Dec 2021, 10:08:20 UTC - in response to Message 57963. Last modified: 11 Dec 2021, 10:08:36 UTC Incidentally how does gridcoin works and where is the list of project gray listed? Get a GridCoin wallet and post your address. I'd be glad to sidestake you my GRC earnings from all projects, not just GPUgrid. I swear there used to be way to see the whitelist the scraper was using in real-time. Maybe they dropped it. They still have obsolete stuff like Team_Whitelist displayed on the Help/Debug window/Scraper tab. ID: 58066 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 58072 - Posted: 11 Dec 2021, 16:47:47 UTC - in response to Message 58066. The website for the status of project whitelisting https://gridcoin.ddns.net/pages/project-list.php isn't being maintained anymore. G-UK stated this to me recently when I added an issue to the code repo for the page. So you can't depend on that for true status. But the wallet client Help/Debug window/Scraper tab does show the status of current whitelisted projects. Unlisted or greylisted projects fall off the scraper convergence tally. Also the Main page of the wallet has the gear icon which takes you to the Researcher configuration page with the Summary and Projects tabs. The Projects tab lists all the projects and their status of listed or unlisted and if there is magnitude shown, you know that it is still listed. This updates at every Superblock. ID: 58072 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 58078 - Posted: 12 Dec 2021, 14:37:34 UTC I had a pair of WUs complete successfully on a computer with a pair of 2080 Ti's. My other GG computer with a 3080 + 3080 Ti has failed 6 times. E.g., <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 14:02:25 (654733): wrapper (7.7.26016): starting 14:02:44 (654733): wrapper (7.7.26016): starting 14:02:44 (654733): wrapper: running bin/acemd3 (--boinc --device 1) 16:50:46 (662618): wrapper (7.7.26016): starting 16:51:07 (662618): wrapper (7.7.26016): starting 16:51:07 (662618): wrapper: running bin/acemd3 (--boinc --device 0) ERROR: /home/user/conda/conda-bld/acemd3_1632842613607/work/src/mdsim/context.cpp line 318: Cannot use a restart file on a different device! 16:51:11 (662618): bin/acemd3 exited; CPU time 4.131044 16:51:11 (662618): app exit status: 0x9e 16:51:11 (662618): called boinc_finish(195) </stderr_txt> ]]> The first two that failed were probably caused by me trying to use the BOINC command <ignore_nvidia_dev>1</ignore_nvidia_dev> in my cc_config and restarting boinc after suspending both tasks. Common sense says the proximal GPU would be GPU0 but apparently the distal GPU is device 0. So I deleted the command to allow acemd3 WUs to run on both GPUs. The next pair of WUs also failed for "Cannot use a restart file on a different device!" While they were running a batch of OPNG WUs arrived and BOINC suspended acemd3 WUs and let the OPNG WUs run. Maybe when they restarted they tried to swap GPUs? So, can acemd3 play nice with others or must I stop accepting OPNG if I want to run acemd3 WUs? ID: 58078 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 58084 - Posted: 12 Dec 2021, 16:55:41 UTC - in response to Message 58078. If you have different devices, you can't interrupt the acemd3 calculation or you will error it out on restart. The only solution is to change your "switch between applications every xx minutes" in the Manager and pass that to the client. Set that to a value that the longest running tasks have run on the host and add 10%. I have mine set at 2880 minutes or two days. The tasks run to completion on the card and then the card can move on to other gpu work. ID: 58084 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 58085 - Posted: 12 Dec 2021, 17:32:13 UTC - in response to Message 58084. To add on to what Keith said, it’s become apparent that the acemd3 app is generating some memory load which is specific to the hardware it’s running on. Look at different GPUs and you’ll see different amounts of memory used by each, and it seems to scale by total memory size. (IE, a 12GB GPU will show more memory used than a 4GB GPU). Or maybe it scales by memory configuration (bandwidth, speed) and not just total size. Or some combination. This is why you can’t restart a task on a different device. The calculation which was setup for some specific hardware cannot continue if you change the device hardware midway through. They seem to have some logic built in to catch when hardware has changed. If you have identical GPUs, most times it will restart on a different device OK, but I’ve seen times when even restarting on an identical GPU still triggers this and it fails right away. Best option is to never interrupt GPUGRID tasks. ID: 58085 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 58086 - Posted: 12 Dec 2021, 18:44:10 UTC At the root of this problem is why does GG switch to a different device when its name did not change??? If another GPU WU switches to Running High Priority it will override that switch every 2880 minutes approach. When we get a continuous supply of acemd3 WUs my urge to timeslice them will vanish :-) ID: 58086 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 58087 - Posted: 12 Dec 2021, 18:56:25 UTC - in response to Message 58086. Has nothing to do with GPUGrid or the acemd3 application. The issue is with BOINC. BOINC does not care that you were running on any particular device. It just knows that a gpu resource just became available when a task finishes on a gpu and assigns the interrupted or checkpointed acemd3 task to the card that just became available. If that is a different device that what the task started on, then the task errors out in the manner that Ian mentioned. ID: 58087 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 58088 - Posted: 12 Dec 2021, 19:24:06 UTC - in response to Message 58086. Make sure your task switching options are set to longer than the estimated task run time. As I recall, you also run WCG OPNG tasks, which if you run that at a higher priority, or some other project at a higher priority (resource share) and your task has been running longer than the task switch time, it will fail over to the new high priority task, stopping GPUGRID, then if it restarts on a different device, you get the error. But if your task switch time is longer than run time, it shouldn’t ever switch away to high priority work until the task is complete. ID: 58088 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 58095 - Posted: 13 Dec 2021, 14:31:19 UTC - in response to Message 58087. Last modified: 13 Dec 2021, 14:32:56 UTC Has nothing to do with GPUGrid or the acemd3 application. The issue is with BOINC. BOINC does not care that you were running on any particular device. It just knows that a gpu resource just became available when a task finishes on a gpu and assigns the interrupted or checkpointed acemd3 task to the card that just became available. If that is a different device that what the task started on, then the task errors out in the manner that Ian mentioned. Sounds like a defect that must be remedied. If it's not already a github issue someone should list. I've worn out my welcome with 4 issues that will never be fixed. I wonder if it's not possible for GG to assign a GPU device number to the WU and stick with same after time-slicing? Set switching to 1440 minutes and got 4 acemd3 WUs overnight that are running nicely. The OPNG WUs are sitting there patiently like a bug in a rug :-) ID: 58095 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 58097 - Posted: 13 Dec 2021, 19:04:39 UTC - in response to Message 58095. Again, the acemd3 application would have to be completely rewritten to hook into BOINC in a manner that BOINC does not currently have. So rewrite BOINC first to get the ability to assign individual tasks to be locked to specific hardware and then rewrite GG to use that BOINC feature. Or if the task is detected to be NOT running on the original hardware, start the the task from zero again so that it does not error. Which is not conducive to returning work in the original 5 day deadline for slower cards. ID: 58097 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 58100 - Posted: 13 Dec 2021, 19:41:41 UTC keep task switching set to the proper value and operate with the understanding of the idiosyncrasies of different projects and this won't be an issue. for GPUGRID that means not turning off your PC or anything else that would cause the task to restart while processing GPUGRID work. I take it a step further and have designed my GPUGRID systems with identical GPUs. not just the same model, but where possible the same exact SKU from the same manufacturer (the 7x 2080Ti system has all EVGA 2080Ti XC Ultra cards; the 7x 2080 system has all ASUS 2080Ti Turbo custom watercooled). Making the system homogeneous in this way greatly reduces the chance that a restart detects a new device as "different" in the event that a restart is unavoidable (like a power outage, or hardware issue), since they are all identical. You can take this mindset even further with battery backup to cover short power outages. My smaller 1-GPU system is on a 1500W battery backup that can keep the system up for a few minutes during short power blips, which is all I usually experience in my area. Just enough that the mains voltage drop doesn't induce a reboot, the battery just kicks on for a few seconds and it stays up. ID: 58100 · Rating: 0 · rate: / Reply Quote

Gogian Send message Joined: 1 May 20 Posts: 2 Credit: 141,805,632 RAC: 0 Level Scientific publications	Message 58235 - Posted: 3 Jan 2022, 22:03:28 UTC I haven't received any new projects since roughly in November and I just recently replaced the Certificate. My other 3 projects are working just fine except for this one. I have tried resetting, removing and re-adding GPUGRID but no work gets downloaded. Any ideas? ID: 58235 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 58236 - Posted: 3 Jan 2022, 22:21:28 UTC - in response to Message 58235. Any ideas? no work is available to send you. GPUGRID always operates with intermittent work availability, especially recently when they really only release a small batch at a time. you're trying to get work in a time when work isn't available. ID: 58236 · Rating: 0 · rate: / Reply Quote

Gogian Send message Joined: 1 May 20 Posts: 2 Credit: 141,805,632 RAC: 0 Level Scientific publications	Message 58252 - Posted: 6 Jan 2022, 15:24:05 UTC - in response to Message 58236. ok, Thanks! ID: 58252 · Rating: 0 · rate: / Reply Quote