What is happening and what will happen at GPUGRID, update for 2021

Message boards : News : What is happening and what will happen at GPUGRID, update for 2021
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
bozz4science

Send message
Joined: 22 May 20
Posts: 110
Credit: 115,525,136
RAC: 0
Level
Cys
Scientific publications
wat
Message 58040 - Posted: 8 Dec 2021, 9:02:30 UTC

can't someone pls just block this guy/bot. guy's posting all over random stuff and cluttering message boards across various projects. thx
ID: 58040 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marsinph

Send message
Joined: 11 Feb 18
Posts: 41
Credit: 579,891,424
RAC: 0
Level
Lys
Scientific publications
wat
Message 58043 - Posted: 10 Dec 2021, 8:54:57 UTC

Admin,
please, block this "administrator" (userid 573008)
he (or it) publish non sense ,and not only here !!!

ID: 58043 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greger

Send message
Joined: 6 Jan 15
Posts: 76
Credit: 25,499,534,331
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 58059 - Posted: 11 Dec 2021, 0:36:50 UTC - in response to Message 58043.  

marsinph that is a hassle to do for admin but you can block him in your end.
ID: 58059 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58064 - Posted: 11 Dec 2021, 7:32:09 UTC - in response to Message 58059.  

marsinph that is a hassle to do for admin

why so? Should be easy.
ID: 58064 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 58066 - Posted: 11 Dec 2021, 10:08:20 UTC - in response to Message 57963.  
Last modified: 11 Dec 2021, 10:08:36 UTC

Incidentally how does gridcoin works and where is the list of project gray listed?

Get a GridCoin wallet and post your address. I'd be glad to sidestake you my GRC earnings from all projects, not just GPUgrid.

I swear there used to be way to see the whitelist the scraper was using in real-time. Maybe they dropped it. They still have obsolete stuff like Team_Whitelist displayed on the Help/Debug window/Scraper tab.
ID: 58066 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58072 - Posted: 11 Dec 2021, 16:47:47 UTC - in response to Message 58066.  

The website for the status of project whitelisting https://gridcoin.ddns.net/pages/project-list.php
isn't being maintained anymore. G-UK stated this to me recently when I added an issue to the code repo for the page. So you can't depend on that for true status.

But the wallet client Help/Debug window/Scraper tab does show the status of current whitelisted projects. Unlisted or greylisted projects fall off the scraper convergence tally.

Also the Main page of the wallet has the gear icon which takes you to the Researcher configuration page with the Summary and Projects tabs. The Projects tab lists all the projects and their status of listed or unlisted and if there is magnitude shown, you know that it is still listed. This updates at every Superblock.
ID: 58072 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 58078 - Posted: 12 Dec 2021, 14:37:34 UTC

I had a pair of WUs complete successfully on a computer with a pair of 2080 Ti's. My other GG computer with a 3080 + 3080 Ti has failed 6 times. E.g.,
<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
14:02:25 (654733): wrapper (7.7.26016): starting
14:02:44 (654733): wrapper (7.7.26016): starting
14:02:44 (654733): wrapper: running bin/acemd3 (--boinc --device 1)
16:50:46 (662618): wrapper (7.7.26016): starting
16:51:07 (662618): wrapper (7.7.26016): starting
16:51:07 (662618): wrapper: running bin/acemd3 (--boinc --device 0)
ERROR: /home/user/conda/conda-bld/acemd3_1632842613607/work/src/mdsim/context.cpp line 318: Cannot use a restart file on a different device!
16:51:11 (662618): bin/acemd3 exited; CPU time 4.131044
16:51:11 (662618): app exit status: 0x9e
16:51:11 (662618): called boinc_finish(195)

</stderr_txt>
]]>

The first two that failed were probably caused by me trying to use the BOINC command <ignore_nvidia_dev>1</ignore_nvidia_dev> in my cc_config and restarting boinc after suspending both tasks. Common sense says the proximal GPU would be GPU0 but apparently the distal GPU is device 0. So I deleted the command to allow acemd3 WUs to run on both GPUs.
The next pair of WUs also failed for "Cannot use a restart file on a different device!" While they were running a batch of OPNG WUs arrived and BOINC suspended acemd3 WUs and let the OPNG WUs run. Maybe when they restarted they tried to swap GPUs?
So, can acemd3 play nice with others or must I stop accepting OPNG if I want to run acemd3 WUs?
ID: 58078 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58084 - Posted: 12 Dec 2021, 16:55:41 UTC - in response to Message 58078.  

If you have different devices, you can't interrupt the acemd3 calculation or you will error it out on restart. The only solution is to change your "switch between applications every xx minutes" in the Manager and pass that to the client.

Set that to a value that the longest running tasks have run on the host and add 10%. I have mine set at 2880 minutes or two days. The tasks run to completion on the card and then the card can move on to other gpu work.
ID: 58084 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58085 - Posted: 12 Dec 2021, 17:32:13 UTC - in response to Message 58084.  

To add on to what Keith said, it’s become apparent that the acemd3 app is generating some memory load which is specific to the hardware it’s running on. Look at different GPUs and you’ll see different amounts of memory used by each, and it seems to scale by total memory size. (IE, a 12GB GPU will show more memory used than a 4GB GPU). Or maybe it scales by memory configuration (bandwidth, speed) and not just total size. Or some combination.

This is why you can’t restart a task on a different device. The calculation which was setup for some specific hardware cannot continue if you change the device hardware midway through. They seem to have some logic built in to catch when hardware has changed. If you have identical GPUs, most times it will restart on a different device OK, but I’ve seen times when even restarting on an identical GPU still triggers this and it fails right away.

Best option is to never interrupt GPUGRID tasks.
ID: 58085 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 58086 - Posted: 12 Dec 2021, 18:44:10 UTC

At the root of this problem is why does GG switch to a different device when its name did not change???

If another GPU WU switches to Running High Priority it will override that switch every 2880 minutes approach.

When we get a continuous supply of acemd3 WUs my urge to timeslice them will vanish :-)
ID: 58086 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58087 - Posted: 12 Dec 2021, 18:56:25 UTC - in response to Message 58086.  

Has nothing to do with GPUGrid or the acemd3 application. The issue is with BOINC.

BOINC does not care that you were running on any particular device. It just knows that a gpu resource just became available when a task finishes on a gpu and assigns the interrupted or checkpointed acemd3 task to the card that just became available.

If that is a different device that what the task started on, then the task errors out in the manner that Ian mentioned.
ID: 58087 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58088 - Posted: 12 Dec 2021, 19:24:06 UTC - in response to Message 58086.  

Make sure your task switching options are set to longer than the estimated task run time.

As I recall, you also run WCG OPNG tasks, which if you run that at a higher priority, or some other project at a higher priority (resource share) and your task has been running longer than the task switch time, it will fail over to the new high priority task, stopping GPUGRID, then if it restarts on a different device, you get the error. But if your task switch time is longer than run time, it shouldn’t ever switch away to high priority work until the task is complete.
ID: 58088 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 58095 - Posted: 13 Dec 2021, 14:31:19 UTC - in response to Message 58087.  
Last modified: 13 Dec 2021, 14:32:56 UTC

Has nothing to do with GPUGrid or the acemd3 application. The issue is with BOINC.

BOINC does not care that you were running on any particular device. It just knows that a gpu resource just became available when a task finishes on a gpu and assigns the interrupted or checkpointed acemd3 task to the card that just became available.

If that is a different device that what the task started on, then the task errors out in the manner that Ian mentioned.

Sounds like a defect that must be remedied. If it's not already a github issue someone should list. I've worn out my welcome with 4 issues that will never be fixed.

I wonder if it's not possible for GG to assign a GPU device number to the WU and stick with same after time-slicing?

Set switching to 1440 minutes and got 4 acemd3 WUs overnight that are running nicely. The OPNG WUs are sitting there patiently like a bug in a rug :-)
ID: 58095 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58097 - Posted: 13 Dec 2021, 19:04:39 UTC - in response to Message 58095.  

Again, the acemd3 application would have to be completely rewritten to hook into BOINC in a manner that BOINC does not currently have.

So rewrite BOINC first to get the ability to assign individual tasks to be locked to specific hardware and then rewrite GG to use that BOINC feature.

Or if the task is detected to be NOT running on the original hardware, start the the task from zero again so that it does not error. Which is not conducive to returning work in the original 5 day deadline for slower cards.
ID: 58097 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58100 - Posted: 13 Dec 2021, 19:41:41 UTC

keep task switching set to the proper value and operate with the understanding of the idiosyncrasies of different projects and this won't be an issue. for GPUGRID that means not turning off your PC or anything else that would cause the task to restart while processing GPUGRID work.

I take it a step further and have designed my GPUGRID systems with identical GPUs. not just the same model, but where possible the same exact SKU from the same manufacturer (the 7x 2080Ti system has all EVGA 2080Ti XC Ultra cards; the 7x 2080 system has all ASUS 2080Ti Turbo custom watercooled). Making the system homogeneous in this way greatly reduces the chance that a restart detects a new device as "different" in the event that a restart is unavoidable (like a power outage, or hardware issue), since they are all identical. You can take this mindset even further with battery backup to cover short power outages. My smaller 1-GPU system is on a 1500W battery backup that can keep the system up for a few minutes during short power blips, which is all I usually experience in my area. Just enough that the mains voltage drop doesn't induce a reboot, the battery just kicks on for a few seconds and it stays up.
ID: 58100 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gogian

Send message
Joined: 1 May 20
Posts: 2
Credit: 141,805,632
RAC: 0
Level
Cys
Scientific publications
wat
Message 58235 - Posted: 3 Jan 2022, 22:03:28 UTC

I haven't received any new projects since roughly in November and I just recently replaced the Certificate. My other 3 projects are working just fine except for this one. I have tried resetting, removing and re-adding GPUGRID but no work gets downloaded. Any ideas?
ID: 58235 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58236 - Posted: 3 Jan 2022, 22:21:28 UTC - in response to Message 58235.  

Any ideas?


no work is available to send you. GPUGRID always operates with intermittent work availability, especially recently when they really only release a small batch at a time. you're trying to get work in a time when work isn't available.


ID: 58236 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gogian

Send message
Joined: 1 May 20
Posts: 2
Credit: 141,805,632
RAC: 0
Level
Cys
Scientific publications
wat
Message 58252 - Posted: 6 Jan 2022, 15:24:05 UTC - in response to Message 58236.  

ok, Thanks!
ID: 58252 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : News : What is happening and what will happen at GPUGRID, update for 2021

©2025 Universitat Pompeu Fabra