What is happening and what will happen at GPUGRID, update for 2021

Message boards : News : What is happening and what will happen at GPUGRID, update for 2021
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57300 - Posted: 14 Sep 2021, 18:22:03 UTC

Great news! Now to just have some tasks ready to send.
ID: 57300 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Philip C Swift [Gridcoin]

Send message
Joined: 23 Dec 18
Posts: 12
Credit: 50,868,500
RAC: 0
Level
Thr
Scientific publications
wat
Message 57304 - Posted: 17 Sep 2021, 9:57:51 UTC - in response to Message 57240.  

Good news! :-)
ID: 57304 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
stiwi

Send message
Joined: 18 Jun 12
Posts: 2
Credit: 100,396,087
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57305 - Posted: 17 Sep 2021, 15:57:22 UTC
Last modified: 17 Sep 2021, 16:00:31 UTC

Got 1 WU.

Boinc shows a runtime of 6 days on my 2080ti with 60% powertarged. Hoppefully its not the real time.

Edit: 1% after 10 Minutes so probably something about 16h
ID: 57305 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57308 - Posted: 17 Sep 2021, 18:17:05 UTC

Picked up a task apiece on two hosts. Some 800 tasks in progress now.

Hope this means the project is getting back to releasing steady work.
ID: 57308 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57360 - Posted: 23 Sep 2021, 15:20:57 UTC
Last modified: 23 Sep 2021, 15:23:55 UTC

it seems that even though the cuda1121 app is available and works fine on Ampere cards, there's nothing preventing the cuda101 app from being sent to an Ampere host. these will always fail.

example: https://gpugrid.net/result.php?resultid=32643471

The project-side scheduler needs to be adjusted to not allow the cuda101 app from being sent to Ampere hosts. this can be achieved by checking the compute capability reported from the host. In addition to the cuda version checks, the cuda101 app should be limited to hosts with compute capability less than 8.0. hosts with 8.0 or greater should only get the cuda1121 app.

or simply remove the cuda101 app and require all users to update their video drivers to use the 1121 app.
ID: 57360 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57363 - Posted: 24 Sep 2021, 0:00:36 UTC - in response to Message 57360.  
Last modified: 24 Sep 2021, 0:00:55 UTC

This is correct. I had the same exact failure with the CUDA101 app sent to my Ampere RTX 3080.

Failure was the inability to compile the CUDA kernel because it was expecting a different architecture.

https://www.gpugrid.net/result.php?resultid=32642922
ID: 57363 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 57365 - Posted: 24 Sep 2021, 13:57:01 UTC - in response to Message 57363.  

Technical question: has anybody idea if the CC is available in the scheduler?
ID: 57365 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57366 - Posted: 24 Sep 2021, 14:19:16 UTC - in response to Message 57365.  
Last modified: 24 Sep 2021, 14:30:08 UTC

Technical question: has anybody idea if the CC is available in the scheduler?

I'm sure it is. I'll start digging out some references, if you want.

Sorry, you caught me in the middle of a driver update.

Try https://boinc.berkeley.edu/trac/wiki/AppPlanSpec#NVIDIAGPUapps:

<min_nvidia_compcap>MMmm</min_nvidia_compcap>
minimum compute capability
<max_nvidia_compcap>MMmm</max_nvidia_compcap>
maximum compute capability
ID: 57366 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57367 - Posted: 24 Sep 2021, 14:28:27 UTC - in response to Message 57366.  

To expand on what Richard wrote, I’m sure it’s available. Einstein@home uses this metric in their scheduler to gatekeep some of their apps to certain generations of Nvidia GPUs. So it’s definitely information that’s provided from the host to the project via BOINC.
ID: 57367 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57368 - Posted: 24 Sep 2021, 14:33:31 UTC - in response to Message 57367.  
Last modified: 24 Sep 2021, 14:34:00 UTC

So it’s definitely information that’s provided from the host to the project via BOINC.

From the most recent sched_request file sent from this computer to your server:

<coproc_cuda>
   ...
   <major>7</major>
   <minor>5</minor>
ID: 57368 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 57369 - Posted: 24 Sep 2021, 14:44:30 UTC - in response to Message 57368.  

Uhm... yes but I was wondering how to retrieve it in the C++ code.
ID: 57369 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57370 - Posted: 24 Sep 2021, 14:56:53 UTC - in response to Message 57369.  

Uhm... yes but I was wondering how to retrieve it in the C++ code.

Same principle. Start with Specifying plan classes in C++, third example.

    ...
    if (!strcmp(plan_class, "cuda23")) {
        if (!cuda_check(c, hu,
            100,        // minimum compute capability (1.0)
            200,        // max compute capability (2.0)
            2030,       // min CUDA version (2.3)
            19500,      // min display driver version (195.00)
            384*MEGA,   // min video RAM
            1.,         // # of GPUs used (may be fractional, or an integer > 1)
            .01,        // fraction of FLOPS done by the CPU
            .21            // estimated GPU efficiency (actual/peak FLOPS)
        )) {
            return false;
        }
    }
ID: 57370 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57377 - Posted: 25 Sep 2021, 21:52:50 UTC
Last modified: 25 Sep 2021, 21:53:52 UTC

Uhm... yes but I was wondering how to retrieve it in the C++ code.

I guess that there might be an easier workaround, with no need to touch the current code.
It would consist of unfolding in Project Preferences page the ACEMD3 app into ACEMD3 (cuda 101) and ACEMD3 (cuda 1121)
This way, Ampere users would be able to untick ACEMD3 (cuda 101) app, thus manually preventing to receive tasks that will fail for sure.
ID: 57377 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57378 - Posted: 26 Sep 2021, 2:49:42 UTC - in response to Message 57377.  

Won't work for multi-gpu users that have both Turing and Ampere cards installed in a host.

I have a 2080 and 3080 together in a host.
ID: 57378 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57379 - Posted: 26 Sep 2021, 7:58:44 UTC - in response to Message 57378.  

Won't work for multi-gpu users that have both Turing and Ampere cards installed in a host.

I have a 2080 and 3080 together in a host.

It should work as a manual selection in Project Preferences for receiving ACEMD3 (cuda 1121) tasks only.
Your RTX 3080 (Ampere - device 0) can't process ACEMD3 (cuda 101), as seen in your failed task e1s627_I757-ADRIA_AdB_KIXCMYB_HIP-1-2-RND0972_0, but it can process ACEMD3 (cuda 1121), as seen in your succeeded task e1s385_I477-ADRIA_AdB_KIXCMYB_HIP-0-2-RND6281_2
And your RTX 2080 (Turing - device 1) on the same host can also process ACEMD3 (cuda 1121) tasks, as seen in your succeeded task e1s667_I831-ADRIA_AdB_KIXCMYB_HIP-1-2-RND8282_1
Therefore, restricting preferences in a particular venue for your host # 462662] to only receiving ACEMD3 (cuda 1121) tasks would work for both cards.

The exception is the general limitation for ACEMD3 app and for all kind of mixed multi-GPU systems when restarting tasks in a different device.
It was described by Toni at his Message #52865, dated on Oct 17 2019.
Paragraph: Can I use it on multi-GPU systems?

Can I use it on multi-GPU systems?

In general yes, with one caveat: if you have DIFFERENT types of NVIDIA GPUs in the same PC, suspending a job in one and restarting it in the other will NOT be possible (errors on restart). Consider restricting the client to one GPU type only ("exclude_gpu", see here).
ID: 57379 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57380 - Posted: 26 Sep 2021, 9:59:18 UTC

There are two different aspects to this debate:

1) What will a project server send to a mixed-GPU client?
2) Which card will the client choose to allocate a task to?

The project server will allocate work solely on the basis of Keith's 3080. BOINC has been developed, effectively, to hide his 2080 from the server.

Keith has some degree of control over the behaviour of his client. He can exclude certain applications from particular cards (using cc_config.xml), but he can't exclude particular versions of the same application - the control structure is too coarse.

He can also control certain behaviours of applications at the plan_class level (using app_config.xml), but that control structure is too fine - it doesn't contain any device-level controls.

Other projects have been able to develop general-purpose GPU applications which are at least compatible with mixed-device hosts - tasks assigned to the 'wrong' or 'changed' device at least run, even if efficiency is downgraded. If this project is unable to follow that design criterion (and I don't know why it is unable at this moment), then I think the only available solution at this time is to divide the versions into separate applications - analogous to the old short/long tasks - so that the limited range of available client options can be leveraged.
ID: 57380 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57381 - Posted: 26 Sep 2021, 12:15:28 UTC - in response to Message 57380.  

I haven’t seen enough compelling evidence to justify keeping the cuda101 app. The cuda1121 app works on all hosts and is basically the same speed.

Removing the cuda101 app would solve all problems.
ID: 57381 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57383 - Posted: 26 Sep 2021, 16:51:41 UTC - in response to Message 57381.  

I agree. Simply remove the CUDA101 app and restrict sending tasks to any host that hasn't updated the drivers to the CUDA11.2 level.
ID: 57383 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57391 - Posted: 29 Sep 2021, 13:22:37 UTC

What's with the new ACEMD beta version 9.17, introduced today? What are we testing?

I got a couple of really short tasks on Linux host 508381. That risks really messing up DCF.
ID: 57391 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
SolidAir79

Send message
Joined: 22 Aug 19
Posts: 7
Credit: 168,393,363
RAC: 0
Level
Ile
Scientific publications
wat
Message 57392 - Posted: 29 Sep 2021, 14:50:26 UTC

I received 15 of the test wu's no problems on Ampere all crunched without issue just want more :)
ID: 57392 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : News : What is happening and what will happen at GPUGRID, update for 2021

©2025 Universitat Pompeu Fabra