Message boards : News : What is happening and what will happen at GPUGRID, update for 2021
Author | Message |
---|---|
As you know, GPUGRID was the first BOINC project to run GPU applications, in fact we help in creating the infrastructure for that. This was very many years ago and since then many things changed. In particular, recently, we had not a constant stream of workunits. I would like to explain the present and expected future of GPUGRID here. | |
ID: 57240 | Rating: 0 | rate: / Reply Quote | |
Thanks for the much anticipated update! Appreciate that you provide a roadmap for the future. Hopefully there aren't too many roadblocks ahead with the development of OpenMM. The future project direction sounds very exciting :) I'll take that as an opportunity to upgrade my host by the end of the year to contribute more intensively next year! | |
ID: 57241 | Rating: 0 | rate: / Reply Quote | |
Are there any plans to add support for AMD GPUs now that ACEMD3 supports OPENCL? https://software.acellera.com/docs/latest/acemd3/capabilities.html This would increase participation. | |
ID: 57242 | Rating: 0 | rate: / Reply Quote | |
Good news, everyone! | |
ID: 57244 | Rating: 0 | rate: / Reply Quote | |
Great News, Thanks!!!! | |
ID: 57245 | Rating: 0 | rate: / Reply Quote | |
Thanks for the news, I hope the work goes well. I am looking forward to making new calculations. | |
ID: 57246 | Rating: 0 | rate: / Reply Quote | |
I've got a 3090 and 2 1080ti's waiting for some work. Looking forward to the new updates. | |
ID: 57249 | Rating: 0 | rate: / Reply Quote | |
Thanks for the update, GDF, it's very much appreciated! | |
ID: 57250 | Rating: 0 | rate: / Reply Quote | |
On the practical term, we expect to have the ACEMD application fixed for RTX30xx within few weeks, as now the developper of ACEMD is also doing the deployment on GPUGRID, making everything simpler. one of my hosts with two RTX3070 inside will be pleased :-) | |
ID: 57251 | Rating: 0 | rate: / Reply Quote | |
Are there any plans to add support for AMD GPUs now that ACEMD3 supports OPENCL? https://software.acellera.com/docs/latest/acemd3/capabilities.html This would increase participation. I would also like to know if AMD will finally be supported. I have a water-cooled Radeon RX 6800 XT and am ready to utilize its full capacity for cancer and COVID research, as well as other projects as they may come. AMD Ryzen 9 5950X AMD Radeon RX 6800 XT 32BG 3200MHz CAS-14 RAM NVMe 4th Gen storage Custom water cooling | |
ID: 57252 | Rating: 0 | rate: / Reply Quote | |
Some initial new version of ACEMD has been deployed on linux and it's working, but we are still testing. | |
ID: 57257 | Rating: 0 | rate: / Reply Quote | |
Some initial new version of ACEMD has been deployed on linux and it's working, but we are still testing. what is the criteria for sending the cuda101 app vs the cuda1121 app? I see both apps exist. and new drivers on even older cards will support both apps. For example, if you have CUDA 11.2 drivers on a Turing card, you can run both the 11.2 app or the 10.1 app. so what criteria does the server use to determine which app to send my Turing cards? Of course Ampere cards should only get the 11.2 app. Also looks like the Windows apps are missing for New ACEMD, are you dropping Windows support? ____________ | |
ID: 57259 | Rating: 0 | rate: / Reply Quote | |
Support for AMD cards would make good for project. | |
ID: 57260 | Rating: 0 | rate: / Reply Quote | |
Some initial new version of ACEMD has been deployed on linux and it's working, but we are still testing. there seems to be a problem with the new 2.17 app. it's always trying to run on GPU0 even when BOINC assigns it to another GPU. I had this happen on two separate hosts now. the host picked up a new task, BOINC assigns it to some other GPU (like device 6 or device 3) but the acemd process spins up on GPU0 anyway, even though it is already occupied by another BOINC process from another project. I think there's sometime off in that the boinc device assignment isnt being communicated to the app properly. this results in multiple processes running on a single GPU, and no process running on the device that BOINC assigned the GPUGRID task to. rebooting the BOINC client brings it back to "OK" since it prioritizes the GPUGRID task to GPU0 on startup (probably due to resource share). but I feel this will keep happening. needs an update ASAP. ____________ | |
ID: 57261 | Rating: 0 | rate: / Reply Quote | |
Haven't snagged one of the new ones as yet (I was out all day), but I'll watch out for them, and try to find out where the device allocation is failing. | |
ID: 57262 | Rating: 0 | rate: / Reply Quote | |
there seems to be a problem with the new 2.17 app. it's always trying to run on GPU0 even when BOINC assigns it to another GPU. First of all: Congratulations, both of your multi GPU systems that weren't getting work from previous app versions, seem to have the problem solved with this new one. Welcome back to the field! I'm experiencing the same behavior, and I can go even further: I catched six WUs of the new app version 2.17 at my triple 1650 GPU system. Then I aborted three of these WUs, and two of them were recatched at my twin 1650 GPU system. At the triple GPU system: While all the three WUs seem to be progressing normally from the Boinc Manager point of view, looking at Psensor only GPU #0 (first PCIE slot) is working and GPUs #1 and #2 are inactive. It's like GPU #0 is carrying all the workload for the three WUs. Same 63% fraction done after 8,25 hours for all the three WUs. However, CPU usage is coherent with three WUs running concurrently at this system. At the twin GPU system: While both WUs seem to be progressing normally from the Boinc Manager point of view, looking at Psensor only GPU #0 (first PCIE slot) is working and GPU #1 is inactive. It's like GPU #0 is carrying all the workload for both WUs. Same 89% fraction done after 8 hours for both WUs. Also, CPU usage is coherent with two WUs running concurrently at this system. | |
ID: 57264 | Rating: 0 | rate: / Reply Quote | |
there seems to be a problem with the new 2.17 app. it's always trying to run on GPU0 even when BOINC assigns it to another GPU. Confirmed: While Boinc Manager was saying that Task #32640074, Task #32640075 and Task #32640080 were running at devices #0, #1 and #2 at this triple GTX 1650 GPU system, they actually were all processed concurrently at the same device #0. | |
ID: 57265 | Rating: 0 | rate: / Reply Quote | |
yeah it was actually partially caused by some settings on my end, combined with the fact that when the cuda1121 app was released on July 1st they deleted/retired/removed the cuda100 app. had they left the cuda100 app in place, I would have at least received that one still. i'll post more details in the original thread about that issue. ____________ | |
ID: 57266 | Rating: 0 | rate: / Reply Quote | |
The device problem should be fixed now. | |
ID: 57298 | Rating: 0 | rate: / Reply Quote | |
Windows version deployed | |
ID: 57299 | Rating: 0 | rate: / Reply Quote | |
Great news! Now to just have some tasks ready to send. | |
ID: 57300 | Rating: 0 | rate: / Reply Quote | |
Good news! :-) | |
ID: 57304 | Rating: 0 | rate: / Reply Quote | |
Got 1 WU. | |
ID: 57305 | Rating: 0 | rate: / Reply Quote | |
Picked up a task apiece on two hosts. Some 800 tasks in progress now. | |
ID: 57308 | Rating: 0 | rate: / Reply Quote | |
it seems that even though the cuda1121 app is available and works fine on Ampere cards, there's nothing preventing the cuda101 app from being sent to an Ampere host. these will always fail. | |
ID: 57360 | Rating: 0 | rate: / Reply Quote | |
This is correct. I had the same exact failure with the CUDA101 app sent to my Ampere RTX 3080. | |
ID: 57363 | Rating: 0 | rate: / Reply Quote | |
Technical question: has anybody idea if the CC is available in the scheduler? | |
ID: 57365 | Rating: 0 | rate: / Reply Quote | |
Technical question: has anybody idea if the CC is available in the scheduler? I'm sure it is. I'll start digging out some references, if you want. Sorry, you caught me in the middle of a driver update. Try https://boinc.berkeley.edu/trac/wiki/AppPlanSpec#NVIDIAGPUapps: <min_nvidia_compcap>MMmm</min_nvidia_compcap> | |
ID: 57366 | Rating: 0 | rate: / Reply Quote | |
To expand on what Richard wrote, I’m sure it’s available. Einstein@home uses this metric in their scheduler to gatekeep some of their apps to certain generations of Nvidia GPUs. So it’s definitely information that’s provided from the host to the project via BOINC. | |
ID: 57367 | Rating: 0 | rate: / Reply Quote | |
So it’s definitely information that’s provided from the host to the project via BOINC. From the most recent sched_request file sent from this computer to your server: <coproc_cuda> ... <major>7</major> <minor>5</minor> | |
ID: 57368 | Rating: 0 | rate: / Reply Quote | |
Uhm... yes but I was wondering how to retrieve it in the C++ code. | |
ID: 57369 | Rating: 0 | rate: / Reply Quote | |
Uhm... yes but I was wondering how to retrieve it in the C++ code. Same principle. Start with Specifying plan classes in C++, third example. ... if (!strcmp(plan_class, "cuda23")) { if (!cuda_check(c, hu, 100, // minimum compute capability (1.0) 200, // max compute capability (2.0) 2030, // min CUDA version (2.3) 19500, // min display driver version (195.00) 384*MEGA, // min video RAM 1., // # of GPUs used (may be fractional, or an integer > 1) .01, // fraction of FLOPS done by the CPU .21 // estimated GPU efficiency (actual/peak FLOPS) )) { return false; } } | |
ID: 57370 | Rating: 0 | rate: / Reply Quote | |
Uhm... yes but I was wondering how to retrieve it in the C++ code. I guess that there might be an easier workaround, with no need to touch the current code. It would consist of unfolding in Project Preferences page the ACEMD3 app into ACEMD3 (cuda 101) and ACEMD3 (cuda 1121) This way, Ampere users would be able to untick ACEMD3 (cuda 101) app, thus manually preventing to receive tasks that will fail for sure. | |
ID: 57377 | Rating: 0 | rate: / Reply Quote | |
Won't work for multi-gpu users that have both Turing and Ampere cards installed in a host. | |
ID: 57378 | Rating: 0 | rate: / Reply Quote | |
Won't work for multi-gpu users that have both Turing and Ampere cards installed in a host. It should work as a manual selection in Project Preferences for receiving ACEMD3 (cuda 1121) tasks only. Your RTX 3080 (Ampere - device 0) can't process ACEMD3 (cuda 101), as seen in your failed task e1s627_I757-ADRIA_AdB_KIXCMYB_HIP-1-2-RND0972_0, but it can process ACEMD3 (cuda 1121), as seen in your succeeded task e1s385_I477-ADRIA_AdB_KIXCMYB_HIP-0-2-RND6281_2 And your RTX 2080 (Turing - device 1) on the same host can also process ACEMD3 (cuda 1121) tasks, as seen in your succeeded task e1s667_I831-ADRIA_AdB_KIXCMYB_HIP-1-2-RND8282_1 Therefore, restricting preferences in a particular venue for your host # 462662] to only receiving ACEMD3 (cuda 1121) tasks would work for both cards. The exception is the general limitation for ACEMD3 app and for all kind of mixed multi-GPU systems when restarting tasks in a different device. It was described by Toni at his Message #52865, dated on Oct 17 2019. Paragraph: Can I use it on multi-GPU systems? Can I use it on multi-GPU systems? | |
ID: 57379 | Rating: 0 | rate: / Reply Quote | |
There are two different aspects to this debate: | |
ID: 57380 | Rating: 0 | rate: / Reply Quote | |
I haven’t seen enough compelling evidence to justify keeping the cuda101 app. The cuda1121 app works on all hosts and is basically the same speed. | |
ID: 57381 | Rating: 0 | rate: / Reply Quote | |
I agree. Simply remove the CUDA101 app and restrict sending tasks to any host that hasn't updated the drivers to the CUDA11.2 level. | |
ID: 57383 | Rating: 0 | rate: / Reply Quote | |
What's with the new ACEMD beta version 9.17, introduced today? What are we testing? | |
ID: 57391 | Rating: 0 | rate: / Reply Quote | |
I received 15 of the test wu's no problems on Ampere all crunched without issue just want more :) | |
ID: 57392 | Rating: 0 | rate: / Reply Quote | |
Yes, you crunched those tasks with the CUDA1121 app on your Ampere which was always working. | |
ID: 57393 | Rating: 0 | rate: / Reply Quote | |
NVIDIA GeForce RTX 3080 TI | |
ID: 57394 | Rating: 0 | rate: / Reply Quote | |
Mine also failed on a single 3080, driver 470.63: | |
ID: 57395 | Rating: 0 | rate: / Reply Quote | |
Yes, you crunched those tasks with the CUDA1121 app on your Ampere which was always working. They’ll never get cuda101 working on Ampere as long as they have the architecture check. ____________ | |
ID: 57396 | Rating: 0 | rate: / Reply Quote | |
I received 15 of the test wu's no problems on Ampere all crunched without issue just want more :) I got a few on my Nvidia 1660 and would also like to see more come my way. | |
ID: 57408 | Rating: 0 | rate: / Reply Quote | |
Yes, you crunched those tasks with the CUDA1121 app on your Ampere which was always working. Just threw away a couple more tasks because the server scheduler sent me the CUDA101 app for my Ampere card. | |
ID: 57466 | Rating: 0 | rate: / Reply Quote | |
Are we out of work again? I am going to have to greylist GPUGrid again unless I see better WU flow soon. | |
ID: 57859 | Rating: 0 | rate: / Reply Quote | |
Looks like it. I was hoping the new work from the new researcher doing machine learning with pytorch was going to provide consistent work again. Until we fill these positions, we have little capacity to send jobs. Didn't help the WAS and ZCD scores either that the server script for generating work credits export wasn't running for 5 days either. | |
ID: 57860 | Rating: 0 | rate: / Reply Quote | |
Project has been greylisted again for Gridcoin. 😱 | |
ID: 57861 | Rating: 0 | rate: / Reply Quote | |
Looks like it. I was hoping the new work from the new researcher doing machine learning with pytorch was going to provide consistent work again. to me all this seems that no one can really tell when GPUGRID will be back to "normal", which is very sad in a way :-( | |
ID: 57862 | Rating: 0 | rate: / Reply Quote | |
If project ostracism takes much longer, there is a certain risk of gradually losing its well earned prestige over years. | |
ID: 57863 | Rating: 0 | rate: / Reply Quote | |
Actually we could have a approximate date of the "restart" of the project. | |
ID: 57864 | Rating: 0 | rate: / Reply Quote | |
Incidentally how does gridcoin works and where is the list of project gray listed? | |
ID: 57963 | Rating: 0 | rate: / Reply Quote | |
The project won't be whitelisted again until it can consistently create work and have its credits exported for enough days to get its work available and zero credit days scores below greylist criteria. | |
ID: 57965 | Rating: 0 | rate: / Reply Quote | |
and what is the point of gridcoin? | |
ID: 57991 | Rating: 0 | rate: / Reply Quote | |
and what is the point of gridcoin?The crucnher receives Gridcoins for crunching. Gridcoins worth much less than Bitcoins as of yet (1 GRC = 0.01 USD). | |
ID: 57992 | Rating: 0 | rate: / Reply Quote | |
Gridcoin rewards citizen scientists for their distributed computing contribution. | |
ID: 58001 | Rating: 0 | rate: / Reply Quote | |
The human foot did not evolve with shoes. | |
ID: 58034 | Rating: 0 | rate: / Reply Quote | |
can't someone pls just block this guy/bot. guy's posting all over random stuff and cluttering message boards across various projects. thx | |
ID: 58040 | Rating: 0 | rate: / Reply Quote | |
Admin, | |
ID: 58043 | Rating: 0 | rate: / Reply Quote | |
marsinph that is a hassle to do for admin but you can block him in your end. | |
ID: 58059 | Rating: 0 | rate: / Reply Quote | |
marsinph that is a hassle to do for admin why so? Should be easy. | |
ID: 58064 | Rating: 0 | rate: / Reply Quote | |
Incidentally how does gridcoin works and where is the list of project gray listed? Get a GridCoin wallet and post your address. I'd be glad to sidestake you my GRC earnings from all projects, not just GPUgrid. I swear there used to be way to see the whitelist the scraper was using in real-time. Maybe they dropped it. They still have obsolete stuff like Team_Whitelist displayed on the Help/Debug window/Scraper tab. | |
ID: 58066 | Rating: 0 | rate: / Reply Quote | |
The website for the status of project whitelisting https://gridcoin.ddns.net/pages/project-list.php | |
ID: 58072 | Rating: 0 | rate: / Reply Quote | |
I had a pair of WUs complete successfully on a computer with a pair of 2080 Ti's. My other GG computer with a 3080 + 3080 Ti has failed 6 times. E.g., <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 14:02:25 (654733): wrapper (7.7.26016): starting 14:02:44 (654733): wrapper (7.7.26016): starting 14:02:44 (654733): wrapper: running bin/acemd3 (--boinc --device 1) 16:50:46 (662618): wrapper (7.7.26016): starting 16:51:07 (662618): wrapper (7.7.26016): starting 16:51:07 (662618): wrapper: running bin/acemd3 (--boinc --device 0) ERROR: /home/user/conda/conda-bld/acemd3_1632842613607/work/src/mdsim/context.cpp line 318: Cannot use a restart file on a different device! 16:51:11 (662618): bin/acemd3 exited; CPU time 4.131044 16:51:11 (662618): app exit status: 0x9e 16:51:11 (662618): called boinc_finish(195) </stderr_txt> ]]> The first two that failed were probably caused by me trying to use the BOINC command <ignore_nvidia_dev>1</ignore_nvidia_dev> in my cc_config and restarting boinc after suspending both tasks. Common sense says the proximal GPU would be GPU0 but apparently the distal GPU is device 0. So I deleted the command to allow acemd3 WUs to run on both GPUs. The next pair of WUs also failed for "Cannot use a restart file on a different device!" While they were running a batch of OPNG WUs arrived and BOINC suspended acemd3 WUs and let the OPNG WUs run. Maybe when they restarted they tried to swap GPUs? So, can acemd3 play nice with others or must I stop accepting OPNG if I want to run acemd3 WUs? | |
ID: 58078 | Rating: 0 | rate: / Reply Quote | |
If you have different devices, you can't interrupt the acemd3 calculation or you will error it out on restart. The only solution is to change your "switch between applications every xx minutes" in the Manager and pass that to the client. | |
ID: 58084 | Rating: 0 | rate: / Reply Quote | |
To add on to what Keith said, it’s become apparent that the acemd3 app is generating some memory load which is specific to the hardware it’s running on. Look at different GPUs and you’ll see different amounts of memory used by each, and it seems to scale by total memory size. (IE, a 12GB GPU will show more memory used than a 4GB GPU). Or maybe it scales by memory configuration (bandwidth, speed) and not just total size. Or some combination. | |
ID: 58085 | Rating: 0 | rate: / Reply Quote | |
At the root of this problem is why does GG switch to a different device when its name did not change??? | |
ID: 58086 | Rating: 0 | rate: / Reply Quote | |
Has nothing to do with GPUGrid or the acemd3 application. The issue is with BOINC. | |
ID: 58087 | Rating: 0 | rate: / Reply Quote | |
Make sure your task switching options are set to longer than the estimated task run time. | |
ID: 58088 | Rating: 0 | rate: / Reply Quote | |
Has nothing to do with GPUGrid or the acemd3 application. The issue is with BOINC. Sounds like a defect that must be remedied. If it's not already a github issue someone should list. I've worn out my welcome with 4 issues that will never be fixed. I wonder if it's not possible for GG to assign a GPU device number to the WU and stick with same after time-slicing? Set switching to 1440 minutes and got 4 acemd3 WUs overnight that are running nicely. The OPNG WUs are sitting there patiently like a bug in a rug :-) | |
ID: 58095 | Rating: 0 | rate: / Reply Quote | |
Again, the acemd3 application would have to be completely rewritten to hook into BOINC in a manner that BOINC does not currently have. | |
ID: 58097 | Rating: 0 | rate: / Reply Quote | |
keep task switching set to the proper value and operate with the understanding of the idiosyncrasies of different projects and this won't be an issue. for GPUGRID that means not turning off your PC or anything else that would cause the task to restart while processing GPUGRID work. | |
ID: 58100 | Rating: 0 | rate: / Reply Quote | |
I haven't received any new projects since roughly in November and I just recently replaced the Certificate. My other 3 projects are working just fine except for this one. I have tried resetting, removing and re-adding GPUGRID but no work gets downloaded. Any ideas? | |
ID: 58235 | Rating: 0 | rate: / Reply Quote | |
Any ideas? no work is available to send you. GPUGRID always operates with intermittent work availability, especially recently when they really only release a small batch at a time. you're trying to get work in a time when work isn't available. ____________ | |
ID: 58236 | Rating: 0 | rate: / Reply Quote | |
ok, Thanks! | |
ID: 58252 | Rating: 0 | rate: / Reply Quote | |
Message boards : News : What is happening and what will happen at GPUGRID, update for 2021