Message boards :
Number crunching :
GTX 770 won't get work
Message board moderation
Previous · 1 · 2 · 3 · Next
| Author | Message |
|---|---|
Michael H.W. WeberSend message Joined: 9 Feb 16 Posts: 78 Credit: 656,229,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Do you have the "Use NVidia GPU" and the "Use Graphics Processing Unit (GPU) if available" selected in GPUGrid preferences? Yes. Do you have at least 8 GB disk space in the partition the BOINC data directory resides? Will check, but would quite certainly say yes. How many other GPU project this host is attached to? Just Primegrid and GPUGRID but even if I suspend Primegrid, the machine won't fetch work for GPUGRID. You could try to increase the work buffer (it is set to 1 day now) for testing. I did, but it won't change the situation even if I set the work buffer to 10/10 days. Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Getting new work is a two-part collaboration between your computer and the project server. The first necessary condition is that your computer requests new work. The work fetch log has confirmed that your machine is requesting work for your NVidia GPU - job done. You can turn that logging off now, and save some disk space and processing cycles. The second necessary condition is that the server responds by allocating new work - which it isn't. The question is - why not? One more to check - are you allowing 'ACEMD long runs' (project preferences)? Short run jobs are as rare as hen's teeth these days. After that, it's a question of verifying that your GPU's 'compute capability' and graphics driver together match the current minimum project requirements. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
After that, it's a question of verifying that your GPU's 'compute capability' and graphics driver together match the current minimum project requirements.It's been done. Moreover this host has already successfully completed a single CUDA8.0 task, but no more sent by the project. I've experienced the same behavior once when I was trying my GTX 1080 under Linux. I thought then that I'd messed up something while trying to make the SWAN_SYNC work under Linux (well, I couldn't). This host stopped receiving work, even if GPUGrid was the only project on this host all the time. Then I've installed Windows 10 on the same hardware, and it has received work, and it haven't stopped receiving work after I've set SWAN_SYNC on. |
Michael H.W. WeberSend message Joined: 9 Feb 16 Posts: 78 Credit: 656,229,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
One more to check - are you allowing 'ACEMD long runs' (project preferences)? As said above: Yes, ALL GPUGRID subprojects are allowed for this machine. Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I would detach from everything and reattach only to GPUGrid as Retvari suggests. If that doesn't work, you have found the one hardware/software configuration that just doesn't obey the rules as we know them. It happens as you know. |
Michael H.W. WeberSend message Joined: 9 Feb 16 Posts: 78 Credit: 656,229,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I would detach from everything and reattach only to GPUGrid as Retvari suggests. I detached from GPUGRID, rebooted the system and re-attached to GPUGRID. No improvement. If that doesn't work, you have found the one hardware/software configuration that just doesn't obey the rules as we know them. It happens as you know. No, I do not know or accept that. This is science, not homeopathy (although homeopathy at least offers the placebo effect (for those of us who are believers) which can't principally be excluded to be something scientifically accessable, too - although we still have not found a clue how that might be possible). Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I am not sure you followed the instructions. Homeopathy is your idea, not mine. However, if your enthusiastic to spend more time, I would try earlier drivers (still CUDA 8). Nvidia may have introduced problems in the later ones. I have seen it on other projects on occasion. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It's frustrating that the server doesn't give more details in its reply. I think your problem can only be investigated further by the project admins, who really should throw us more bones in the server replies in the Event Log, to further explain WHY tasks were not given. |
|
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I would detach from everything and reattach only to GPUGrid as Retvari suggests. Are you sure its not your work buffer or some other config in BOINC |
Michael H.W. WeberSend message Joined: 9 Feb 16 Posts: 78 Credit: 656,229,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Are you sure its not your work buffer or some other config in BOINC Yes, I am sure about that. This machine just did not receive any work anymore from one day to the other without me having altered any of the BOINC or project settings. Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
|
Send message Joined: 27 Feb 14 Posts: 4 Credit: 121,376,887 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
I have two 750ti's on different machines at different physical locations using the same BOINC and GPUGrid settings/prefs. I noticed one was getting GPUGrid work and the other wasn't. After a couple of days of not getting work I began to investigate. I found this thread did some reading and noticed the driver difference between the two. I updated the driver to the newest (382.33) and got work. TL;DR update your drivers. |
Michael H.W. WeberSend message Joined: 9 Feb 16 Posts: 78 Credit: 656,229,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Moreover this host has already successfully completed a single CUDA8.0 task... How did you actually find out about that? I could't see any of the completed WUs in my client's history even before I started posting this thread. Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
First post has a link to a host. Then on that page, you can click Application Details, to see application details for that host. |
Michael H.W. WeberSend message Joined: 9 Feb 16 Posts: 78 Credit: 656,229,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
First post has a link to a host. Then on that page, you can click Application Details, to see application details for that host. Indeed. Never checked that link. One question though: Why aren't all the tasks completed using CUDA 6.5 indicated as valid (although they all were valid)? Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
First post has a link to a host. Then on that page, you can click Application Details, to see application details for that host. Your assumption, that they were all valid, seems invalid :) From my experience, if a task is suspended+resumed, or stopped+resumed, then it has a chance of being invalid, even if you watched it complete without error. Something in the validator must not like the output, sometimes, when those scenarios happen. Getting back on topic, I'm sure that GPUGrid changed their logic to decide when to give hosts work, and I'm fairly certain that "driver version detected" has a hand in that criteria. I wonder if they screwed something up for the app version criteria, for the 700-series-GPUs on linux? Also, can you see if you can upgrade your driver (I looked briefly and there might be a minor update available to you). |
Michael H.W. WeberSend message Joined: 9 Feb 16 Posts: 78 Credit: 656,229,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Your assumption, that they were all valid, seems invalid :) Not really. They generated at least 73 billions of credits, so a few should have been OK. :) The point is that not a single valid task is listed (and none invalid, too). Getting back on topic, I'm sure that GPUGrid changed their logic to decide when to give hosts work, and I'm fairly certain that "driver version detected" has a hand in that criteria. I wonder if they screwed something up for the app version criteria, for the 700-series-GPUs on linux? Two things: (1) the NVIDIA proprietary driver is updated from time to time using auto-update of Ubuntu. I actually do not like to change this manually as everything except for GPUGRID works perfectly. (2) This GTX 770 machine uses the same driver as my GTX 970 machine. The latter does receive tasks on a daily basis, the former not. So, I don't really see a reason for why the current driver should be the problem. Especially since, as stated above, even the GTX 770 completed a CUDA 8 WU successfully. But why should I care? It is not my project and the GTX 770 now contributes to some other project until the GPGRID team decides to do something in order to keep or increase their number of contributers. I find it really kind of strange that - if I got it correctly - so far this topic has exclusively been discussed by volunteers? Thank you guys, I think you did your best. Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The point is that not a single valid task is listed (and none invalid, too). Don't worry about that. Task data is kept in a short-term 'transactional' database, and purged (to save space and processing time) when no longer needed - usually after 10 days or so. The important scientific data is transferred to a long-term scientific database and kept indefinitely. But from the same 'application details' link for your machine, we can see for cuda65 (long tasks): Number of tasks completed 249 Max tasks per day 1 Number of tasks today 0 Consecutive valid tasks 0 'Max tasks per day' and 'consecutive valid tasks' together imply that your machine produced a considerable number of invalid tasks at some point: no shame in that, we all did the same thing when the cuda65 licence expired, but it shows the sort of inferences you can draw. Two things: I'm not a Linux user, but I have read comments that Linux GPU drivers tend to be compiled against a specific Linux kernel. If your kernel also auto-updates, you may need to take precautions to ensure that your kernel and driver updates are kept in sync. |
Michael H.W. WeberSend message Joined: 9 Feb 16 Posts: 78 Credit: 656,229,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
But from the same 'application details' link for your machine, we can see for cuda65 (long tasks): Hm, I do not understand how that conclusion can be drawn. The number of tasks is anyway limited to two per day by the GPUGRID server. The GTX 770 mostly got long runs, so it rarely can complete more than one per day. Moreover, I virtually checked the machine and its output on a daily basis during the entire year 2016 and early 2017. Rarely have I seen an invalid task, and when it happened, I caused it by accidentally updating the system inlcuding NVIDIA drivers during full DC operation. I must confess, though, that around the time when GPUGRID stopped sending tasks to my system, I had not checked regularly for probably a few weeks. IF there had been many, many consecutive errors at the transition from CUDA 6.5 to 8.0, wouldn't it be possible that some information flag is stored somewhere locally on my machine (or on the GPUGRID server) that causes my system being marked as permanently unreliable? And that this flag somewhow has not yet been removed and now causes WUs not to be sent? Hm, probably also not the case as it completed a CUDA 8.0 task...
See, that is exactly why I am hesitant to manually install a more recent NVIDIA driver: When you use the console to update the whole system, everything is coordinately (!) brought to the most recent state. A new kernel plus the corresponding and tested GPU driver will be delivered. For now, I will just wait and see whether GPUGRID will again retrieve WUs for my GTX 770 after a future system update with even more recent drivers than the ones I currently have in use. Until then, other DC projects will be supported. Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
IF there had been many, many consecutive errors at the transition from CUDA 6.5 to 8.0, wouldn't it be possible that some information flag is stored somewhere locally on my machine (or on the GPUGRID server) that causes my system being marked as permanently unreliable? And that this flag somewhow has not yet been removed and now causes WUs not to be sent? Hm, probably also not the case as it completed a CUDA 8.0 task...This came to my mind too. Perhaps you should try to force the BOINC manager to request a new host ID for your host. You can do it by stopping the BOINC manager, editing the client_state.xml, searching for <hostid>342877</hostid>, and replace the number to the number of a previous host of yours (or a random number, if you don't have an older host), saving the client_state.xml, and restaring the BOINC manager. |
Michael H.W. WeberSend message Joined: 9 Feb 16 Posts: 78 Credit: 656,229,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks for this suggestion. A second idea of mine was that the client indicate a GPU memory of 1998 MB instead of the expected 2048 MB. What is the minimum V-RAM which GPUGRID requires to send tasks, is this value stored somewhere in the BOINC system files and can it be modified without causing trouble? Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
©2025 Universitat Pompeu Fabra