Message boards :
Graphics cards (GPUs) :
GPU work units [network connection issue]
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 20 Aug 07 Posts: 18 Credit: 1,319,274 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Are GPU work units dependent on the internet other than sending and/or receiving. The reason for this question is because our internet went out a three times over the last month and the workunits that were currently being processed continued on until completion and they were reported as client error/compute error. The last internet outage, I turned off the computer and restarted it after the outage and the work unit processed correctly. Are the workunits dependent on a good internet connection while being processed? I have lost 6 workunits due to this. |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Um, no, and, um, well, yes ... Which is confusing ... bear with me ... There is no need for an internet connection to compute a GPU Grid task. On the other hand, the TCP/IP connection used to connect the client to the manager can be farbled up if you lose your external connection. This is a long standing issue with the BOINC client that, because of its "intermittent" nature is difficult to locate. *IF* you know you are losing, or will lose, or have lost your connection you have two choices to protect your work. One shut down the system or two, turn off BOINC's connection (Activity menu, suspend internet). What killed your tasks were "No heartbeat" meaning the client and the science application lost "sight" of each other ... most specifically, the application thinks that the client is dead. It isn't really, it is just tied up trying to make a connection to the Internet. And it is locked up, so to speak, in the deadly embrace waiting for the connection to fail... but while it is doing that it is not paying any attention to anything else ... {edit} Like the running applications ... so they try to make the connection a specific number of times and if they fail that number of times, they quit ... *SO* ... Hope this helps ... |
|
Send message Joined: 13 Mar 09 Posts: 59 Credit: 324,366 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Thanks for that I was coincidentally having latency issues last night when Boinc lost connection to the client. Isn't there a reference server/s that the system uses to check if the connection is good or not. Maybe it has something to do with losing site of that. Rob |
|
Send message Joined: 20 Aug 07 Posts: 18 Credit: 1,319,274 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Thanks for the info, I'll shut down the internet connection next time and let the application continue to process. |
|
Send message Joined: 19 Feb 09 Posts: 37 Credit: 30,657,566 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I think the reference server it tries is google. I think i saw it somewhere but no idea where. |
Michael GoetzSend message Joined: 2 Mar 09 Posts: 124 Credit: 124,873,744 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I think the reference server it tries is google. I think i saw it somewhere but no idea where. The default reference site is indeed Google. You can change that if you wish; I believe it's an option you can put in the cc_config.xml file (or whatever the filename is.) The place you saw that information was probably the documentation for the cc_config.xml file. Mike |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Mhh, I also had occasional network failures but I didn't see the apps loosing connection to the BOINC client, even though I didn't suspend network activity (after all, there's always the chance the connection is restored and I won't have to run dry..). So I'm wondering: is it really your external connection? Or some strange problem with your windows setup? MrS Scanning for our furry friends since Jan 2002 |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Mhh, I also had occasional network failures but I didn't see the apps loosing connection to the BOINC client, even though I didn't suspend network activity (after all, there's always the chance the connection is restored and I won't have to run dry..). That is why it is such a hard problem to find. It does not happen to all people all the time. But, if you look into the past the "no Heartbeat" has been a annoying bug for a long time in the BOINC world. Just like the "Can't acquire lockfile" ... another pest ... |
|
Send message Joined: 20 Aug 07 Posts: 18 Credit: 1,319,274 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I don't think its a problem with the Windows setup. This is just a recent problem and it only happens when the internet in the area goes down. It has gone down three times in the last month and the two work units being processed at the time crashed and burned. The onlt problem I can think of ia a bug in BOINC Application 6.6.20 files (which was also installed about the time the problems started). I will revert back to a previous version and see what happens. I am also running the most current nvidia drivers for the video cards. |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I don't think its a problem with the Windows setup. This is just a recent problem and it only happens when the internet in the area goes down. It has gone down three times in the last month and the two work units being processed at the time crashed and burned. The onlt problem I can think of ia a bug in BOINC Application 6.6.20 files (which was also installed about the time the problems started). I will revert back to a previous version and see what happens. I am also running the most current nvidia drivers for the video cards. No Heartbeat has been around for years ... but trying another client may work ... :) |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Yes, if inet was really broken then it surely was not an issue with windows and the installed programs. However, what I was thinking: what if some program went bezerk and blocked your inet access as well as your local servers and thus the no heartbeat issue. In this case it would also look like a broken inet from your point of view. Except if you have different computers and / or you know the neighbours inet is also gone or you see the service guys working or whatever. I don't know your situation, so this was just an idea.. maybe a crazy one ;) MrS Scanning for our furry friends since Jan 2002 |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Except if you have different computers and / or you know the neighbours inet is also gone or you see the service guys working or whatever. I don't know your situation, so this was just an idea.. maybe a crazy one ;) Not really. ANOTHER bug I am chasing causes OS-X versions of BOINC Manger to lose conection to the Client though it continues to run, apparently properly ... but the manager cannot connect to the client. I have sent Charlie Fenton I think 4 reports now of what I have discovered and what I suspect ... nothing back from him yet ... (sadly) ... BUt, Charlie is a good guy I think so patience is a virtue which is probably why I don't have much of it ... {edit} Forgot to mention, it looks like a TCP/IP bug also ... |
|
Send message Joined: 20 Aug 07 Posts: 18 Credit: 1,319,274 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Well, I reverted back to BOINC 6.4.7 and everything has been running properly for the last two days. No problems to report at all. In my opinion there is a problem in the BOINC 6.6.20 code as it is applied to GPU/CUDA functions. BOINC 6.6.20 runs fine on my other computers, however they are not running GPU/CUDA functions. |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
In my opinion there is a problem in the BOINC 6.6.20 code as it is applied to GPU/CUDA functions. There is a problem? Boy, we could all finally be happy again if it was only one ;) MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 20 Aug 07 Posts: 18 Credit: 1,319,274 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Maybe I should have said "some problems' onstead of "a problem." Bad choice of words on my part. However, these problems have not manifested themselves on my computers that are not using CUDA. |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
There are two major problems with 6.6.20; neither of which is recognized by UCB as far as I know. One of them has been fixed in 6.6.23 and later, though 6.6.24 introduced a new issue, addressed in 6.6.25 (multi-GPU users only). The two problems in 6.6.20 show as long running tasks on either CPU or GPU (I now have a confirmed instance of this relating to AP tasks of SaH), and secondly a gradual imbalance in internal debts that results in a very poor mix of work selection on the system. |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Could the problem with long running tasks be only for those tasks that don't have fairly frequent checkpoints available yet? |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Could the problem with long running tasks be only for those tasks that don't have fairly frequent checkpoints available yet? No, because it is universal with tasks from more than one project. I saw it only with GPU Grid for sure (it may have affected other tasks, I just did not see it). Another user saw his AP tasks of SaH with estimated times of 187 hours plus change to 63 hours by going back to 6.4.7 (I think) ... I have not seen it with 6.6.23 and there was a change in the resource scheduler (in the release notes) an though I forget what it said it certainly sounded like the issue. I have been running 6.6.23 and 6.6.25 for several weeks so I can get at the other issues (if you watched the mailing list this week end I sure did fill that up), and the only way you can get the developers attention is to run the latest versions (or close to it, there is no significant change in 6.6.26 so I have not tried it yet). |
©2025 Universitat Pompeu Fabra