no cuda work requested

Author	Message
uBronan Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level Scientific publications	Message 8847 - Posted: 24 Apr 2009, 16:07:01 UTC Last modified: 24 Apr 2009, 16:07:58 UTC This 6.6.20 problem may mostly affect people running multiple GPU setups Sadly no on my single gpu system i have the same although my system just ends the unit sends it and then receives a new one or sometimes i receive 4 new which probably get cancelled sooner or later by the server ;) So the issue is more wide spread and seem to affect more projects, but these projects send much more units and/or have longer time to be used or run much longer. That makes them have less problems then the gpugrid project which is time critical. ID: 8847 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 8862 - Posted: 24 Apr 2009, 19:35:54 UTC - in response to Message 8841. I've been getting the 'No cuda work requested' messages too. Since it has been days since I got a GPUGrid WU, but SETI-cuda is running fine, I knew my hardware and drivers were okay. I reset the GPUGrid project and immediately got workunits. FWIW. Edit: but all is not well, with a quad-core cpu and dual-core GPU (Nvidia 9800GX2), I should be running 6 tasks: 4 cpu and 2 cuda. It just paused a seti-cuda task to run the GPUgrid, leaving only 5 tasks active... <sigh> 6.6.20 and above are still works in progress. I did not, and do not think 6.6.20 was ready for prime time. It works, mostly, but it actually does not work as well as 6.5.0 IMO ... especially when you have more than one GPU in the system. When you did a project reset you reset the debt on the one project. The problem is that you did not reset the debts on the other projects. To clear up most of the scheduling problems when you have anomalies like this you need to use the debt reset flag in the cc_config file, stop and restart the client (reading config will not reset debts). Be sure to change the flag back to 0 after you stop and restart. 6.6.23 actually seems to be worse on the debt management. 6.6.24 seems to insist that the number 2 GPU is not like all the others and refuses to use it regardless of how identical it is ... The fix in 6.6.24 to address excessive task switching also did not clear up the problem though it may have addressed a bug that exaggerated the problem (or may have been inconsequential). Waiting on 6.6.25 ... Seriously, if you are having problems with work fetch drop back to 6.5.0 ... the only thing you lose is some debug message improvements and the change to time tracking (you can't see how long the task has to run correctly). ID: 8862 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 8894 - Posted: 25 Apr 2009, 12:33:40 UTC - in response to Message 8862. Thanks for your effort to put this bug report together. If the developers are not totally blind or have put you onto their ignore lists they should be able to see that this is not just chatting, it's a real problem. And I guess sometimes you hate to be right.. you said many times 6.6.20 was not ready (and from the various reports it clearly wasn't) and has serious debt issues. Well, that's what we're seeing now. /me is sticking to 6.5.0 a while longer. MrS Scanning for our furry friends since Jan 2002 ID: 8894 · Rating: 0 · rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 8902 - Posted: 25 Apr 2009, 13:10:34 UTC - in response to Message 8797. Last modified: 25 Apr 2009, 13:11:37 UTC No work sent Full-atom molecular dynamics on Cell processor is not available for your type of computer. cuda app exists for Full-atom molecular dynamics but no cuda work requested. I understand this message in the way that BOINC does request work from GPU-Grid, but it does not request CUDA work (which wo9uld be extremely strange / stupid) and hence the server is not sending CUDA work. Am I totally wrong here? MrS If you turn on the cc_config flag <sched_op_debug> you will see what its requesting. It is not a bug. BOINC 6.6 series make 2 requests. One for cpu work and one for cuda work. GPUgrid does not have cpu work so when it asks for some you get the message above. It should make another request for cuda work which it can provide. If you have recently upgraded to the 6.6 client I would suggest you reset the debts. This can be done using the cc_config flag <zero_debts> BOINC blog ID: 8902 · Rating: 0 · rate: / Reply Quote

Michael Goetz Send message Joined: 2 Mar 09 Posts: 124 Credit: 124,873,744 RAC: 0 Level Scientific publications	Message 8905 - Posted: 25 Apr 2009, 13:20:06 UTC - in response to Message 8902. If you have recently upgraded to the 6.6 client I would suggest you reset the debts. This can be done using the cc_config flag <zero_debts> And if you do this, don't forget to remove it after the you restart BOINC. If left in there, the debts would be reset everytime BOINC starts. ID: 8905 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 8917 - Posted: 25 Apr 2009, 15:00:16 UTC - in response to Message 8905. Last modified: 25 Apr 2009, 15:00:45 UTC If left in there, the debts would be reset everytime BOINC starts. Which actually might be a good idea. Ignore the debts and treat the resource share as an approximation.. don't stick to the code, they're more like guidelines anyway ;) BOINC 6.6 series make 2 requests. One for cpu work and one for cuda work. GPUgrid does not have cpu work so when it asks for some you get the message above. It should make another request for cuda work which it can provide. Wouldn't that completely screw the scheduling? BOINC would quickly assign a massive debt to CPU-GPU-Grid, which can never be reduced as there is no cou client? Which would in turn screw the scheduling of all other cpu projects? This assumes there are separate debts for cpus and coprocessors. If this is not the case.. well, the entire debt system is screwed anyway and can by definition not work. (.. please don't take this a personal offense, I'm just thinking a little further ahead ;) MrS Scanning for our furry friends since Jan 2002 ID: 8917 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 8934 - Posted: 25 Apr 2009, 18:12:25 UTC - in response to Message 8917. This assumes there are separate debts for cpus and coprocessors. If this is not the case.. well, the entire debt system is screwed anyway and can by definition not work. The new debt system is supposed to track two debt levels for each project. The problem is that if you have only one project of one resource class you can and will get unconstrained growth of the debt for that resource. I get if for GPU Grid on my i7 (running 6.6.23 NOT RECOMMENDED I do not recommend running 6.6.23 or 6.6.24, NOTE I AM TESTING... and 6.6.23 is PAINFUL for GPU Grid ... YMMV) ... Another participant running GPU Grid and Rosetta@Home, only the two projects gets if for Rosetta ... See the fuller discussion BOINC v6.6.20 scheduler issues (most specifically Message 60808 or the BOINC Alpha and Dev mailing lists for an even fuller discussion. The net effect is that you stop getting a full queue of tasks for the one resource. Sadly, even in the face of providing them lots of logs and other data I am not sure they have even started looking at this problem. The good news is that they are finally starting to take seriously a problem I pointed out when I first saw it about 2005 when I bought my first dual Xeons with HT (the first quad CPU systems) and is now a killer on 8 CPU systems ... especially if you also add in multiple GPUs ... The system I am considering building this summer will be at least an i7 and I hope to put at least 3 GTX295 cards into it ... making it have 8 CPUs and 6 GPUs ... 14 processors ... an alternative is a dual Xeon again ... that would be 16 CPUs and 6 GPUs (or 8 with 4 PCI-e slots) ... that will make the problem I noted a real killer ... ID: 8934 · Rating: 0 · rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 8940 - Posted: 26 Apr 2009, 1:16:44 UTC - in response to Message 8917. If left in there, the debts would be reset everytime BOINC starts. Which actually might be a good idea. Ignore the debts and treat the resource share as an approximation.. don't stick to the code, they're more like guidelines anyway ;) BOINC 6.6 series make 2 requests. One for cpu work and one for cuda work. GPUgrid does not have cpu work so when it asks for some you get the message above. It should make another request for cuda work which it can provide. Wouldn't that completely screw the scheduling? BOINC would quickly assign a massive debt to CPU-GPU-Grid, which can never be reduced as there is no cou client? Which would in turn screw the scheduling of all other cpu projects? This assumes there are separate debts for cpus and coprocessors. If this is not the case.. well, the entire debt system is screwed anyway and can by definition not work. (.. please don't take this a personal offense, I'm just thinking a little further ahead ;) MrS Its supposed to maintain 2 sets of debts (ie one for cpu and one for gpu). With projects like Seti which use both types of resource it is useful. GPUgrid causes it grief because it only uses one resource type. There is supposed to be some checking of resources debt growing too much but doesn't seem to work. Then there is the scheduling system, which is where the current discussions are at the moment. I don't quite share Paul's pessimism regarding 6.6.23 (or 6.6.24). It has improved since 6.6.20 but not substantially. Now if they can fix them it could once again become reliable. Don't worry i'm not offended - I didn't write BOINC. Paul and I make suggestions but they usually get ignored by the developers anyway. BOINC blog ID: 8940 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 8942 - Posted: 26 Apr 2009, 1:28:20 UTC - in response to Message 8940. Then there is the scheduling system, which is where the current discussions are at the moment. I don't quite share Paul's pessimism regarding 6.6.23 (or 6.6.24). It has improved since 6.6.20 but not substantially. Now if they can fix them it could once again become reliable. Don't worry i'm not offended - I didn't write BOINC. Paul and I make suggestions but they usually get ignored by the developers anyway. Um, did not think I was being pessimistic ... I thought rational was more like it ... but Ok ... :) If 6.6.23 or .24 works for you ... cool ... .23 IS better than .20 in my opinion, though if you run single project as I do it seems to have the debt problem. If you don't mind resetting debts on occasion then go for it. The main improvements in .23 had to do with initialization crashes and CUDA task switches which were not handled properly. What I saw on .20 was at times the tasks took twice as long to run. Have not seen that at all on .23 ... and I have been running the heck out of .23 on the i7 ... but, 24-48 hours later, I can't get 4 queued tasks from GPU Grid ... reset debts and I am good to go ... In .24 there is a huge mistake of some kind and my second of 4 GPUs is suddenly not the same as the others ... in that it always is the second of teh GPUs, sounds like a bug to me ... not sure where ... I suggested a change to print out the exact error, lets see if they pick that up ... and or find the real problem (I looked and saw nothing that leaped out at me ... but I am not a C programmer). For me to notice someone trying to offend me you have to be at Dick Cheney level of effort to get me to even notice you are trying ... so, I don't do offended... :) And so, one of the reasons I don't understand why others do ... thankfully you don't ... :) Now if others would be so reasonable ... ID: 8942 · Rating: 0 · rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 8943 - Posted: 26 Apr 2009, 2:26:58 UTC - in response to Message 8942. Then there is the scheduling system, which is where the current discussions are at the moment. I don't quite share Paul's pessimism regarding 6.6.23 (or 6.6.24). It has improved since 6.6.20 but not substantially. Now if they can fix them it could once again become reliable. Don't worry i'm not offended - I didn't write BOINC. Paul and I make suggestions but they usually get ignored by the developers anyway. Um, did not think I was being pessimistic ... I thought rational was more like it ... but Ok ... :) If 6.6.23 or .24 works for you ... cool ... .23 IS better than .20 in my opinion, though if you run single project as I do it seems to have the debt problem. If you don't mind resetting debts on occasion then go for it. The main improvements in .23 had to do with initialization crashes and CUDA task switches which were not handled properly. What I saw on .20 was at times the tasks took twice as long to run. Have not seen that at all on .23 ... and I have been running the heck out of .23 on the i7 ... but, 24-48 hours later, I can't get 4 queued tasks from GPU Grid ... reset debts and I am good to go ... In .24 there is a huge mistake of some kind and my second of 4 GPUs is suddenly not the same as the others ... in that it always is the second of teh GPUs, sounds like a bug to me ... not sure where ... I suggested a change to print out the exact error, lets see if they pick that up ... and or find the real problem (I looked and saw nothing that leaped out at me ... but I am not a C programmer). For me to notice someone trying to offend me you have to be at Dick Cheney level of effort to get me to even notice you are trying ... so, I don't do offended... :) And so, one of the reasons I don't understand why others do ... thankfully you don't ... :) Now if others would be so reasonable ... I haven't had to reset debts on any of my machines, but I don't run a single project. I usually have 3 (or when Einstein went off last week 4) running. .23 seemed to have fixed the never-ending GPUgrid wu bug. Apart from the debugging messages .24 doesn't seem to correct anything. But then i've only got it installed on a single gpu machine because of the "can't find 2nd gpu" bug. BOINC blog ID: 8943 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 8963 - Posted: 26 Apr 2009, 21:22:36 UTC - in response to Message 8943. I haven't had to reset debts on any of my machines, but I don't run a single project. I usually have 3 (or when Einstein went off last week 4) running. .23 seemed to have fixed the never-ending GPUgrid wu bug. Apart from the debugging messages .24 doesn't seem to correct anything. But then i've only got it installed on a single gpu machine because of the "can't find 2nd gpu" bug. It is not a problem running a single project, it is running a single project of a particular resource class. And I also think that the speed of the system plays a part in how fast the debts get out of whack. I run 6.6.20 on the Q9300 and it has a single GPU and does not seem to get into trouble that fast. The i7 on the other hand only lasts a day or so before the GPU Grid debt is so out of whack that I have to reset it so that I can keep 4 tasks in the queue. If I don't reset the debts, well, pretty soon all I have is the tasks running on the 4 GPUs. It is possible if I had only one or two GPUs in the system that it would not get out of whack so fast ... but ... The change in .24 was in response to some discussions on the lists about the asymetry of GPUs ... I think the decision was wrong and hope we can get some reasonableness going ... but so far there has been no acknowledgement that this is a bad choice ... hopefully Dr. Korpela at SaH will speak up and the PM types here too ... if they don't the chances of getting the change backed out are lower (Note they can also do silent e-mails directly to Dr. A) ... ID: 8963 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 9020 - Posted: 27 Apr 2009, 21:21:35 UTC - in response to Message 8940. Its supposed to maintain 2 sets of debts (ie one for cpu and one for gpu). With projects like Seti which use both types of resource it is useful. GPUgrid causes it grief because it only uses one resource type. There is supposed to be some checking of resources debt growing too much but doesn't seem to work. Thanks for explaining. Still looks stupid: if someone has a CUDA device and is attached to 50 cpu projects, then 6.6.2x will continue to request GPU work from all of them? I really hope the new versions of the server software feature some flag to tell the clients which work they can expect from them.. MrS Scanning for our furry friends since Jan 2002 ID: 9020 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 9034 - Posted: 27 Apr 2009, 21:51:25 UTC - in response to Message 9020. Its supposed to maintain 2 sets of debts (ie one for cpu and one for gpu). With projects like Seti which use both types of resource it is useful. GPUgrid causes it grief because it only uses one resource type. There is supposed to be some checking of resources debt growing too much but doesn't seem to work. Thanks for explaining. Still looks stupid: if someone has a CUDA device and is attached to 50 cpu projects, then 6.6.2x will continue to request GPU work from all of them? I really hope the new versions of the server software feature some flag to tell the clients which work they can expect from them.. And that is exactly what happens. In the last 26 hours I have hit the 50 some projects I am attached to with 800 some requests, most of them are probably asking for CUDA work because my GPU debt is high and climbing in that I am only attached to GPU Grid for GPU work. What they are relying on is the "back-off" mechanism with the assumption that the number of requests is nominal. The problem is that DoS is also a very small request, just made lots of times. Multiply my 800 requests times 250,000 participants and pretty soon you are talking some real numbers. I have TWO threads going on the alpha list right now about this type of lunacy where, surprisingly, John McLeod VII is arguing for policies that are a waste of time and lead to system instability because the cost of doing the policy is "low". The trouble is that reality is that the numbers are not as low as he insists that they are ... Worse, the repetitive nature of this obsessive checking (as often as once every 10 seconds or faster) is that the logs get so full of stuff that you cannot find meaningful instances of problems that you are trying to cure. Just because you can do something does not mean that you should. A lesson the BOINC developers have chosen not to learn yet. I would point out that the latest spate of server outages took place shortly after 6.6.20 was made the standard... coincidence? Maybe, maybe not ... But why they are so blase about adding to the loads of the scheduler Is beyond me ... My latest post on DoS and 6.6.20+: Ok, Related to the debt issue of these later clients with GPU work shortfall there is a side issue. I turned on sched_op_debug and have been watching my i7 mount a DOS attack on most projects trying to get CUDA work from projects that don't have any and are not likely to have any anytime soon. So, my GPU debt is climbing for AI ... but the AI project does not have a CUDA application, so, I ping the server, it backs me off, I ping again 7 seconds later and slowly back off ... the problem is that with sufficient 6.6.x clients all doing this ... well ... DOS attack ... This is another case of too much of a good thing being bad for the system as a whole. The assumption that the rates are low, the cost is low, ignores the fact that it is not necessary ... and things that are not necessary should not be done regardless of how low we think the cost might be ... I suggested earlier this week, well last calendar week, that we add a flag from project preferences that would block the request of CUDA work unless it was explicitly set by the project. For the moment that keeps the number of projects that have to make a server side change low (SaH, SaH Beta, GPU Grid and The Lattice Project). These would add a project "preference" that would indicate that GPU work is allowed, much like the preference setting is on SaH... GPU Grid would not SHOW the setting as it is meaningless to do so ... but, the client would NOT issue GPU work requests to project without this flag set. This will stop the mounting DoS attacks on the servers, and lower the frequency of mindless CPU scheduling events ... And my prior: Perhaps we should make the flags explicit in the system side revision where: <cpu>1 <gpu>1 Have to be set specifically. (assume CPU=1) Then these flags could be used to control debt allocation. GPU Grid would be of course: <cpu>0 <gpu>1 Prime Grid (at the moment) <cpu>1 <gpu>0 and so on ... If not explicitly set by the project the assumption would be: <cpu>1 <gpu>0 Of course the most depressing thing is that as John explicitly said, if I keep saying things that "they" don't want to hear, "they" are going to keep ignoring me ... my reply was, of course, just because I am saying things that he, and others, might not want to hear does not make me wrong ... nor will ignoring problems make them go away ... ID: 9034 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 9082 - Posted: 28 Apr 2009, 20:50:05 UTC - in response to Message 9034. Last modified: 28 Apr 2009, 20:53:23 UTC Wow, now that makes me want to scream. Many of our far too many software issues happen because people don't plan properly. They didn't plan to include features which later on become necessary and everything becomes a mess when these features are "hacked in". In our case the BOINC devs actually have the benefit of knowing what they will need: the ability to handle a heterogeneous landscape with different coprocessors. Do any of their current changes factor ATIs and Larrabees in, even remotely? Or the different CUDA hardware capabilities? It doesn't look like it, judging by the way the term "the GPU" is used, as if there was only one kind. Just imagine what 10 possible coprocessors will do to DOS attacks: each host issuing 10 request every ~10s to each project it's attached to? How can one even remotely like this idea? Sure, currently the request can be handled, but why take the risk of letting this grow out of hands and invest time struggling to fix the side effects this has on the local scheduler?! I'm not a real software developer, but I know that if you do things the quick & dirty way, many of them [i]are8/i] going to bite you into the a**... MrS Scanning for our furry friends since Jan 2002 ID: 9082 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 9092 - Posted: 28 Apr 2009, 23:35:32 UTC - in response to Message 9082. Wow, now that makes me want to scream. It does not help me much either. Being suicidally depressed as a normal state with medication not being effective, I really don't need the aggravation. When they were first asking about the 6.6.x to 6.8.x versions we (Richard Hasslegrove, Nicholas Alveres (sp?), and a few others (sorry guys forgot the list) made a lot of suggestions ... as I said before, none of them were considered. Now we see issues with work fetch and resource scheduling. To the point where my system is bordering on chaos ... I cannot imagine what a 16 CPU system with 6 GPU cores will look like. Though there is some glimmer that there is an issue, it is the same, lets tinker with the rules and not make any big changes Sadly, I know, based on experience that this will not work. Yes they may be able to fake it for some more time. BUt it would be better an cleaner to start anew. Theory says that they left room for future GPU and other co-processor types in the mix. Nick does not think that they virtualized enough and though I cannot read the code well enough (I don't do C well, C++ as hacked less well) to know for sure, but, it sure does not look like he is wrong. The issue is that none of them are systems engineers (I was) and don't really consider, or know, issues and charge on with the courage of their convictions that because they can hack together code they know what they are doing. The courage and skill of amateurs. At one point I specialized in database design and most people don't know that there are three types of DBAs or database specialists ... the logical designer that is the one interested in the data life-cycle and data models (thats what I did) and is generally not interested or concerned about speed or efficiency (what I mean is that this is not a primary concern, though you do know what will make the system fast or slow). Completed database models are implemented and tuned by a Systems DBA (a class of DBA most people have never met. There just are not that many of them around.) This guy tunes the hardware and system software (may even select and buy it specifially for the data model to be implemented) and creates things like table spaces and lays the data out on the physical media. Backups and all the system stuff is designed by this guy. The third guy is the type of DBA most people know about. He knows a lot of stuff but is mostly concerned about the day to day operation of the database. Though he may know about making tables and putting them on disks ... well ... it is a art and few do it well ... What is the point of all this? BOINC's database was put together by the third kinda DBA and amateurs ... it is one of the reasons that the databases are so fragile ... and crash so often ... I was doing BOINC while I was still working and showed the data model to a systems DBA I knew and he thought it was as poor of a design as I did ... Anyway, the study of logical database design for relational databases has a point to it ... ignore the "rules" at your peril ... and we can see the result of the choices made ... Anyway, I sent in a pseudo code outline of what I think should be done for Resource Scheduling so we can solve that problem that is coming on 5 years old now ... I will tackle the work fetch and DoS issues 5 years from now when they agree (finally) that it is an issue ... if history is a guide ... RH and I though are trying to bring it up along with other work fetch issues in 6.6.23. 24. and now 25 ... ID: 9092 · Rating: 0 · rate: / Reply Quote