Message boards :
Graphics cards (GPUs) :
Really low Run Times, but still Completed and Successful?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 8 · Next
| Author | Message |
|---|---|
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
WCG's HCC on GPU WU's began by using a full CPU, then used the GPU (high GPU utilization), and then used a full CPU thread again. When the CPU was being used the GPU was not and when the GPU was being used the CPU was not. That is not what I saw. On my machine, on my nVidia GTX 460, each of the WCG HCC GPU tasks consumed a CPU core fully for the entire time. Also, for each task, there were 2 times within the duration of a task, where the task would not use the GPU for about 20 seconds. I watched in Task Manager and Process Explorer and GPU-Z. In this situation, you could therefore be running two HCC tasks without using any CPU (depending on how far along the tasks were) or be fully using two CPU threads. That never happened for me. Again, for me, each WCG HCC task always used a full CPU core, and except for the 2 20-second non-GPU times per task, each task utilized the GPU. In my opinion when using the app_config and specifying that GPUGrid tasks use 0.001 CPU's this meant that when some GPUGrid tasks started two CPU threads would be allocated and be fully used by HCC tasks. Along with 7 CPU tasks this meant that at that critical time for a GPUGrid task (the start), the CPU was already saturated and Boinc was being told (by the app_config file) that GPUGrid apps hardly needed any CPU (0.001). They failed as a result of CPU starvation. When running 1 GPUGrid task... Do you have any proof that the system or BOINC does anything different, when using an app_config of 0.001, as compared to not using an app_config (where it uses 0.729 CPUs by default). I'm looking for proof, not opinion or conjecture. My claim is that the number is only used when determining how many tasks to start, and nothing else. But I have not yet been able to conclusively prove it. They were incorrectly granted credit because the system didn't catch the error, but app_config is new and the server hasn't been equipped to catch such unknown errors. While this might be a problem for the researchers, perhaps future server updates will help protect against such problems. The admins/researchers are capable, now, of putting in necessary checks in the validator to invalidate these types of results, without any server updates. So far, they have not made this a priority. So, I've given up on them helping the situation. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
WCG's HCC on GPU WU's began by using a full CPU, then used the GPU (high GPU utilization), and then used a full CPU thread again. When the CPU was being used the GPU was not and when the GPU was being used the CPU was not. Most of my ATI/AMD cards are on WCG HCC1 and generally they do best with 4 WUs/GPU running concurrently. Along with an NVidia GPU running GPUGrid it works best here to reserve 3 X6 CPU cores for the 5 GPU WUs. BTW the HCC project is winding down fast, they've collected all the data they need for now. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Just checked and it looks like my GTX660Ti continuously uses a full thread running HCC - it's my ATI that only uses a full thread at the start and end of the run. I don't have a GTX460 to check with but I'll take your word that its the same. No wonder the NVidia's are so poor there; they're running something on the CPU and probably hammering the PCIE. Might be down to NVidia's OpenCL1.1 driver limitations, but it's winding up, so it's no odds now. Jacob, exactly how do you propose proving or disproving the effect of using 0.001 or any other number in the app_config file? Put an acceptable test together with reasoning and I'll test it, but anything that starts killing tasks is a no-no. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
skgiven, Your claim is that the CPU is somehow being "overloaded", and that using 0.001 CPU_Usage in an app_config.xml file could be the cause. I don't believe that's possible, but let's have you test it! So, a suitable test (for you to do) would be: - Tell BOINC to "use at most 100% of the processors" - Close BOINC - Add an app_config.xml file to your GPUGrid.net project directory, and make it look like the one referenced here: http://www.gpugrid.net/forum_thread.php?id=3332&nowrap=true#29520 - Restart BOINC - Confirm that any running GPUGrid tasks show up as "Running (0.001 CPU + 1 NVIDIA GPU (device x))" - Confirm that the CPU is either fully saturated, or over-saturated. The more saturation, the better for our test! - Run that for 3 weeks, monitoring your GPUGrid.net task results - Let me know if any results have really low run times but are still completed successfully, validated, and granted credit Does that sound suitable? |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Firstly, I use my systems, so I don't want the CPU to be saturated. I'm also not keen on deliberately trying to banjax tasks! I think that over-saturating the CPU is a fundamentally bad setup, one that is basically designed to cause failures and certainly not one I would recommend using. So I would also be concerned that such a setup would generate other failure types. What is run on the CPU could be significant too. So far only you have experienced such failures and only when using 0.001 cpu_usage in an app_config.xml file. This strongly suggests that the 0.001 configuration or the use of the app_config file is causing the problem for you. I know it's supposed to be a scheduling parameter but its terminology suggest that it stipulates the cpu usage of the app in question. If it does just apply scheduling parameters it shouldn't ordinarily limit the GPU apps usage of the CPU; if I set it to 0.001 or 0.99 it shouldn't make any difference - the apps will use whatever amount of CPU that it requires. On that theory, yes, perhaps during the start of a task Boinc incorrectly uses this parameter to proportion CPU usage to the loading apps, and only if the CPU becomes overly-saturated does having such a low cpu_usage restrict the GPU apps access to the CPU to the extent that it causes tasks to fail. To determine that however, you would be better setting flags and looking at the logs to get some indication that this is happening. I don't think there are any specific flags for app_config but perhaps <coproc_debug>, <cpu_sched> and <cpu_sched_debug> would be useful? The problem with this test is that you can't determine if it's the 0.001 setting or just using the app_config file when the CPU is already over-saturated. To determine that you would have to test, as you suggested, with normal CPU usage assigned (1.0 not Boincs 0.729 estimate) and also without the cpu_usage line (to show that it's not something else to do with using app_config), if possible? That's 9weeks testing something that could turn out to be a peculiarity specific to your individual setup. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
@SK: this is not about recommending a setup, it's about helping debugging. And to track the error down we've got to find what triggers it. If this makes you more comfortable to run the test Jakob suggested: I'm doing exactly what he said since last weekend. No problems so far.. of course. And we all know modern PCs can be strange beasts at times.. but this entire argument of "not overloading CPU, might lead to errors" sounds bollux to me. The OS task scheduler is there to handle this. As I said: I'm running overcommited now with 1 more thread than logical cores, yet GPU-Grid typically still gets about 12.5% overall CPU time, i.e. one logical core. It's the lower priority (Einstein) jobs which get less CPU time, as Jakob already pointed out several times. One could concieve a scenario that due to some CPU stalling the GPU isn't fed properly and some timeout triggers, causing the error. But this would result in a "display driver stopped responding" message. Jakob.. did you get such errors? Do they appear in the event log in general? And this CPU utilization number is really just a guideline for the BOINC task scheduling. Actually, if the developers wanted it to be anything else, they'd be in trouble as BOINC can only start the tasks. What these do afterwards is out of their control. One could even put a multi-threaded app in there, stealing CPU time from other projects while still claiming to use 1 core. Of course, no real project would be dumb enough to try any such thing. MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
skgiven, I thought you were serious about wanting to help test. But it sounds like you don't want to perform the test I recommended. That's unfortunate. But I'm trying to channel my frustrations into solving this. It's taken me 1 month thus far, with very little guidance/help from the admins. I feel I've made some progress, and if it takes 9 more weeks, or longer, to solve it, then so be it. You're welcome to help out any time. But it will involve stepping outside your comfort zone. Regards, Jacob PS: Thanks for the advice about the logging flags. |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Jacob wrote: I didn't say what you think I said. Sorry, you're right: I actually get 8 Einsteins and 1 GPU-Grid tasks with either config, so the same as you. I thought it would have been 7 + 1. Might have been with some prior BOINC version or who-knows! MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
For reference, I believe the BOINC scheduling rule is (and has been for a long time): - While CPU_Used <= (#CPU+1), schedule GPU tasks to run - While CPU_Used < (#CPU+1), schedule CPU tasks to run - If you have any high-priority tasks, they get scheduled first So, if you're running 1 GPUGrid Task, on an 8-CPU machine, with BOINC set to use 100% CPUs, and the GPUGrid task's CPU_Usage is < 1, then I would expect you to also be running 8 CPU tasks, and it is normal. Also, you should be running BOINC v7.0.64, which was recently publicly released. Kind Regards, Jacob |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I thought it would be (#CPU) instead of (+CPOU+1), but apparently you're right! Also, you should be running BOINC v7.0.64, which was recently publicly released. That's what BOINC keeps telling me as well. But regarding BOINC I like to play it safe and only upgrade if there's anything to be gained for me. MrS Scanning for our furry friends since Jan 2002 |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
@SK: this is not about recommending a setup, it's about helping debugging.Fine so long as the project is happy with people using it to test and debug Boinc code. And to track the error down we've got to find what triggers it. If this makes you more comfortable to run the test Jakob suggested: I'm doing exactly what he said since last weekend. No problems so far.. of course.Hence the suggested flags. The point of running without saturating the GPU was to perform a basil trial. I've done this, and not experienced any similar issues. Unfortunately running with 100% CPU usage has resulted in driver and app crashes for me. While it does make me a bit more comfortable knowing that you are testing, I still have my concerns. And we all know modern PCs can be strange beasts at times.. but this entire argument of "not overloading CPU, might lead to errors" sounds bollux to me.The hypothetical argument was to try and understand whats going on; state something and then prove or disprove it. Anyway, the argument was that overloading the cpu leads to errors, and it does for me, just not the same problem. Not overloading the CPU, while using app_config, hasn't so far resulted in any problems, including the same problem as Jacob described (and 've been testing with app_config for a month now). My present concern is that there are important setup differences, I don't have a 3GB card or a GTX460 to test with, and it might be important to crunch the same WU's on the second GPU (and possibly the same CPU WU's). Can we agree on something here? The OS task scheduler is there to handle this. As I said: I'm running overcommited now with 1 more thread than logical cores, yet GPU-Grid typically still gets about 12.5% overall CPU time, i.e. one logical core. It's the lower priority (Einstein) jobs which get less CPU time, as Jakob already pointed out several times.That's nice, but I got an app crash within 5min of running over committed. There is no point running with such a setup if you get lots of other failures, it defeats the purpose. One could concieve a scenario that due to some CPU stalling the GPU isn't fed properly and some timeout triggers, causing the error. But this would result in a "display driver stopped responding" message. Jakob.. did you get such errors? Do they appear in the event log in general?One realized this. I doubt that my driver and app crashes are in any way related to Jacobs special error, but they are a show stopper. And this CPU utilization number is really just a guideline for the BOINC task scheduling. Actually, if the developers wanted it to be anything else, they'd be in trouble as BOINC can only start the tasks. What these do afterwards is out of their control.I'm aware of what the app_config is supposed to do, but because Jacob has only experienced the problems when it's in use, what it's actually doing is open to debate. We just need to bear in mind that this could still be a client bug, or related to the GTX460. One could even put a multi-threaded app in there, stealing CPU time from other projects while still claiming to use 1 core. Of course, no real project would be dumb enough to try any such thing.Plenty of MT apps have been used including Aqua and more recently SimOne, and yes they stopped other tasks from running including GPU work (and that's post 7.0.40). For that reason I ran SimOne in a VM. Between driver, app and system crashes I still try to use a VM for other projects. This makes it more difficult to commit to testing with a setup that has already and repeatedly resulted in crashes for me (and that's without using the VM). Is it not the case that without app_config you have 7CPU tasks and 1 GPU task running, but when using app_config it's 8? Jacob has 2 GPU's in his setup (both NVidia). My setup primarily differs in that I presently have an ATI and an NVidia. Your setup differs in that it just has one NVidia - what I initially tested with. I had less errors with the one GPU when trying to use 100% CPU, but I still had some and I wasn't around enough to run like that. Adding the ATI was to facilitate POEM, which as you know doesn't like two NVidia's, but I also thought it would be closer to Jacob's setup. Jacob, the version of Boinc you were using when you first encountered the credit rich 'errors' was pre-70.0.64, probably .58, .59 or .60. The released date of .64 was 16-Apr, so the issue seems to span several release versions. While I agree it's best to test on a common Boinc version, I don't think it's essential. I've used 7.0.60 and 7.0.64 and possibly 7.0.62 (albeit for only a few days) - not that I've manages to replicate the problem. Do you think the operating system or GPU types in the system would make any difference in determining if the problem is CPU or app_config related? If not I could test for a while on a different system (XP and a GTX470). At present we still haven't eliminated the possibility of the issue being related to the GTX460. If you pull it, you might find out. It does look like the issue is related to app_config, but saying as no-one else has managed to replicate the problem, with or without an identical config file and CPU usage profile, it's still looking like it's rig-specific. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
OK, well here's a sort-of bug! Running with Jacob's app_config. I exited Boinc and removed the app_config file from the GPUGrid project folder. Start Boinc, and click on the GPUGrid WU Properties. It says using 0.001 CPU's! It should say what it's actually using which is 1 CPU. Re-read the config file(s) and the Boinc logs say Boinc isn't using any app_config file: 04/05/2013 21:47:32 | | Re-reading cc_config.xml 04/05/2013 21:47:32 | | No config file found - using defaults The WU properties still says it's using 0.001 CPU's. Even after a system restart, it's still saying that the WU's using 0.001 CPU's. The WU is clearly using a full CPU, but saying that it's using 0.001. That could only have come from the app_config file - which has now been removed. So why does Boinc still think it's using 0.001 CPU's? The setting is being saved and retained (for future use) in the init_data.xml file in the Boinc Slot associated with the GPUGrid WU: <project_dir>C:\ProgramData\BOINC/projects/www.gpugrid.net</project_dir> <boinc_dir>C:\ProgramData\BOINC</boinc_dir> <wu_name>I94R8-NATHAN_dhfr36_6-8-32-RND5174</wu_name> <result_name>I94R8-NATHAN_dhfr36_6-8-32-RND5174_0</result_name> <comm_obj_name>boinc_0</comm_obj_name> <slot>1</slot> <client_pid>3508</client_pid> <wu_cpu_time>10328.080000</wu_cpu_time> <starting_elapsed_time>10350.163047</starting_elapsed_time> <using_sandbox>0</using_sandbox> <user_total_credit>278316356.517260</user_total_credit> <user_expavg_credit>164632.657628</user_expavg_credit> <host_total_credit>35865525.000000</host_total_credit> <host_expavg_credit>81185.211218</host_expavg_credit> <resource_share_fraction>0.083333</resource_share_fraction> <checkpoint_period>600.000000</checkpoint_period> <fraction_done_start>0.000000</fraction_done_start> <fraction_done_end>1.000000</fraction_done_end> <gpu_type>NVIDIA</gpu_type> <gpu_device_num>0</gpu_device_num> <gpu_opencl_dev_index>0</gpu_opencl_dev_index> <ncpus>0.001000</ncpus> <rsc_fpops_est>5000000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound> <rsc_memory_bound>100000000.000000</rsc_memory_bound> <rsc_disk_bound>300000000.000000</rsc_disk_bound> <computation_deadline>1368103379.000000</computation_deadline> <vbox_window>0</vbox_window> <host_info> This means that removing app_config doesn't effect running tasks - they keep using the parameters already set. The problem with this is that Boinc thinks it's using 0.001 CPU's while it's actually using 1 CPU, and if I have 100% CPU usage selected, Boinc will be running 9 WU's (8 CPU tasks and 1 GPUGrid WU). If I go in and set app_config to use 1 cpu and re-read the config file(s), Boinc rethinks the situation and starts running 8 tasks (7 CPU tasks and 1 GPU task). Of course the WU's properties still say 0.001 CPU is being used, but at least Boinc isn't actually trying to run 9 tasks and is only running 8 (including the GPU WU). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Fine so long as the project is happy with people using it to test and debug Boinc code. We're not debugging BOINC code, here. We are simply trying to find out what triggers the problem specified in this thread. And, for some of us trying to do that, it's okay with us if we get a few errors along the way. At this time, I'm not prepared to blame BOINC code. My present concern is that there are important setup differences, I don't have a 3GB card or a GTX460 to test with, and it might be important to crunch the same WU's on the second GPU (and possibly the same CPU WU's). Can we agree on something here? I understand that a setup difference could be an issue. That is why it is so important for someone other than me, to replicate the problem, on their machine. That's nice, but I got an app crash within 5min of running over committed. There is no point running with such a setup if you get lots of other failures, it defeats the purpose. The 'purpose' of running fully committed, or over-committed, would be to try to replicate the issue in this thread. So, even if you get lots of other failures (which would probably be some other issue), it does NOT defeat the purpose of trying to replicate the issue in this thread. You only think it does. Is it not the case that without app_config you have 7CPU tasks and 1 GPU task running, but when using app_config it's 8? You're wrong. If he is running a GPUGrid task that says it takes < 1.000 core (either 0.729 CPU default without an app_config, or 0.001 CPU with an app_config), BOINC will still schedule 8 CPU tasks either way, because [0.729 + 8] < [8 + 1], and because [0.001 + 8] < [8 + 1]. This is normal behavior by BOINC, because BOINC's goal is to always keep all resources busy, and if a GPU task says it will only use the CPU part of the time, BOINC makes sure that a CPU task can keep the resource busy for the times where the GPU task doesn't. So, when my system was running WCG HCC GPU tasks, when I had 2-WCG-at-a-time running on the GTX460, my setup was: - If I set BOINC to use 100% CPUs, and reset the GPUGrid.net project (so that it doesn't have any app_config), BOINC will run 1 GPUGrid task on the 660 Ti (which says it uses 0.729 CPU, but actually uses a full core), 2 WCG GPU tasks on the GTX 460 (each saying 1.000 CPU, and each actually taking a full core), and 6 other CPU tasks (because [0.729 + 1.000 + 1.000 + 6] < [8 + 1]). - If I set BOINC to use 100% CPUs, and use an app_config file of 0.001, BOINC will run 1 GPUGrid task on the 660 Ti (which says it uses 0.001 CPU, but actually uses a full core), 2 WCG GPU tasks on the GTX 460 (each saying 1.000 CPU, and each actually taking a full core), and 6 other CPU tasks (because [0.001 + 1.000 + 1.000 + 6] < [8 + 1]). - So, here in this case, using an app_config file should NOT have made any significant difference at all in terms of CPU scheduling, yet, so far my tests seem to indicate that using the file DID make a difference. I cannot understand how/why. Let's apply that logic to my current setup, now that WCG HCC GPU is done. Listen carefully, because the math now involves running GPUGrid on 2 GPUs, and it's a bit different -- here goes: - If I set BOINC to use 100% CPUs, and reset the GPUGrid.net project (so that it doesn't have any app_config), BOINC will run 1 GPUGrid task on the 660 Ti (which says it uses 0.729 CPU, but actually uses a full core), 1 GPUGrid task on the 460 (which says it uses 0.729 CPU, but actually only uses about 1/8th of a core), and 7 other CPU tasks (because [0.729 + 0.729 + 7] < [8 + 1]). - If I set BOINC to use 100% CPUs, and use an app_config file of 0.001, BOINC will run 1 GPUGrid task on the 660 Ti (which says it uses 0.001 CPU, but actually uses a full core), 1 GPUGrid task on the 460 (which says it uses 0.001 CPU, but actually only uses about 1/8th of a core), and 8 other CPU tasks (because [0.001 + 0.001 + 8] < [8 + 1]). - I'm actually currently running the ([0.001 + 0.001 + 8] < [8 + 1]) setup right now, trying to recreate the problem without any WCG HCC GPU tasks. Jacob, the version of Boinc you were using when you first encountered the credit rich 'errors' was pre-70.0.64, probably .58, .59 or .60. I don't think the BOINC version has anything to do with this bug. I'm of the opinion that, on order to pick up all of the other small bug fixes that a user might not be aware of, they should always use the latest publicly-released version of BOINC. That is why I recommended 7.0.64. Regarding OS and GPU types... I'm not sure. If you could please run at 100% CPUs, using an app_config.xml file 0.001, on as many systems as possible, for the sole purpose of trying to recreate the issue, that would be awesome. I challenge you to recreate the issue. Running with Jacob's app_config. First of all, maybe you are finally realizing that that number is just a number that is used to schedule tasks. You said "It should say what it's actually using", and that is simply not true. Heck, imagine a scenario where you have a GTX 660 Ti (which actually uses a full core), and a GTX 460 (which actually uses 1/8th of a core). BOINC doesn't use that number to "say what it's actually using", because if it did, then it would show "1.000" for my GTX 660 Ti, and "0.125" for my GTX 460. BOINC can't do that! The "number" is a number that must be the same for all tasks of a particular app. And by default, GPUGrid developers have specified 0.729 to be the default value (at least for me) for every GPU task that runs, regardless of how much CPU the GPU task actually uses, regardless of GPU architecture. See, that number will always be the same for all tasks of a particular app that are running. It gets set when a task is started/restarted, and can be updated when an app_config file is changed and the file is reread. If no app_config file is ever used, BOINC will use whatever default value the project gave, which for me is 0.729. Once an app_config file is used, the only way for me to get BOINC to start re-using the default value is to reset the project. That may be a bug, not sure, but that's just the way it works for now. Is this hopefully making sense now? Stop thinking of that number as a "somehow limit the CPU usage to this amount"; BOINC and the OS don't limit the CPU usage using that number! Instead, think of that number as "it's just a number that's added into the total CPU usage amount when determining how many additional CPU tasks to process". Kind regards, Jacob Klein |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Boinc says that my GTX470 uses 0.583 CPU's when running a long Nathan WU. In reality it uses next to nothing. The reason it uses so little cpu is that it's CC2.0 and performs something on the GPU that the Keplers perform on the cpu. The GTX660Ti is CC3.0 and uses a full CPU when running the same task types, despite Boinc saying it uses 0.687 CPU's. I think this 1 full CPU usage requirement is stipulated by the app (akin to swan). In this case it appeared that Boinc recognized this and didn't over-commit the CPU. When I didn't use app_config and when I had the CPU set to be used 100% by Boinc apps it ran 7 CPU WU's and one GPUGrid WU on the GTX660Ti system (ATI not in use). I didn't perform a project reset; just deleted the app_config file from the project directory and when the new GPUGrid WU started it used a full CPU core, reported whatever fraction of the CPU Boinc thinks it uses (0.685), but Boinc didn't over-commit the CPU. So is using app_config the source of your problems? It seems to cause the CPU over-commitment as when it's not used Boinc runs less apps, however it's lying about what it's using and what it had actually read. I had set cpu usage to 1, then removed the app_config, re-read the config files and boinc pretended it was using 0.685 CPU's when in fact it was still using the 1CPU setting. When the WU crashed and a new one ran it then said it was using 1CPU. After a project reset, like you suggested, it then said 0.685 again, but ran 8 CPU WU's and one GPU WU. Note that if I use both GPU's on my main system and try to use 100% CPU with 9apps or more, it often crashes inside 5min. My errors are different to yours because you have a different setup. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
When I don't use app_config and when I have the CPU set to be used 100% by Boinc apps I run 7 CPU WU's and one GPUGrid WU on the GTX660Ti system (ATI not in use). I require proof in the form of screenshots. Please provide this proof, as it goes against everything I've learned about how BOINC functions. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Re-read my previous post, I edited it with some findings. Screen grab below shows 7 POGS CPU WU's and one GPU WU running, with Task Manger showing 100% CPU usage. At the time the WU properties showed the following, However this misled me, it's either not correct or was subsequently overwritten. I think the app_config was still being applied and was setting the CPU usage to be 1 and Boinc was applying that wrt task scheduling, but just incorrectly displaying 0.685 which (along with the log file following a read config file) led me to believe app_config was no longer being applied. I think it was, because after that GPUGrid WU crashed (probably due to opening GPUZ) the next GPUGrid WU said it was using 1 CPU. Where did it get that from, and why did it start running 8 CPU WU's after resetting? Something isn't being updated, at least until Boinc is restarted or config file read several times, or the project reset? I don't think having to reset the project to un-apply app_config settings was part of the plan. So that part isn't working. I'm seeing other anomalies in the WU properties too. The POGS WU's say they have used 1 to 4minutes of CPU time and yet have run for 32min. Does that indicates that we can't go by any info being displayed or is it POGS specific? This WU suffered an error when I opened GPUZ, I4R13-NATHAN_dhfr36_5-20-32-RND3729_0 4423712 4 May 2013 | 20:57:09 UTC 5 May 2013 | 1:24:20 UTC Error while computing 7,166.53 7,166.53 --- Long runs (8-12 hours on fastest card) v6.18 (cuda42) FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This is interesting info, but not exactly conclusive, as I believe there are certain bugs whenever you: - Switch the CPU_Usage in an app_config, without restarting BOINC - Remove the app_config, without restarting BOINC So... Let me pose it to you this way. With BOINC set to use 100% CPUs, and GPUGrid CPU_Usage is set to a value that is less than 1.000 (either because it is its default value, or because you have an app_config specifying a CPU_Usage value less than 1).... on a restart of BOINC, has BOINC ever started less than 8 CPU tasks? From how I understand BOINC task scheduling, the answer should be "no, it always starts 8 CPU tasks." If that's not the case, please provide screenshots again, clearer if possible. I appreciate your help; I do want to make sure the information I say about an app_config's CPU_Usage setting, and about BOINC task scheduling, is all accurate, and I still claim it is. Thanks, Jacob |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
This is interesting info, but not exactly conclusive, as I believe there are certain bugs whenever you: Yes, there are reporting anomalies and what is being reported from Boinc Manager isn't reliable under certain conditions (without a Boinc restart). Obviously if you change app_config settings, you want BM to correctly tell you what the settings are without having to restart Boinc or goto the app_config file and work them out. However, while this misreporting has been rather unhelpful, app_config is just a set of scheduler settings. Boinc is reading the file and applying what settings it can, but not to running WU's, just to new WU's because scheduling settings are not retroactive; if a WU's already started previous settings have already been applied. To apply new settings a Boinc restart seems to be required. I would call this an undesirable feature. This might also depend on what slots WU's start/restart in? In my case the number of slots Boinc has at it's disposal has risen. At present I have 12, because at one stage I was trying to run 4 WU's across the GPU's. Think about this, If you don't use any app config files and try to run 1 WU on each of 2 GPU's boinc will launch 9 WU's (dropping one of the CPU WU's if it was running 8). Agreed? If you stipulate in an app_config file to use 2 WU's each on each of 2 GPU's and that these WU's each use 1 CPU, Boinc will again launch 9 WU's (4 GPU WU's and 5 CPU WU's). Agreed? When you stipulate in an app_config file to use 2 WU's each on each of 2 GPU's but that these WU's each only use 0.001 CPU's, Boinc will launch 12 WU's (4 GPU WU's and 8 CPU WU's). Do you agree? In your initial setup (running 2 GPUGrid WU's) you were doing just that, running two WU’s on each of your GPU's. You said your second GPU ran (at least at one time) 2 WCG tasks, and these used a full CPU core each, and we know the Long GPUGrid WU's also use 1 core on your GTX660Ti. Now that’s Over-Saturation of the CPU, in a big way! I think this could be the source of your problems. If the CPU is that over-committed failures are almost inevitable. However what failures occur depends on everything that is running, and your setup. So in my opinion that would need to be replicated; there is no point in randomly running CPU and GPU projects. Again, we need to agree to crunch specific WU's. Even then having different setups might mean replicating the problem is impossible. I also think there is no point testing with one GPU and using app_config; it's not going to run any more WU's and thus it's not going to test anything. You need to test with 2 GPU's and one of them would need to be running at least 2 WU's. Possibly running two apps on the one GPU would be similar enough, but I don't know. If these issues only occurred when running HCC on your second GPU, then you will never replicate the problem, nor will anyone else. So... Let me pose it to you this way. With BOINC set to use 100% CPUs, and GPUGrid CPU_Usage is set to a value that is less than 1.000 (either because it is its default value, or because you have an app_config specifying a CPU_Usage value less than 1).... on a restart of BOINC, has BOINC ever started less than 8 CPU tasks? From how I understand BOINC task scheduling, the answer should be "no, it always starts 8 CPU tasks." If that's not the case, please provide screenshots again, clearer if possible.Under these conditions it's behaving the way you would expect. If the issues are due to CPU over-saturation then if you used app_config to run 2 GPUGrid WU's on the 3GB card, this won't cause problems so long as the CPU isn't over-saturated. Thus I would recommend testing that (saying as the original goal was to do more work), with cpu_usage of 1.0. However I would suggest limiting the CPU in BM to 99%. That way you are guaranteeing to not over-commit the CPU any other way, and thus testing if the cause is using app_config, but nothing to do with cpu saturation. If you get any repeat of the same problem then stop, and report it. I think there is no point in anyone else trying this, unless they have a 3GB Kepler. I already tried and while I could run the short WU's (just without seeing any real advantage), the Long tasks caused a massive slow down in task turnover and really bad interface problems (GPU downclocking, system lockups, driver crashes). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This is interesting info, but not exactly conclusive, as I believe there are certain bugs whenever you: Yep. That's why I usually recommend a restart anytime app_config is changed. If you don't use any app config files and try to run 1 WU on each of 2 GPU's boinc will launch 9 WU's (dropping one of the CPU WU's if it was running 8). Agreed?Agreed. Without WCG HCC GPU, and GPUGrid using the project default values, my PC will run: - 1 GPUGrid.net task on the GTX 660 Ti (0.729 CPU + 1 NVIDIA) - 1 GPUGrid.net task on the GTX 460 (0.729 CPU + 1 NVIDIA) - 7 CPU tasks (because [0.729 + 0.729 + 7] < [8 + 1]) If you stipulate in an app_config file to use 2 WU's each on each of 2 GPU's and that these WU's each use 1 CPU, Boinc will again launch 9 WU's (4 GPU WU's and 5 CPU WU's). Agreed?Disagree. Without WCG HCC GPU, and GPUGrid setup as you suggest, my PC will run: - 2 GPUGrid.net tasks on the GTX 660 Ti (each: 1 CPU + 0.500 NVIDIA) - 2 GPUGrid.net tasks on the GTX 460 (each: 1 CPU + 0.500 NVIDIA) - 4 CPU tasks (because [2 + 2 + 4] < [8 + 1], but [2 + 2 + 5] !< [8 + 1]) When you stipulate in an app_config file to use 2 WU's each on each of 2 GPU's but that these WU's each only use 0.001 CPU's, Boinc will launch 12 WU's (4 GPU WU's and 8 CPU WU's). Do you agree?Agree. Without WCG HCC GPU, and GPUGrid setup as you suggest, my PC will run: - 2 GPUGrid.net tasks on the GTX 660 Ti (each: 0.001 CPU + 0.500 NVIDIA) - 2 GPUGrid.net tasks on the GTX 460 (each: 0.001 CPU + 0.500 NVIDIA) - 8 CPU tasks (because [0.002 + 0.002 + 8] < [8 + 1]) In your initial setup (running 2 GPUGrid WU's) you were doing just that, running two WU’s on each of your GPU's. You said your second GPU ran (at least at one time) 2 WCG tasks, and these used a full CPU core each, and we know the Long GPUGrid WU's also use 1 core on your GTX660Ti. Now that’s Over-Saturation of the CPU, in a big way! I think this could be the source of your problems. If the CPU is that over-committed failures are almost inevitable. However what failures occur depends on everything that is running, and your setup. So in my opinion that would need to be replicated; there is no point in randomly running CPU and GPU projects. Again, we need to agree to crunch specific WU's. Even then having different setups might mean replicating the problem is impossible. We can only try to replicate the issue. I still claim that "CPU oversaturation" is not the cause of this issue. Right now I'm running 0.001 on both GPUs, 1 task per GPU, and 8 CPU tasks... and not having a problem [yet]. What'll really make you cringe is that I'm considering using the cc_config option for <ncpus> to make BOINC think I have 12 or 16 CPUs. If your theory is correct, I'd see more problems, but if my theory is correct, tasks will still complete but there will be increased resource contention. Do you think that seems like a good test for me to do? If these issues only occurred when running HCC on your second GPU, then you will never replicate the problem, nor will anyone else. Yeah, that bugs me, but that may be the case. I hope to recreate the problem soon, but I'm not sure I can. So... Let me pose it to you this way. With BOINC set to use 100% CPUs, and GPUGrid CPU_Usage is set to a value that is less than 1.000 (either because it is its default value, or because you have an app_config specifying a CPU_Usage value less than 1).... on a restart of BOINC, has BOINC ever started less than 8 CPU tasks? From how I understand BOINC task scheduling, the answer should be "no, it always starts 8 CPU tasks." If that's not the case, please provide screenshots again, clearer if possible.Under these conditions it's behaving the way you would expect. That's what I expected. Believe it or not, I really do know what I'm talking about. I try not to guess. If the issues are due to CPU over-saturation then if you used app_config to run 2 GPUGrid WU's on the 3GB card, this won't cause problems so long as the CPU isn't over-saturated. Thus I would recommend testing that (saying as the original goal was to do more work), with cpu_usage of 1.0. However I would suggest limiting the CPU in BM to 99%. That way you are guaranteeing to not over-commit the CPU any other way, and thus testing if the cause is using app_config, but nothing to do with cpu saturation. If you get any repeat of the same problem then stop, and report it. Thanks for the suggestions; you and I sure do think differently. Right now, because WCG HCC GPU tasks are done, I need to re-test to see if I can reproduce the issue. And, if the re-test will run GPUGrid on both GPUs, then the re-test cannot involve 2-on-1-card, since the GTX 460 doesn't have enough memory. So, I've got WCG set for No-New-Tasks (just to make sure I don't get any rogue WCG HCC GPU tasks), and I want TONS of stress on the system. I'm using 0.001 CPU, with BOINC set to use 100% CPUs, such that right now I'm running 2 GPUGrid tasks alongside 8 CPU tasks, which is already over-committed. But, as I said, I want TONS of stress. So, I'm going to use <ncpus> and see if I can add additional stress, probably setting it to 12, to run 12 CPU tasks. My goal right now is to recreate the problem. I think there is no point in anyone else trying this, unless they have a 3GB Kepler. I already tried and while I could run the short WU's (just without seeing any real advantage), the Long tasks caused a massive slow down in task turnover and really bad interface problems (GPU downclocking, system lockups, driver crashes). Right now, I still claim that CPU oversaturation is not a constraint. In fact, I claim that the only real constraint on 2-at-a-time is memory usage, as documented in the Performance thread. So, I would say that: - if the user will be running a particular GPUGrid application on a card that is less than 2GB, then I'd recommend against 2-at-a-time for that GPUGrid application - if the user will be running a particular GPUGrid application only on cards that are 2GB or higher, it would be worth testing 2-at-a-time on each GPU, for increased GPU Usage and a higher throughput. I'm sure you disagree with my recommendations, but any future discussion on it should be in the other thread. Kind regards, Jacob |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
To me it looks like the settings in the app_config are applied to the current WU and remain in effect even if the app_config is removed and BOINC restart (no project reset). That's exactly what I did the other day, and I still got the "0.001 CPUs used" as you did. However, in the next morning I was back to 0.7 without doing anything except letting my PC finish some work. For me this is all I need to know about tis issue. And incidently I made this switch to check for myself if I had 8 Einsteins running along GPU-Grid even without the app_config. I had actually written a paragraph stating that it wasn't so (expecting to get 7+1), then added a disclaimer that this was out of my head and that I couldn't test right now.. an then thought "oh what the heck, let's do the bloody test..". And, much to my surprise, found I was also getting 8+1 tasks in this case. I suspect the issue Jacob found might have something to do with his setup (hardware, OS, drivers, BIOS and whatever) or the other CUDA tasks he was running along GPU-Grid. It could, for example, happen that some buffer in some library/CUDA-function being used by both projects wasn't properly (re-)initialized. Ideally such things should not happen, but we all know there are still quite some bugs being found in CUDA. What I suggest as the next steps: - I keep testing with the app_info, jsut to make sure - SK doesn't test, at least not on his main rig (the instability you mentioned shouldn't happen due to more CPU tasks alone.. but I guess your rig wouldn't listen to me) - Jacob tries to reproduce the error without the WCG HCC GPU units. Throw everything into the mix except 2 concurrent WUs, including more "logical" CPUs. I doubt this will make any difference.. but let's try. CPU throughput will degrade somewhat due to more context switches and cache contention.. but that shouldn't be a major issue unless you currently run the Pentathlon. If neither of us can provoke the error within.. say a few weeks, we're left to conclude that it was the combination with WCG HCC, and maybe also the generational jump between Jacobs GPUs. MrS Scanning for our furry friends since Jan 2002 |
©2025 Universitat Pompeu Fabra