Message boards :
Number crunching :
Problem with PABLO tasks
Message board moderation
| Author | Message |
|---|---|
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'm seeing a problem with tasks that have PABLO in their names. More than half complete without problems, but the remainder seem to go into an endless loop that stops them from writing any more checkpoints or making any more progress. Estimated remaining time eventually drops to zero without changing the progress percentage. Workaround that lets progress resume - suspend the task for at least one minute. Then resume the task. Expect most of the elapsed time to be lost when this is done, but progress then resumes. A task where this happened: http://www.gpugrid.net/result.php?resultid=16421650 Computer where this happened: http://www.gpugrid.net/show_host_detail.php?hostid=422382 A wingmate for this workunit got this error: The simulation has become unstable. Terminating to avoid lock-up (1) I've seen the problem on one or two tasks before, but did not save enough information about those tasks to tell you which ones. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 318 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I think this is more common on Windows 10. I haven't encountered it on Windows 7 yet (or on my single Windows 10 machine, come to think of it). |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Another PABLO task that seems to go into an endless loop, unless suspended, losing nearly a full day of compute time: http://www.gpugrid.net/result.php?resultid=16435516 http://www.gpugrid.net/workunit.php?wuid=12651837 Do these task have enough debugging enabled to show the cause of the endless loop? The slot directory does not appears to contain a text file showing anything relevant. Running under 64-bit Windows 10. Problem does not happen on all of the PABLO tasks. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This may be a shot in the dark, but ... Do you have anything BOINC-related, in the following folder: C:\Users\{username}\AppData\Local\VirtualStore\ I have seen some GPUGrid strangeness at one time, where I was playing with compatibility modes, and Windows created files in that "VirtualStore" folder that ... get used, instead of the normal (C:\Program Files\BOINC\) files. Worse yet, BOINC-related "VirtualStore" folders won't get properly cleaned by BOINC! In my case, at that time, my tasks were erroneously insta-completing. Anyway .. So, do you have anything in that "VirtualStore" folder? If you do have BOINC-related stuff in there, try closing BOINC then removing the BOINC-related stuff then restarting BOINC. |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This may be a shot in the dark, but ... There are a number of files there, but none appear to be BOINC-related. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
What setting are you using for: "Use at most X% of CPU time" If you're not using 100%, can you try it and see if it fixes the problem? |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
What setting are you using for: 100% |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Another task with the problem: http://www.gpugrid.net/result.php?resultid=16442636 http://www.gpugrid.net/workunit.php?wuid=12657605 I've bought a GTX 1080, and have told BOINC not to download any more GPU workunits for now so all of them can finish before I install the new graphics board tomorrow. |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Now using the GTX 1080, which appears to have stopped the problem. Some PABLO tasks run for an unexpectedly long time now, but they finish and verify properly. |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Another PABLO task that apparantly went into an endless loop: http://www.gpugrid.net/workunit.php?wuid=12708240 http://www.gpugrid.net/result.php?resultid=16504434 It reached: 55.160% progress, 1d 10:29:13 elapsed, --- remaining no change to these numbers other than elapsed for at least 12 hours using GTX 1080 with 385.28 driver, and i7-5980X CPU, BOINC 7.6.33 I suspended it and installed the 385.41 driver (without the 3D sections). It now indicates 55.160% progress, 03:38:48 elapsed, 04:03:15 remaining. Running has not resumed - BOINC appears to be catching up on other GPU work. This suggests that the problem is soon after writing a checkpoint, but before anything that does the next progress increase. Resuming from a checkpoint instead does not appear to give the problem. NO error messages on the screen, or in any file in the slot that looked likely to be a non-empty text file. |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Another PABLO task that apparantly went into an endless loop: It finally resumed from the checkpoint, and started updating the progress percentage about once a second. The task completed with a few hours, was reported, and has now validated. This suggests that also enabling debug output between resuming from the checkpoint file and first update of the progress percentage would allow comparing the failed first try to the second try that worked better. |
|
Send message Joined: 20 Nov 13 Posts: 21 Credit: 480,846,415 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have also had this problem using my 1070, on Win7. GPUgrid tasks will compute to halfway or so and then stop for hours. The card is dedicated to GPUgrid only so no other projects are competing with it for compute time. The core load of the card indicates it is just sitting idle. Exiting BOINC and restarting it will resume computation on the task, as will suspending it and then resuming. It's happening pretty frequently, every 1-2 days. I have the most recent BOINC version, but I have not updated graphics drivers in a while so my next step is to try that. |
|
Send message Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
What setting are you using for: Not sure whether this will help, but I just got a 1060 up last night on Win 10 x64, and noted that while the elapsed time kept incrementing, the percent done stopped. This was after it had crunched about 2.5-percent of the WU. I looked into the BOINC client log and there was a message in there that said "CPU busy, suspending work" or something like that. I was using my computer at the time with non-CPU intensive stuff like running my web browser. I checked "Suspend work when non-BOINC CPU usage is above" in "When and how BOINC uses your computer" under "Preferences" on my account page and noted it was set to 80%. I then set it to 0 which means to run BOINC projects all the time regardless of host CPU usage. I suggest checking that setting. I then did an Update on GPUGrid, and it still did not restart. So I exited BOINC and restarted, and the problem seemed to disappear - that is, I was still using my computer, however, the task did not suspend and ran to completion in a timely manner. Perhaps this is coincidental, IDK. However, I have a 6-core processor and there is no way that non-BOINC total core/thread usage was 80% or above at the time, unless the browser briefly spun up 11 or 12 threads. The next time I get a PABLO, I will watch for this again. If it is not coincidental and the job of checking that setting is in each individual client's code, then perhaps there is a bug in that code with PABLO units. If the code is in BOINC, then perhaps there is a bug there. |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
[snip] I don't see a "Suspend work when non-BOINC CPU usage is above" setting, but I would have set it to off. My computer has 8 physical cores plus hyperthreading, which makes it behave like it has 16 cores, and I've found that limiting the number of cores BOINC can use gives better results than limiting the percentage of CPU time it can use on each core. 14 cores are allowed for CPU tasks, leaving one for GPU tasks and one for non-BOINC programs. Using BOINC 7.6.33 under Windows 10. |
|
Send message Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Interesting to know of your experience, and it is also interesting that you do not see this setting. For me, it is under - 1. My Account 2. Preferences section 3. "When and how BOINC uses your computer" 4. Click on "Computing Preferences" which is on the same line as "When and how BOINC uses your computer" 5. "Processor Usage" 6. In the "Processor Usage" section there is an entry "Suspend work when non-BOINC CPU usage is above" If you also observe that this problem happens again, I suggest checking BOINC's Activity Log for a message similar to the one I found. To me, finding the same message would be a strong indicator that the same thing is happening on your system. |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Interesting to know of your experience, and it is also interesting that you do not see this setting. I finally found it. It was turned on, so I turned it off. |
|
Send message Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Interesting to know of your experience, and it is also interesting that you do not see this setting. Awesome! So we may have found the problem where these tasks seem to suspend and then not resume. I was reading another thread, and it seems like it may not be specific to PABLO tasks. It will be interesting to know if you see it again. If I do, I will post to this thread. |
|
Send message Joined: 5 Jul 15 Posts: 2 Credit: 135,260,724 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I have this problem many times. Win 10, latest Nvidia driver etc. GPU run out of work, process still activ. I tick the task_debug option in messageoption... Output is: 12.11.2017 16:11:47 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:11:54 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:12:01 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:12:08 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:12:15 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:12:24 | | Suspending computation - CPU is busy 12.11.2017 16:12:24 | GPUGRID | [cpu_sched] Preempting e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 (left in memory) 12.11.2017 16:12:24 | GPUGRID | [task] task_state=SUSPENDED for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from suspend 12.11.2017 16:12:44 | | Resuming computation 12.11.2017 16:12:44 | GPUGRID | [cpu_sched] Resuming e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 12.11.2017 16:12:44 | GPUGRID | [task] task_state=EXECUTING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from unsuspend 12.11.2017 16:12:48 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:13:04 | | Suspending computation - CPU is busy 12.11.2017 16:13:04 | GPUGRID | [cpu_sched] Preempting e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 (left in memory) 12.11.2017 16:13:04 | GPUGRID | [task] task_state=SUSPENDED for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from suspend 12.11.2017 16:13:06 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:13:12 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:13:14 | | Resuming computation 12.11.2017 16:13:14 | GPUGRID | [cpu_sched] Resuming e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 12.11.2017 16:13:14 | GPUGRID | [task] task_state=EXECUTING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from unsuspend 12.11.2017 16:19:54 | | Suspending GPU computation - user request 12.11.2017 16:19:54 | GPUGRID | [cpu_sched] Preempting e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 (removed from memory) 12.11.2017 16:19:54 | GPUGRID | [task] task_state=QUIT_PENDING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from request_exit() 12.11.2017 16:19:54 | | request_exit(): PID 13004 has 1 descendants 12.11.2017 16:19:54 | | PID 5096 12.11.2017 16:20:02 | | Resuming GPU computation 12.11.2017 16:20:55 | GPUGRID | [task] quit request timed out, killing task e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 12.11.2017 16:20:56 | GPUGRID | [task] Process for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 exited, exit code 0, task state 8 12.11.2017 16:20:56 | GPUGRID | [task] task_state=UNINITIALIZED for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from handle_exited_app 12.11.2017 16:20:56 | GPUGRID | [task] task_state=EXECUTING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from start 12.11.2017 16:20:56 | GPUGRID | [cpu_sched] Restarting task e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 using acemdlong version 918 (cuda80) in slot 0 12.11.2017 16:21:06 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:21:13 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:21:20 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:21:27 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:21:35 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:21:42 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed At the second resume at 16:13 the GPU-perfomance was lost. I stop the GPU-usage at 16:19 and resume at 16:20. The job worked again. The different I found was that it was removed from memory (16:19). Too bad that there is no solution. There is also a checkpoint every 15 seconds. I think it's usually 5-15 min. |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
hsdecalc, I haven't seen the problem lately. My recent changes include installing BOINC 7.8.3, and setting the number of CPU cores BOINC is allowed to use to two less than the total number present. |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
hsdecalc, another recent change, probably after the last time I saw the problem. Note that you must be in the advanced view, not the simple view, to follow the directions below. Click on View to start changing which view you have. Interesting to know of your experience, and it is also interesting that you do not see this setting. |
©2025 Universitat Pompeu Fabra