Advanced search

Message boards : Number crunching : Long runs stopping for no apparent reason

Author Message
Azmodes
Send message
Joined: 7 Jan 17
Posts: 34
Credit: 1,371,429,518
RAC: 0
Level
Met
Scientific publications
watwatwat
Message 47900 - Posted: 20 Sep 2017 | 18:10:06 UTC
Last modified: 20 Sep 2017 | 18:56:27 UTC

I'm running long runs on this machine and they just cease all activity after 15% or so. Not a computation error, not paused for another higher-priority task, it just stops, but the run time keeps on ticking and it is still listed as running. There is no progress bar change, no GPU activity, no CPU activity. It seems suspending and restarting the task helps temporarily, but that's obviously not a good solution if I have to micro-manage each run like a hundred times. EDIT: Okay, it resumes by itself after 5-10 minutes, only to stop again shortly after.

This one runs the project on a GTX 1050, all going smoothly.

This is the output on the task page for a start-stop WU I ended up aborting, doesn't seem helpful:

<core_client_version>7.6.33</core_client_version>
<![CDATA[
<message>
aborted by user
</message>
<stderr_txt>
# GPU [GeForce GTX 1060 6GB] Platform [Windows] Rev [3212] VERSION [80]
# SWAN Device 0 :
# Name : GeForce GTX 1060 6GB
# ECC : Disabled
# Global mem : 6144MB
# Capability : 6.1
# PCI ID : 0000:01:00.0
# Device clock : 1708MHz
# Memory clock : 4004MHz
# Memory width : 192bit
# Driver version : r384_00 : 38494
# GPU 0 : 48C
# GPU 0 : 53C
# GPU 0 : 54C
# GPU 0 : 56C
# GPU 0 : 59C
# GPU 0 : 61C
# GPU 0 : 63C
# GPU 0 : 64C
# GPU 0 : 65C
# GPU 0 : 66C

</stderr_txt>
]]>


The event log has this:
20/09/2017 20:12:24 | GPUGRID | Task e33s10_e19s1p0f163-ADRIA_FOLDA3D_crystal_ss_contacts_50_a3D_1-0-1-RND9573_0 exited with zero status but no 'finished' file
20/09/2017 20:12:24 | GPUGRID | If this happens repeatedly you may need to reset the project.

I'm pretty sure I already tried resetting it the last time I had this issue (and just switched to other projects for this computer, because it didn't work).

Maybe related or identical to this? Progress does however not drop after resuming or never advance beyond a fixed value. It is going forward, just with weird pauses. Is this normal? I don't think I've ever seen it with my other Pascal card.

wiyosaya
Send message
Joined: 22 Nov 09
Posts: 114
Credit: 589,114,683
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47902 - Posted: 20 Sep 2017 | 20:20:02 UTC - in response to Message 47900.

It may be that this is related to the thread Problem with Pablo Tasks Specifically from the post that is linked to the end of the thread may provide a solution for you.
____________

Azmodes
Send message
Joined: 7 Jan 17
Posts: 34
Credit: 1,371,429,518
RAC: 0
Level
Met
Scientific publications
watwatwat
Message 47903 - Posted: 20 Sep 2017 | 21:42:58 UTC
Last modified: 20 Sep 2017 | 21:44:27 UTC

Thanks. I did try this earlier, though, in that I checked whether the task would freeze after being suspended due to CPU use. But It continued normally after everything else resumed.

Still, I've set that to no restrictions and will check if it occurs again. The task is now at around 61% and running normally.

Azmodes
Send message
Joined: 7 Jan 17
Posts: 34
Credit: 1,371,429,518
RAC: 0
Level
Met
Scientific publications
watwatwat
Message 47908 - Posted: 21 Sep 2017 | 13:19:50 UTC
Last modified: 21 Sep 2017 | 13:21:43 UTC

mkay, it hasn't happened since and the last task with this issue successfully ran to completion. I now got two tasks running on the same GPU and have an alert set up that tells me whenever the GPU load drops to near zero for longer periods of time. Nothing so far. Perhaps telling BOINC never to suspend tasks automatically really was the solution. Or maybe the multiple WUs.

By the way, what's the current stance on this? I was only getting a GPU usage of ~70% with one task (the 1050 Ti on my Linux machine has ~95%), so I played around with the project config file. Now with two it's up to ~87% and extrapolating the durations there seems to be some improvement indeed. I browsed a few threads here, but most of them were quite old, so not sure if it's recommended nowadays. Thanks.

Post to thread

Message boards : Number crunching : Long runs stopping for no apparent reason

//