Early WU Downloads

Message boards : Number crunching : Early WU Downloads
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
tomba

Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34675 - Posted: 15 Jan 2014, 14:52:48 UTC - in response to Message 34670.  


15/01/2014 11:51:50 | | log flags: file_xfer, sched_ops, task, file_xfer_debug
15/01/2014 11:51:50 | | Libraries: libcurl/7.25.0 OpenSSL/1.0.1 zlib/1.2.6

The red line lists the logging flags currently turned on. It looks like you turned on the <file_xfer_debug> instead of <work_fetch_debug>.

cc_config.xml fixed. I really must listen to instructions!!
ID: 34675 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tomba

Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34676 - Posted: 15 Jan 2014, 14:57:31 UTC

Here's the log of a work_fetch cycle. Is the line in red normal?

15/01/2014 15:56:42 | | [work_fetch] entering choose_project()
15/01/2014 15:56:42 | | [work_fetch] ------- start work fetch state -------
15/01/2014 15:56:42 | | [work_fetch] target work buffer: 180.00 + 0.00 sec
15/01/2014 15:56:42 | | [work_fetch] --- project states ---
15/01/2014 15:56:42 | GPUGRID | [work_fetch] REC 71611.135 prio -48.201093 can req work
15/01/2014 15:56:42 | | [work_fetch] --- state for CPU ---
15/01/2014 15:56:42 | | [work_fetch] shortfall 540.00 nidle 3.00 saturated 0.00 busy 0.00
15/01/2014 15:56:42 | GPUGRID | [work_fetch] fetch share 0.000 (no apps)
15/01/2014 15:56:42 | | [work_fetch] --- state for NVIDIA ---
15/01/2014 15:56:42 | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 30261.82 busy 0.00
15/01/2014 15:56:42 | GPUGRID | [work_fetch] fetch share 1.000
15/01/2014 15:56:42 | | [work_fetch] ------- end work fetch state -------
15/01/2014 15:56:42 | | [work_fetch] No project chosen for work fetch
ID: 34676 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 2
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34677 - Posted: 15 Jan 2014, 15:22:14 UTC - in response to Message 34676.  

Here's the log of a work_fetch cycle. Is the line in red normal?

15/01/2014 15:56:42 | | [work_fetch] entering choose_project()
15/01/2014 15:56:42 | | [work_fetch] ------- start work fetch state -------
15/01/2014 15:56:42 | | [work_fetch] target work buffer: 180.00 + 0.00 sec
15/01/2014 15:56:42 | | [work_fetch] --- project states ---
15/01/2014 15:56:42 | GPUGRID | [work_fetch] REC 71611.135 prio -48.201093 can req work
15/01/2014 15:56:42 | | [work_fetch] --- state for CPU ---
15/01/2014 15:56:42 | | [work_fetch] shortfall 540.00 nidle 3.00 saturated 0.00 busy 0.00
15/01/2014 15:56:42 | GPUGRID | [work_fetch] fetch share 0.000 (no apps)
15/01/2014 15:56:42 | | [work_fetch] --- state for NVIDIA ---
15/01/2014 15:56:42 | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 30261.82 busy 0.00
15/01/2014 15:56:42 | GPUGRID | [work_fetch] fetch share 1.000
15/01/2014 15:56:42 | | [work_fetch] ------- end work fetch state -------
15/01/2014 15:56:42 | | [work_fetch] No project chosen for work fetch

It's normal when

--- state for NVIDIA --- saturated 30261.82 [seconds]

is larger than

target work buffer: 180.00 + 0.00 sec[onds]

- in other words, you have enough work for now, and don't need any more.
ID: 34677 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34678 - Posted: 15 Jan 2014, 15:22:25 UTC - in response to Message 34676.  
Last modified: 15 Jan 2014, 15:33:24 UTC

Here's the log of a work_fetch cycle. Is the line in red normal?

15/01/2014 15:56:42 | | [work_fetch] entering choose_project()
15/01/2014 15:56:42 | | [work_fetch] ------- start work fetch state -------
15/01/2014 15:56:42 | | [work_fetch] target work buffer: 180.00 + 0.00 sec
15/01/2014 15:56:42 | | [work_fetch] --- project states ---
15/01/2014 15:56:42 | GPUGRID | [work_fetch] REC 71611.135 prio -48.201093 can req work
15/01/2014 15:56:42 | | [work_fetch] --- state for CPU ---
15/01/2014 15:56:42 | | [work_fetch] shortfall 540.00 nidle 3.00 saturated 0.00 busy 0.00
15/01/2014 15:56:42 | GPUGRID | [work_fetch] fetch share 0.000 (no apps)
15/01/2014 15:56:42 | | [work_fetch] --- state for NVIDIA ---
15/01/2014 15:56:42 | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 30261.82 busy 0.00
15/01/2014 15:56:42 | GPUGRID | [work_fetch] fetch share 1.000
15/01/2014 15:56:42 | | [work_fetch] ------- end work fetch state -------
15/01/2014 15:56:42 | | [work_fetch] No project chosen for work fetch


Let's teach you how to read this.

target work buffer:
...says you need work to keep busy for at least "180" seconds (that's the 3 minutes I was talking about earlier, where even if you set min_buffer to 0, BOINC uses 3 minutes intentionally, since it could take around 3 minutes to ask projects for work) This line also equates to "when getting work, try not to get much more than: 180.00 + 0.00", which takes your max_addition_buffer setting into account. For reference, I use 0.1 days and 0.5 days for my buffer settings. So, my line says: target work buffer: 8640.00 + 43200.00 sec

project states:
... GPUGrid is listed as "can req work". If you had it set for no new tasks, or suspended, it would be noted here, and then excluded from work fetch operations.

state for CPU:
shortfall 540 means that, in order to keep all your CPUs busy for that min_buffer setting, you'd need 540 instance seconds of CPU work. nidle 3 means that you have 3 CPUs that are currently completely idle. (Note: This saddens me, might prove beneficial to put those to work with some CPU projects). Notice that the GPUGrid entry in that block says (no apps), meaning that the project told BOINC it doesn't have CPU apps, and BOINC won't ever request CPU work from it.

state for NVIDIA:
shortfall 0 means that, in order to keep all your NVIDIA GPUs busy for that min_buffer setting, you'd need 0 seconds. In fact, you have saturation, meaning that all instances are projected to be busy for 30261.82 seconds (8.4 hours).

end work fetch state:
Here is where it makes a decision, based on the info above, of whether to request work from a project or not. You have no CPU projects available for your idle CPUs, so they get left idle :sadface: You have no idle NVIDIA devices, and also your saturation level (30261.82 seconds) is greater than your low water mark (180 seconds), so you don't need NVIDIA work either. So it correctly says "No project chosen for work fetch", and doesn't request work.

Does that help? You should now be ready to read these log messages on your own, I'd think. Feel free to change some buffer values, or set GPUGrid for No New Tasks, to see the effects on this work_fetch_debug output.
ID: 34678 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tomba

Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34680 - Posted: 15 Jan 2014, 16:58:18 UTC - in response to Message 34678.  

Let's teach you how to read this.

Thanks for that, Jacob. I shall study it carefully.

You have no CPU projects available for your idle CPUs, so they get left idle :sadface:

Yes. I've been feeling guilty about that. So I'm now running six Rosettas too. A bit worried that the CPU fan has gone from 3700 rpm to 4300rpm and the CPU temperature has gone from 55C to 64C but I guess that's a question for my other thread.
ID: 34680 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34681 - Posted: 15 Jan 2014, 17:08:56 UTC - in response to Message 34680.  
Last modified: 15 Jan 2014, 17:15:13 UTC

Note 1: Richard's previous post in this thread, is likely correct.
Note 2: REC is Recent estimated credit, and is used by BOINC in the "prio" priority calculation when choosing which project to ask for work. The projects are listed in "prio" order, such that you can easily see which would be "next in line" in a request for work.
Note 3: In case you're curious to see a more-involved work fetch cycle, or might be wanting a list of projects that I'm attached to, below is a work_fetch_debug that shows the projects I'm attached to. I run a lot of various CPU and NVIDIA projects on this machine.
Note 4: Further information about work_fetch can be found in this slightly outdated, but highly useful, document: http://boinc.berkeley.edu/trac/wiki/ClientSched

------------------------------
A cycle of my work fetch:
------------------------------

1/15/2014 12:09:59 PM | | [work_fetch] entering choose_project()

1/15/2014 12:09:59 PM | | [work_fetch] ------- start work fetch state -------
1/15/2014 12:09:59 PM | | [work_fetch] target work buffer: 8640.00 + 43200.00 sec

1/15/2014 12:09:59 PM | | [work_fetch] --- project states ---
1/15/2014 12:09:59 PM | DrugDiscovery | [work_fetch] REC 0.000 prio -0.000000 can req work
1/15/2014 12:09:59 PM | The Lattice Project | [work_fetch] REC 0.000 prio -0.000000 can req work
1/15/2014 12:09:59 PM | superlinkattechnion | [work_fetch] REC 0.126 prio -0.000000 can't req work: master URL fetch pending (backoff: 43700.11 sec)
1/15/2014 12:09:59 PM | pogs | [work_fetch] REC 0.000 prio -0.000000 can't req work: "no new tasks" requested via Manager
1/15/2014 12:09:59 PM | Quake-Catcher Network | [work_fetch] REC 0.000 prio 0.000000 can't req work: non CPU intensive
1/15/2014 12:09:59 PM | ralph@home | [work_fetch] REC 0.000 prio -0.000000 can req work
1/15/2014 12:09:59 PM | DNA@Home | [work_fetch] REC 0.000 prio -0.000000 can req work
1/15/2014 12:09:59 PM | correlizer | [work_fetch] REC 0.014 prio -0.000000 can req work
1/15/2014 12:09:59 PM | WUProp@Home | [work_fetch] REC 0.014 prio -0.000002 can't req work: non CPU intensive
1/15/2014 12:09:59 PM | MindModeling@Beta | [work_fetch] REC 110.106 prio -0.002449 can req work
1/15/2014 12:09:59 PM | LHC@home 1.0 | [work_fetch] REC 221.382 prio -0.004923 can req work
1/15/2014 12:09:59 PM | Test4Theory@Home | [work_fetch] REC 221.453 prio -0.004925 can req work
1/15/2014 12:09:59 PM | boincsimap | [work_fetch] REC 248.218 prio -0.005520 can req work
1/15/2014 12:09:59 PM | World Community Grid | [work_fetch] REC 1063.070 prio -0.006005 can req work
1/15/2014 12:09:59 PM | rosetta@home | [work_fetch] REC 287.694 prio -0.007551 can req work
1/15/2014 12:09:59 PM | climateprediction.net | [work_fetch] REC 304.466 prio -0.007763 can req work
1/15/2014 12:09:59 PM | Cosmology@Home | [work_fetch] REC 306.047 prio -0.010929 can req work
1/15/2014 12:09:59 PM | Docking | [work_fetch] REC 288.875 prio -0.011575 can req work
1/15/2014 12:09:59 PM | Poem@Home | [work_fetch] REC 590.890 prio -0.013141 can req work
1/15/2014 12:09:59 PM | climateathome | [work_fetch] REC 251.853 prio -0.022648 can req work
1/15/2014 12:09:59 PM | Milkyway@Home | [work_fetch] REC 229.825 prio -0.058248 can req work
1/15/2014 12:09:59 PM | RNA World | [work_fetch] REC 446.313 prio -0.070412 can req work
1/15/2014 12:09:59 PM | GPUGRID | [work_fetch] REC 662565.023 prio -14.773922 can req work
1/15/2014 12:09:59 PM | SETI@home | [work_fetch] REC 76065.475 prio -169.162250 can req work
1/15/2014 12:09:59 PM | Einstein@Home | [work_fetch] REC 80649.354 prio -179.356354 can req work
1/15/2014 12:09:59 PM | SETI@home Beta Test | [work_fetch] REC 80680.357 prio -179.425301 can req work
1/15/2014 12:09:59 PM | Albert@Home | [work_fetch] REC 86519.931 prio -192.412500 can req work

1/15/2014 12:09:59 PM | | [work_fetch] --- state for CPU ---
1/15/2014 12:09:59 PM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 302655.24 busy 0.00
1/15/2014 12:09:59 PM | DrugDiscovery | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | The Lattice Project | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | superlinkattechnion | [work_fetch] fetch share 0.000
1/15/2014 12:09:59 PM | pogs | [work_fetch] fetch share 0.000
1/15/2014 12:09:59 PM | ralph@home | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | DNA@Home | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | correlizer | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | MindModeling@Beta | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | LHC@home 1.0 | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | Test4Theory@Home | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | boincsimap | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | World Community Grid | [work_fetch] fetch share 0.190
1/15/2014 12:09:59 PM | rosetta@home | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | climateprediction.net | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | Cosmology@Home | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | Docking | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | Poem@Home | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | climateathome | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | Milkyway@Home | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | RNA World | [work_fetch] fetch share 0.048
1/15/2014 12:09:59 PM | GPUGRID | [work_fetch] fetch share 0.000 (no apps)
1/15/2014 12:09:59 PM | SETI@home | [work_fetch] fetch share 0.000
1/15/2014 12:09:59 PM | Einstein@Home | [work_fetch] fetch share 0.000
1/15/2014 12:09:59 PM | SETI@home Beta Test | [work_fetch] fetch share 0.000
1/15/2014 12:09:59 PM | Albert@Home | [work_fetch] fetch share 0.000

1/15/2014 12:09:59 PM | | [work_fetch] --- state for NVIDIA ---
1/15/2014 12:09:59 PM | | [work_fetch] shortfall 68058.62 nidle 0.00 saturated 8942.84 busy 0.00
1/15/2014 12:09:59 PM | DrugDiscovery | [work_fetch] fetch share 0.111
1/15/2014 12:09:59 PM | The Lattice Project | [work_fetch] fetch share 0.000 (no apps)
1/15/2014 12:09:59 PM | superlinkattechnion | [work_fetch] fetch share 0.000
1/15/2014 12:09:59 PM | pogs | [work_fetch] fetch share 0.000 (no apps)
1/15/2014 12:09:59 PM | ralph@home | [work_fetch] fetch share 0.111
1/15/2014 12:09:59 PM | DNA@Home | [work_fetch] fetch share 0.111
1/15/2014 12:09:59 PM | correlizer | [work_fetch] fetch share 0.111
1/15/2014 12:09:59 PM | MindModeling@Beta | [work_fetch] fetch share 0.000 (no apps)
1/15/2014 12:09:59 PM | LHC@home 1.0 | [work_fetch] fetch share 0.000 (no apps)
1/15/2014 12:09:59 PM | Test4Theory@Home | [work_fetch] fetch share 0.000 (no apps)
1/15/2014 12:09:59 PM | boincsimap | [work_fetch] fetch share 0.000 (no apps)
1/15/2014 12:09:59 PM | World Community Grid | [work_fetch] fetch share 0.000 (no apps)
1/15/2014 12:09:59 PM | rosetta@home | [work_fetch] fetch share 0.111
1/15/2014 12:09:59 PM | climateprediction.net | [work_fetch] fetch share 0.000 (no apps)
1/15/2014 12:09:59 PM | Cosmology@Home | [work_fetch] fetch share 0.111
1/15/2014 12:09:59 PM | Docking | [work_fetch] fetch share 0.111
1/15/2014 12:09:59 PM | Poem@Home | [work_fetch] fetch share 0.111
1/15/2014 12:09:59 PM | climateathome | [work_fetch] fetch share 0.000 (no apps)
1/15/2014 12:09:59 PM | Milkyway@Home | [work_fetch] fetch share 0.000 (blocked by configuration file)
1/15/2014 12:09:59 PM | RNA World | [work_fetch] fetch share 0.000 (no apps)
1/15/2014 12:09:59 PM | GPUGRID | [work_fetch] fetch share 0.111
1/15/2014 12:09:59 PM | SETI@home | [work_fetch] fetch share 0.001
1/15/2014 12:09:59 PM | Einstein@Home | [work_fetch] fetch share 0.001
1/15/2014 12:09:59 PM | SETI@home Beta Test | [work_fetch] fetch share 0.001
1/15/2014 12:09:59 PM | Albert@Home | [work_fetch] fetch share 0.001

1/15/2014 12:09:59 PM | | [work_fetch] ------- end work fetch state -------
1/15/2014 12:09:59 PM | | [work_fetch] No project chosen for work fetch
ID: 34681 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tomba

Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34757 - Posted: 22 Jan 2014, 7:58:47 UTC

Just had an early download; eight CPUs and two GPUs busy and no sign of a GPUGrid WU stopping:

22/01/2014 08:07:49 | GPUGRID | [work_fetch] fetch share 1.000
22/01/2014 08:07:49 | | [work_fetch] ------- end work fetch state -------
22/01/2014 08:07:49 | | [work_fetch] No project chosen for work fetch
22/01/2014 08:08:29 | | [work_fetch] Request work fetch: application exited
22/01/2014 08:08:29 | GPUGRID | [work_fetch] REC 272425.128 prio -2.579669 can req work
22/01/2014 08:08:29 | | [work_fetch] --- state for CPU ---
22/01/2014 08:08:29 | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 2030.85 busy 0.00
22/01/2014 08:08:29 | GPUGRID | [work_fetch] fetch share 0.000 (no apps)
22/01/2014 08:08:29 | | [work_fetch] --- state for NVIDIA ---
22/01/2014 08:08:29 | | [work_fetch] shortfall 1044.00 nidle 1.00 saturated 0.00 busy 0.00
22/01/2014 08:08:29 | GPUGRID | [work_fetch] fetch share 0.000
22/01/2014 08:08:29 | | [work_fetch] ------- end work fetch state -------
22/01/2014 08:08:29 | GPUGRID | [work_fetch] set_request() for NVIDIA: ninst 2 nused_total 1.000000 nidle_now 1.000000 fetch share 0.000000 req_inst 1.000000 req_secs 1044.000000
22/01/2014 08:08:29 | GPUGRID | [work_fetch] request: CPU (0.00 sec, 0.00 inst) NVIDIA (1044.00 sec, 1.00 inst)
22/01/2014 08:08:29 | GPUGRID | Sending scheduler request: To fetch work.
22/01/2014 08:08:29 | GPUGRID | Requesting new tasks for NVIDIA
22/01/2014 08:08:32 | GPUGRID | Scheduler request completed: got 1 new tasks
22/01/2014 08:08:32 | | [work_fetch] Request work fetch: RPC complete
22/01/2014 08:08:34 | GPUGRID | Started download of 72x-SANTI_MAR420cap310-29-LICENSE
ID: 34757 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34764 - Posted: 22 Jan 2014, 14:04:15 UTC

Work fetch asked GPUGRID for NVIDIA work because it detected 1 idle NVIDIA instance. Is it possible that it was correct?

if you think it was incorrect, then the next steps might be to turn on cpu_sched and coproc_debug, to "prove" that work fetch made a bad call.
ID: 34764 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 2
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34765 - Posted: 22 Jan 2014, 15:52:52 UTC - in response to Message 34764.  

Work fetch asked GPUGRID for NVIDIA work because it detected 1 idle NVIDIA instance. Is it possible that it was correct?

if you think it was incorrect, then the next steps might be to turn on cpu_sched and coproc_debug, to "prove" that work fetch made a bad call.

It might be possible to work out what happened from the message log entries immediately before and after the section Tomba posted. Did a task restart, for example?

The trouble with the extra log flags is that you can't use them retrospectively to diagnose a problem which has already happened - you have to set them anyway, and wait for it to happen again.
ID: 34765 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tomba

Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34769 - Posted: 22 Jan 2014, 17:38:45 UTC - in response to Message 34764.  

Work fetch asked GPUGRID for NVIDIA work because it detected 1 idle NVIDIA instance. Is it possible that it was correct?

As I said in my post "no sign of a GPUGrid WU stopping" and I did check back for 30 minutes.

if you think it was incorrect, then the next steps might be to turn on cpu_sched and coproc_debug, to "prove" that work fetch made a bad call.

OK. I'll check 'em out.
ID: 34769 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 2
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34771 - Posted: 22 Jan 2014, 18:27:56 UTC - in response to Message 34769.  

Work fetch asked GPUGRID for NVIDIA work because it detected 1 idle NVIDIA instance. Is it possible that it was correct?

As I said in my post "no sign of a GPUGrid WU stopping" and I did check back for 30 minutes.

Unfortunately, a task stopping isn't necessarily logged with normal settings - I think you'd need to add <task_debug> to be sure of seeing that.

But you should see the restart afterwards, in the normal logs (I think).
ID: 34771 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tomba

Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34793 - Posted: 23 Jan 2014, 18:11:30 UTC

We start with two WUs confirmed running followed by a couple of nidles of 0.

Then - oops - only one WU is running.

One more 0 nidle then a 1 nidle, and one more 0 nidle!!

Then we have a 1 "nidle_now", followed by an early WU fetch....

23/01/2014 12:17:43 | GPUGRID | [coproc] NVIDIA instance 0: confirming for I91R1-NATHAN_KIDc22_full2-3-10-RND2112_0
23/01/2014 12:17:43 | GPUGRID | [coproc] NVIDIA instance 1: confirming for 75x-SANTI_MARwtcap310-22-32-RND0081_0

23/01/2014 12:18:10 | | [work_fetch] entering choose_project()
23/01/2014 12:18:10 | | [work_fetch] ------- start work fetch state -------
23/01/2014 12:18:10 | | [work_fetch] target work buffer: 180.00 + 864.00 sec
23/01/2014 12:18:10 | | [work_fetch] --- project states ---
23/01/2014 12:18:10 | GPUGRID | [work_fetch] REC 275978.848 prio -3.498618 can req work
23/01/2014 12:18:10 | | [work_fetch] --- state for CPU ---
23/01/2014 12:18:10 | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 2568.97 busy 0.00
23/01/2014 12:18:10 | GPUGRID | [work_fetch] fetch share 0.000 (no apps)
23/01/2014 12:18:10 | | [work_fetch] --- state for NVIDIA ---
23/01/2014 12:18:10 | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 4775.59 busy 0.00
23/01/2014 12:18:10 | GPUGRID | [work_fetch] fetch share 0.500
23/01/2014 12:18:10 | | [work_fetch] ------- end work fetch state -------
23/01/2014 12:18:10 | | [work_fetch] No project chosen for work fetch
23/01/2014 12:18:13 | | [work_fetch] Request work fetch: application exited
23/01/2014 12:18:13 | GPUGRID | [coproc] NVIDIA instance 0: confirming for I91R1-NATHAN_KIDc22_full2-3-10-RND2112_0
23/01/2014 12:18:15 | | [work_fetch] entering choose_project()
23/01/2014 12:18:15 | | [work_fetch] ------- start work fetch state -------
23/01/2014 12:18:15 | | [work_fetch] target work buffer: 180.00 + 864.00 sec
23/01/2014 12:18:15 | | [work_fetch] --- project states ---
23/01/2014 12:18:15 | GPUGRID | [work_fetch] REC 275979.860 prio -3.498092 can req work
23/01/2014 12:18:15 | | [work_fetch] --- state for CPU ---
23/01/2014 12:18:15 | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 2560.84 busy 0.00
23/01/2014 12:18:15 | GPUGRID | [work_fetch] fetch share 0.000 (no apps)
23/01/2014 12:18:15 | | [work_fetch] --- state for NVIDIA ---
23/01/2014 12:18:15 | | [work_fetch] shortfall 1044.00 nidle 1.00 saturated 0.00 busy 0.00
23/01/2014 12:18:15 | GPUGRID | [work_fetch] fetch share 0.000
23/01/2014 12:18:15 | | [work_fetch] ------- end work fetch state -------
23/01/2014 12:18:19 | | [work_fetch] Request work fetch: RPC complete
23/01/2014 12:18:24 | | [work_fetch] entering choose_project()
23/01/2014 12:18:24 | | [work_fetch] ------- start work fetch state -------
23/01/2014 12:18:24 | | [work_fetch] target work buffer: 180.00 + 864.00 sec
23/01/2014 12:18:24 | | [work_fetch] --- project states ---
23/01/2014 12:18:24 | GPUGRID | [work_fetch] REC 275979.860 prio -2.505735 can req work
23/01/2014 12:18:24 | | [work_fetch] --- state for CPU ---
23/01/2014 12:18:24 | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 2546.51 busy 0.00
23/01/2014 12:18:24 | GPUGRID | [work_fetch] fetch share 0.000 (no apps)
23/01/2014 12:18:24 | | [work_fetch] --- state for NVIDIA ---
23/01/2014 12:18:24 | | [work_fetch] shortfall 1044.00 nidle 1.00 saturated 0.00 busy 0.00
23/01/2014 12:18:24 | GPUGRID | [work_fetch] fetch share 0.000
23/01/2014 12:18:24 | | [work_fetch] ------- end work fetch state -------
23/01/2014 12:18:24 | GPUGRID | [work_fetch] set_request() for NVIDIA: ninst 2 nused_total 1.000000 nidle_now 1.000000 fetch share 0.000000 req_inst 1.000000 req_secs 1044.000000
23/01/2014 12:18:24 | GPUGRID | [work_fetch] request: CPU (0.00 sec, 0.00 inst) NVIDIA (1044.00 sec, 1.00 inst)
23/01/2014 12:18:24 | GPUGRID | Sending scheduler request: To fetch work.
23/01/2014 12:18:24 | GPUGRID | Requesting new tasks for NVIDIA
23/01/2014 12:18:27 | GPUGRID | Scheduler request completed: got 1 new tasks
23/01/2014 12:18:27 | | [work_fetch] Request work fetch: RPC complete
23/01/2014 12:18:29 | GPUGRID | Started download of 98x-SANTI_MARwtcap310-30-LICENSE

ID: 34793 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34794 - Posted: 23 Jan 2014, 18:40:43 UTC

23/01/2014 12:18:13 | | [work_fetch] Request work fetch: application exited

Any idea what application exited, causing the work fetch request?
Also, are you using CPU Throttling (The "Use at most X% CPU Time" setting)?
Also, can you please include the first messages at the beginning of the event log, so we can see what version you are using?

I agree this looks a bit suspicious, but it sounds like a GPU task got unloaded, and work fetch decided to fill an idle spot, even if the timing isn't exactly perfect.


ID: 34794 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 2
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34795 - Posted: 23 Jan 2014, 19:12:50 UTC - in response to Message 34793.  

I think that log sequence is pretty definitive. There are two sets of nidle:

The --- state for CPU --- remains at zero throughout. No problems there.

The --- state for NVIDIA --- starts at 0, jumps to 1, and then drops to 0 again.

At the point of the jump, we can see

23/01/2014 12:18:13 | | [work_fetch] Request work fetch: application exited

and NVIDIA instance 1: confirming for 75x-SANTI_MARwtcap310-22-32-RND0081_0 disappears from the record.

That's result 7689115, which you can see has a pause in the middle:

# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 6109000)

I imagine that if you look a bit further down, you'd see, perhaps first a 'restarting' entry for 75x-SANTI_MARwtcap310-22-32-RND0081_0, and then two task instances being confirmed again at each [coproc] step.

The good news is that 75x-SANTI_MARwtcap310-22-32-RND0081_0 completed successfully and validated, despite the pause in the middle.
ID: 34795 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34796 - Posted: 23 Jan 2014, 19:16:35 UTC - in response to Message 34795.  

And so, for that brief pause, work fetch correctly tried to fill an idle GPU, right?
ID: 34796 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 2
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34797 - Posted: 23 Jan 2014, 19:51:32 UTC - in response to Message 34796.  

And so, for that brief pause, work fetch correctly tried to fill an idle GPU, right?

That's my guess. And I'm also guessing that BOINC restarted the missing 75x-SANTI_MARwtcap310-22-32-RND0081_0 (allowing it to run to completion and report success), before the file downloads for the replacement - probably result 7690601 - had completed and allowed it to be started on the idle GPU.
ID: 34797 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath

Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34799 - Posted: 23 Jan 2014, 21:55:29 UTC - in response to Message 34797.  

Makes sense to me. Thanks for the lesson in debug message interpretation, Richard and Jacob, I swear I'll get it eventually. So what's causing the simulation to become unstable and pause to catch its breath? Clocks too high?

Shouldn't the client recognize the pause as a temporary suspend and not request more work?

BOINC <<--- credit whores, pedants, alien hunters
ID: 34799 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 34802 - Posted: 24 Jan 2014, 10:11:27 UTC

Is this maybe the pause time that Matt implemented in the latest versions that after a crash it pauses to avoid consecutive crashes or something like that?
It rings a bell.
ID: 34802 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 34803 - Posted: 24 Jan 2014, 10:29:50 UTC - in response to Message 34802.  


Is this maybe the pause time that Matt implemented in the latest versions that after a crash it pauses to avoid consecutive crashes or something like that?



Quite likely. If the task crashes having made some progress, it will be restarted after a delay of at least 60 sec. BOINC will try and start some other work during this idle time, which may involve downloading another tasks.

In the worst case, a very unreliable machine may be continuously cycling between two or three tasks, each time making just enough progress to merit a restart attempt.

MJH
ID: 34803 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34814 - Posted: 25 Jan 2014, 13:55:44 UTC - in response to Message 34803.  


Is this maybe the pause time that Matt implemented in the latest versions that after a crash it pauses to avoid consecutive crashes or something like that?

Quite likely. If the task crashes having made some progress, it will be restarted after a delay of at least 60 sec. BOINC will try and start some other work during this idle time, which may involve downloading another tasks.

In the worst case, a very unreliable machine may be continuously cycling between two or three tasks, each time making just enough progress to merit a restart attempt.

MJH

Exactly that happened when my Gigabyte GTX 780Ti OC was unreliable.
ID: 34814 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Early WU Downloads

©2026 Universitat Pompeu Fabra