acemdbeta application - discussion

Message boards : News : acemdbeta application - discussion
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32974 - Posted: 15 Sep 2013, 13:43:34 UTC

Matt, thanks for the explanations some week ago! Going back to my original question, if there was a better way to communicate the number of task restarts instead of "when in doubt take a close look at the task outputs".

This is the best idea I have so far: it would be nice if the BOINC server software could be changed (1) to show additional "custom" columns in the task overview. Projects could configure what this custom column should actually display (2). If this was used for e.g. the number of task recoveries we could just browse the task lists for anomalies (which I think many of us are already doing anyway).

(1) I know this isn't your job. It's just a suggestion I'm throwing into the discussion, which could be implemented in the future if enough projects wanted it.

(2) I'm sure there would be some use of this to output trouble-shooting info, performance data, results (pulsars or aliens found etc.) or other new ideas.

Zoltan wrote:
2. for the client side: now that you can monitor the GPU temperature, you should throttle the client if the GPU it's running on became too hot (for example above 80°C, and a warning should be present in the stderr.out)

I don't think throtteling by GPU-Grid itself would be a good idea. Titans and newer cards with Boost 2 are set for a target temperature of 80°C, which could be changed by the user. Older cards fan control often targets "<90°C" under load. And GPU-Grid would only have one lever available: switching calculations on or off. Which is not efficient at all if a GPU boosts into a high-voltage state (because its boost settings say that's OK), which then triggers an app-internal throttle, pausing computations for a moment.. only to run again into the inefficient high-performance state. In this case it would be much better if the user simply adjusted the temperature target to a good value, so that the card could choose a more efficient lower voltage state which allows sustained load at the target temperature.

I agree, though, that a notification system could help users how're unfamiliar with these things. On the other hand: these users would probably not look into the stderr.out either.

SK suggests using e-mail or the boinc notification system for this. I'd caution against overusing these - users could easily feel spammed if they read the warning but have reasons to ignore it, yet keep recieving them. Also the notifications are pushed out to all the users machines connected to the project (Or could this be changed? I don't think it's intended for individual messages), which could be badgering. I'm already getting quite a few messages through this system repeatedly which.. ehm, I don't like getting. Let's leave it at that ;)

MJH wrote:
Unsurprising - it's an inevitable consequence of the way the BOINC client library (which we build into our application) goes about doing suspend -resume[1] I've re-plumbed the whole thing entirely, using a much more reliable method.

Sounds like an iprovement which should find its way back into the main branch :)

MrS
Scanning for our furry friends since Jan 2002
ID: 32974 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32979 - Posted: 15 Sep 2013, 22:39:44 UTC - in response to Message 32974.  
Last modified: 15 Sep 2013, 22:41:16 UTC

Communication is a conundrum. Some want it, some don't. In the Boinc Anouncements system (which isn't a per user system) you can opt out; Tools, Options, Notice Reminder Interval (Never). You can also opt out of project messages (again not per user, but the option would be good) or into PM's, so they go straight to email - which I like and would favor in the case of critical announcements (your card fails every WU because, it has memory issues/is too hot/the clocks have dropped/a fan has failed...).

If we had a Personal Notices area, the server cold post warning messages to alert users and make suggestions. An opt out and auto delete after x days would keep it controlled. Maybe in MyAccount, but ideally just a link from there to a new page (which can be linked to from a warnings button in Boinc Manger, under project web pages). More of a web dev's area than an app dev's and would be nice to see such things added by the Boinc server and site devs, but Matt's rather capable and could easily add a web page and a button to Boinc Manager.

While most newbies wouldn't know to look in the Boinc Logs, some would and a message there (in Bold Red) would help many non newbies too.

When advertising you don't put one sign up!
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 32979 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Carlesa25
Avatar

Send message
Joined: 13 Nov 10
Posts: 328
Credit: 72,619,453
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33000 - Posted: 16 Sep 2013, 16:00:34 UTC

Hello: Task Beta MJHARVEY_CRASH2-110-18-25-RND5104_0 is running without problem on Linux (Ubuntu13.10) and the behavior is normal CPU load of 90>99% dedicated to the core.

The Beta CRASHNPT (two completed) who carried the four CPUs <25%.
ID: 33000 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33012 - Posted: 16 Sep 2013, 20:33:42 UTC - in response to Message 32979.  

Thanks SK, that's certainly worth considering and could take care of all my concerns against automated messages. An implementation would require some serious work. I think it would be best done in the BOINC main code base, so that projects could benefit from it, but setting it up here could be seen as a demonstration / showcase, which could motivate the main BOINC developers to include it.

MrS
Scanning for our furry friends since Jan 2002
ID: 33012 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33059 - Posted: 18 Sep 2013, 16:44:04 UTC - in response to Message 33012.  
Last modified: 18 Sep 2013, 16:48:46 UTC

I would add that it would be nice to know if a WU is taking exceptionally long to complete - say 2 or 3 times what is normal for any given type of WU.

- just spotted a WU that had been pretending to run for the last 5 days!
WU cache on that system is up to 0.2days only.

- Maybe a Danger message to alert you when you are running a NOELIA WU!
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 33059 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33071 - Posted: 18 Sep 2013, 19:34:38 UTC - in response to Message 33059.  

We may already have a mechanism for that. I can't remember the exact wording.. time limit exceeded? It mostly triggers unintentionally and results in a straight error, though.

Assuming this is not actually a time limit, as the naming suggests, but something related to the amount of flops expected for a WU, it might be possible to fine tune and use this limit to catch hanging WUs. One would have to be very careful about not generating false positives.. the harm done would probably outweight the gain.

Another thought regarding a hanging app: there's this error message "no heartbeat from client" sometimes popping up. Which implies there's some heartbeat-checking going on. I assume BOINC would throw an error if it doesn't receive heartbeats from an app any more. I this is true than your hanging WU was still generating a heartbeat and hence was not totally stuck. At this point the GPU-Grid app could monitor itself and trigger Matt's new recovery mode if it detects no progress from the GPU after some time.

MrS
Scanning for our furry friends since Jan 2002
ID: 33071 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33082 - Posted: 18 Sep 2013, 21:39:41 UTC - in response to Message 33071.  

The "no heartbeat from client" problem is a bit of an anchor for Boinc; it trawls around the seabeds ripping up reefs - as old-school as my ideas on task timeouts. I presume I experienced the consequences of such murmurs today when my CPU apps failed, compliments of a N* WU.
Is the recovery mechanism confined to the WDDM timeout limits (which can be changed)?
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 33082 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33871 - Posted: 14 Nov 2013, 9:55:37 UTC - in response to Message 32974.  


Matt, thanks for the explanations some week ago! Going back to my original question, if there was a better way to communicate the number of task restarts instead of "when in doubt take a close look at the task outputs".


Unfortunately not. I agree, it would be really nice to be able to push a message to the BOINC Mangler from the client. I should make a request to DA.

Matt
ID: 33871 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33872 - Posted: 14 Nov 2013, 9:56:45 UTC

New BETA coming later today.
Will include a fix for the repeated crashing on restart that some of you have seen.

Matt
ID: 33872 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33875 - Posted: 14 Nov 2013, 14:05:13 UTC
Last modified: 14 Nov 2013, 14:06:00 UTC

Huzzah! [Thanks, looking forward to testing it!]
ID: 33875 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33876 - Posted: 14 Nov 2013, 14:39:48 UTC - in response to Message 33875.  

815 is live now.
ID: 33876 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33880 - Posted: 14 Nov 2013, 16:16:45 UTC - in response to Message 33876.  

815 is live now.

Got one. I think you might have sent out 10-minute jobs with a full-length runtime estimate again.
ID: 33880 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33927 - Posted: 18 Nov 2013, 19:57:51 UTC

Just had a batch of 'KLAUDE' beta tasks all error with

ERROR: file pme.cpp line 85: PME NX too small 
ID: 33927 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33928 - Posted: 18 Nov 2013, 20:16:11 UTC - in response to Message 33927.  
Last modified: 18 Nov 2013, 20:18:22 UTC

ditto, all failed in 2 or 3 seconds:

Name 8-KLAUDE_6426-0-3-RND4620_2
Workunit 4932250
Created 18 Nov 2013 | 20:05:37 UTC
Sent 18 Nov 2013 | 20:08:06 UTC
Received 18 Nov 2013 | 20:12:46 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -98 (0xffffffffffffff9e) Unknown error number

Computer ID 139265
Report deadline 23 Nov 2013 | 20:08:06 UTC
Run time 2.52
CPU time 0.45
Validate state Invalid
Credit 0.00
Application version ACEMD beta version v8.15 (cuda55)
Stderr output

<core_client_version>7.2.28</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -98 (0xffffff9e)
</message>
<stderr_txt>
# GPU [GeForce GTX 770] Platform [Windows] Rev [3203M] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 770
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1084MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
ERROR: file pme.cpp line 85: PME NX too small
20:10:02 (1592): called boinc_finish

</stderr_txt>
]]>
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 33928 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile (retired account)

Send message
Joined: 22 Dec 11
Posts: 38
Credit: 28,606,255
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33929 - Posted: 18 Nov 2013, 20:44:39 UTC
Last modified: 18 Nov 2013, 20:54:06 UTC

Yep, all thrashed within seconds, no matter if GTX Titan or GT 650M... Hope it helps as I'm only consuming up-/download here and it won't even heat my room. ;)

EDIT: There's a good one, at least for the 10 and a half minutes it ran so far.
Mark my words and remember me. - 11th Hour, Lamb of God

ID: 33929 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile (retired account)

Send message
Joined: 22 Dec 11
Posts: 38
Credit: 28,606,255
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33930 - Posted: 18 Nov 2013, 21:53:55 UTC - in response to Message 33929.  


EDIT: There's a good one, at least for the 10 and a half minutes it ran so far.


Still running, but estimation of remaining time is way off: 18.730%, runtime 01:07:40, remaining time 00:10:18.
ID: 33930 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33931 - Posted: 18 Nov 2013, 22:06:32 UTC - in response to Message 33930.  
Last modified: 18 Nov 2013, 22:08:24 UTC


EDIT: There's a good one, at least for the 10 and a half minutes it ran so far.


Still running, but estimation of remaining time is way off: 18.730%, runtime 01:07:40, remaining time 00:10:18.

The Run Time and % Complete are accurate, so you can estimate the overall time from that; 18.73% in 67 2/3min suggests it will take a total of 6h and 2minutes (+/- a couple) to complete.

I have two 8.15 Betas running on a GTX660Ti and a GTX770 (W7) that look like taking 9h 12min and 6h 32min respectively.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 33931 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile (retired account)

Send message
Joined: 22 Dec 11
Posts: 38
Credit: 28,606,255
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33937 - Posted: 19 Nov 2013, 6:37:49 UTC

Yes, it took 5 hrs. and 58 min., validated ok.
ID: 33937 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Damaraland

Send message
Joined: 7 Nov 09
Posts: 152
Credit: 16,181,924
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwat
Message 33945 - Posted: 20 Nov 2013, 19:52:28 UTC - in response to Message 33937.  

Not very sure if you still want this info. Maybe you could be more precise.

CUDA: NVIDIA GPU 0: GeForce GTX 260 (driver version 331.65, CUDA version 6.0, compute capability 1.3, 896MB, 818MB available, 912 GFLOPS peak)

ACEMD beta version v8.15 (cuda55)
77-KLAUDE_6429-0-2-RND1641_1 Expected to finish in 22h. 83% processed right so far.
ID: 33945 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33949 - Posted: 20 Nov 2013, 23:58:27 UTC

I've finally had a crash while beta tasks were running ;)

I would need to examine the logs more carefully to be certain of the sequence of events, but it seems likely that these two tasks were running (one on each GTX 670) at around 16:50 tonight when the computer froze: I restarted it (hard power off) some 15 minutes later.

1-KLAUDE_6429-1-2-RND1937_0 did not survive the experience.

95-KLAUDE_6429-0-2-RND2489_1 was luckier with its restart.
ID: 33949 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : News : acemdbeta application - discussion

©2025 Universitat Pompeu Fabra