acemdbeta application

Author	Message
ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 32974 - Posted: 15 Sep 2013, 13:43:34 UTC Matt, thanks for the explanations some week ago! Going back to my original question, if there was a better way to communicate the number of task restarts instead of "when in doubt take a close look at the task outputs". This is the best idea I have so far: it would be nice if the BOINC server software could be changed (1) to show additional "custom" columns in the task overview. Projects could configure what this custom column should actually display (2). If this was used for e.g. the number of task recoveries we could just browse the task lists for anomalies (which I think many of us are already doing anyway). (1) I know this isn't your job. It's just a suggestion I'm throwing into the discussion, which could be implemented in the future if enough projects wanted it. (2) I'm sure there would be some use of this to output trouble-shooting info, performance data, results (pulsars or aliens found etc.) or other new ideas. Zoltan wrote: 2. for the client side: now that you can monitor the GPU temperature, you should throttle the client if the GPU it's running on became too hot (for example above 80°C, and a warning should be present in the stderr.out) I don't think throtteling by GPU-Grid itself would be a good idea. Titans and newer cards with Boost 2 are set for a target temperature of 80°C, which could be changed by the user. Older cards fan control often targets "<90°C" under load. And GPU-Grid would only have one lever available: switching calculations on or off. Which is not efficient at all if a GPU boosts into a high-voltage state (because its boost settings say that's OK), which then triggers an app-internal throttle, pausing computations for a moment.. only to run again into the inefficient high-performance state. In this case it would be much better if the user simply adjusted the temperature target to a good value, so that the card could choose a more efficient lower voltage state which allows sustained load at the target temperature. I agree, though, that a notification system could help users how're unfamiliar with these things. On the other hand: these users would probably not look into the stderr.out either. SK suggests using e-mail or the boinc notification system for this. I'd caution against overusing these - users could easily feel spammed if they read the warning but have reasons to ignore it, yet keep recieving them. Also the notifications are pushed out to all the users machines connected to the project (Or could this be changed? I don't think it's intended for individual messages), which could be badgering. I'm already getting quite a few messages through this system repeatedly which.. ehm, I don't like getting. Let's leave it at that ;) MJH wrote: Unsurprising - it's an inevitable consequence of the way the BOINC client library (which we build into our application) goes about doing suspend -resume[1] I've re-plumbed the whole thing entirely, using a much more reliable method. Sounds like an iprovement which should find its way back into the main branch :) MrS Scanning for our furry friends since Jan 2002 ID: 32974 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 32979 - Posted: 15 Sep 2013, 22:39:44 UTC - in response to Message 32974. Last modified: 15 Sep 2013, 22:41:16 UTC Communication is a conundrum. Some want it, some don't. In the Boinc Anouncements system (which isn't a per user system) you can opt out; Tools, Options, Notice Reminder Interval (Never). You can also opt out of project messages (again not per user, but the option would be good) or into PM's, so they go straight to email - which I like and would favor in the case of critical announcements (your card fails every WU because, it has memory issues/is too hot/the clocks have dropped/a fan has failed...). If we had a Personal Notices area, the server cold post warning messages to alert users and make suggestions. An opt out and auto delete after x days would keep it controlled. Maybe in MyAccount, but ideally just a link from there to a new page (which can be linked to from a warnings button in Boinc Manger, under project web pages). More of a web dev's area than an app dev's and would be nice to see such things added by the Boinc server and site devs, but Matt's rather capable and could easily add a web page and a button to Boinc Manager. While most newbies wouldn't know to look in the Boinc Logs, some would and a message there (in Bold Red) would help many non newbies too. When advertising you don't put one sign up! FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 32979 · Rating: 0 · rate: / Reply Quote

Carlesa25 Send message Joined: 13 Nov 10 Posts: 328 Credit: 72,619,453 RAC: 0 Level Scientific publications	Message 33000 - Posted: 16 Sep 2013, 16:00:34 UTC Hello: Task Beta MJHARVEY_CRASH2-110-18-25-RND5104_0 is running without problem on Linux (Ubuntu13.10) and the behavior is normal CPU load of 90>99% dedicated to the core. The Beta CRASHNPT (two completed) who carried the four CPUs <25%. ID: 33000 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33012 - Posted: 16 Sep 2013, 20:33:42 UTC - in response to Message 32979. Thanks SK, that's certainly worth considering and could take care of all my concerns against automated messages. An implementation would require some serious work. I think it would be best done in the BOINC main code base, so that projects could benefit from it, but setting it up here could be seen as a demonstration / showcase, which could motivate the main BOINC developers to include it. MrS Scanning for our furry friends since Jan 2002 ID: 33012 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33059 - Posted: 18 Sep 2013, 16:44:04 UTC - in response to Message 33012. Last modified: 18 Sep 2013, 16:48:46 UTC I would add that it would be nice to know if a WU is taking exceptionally long to complete - say 2 or 3 times what is normal for any given type of WU. - just spotted a WU that had been pretending to run for the last 5 days! WU cache on that system is up to 0.2days only. - Maybe a Danger message to alert you when you are running a NOELIA WU! FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33059 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33071 - Posted: 18 Sep 2013, 19:34:38 UTC - in response to Message 33059. We may already have a mechanism for that. I can't remember the exact wording.. time limit exceeded? It mostly triggers unintentionally and results in a straight error, though. Assuming this is not actually a time limit, as the naming suggests, but something related to the amount of flops expected for a WU, it might be possible to fine tune and use this limit to catch hanging WUs. One would have to be very careful about not generating false positives.. the harm done would probably outweight the gain. Another thought regarding a hanging app: there's this error message "no heartbeat from client" sometimes popping up. Which implies there's some heartbeat-checking going on. I assume BOINC would throw an error if it doesn't receive heartbeats from an app any more. I this is true than your hanging WU was still generating a heartbeat and hence was not totally stuck. At this point the GPU-Grid app could monitor itself and trigger Matt's new recovery mode if it detects no progress from the GPU after some time. MrS Scanning for our furry friends since Jan 2002 ID: 33071 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33082 - Posted: 18 Sep 2013, 21:39:41 UTC - in response to Message 33071. The "no heartbeat from client" problem is a bit of an anchor for Boinc; it trawls around the seabeds ripping up reefs - as old-school as my ideas on task timeouts. I presume I experienced the consequences of such murmurs today when my CPU apps failed, compliments of a N* WU. Is the recovery mechanism confined to the WDDM timeout limits (which can be changed)? FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33082 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 33871 - Posted: 14 Nov 2013, 9:55:37 UTC - in response to Message 32974. Matt, thanks for the explanations some week ago! Going back to my original question, if there was a better way to communicate the number of task restarts instead of "when in doubt take a close look at the task outputs". Unfortunately not. I agree, it would be really nice to be able to push a message to the BOINC Mangler from the client. I should make a request to DA. Matt ID: 33871 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 33872 - Posted: 14 Nov 2013, 9:56:45 UTC New BETA coming later today. Will include a fix for the repeated crashing on restart that some of you have seen. Matt ID: 33872 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33875 - Posted: 14 Nov 2013, 14:05:13 UTC Last modified: 14 Nov 2013, 14:06:00 UTC Huzzah! [Thanks, looking forward to testing it!] ID: 33875 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 33876 - Posted: 14 Nov 2013, 14:39:48 UTC - in response to Message 33875. 815 is live now. ID: 33876 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 33880 - Posted: 14 Nov 2013, 16:16:45 UTC - in response to Message 33876. 815 is live now. Got one. I think you might have sent out 10-minute jobs with a full-length runtime estimate again. ID: 33880 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 33927 - Posted: 18 Nov 2013, 19:57:51 UTC Just had a batch of 'KLAUDE' beta tasks all error with ERROR: file pme.cpp line 85: PME NX too small ID: 33927 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33928 - Posted: 18 Nov 2013, 20:16:11 UTC - in response to Message 33927. Last modified: 18 Nov 2013, 20:18:22 UTC ditto, all failed in 2 or 3 seconds: Name 8-KLAUDE_6426-0-3-RND4620_2 Workunit 4932250 Created 18 Nov 2013 \| 20:05:37 UTC Sent 18 Nov 2013 \| 20:08:06 UTC Received 18 Nov 2013 \| 20:12:46 UTC Server state Over Outcome Computation error Client state Compute error Exit status -98 (0xffffffffffffff9e) Unknown error number Computer ID 139265 Report deadline 23 Nov 2013 \| 20:08:06 UTC Run time 2.52 CPU time 0.45 Validate state Invalid Credit 0.00 Application version ACEMD beta version v8.15 (cuda55) Stderr output <core_client_version>7.2.28</core_client_version> <![CDATA[ <message> (unknown error) - exit code -98 (0xffffff9e) </message> <stderr_txt> # GPU [GeForce GTX 770] Platform [Windows] Rev [3203M] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 770 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:01:00.0 # Device clock : 1084MHz # Memory clock : 3505MHz # Memory width : 256bit # Driver version : r331_00 : 33140 ERROR: file pme.cpp line 85: PME NX too small 20:10:02 (1592): called boinc_finish </stderr_txt> ]]> FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33928 · Rating: 0 · rate: / Reply Quote

(retired account) Send message Joined: 22 Dec 11 Posts: 38 Credit: 28,606,255 RAC: 0 Level Scientific publications	Message 33929 - Posted: 18 Nov 2013, 20:44:39 UTC Last modified: 18 Nov 2013, 20:54:06 UTC Yep, all thrashed within seconds, no matter if GTX Titan or GT 650M... Hope it helps as I'm only consuming up-/download here and it won't even heat my room. ;) EDIT: There's a good one, at least for the 10 and a half minutes it ran so far. Mark my words and remember me. - 11th Hour, Lamb of God ID: 33929 · Rating: 0 · rate: / Reply Quote

(retired account) Send message Joined: 22 Dec 11 Posts: 38 Credit: 28,606,255 RAC: 0 Level Scientific publications	Message 33930 - Posted: 18 Nov 2013, 21:53:55 UTC - in response to Message 33929. EDIT: There's a good one, at least for the 10 and a half minutes it ran so far. Still running, but estimation of remaining time is way off: 18.730%, runtime 01:07:40, remaining time 00:10:18. ID: 33930 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33931 - Posted: 18 Nov 2013, 22:06:32 UTC - in response to Message 33930. Last modified: 18 Nov 2013, 22:08:24 UTC EDIT: There's a good one, at least for the 10 and a half minutes it ran so far. Still running, but estimation of remaining time is way off: 18.730%, runtime 01:07:40, remaining time 00:10:18. The Run Time and % Complete are accurate, so you can estimate the overall time from that; 18.73% in 67 2/3min suggests it will take a total of 6h and 2minutes (+/- a couple) to complete. I have two 8.15 Betas running on a GTX660Ti and a GTX770 (W7) that look like taking 9h 12min and 6h 32min respectively. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33931 · Rating: 0 · rate: / Reply Quote

(retired account) Send message Joined: 22 Dec 11 Posts: 38 Credit: 28,606,255 RAC: 0 Level Scientific publications	Message 33937 - Posted: 19 Nov 2013, 6:37:49 UTC Yes, it took 5 hrs. and 58 min., validated ok. ID: 33937 · Rating: 0 · rate: / Reply Quote

Damaraland Send message Joined: 7 Nov 09 Posts: 152 Credit: 16,181,924 RAC: 0 Level Scientific publications	Message 33945 - Posted: 20 Nov 2013, 19:52:28 UTC - in response to Message 33937. Not very sure if you still want this info. Maybe you could be more precise. CUDA: NVIDIA GPU 0: GeForce GTX 260 (driver version 331.65, CUDA version 6.0, compute capability 1.3, 896MB, 818MB available, 912 GFLOPS peak) ACEMD beta version v8.15 (cuda55) 77-KLAUDE_6429-0-2-RND1641_1 Expected to finish in 22h. 83% processed right so far. ID: 33945 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 33949 - Posted: 20 Nov 2013, 23:58:27 UTC I've finally had a crash while beta tasks were running ;) I would need to examine the logs more carefully to be certain of the sequence of events, but it seems likely that these two tasks were running (one on each GTX 670) at around 16:50 tonight when the computer froze: I restarted it (hard power off) some 15 minutes later. 1-KLAUDE_6429-1-2-RND1937_0 did not survive the experience. 95-KLAUDE_6429-0-2-RND2489_1 was luckier with its restart. ID: 33949 · Rating: 0 · rate: / Reply Quote

acemdbeta application - discussion