acemdbeta application

Author	Message
MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32790 - Posted: 6 Sep 2013, 12:58:31 UTC - in response to Message 32788. Thanks Richard, By coincidence I was just looking into the suspend resume mechanism. I'm going to put out a new beta shortly that should allow more graceful termination, and also make suspend/resume to memory safer. MJH ID: 32790 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 32791 - Posted: 6 Sep 2013, 13:01:01 UTC - in response to Message 32780. It was as Jacob explained in post 32786. No error message and not stopping the ACEMD app. BOINC manager said running and the time kept ticking, but no progress. Seems to be almost 2 hours in that state in my case. Nice test from Jacob as well that it is happening from 8.13 onwards. Greetings from TJ ID: 32791 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 32799 - Posted: 6 Sep 2013, 15:09:32 UTC My most recent one (task 7253807) shows a crash and recovery from a SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 which is pretty impressive. In case you're puzzled by the high frequency of restarts at the beginning of the task: at the moment, I'm restricting BOINC to running only one GPUGrid task at a time ('<max_concurrent>'). If the running task suffers a failure, the next in line gets called forward, and runs for a few seconds. But when the original task is ready to resume, 'high priority' (EDF) forces it to run immediately, and the second task to be swapped out. So, a rather stuttering start, but not the fault of the application. The previous task (7253208) shows a number of # The simulation has become unstable. Terminating to avoid lock-up (1) which account for the false starts. ID: 32799 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32800 - Posted: 6 Sep 2013, 15:46:58 UTC Last modified: 6 Sep 2013, 15:47:14 UTC The 8.13 app is still spitting out too much temperature data. On this task, I can't see which GPU it started on :( http://www.gpugrid.net/result.php?resultid=7253930 Are the temperature readings that important? If so, then maybe only output temp changes on the current-running-GPU, and even then, condense the text to just say "67*C" instead of "# GPU 0 Current Temp: 67 C" each line? It may even be more ideal to not have each reading on its own line; instead, maybe have a single long line that has temperature fluctuations for the current GPU? I just want to be able to always see what GPU it started on, and which GPUs it was restarted on. The temps are irrelevant to me, but if you want/need them, please find a way to consolidate further. Thanks, Jacob ID: 32800 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32801 - Posted: 6 Sep 2013, 16:03:40 UTC New beta 8.14. Suspend and resume, of either favour, should now be working without problems. MJH ID: 32801 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32802 - Posted: 6 Sep 2013, 16:13:43 UTC - in response to Message 32800. The 8.13 app is still spitting out too much temperature data. Only maxima printed now ID: 32802 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32803 - Posted: 6 Sep 2013, 16:15:24 UTC - in response to Message 32801. Last modified: 6 Sep 2013, 16:16:56 UTC 8.14 appears to be resuming appropriately from running CPU benchmarks. And I think you should keep the "event notifications" that are in the stderr.txt, they are very very helpful. # BOINC suspending at user request (thread suspend) # BOINC resuming at user request (thread suspend) # BOINC suspending at user request (exit) Great job! I also see you've done some work to condense the temp readings. Thanks for that. The 8.13 app is still spitting out too much temperature data. Only maxima printed now If that means "Only printing a temperature reading if it has increased since the start of the run", then that is a GREAT compromise. Do you think you need them for all GPUs? Or could you maybe just limit to the running GPU? # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32680 # GPU 0 : 67C # GPU 1 : 66C # GPU 2 : 76C # GPU 1 : 67C # GPU 2 : 77C ID: 32803 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32804 - Posted: 6 Sep 2013, 16:17:56 UTC - in response to Message 32803. Do you think you need them for all GPUs? Or could you maybe just limit to the running GPU? The GPU numbering doesn't necessarily correspond to that that the rest of the app uses, so I'm going to leave them all in. MJH ID: 32804 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32806 - Posted: 6 Sep 2013, 16:19:18 UTC - in response to Message 32804. Last modified: 6 Sep 2013, 16:19:26 UTC Thanks Matt. The work you've done here, especially the suspend/resume work, will greatly improve the stability of people's machines, and the ability to diagnose problems. It is very much appreciated! ID: 32806 · Rating: 0 · rate: / Reply Quote

Zarck Send message Joined: 16 Aug 08 Posts: 145 Credit: 328,473,995 RAC: 0 Level Scientific publications	Message 32807 - Posted: 6 Sep 2013, 17:01:08 UTC - in response to Message 32806. Last modified: 6 Sep 2013, 17:01:42 UTC Despite units GPUGRID test "Crash" my machine continues to produce blue screens and reboot, I need to work with, I am forced to stop GPUGRID and replace by Seti Beta. Sorry. @+ _ ID: 32807 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32808 - Posted: 6 Sep 2013, 17:05:18 UTC - in response to Message 32807. Zarck, hello! Don't give up now - I've been watching your tasks and 8.13 has a fix especially for you! MJH ID: 32808 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32810 - Posted: 6 Sep 2013, 18:18:42 UTC Ok folks - last call for feature/mod requests for the beta. Next week I'm moving on to other things. MJH ID: 32810 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 32811 - Posted: 6 Sep 2013, 18:33:25 UTC - in response to Message 32808. Performed a CPU Benchmark (with LAIM off). The WU running on the 8.13 app stopped and didn't resume, but the WU on the 8.14 app resumed normally (also with LAIM on). The WU resumed on the 8.13 app when I exited Boinc (and running tasks) and reopened it (not that it's an issue any more with 8.14). BTW. I've experienced this 'task not resuming' issue before, so it wasn't a new one, and benchmarks run periodically (just not often enough to have associate it with a task issue, especially when the tasks had plenty of other issues). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 32811 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 32812 - Posted: 6 Sep 2013, 18:34:28 UTC - in response to Message 32810. Ok folks - last call for feature/mod requests for the beta. Since you asked. There have been a number of comments about monitoring temperature, which is good. But I have found that cards can crash due to overclocking while still running relatively cool (less than 70C for example). I don't know if BOINC allows you to monitor the actual GPU core speed, but if so that would be worthwhile to report in some form. I don't know that it is high priority for this beta, but maybe the next one. ID: 32812 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32813 - Posted: 6 Sep 2013, 18:37:49 UTC - in response to Message 32811. Last modified: 6 Sep 2013, 18:41:53 UTC BTW. I've experienced this 'task not resuming' issue before, so it wasn't a new one, and benchmarks run periodically (just not often enough to have associate it with a task issue, especially when the tasks had plenty of other issues). Unsurprising - it's an inevitable consequence of the way the BOINC client library (which we build into our application) goes about doing suspend -resume[1] I've re-plumbed the whole thing entirely, using a much more reliable method. MJH [1] To paraphrase the old saying - 'Some people, when confronted with a problem, think "I'll uses threads". Now they have two problems'. ID: 32813 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32814 - Posted: 6 Sep 2013, 19:06:53 UTC - in response to Message 32810. Last modified: 6 Sep 2013, 19:07:07 UTC Ok folks - last call for feature/mod requests for the beta. Next week I'm moving on to other things. MJH Can you make it print an ascii rainbow at the end of a successful task? Seriously, though, can't think of much, except maybe - Format the driver version to say 326.80 instead of 32680 - Add a timestamp with every start/restart block ID: 32814 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32875 - Posted: 10 Sep 2013, 22:39:45 UTC Last modified: 10 Sep 2013, 22:54:41 UTC There's a new batch of Beta WUs - "MJHARVEY-CRASHNPT". These test an important feature of the application that we've not been able to use much in the past because it seemed to be contributing towards crashes. The last series of CRASH units has given me good stats on the underlying error rates for control, so this batch should reveal whether there is in fact a bug with the feature. Please report here particularly if you have a failure mode unlike any you have sene with 8.14 and earlier CRASH units. MJH ID: 32875 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 32887 - Posted: 11 Sep 2013, 13:37:37 UTC - in response to Message 32875. First NPT processed with no errors at all - task 7269244. If I get any more, I'll try running them on the 'hot' GPU. ID: 32887 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32947 - Posted: 14 Sep 2013, 1:04:21 UTC Last modified: 14 Sep 2013, 2:03:07 UTC I have not had any problems processing the "MJHARVEY-CRASHNPT" units on my stable machine that runs GPUGrid on my GTX 660 Ti and GTX 460. :) I kinda wish the server would stop sending me beta units, but alas, I'm going to keep my settings set at "Give me any unit you think I should do" (aka: all apps checked). It just seems that lately it wants me to do beta! Just wanted to report that it is running smoothly for me. ID: 32947 · Rating: 0 · rate: / Reply Quote

Carlesa25 Send message Joined: 13 Nov 10 Posts: 328 Credit: 72,619,453 RAC: 0 Level Scientific publications	Message 32951 - Posted: 14 Sep 2013, 14:00:08 UTC Last modified: 14 Sep 2013, 14:23:36 UTC Hello: You are about to finish without problems Beta " 102-MJHARVEY_CRASHNPT-7-25-RND3270_0 " and what I've noticed is a different behavior of the CPU usage, at least Linux The four cores enabled BOINC that I have are with an average load of 23-25% (no more running processes) although the task indicates the use of 1 CPU - 1 NVIDIA GPU. clearly there is an execution of the task in the form of multi-threaded on the CPU, even setting the app_config.xml to use 1 CPU and 1 GPU task. Note: Completed without problem. ID: 32951 · Rating: 0 · rate: / Reply Quote

acemdbeta application - discussion