Message boards :
News :
acemdbeta application - discussion
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Thanks Richard, By coincidence I was just looking into the suspend resume mechanism. I'm going to put out a new beta shortly that should allow more graceful termination, and also make suspend/resume to memory safer. MJH |
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It was as Jacob explained in post 32786. No error message and not stopping the ACEMD app. BOINC manager said running and the time kept ticking, but no progress. Seems to be almost 2 hours in that state in my case. Nice test from Jacob as well that it is happening from 8.13 onwards. Greetings from TJ |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My most recent one (task 7253807) shows a crash and recovery from a SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 which is pretty impressive. In case you're puzzled by the high frequency of restarts at the beginning of the task: at the moment, I'm restricting BOINC to running only one GPUGrid task at a time ('<max_concurrent>'). If the running task suffers a failure, the next in line gets called forward, and runs for a few seconds. But when the original task is ready to resume, 'high priority' (EDF) forces it to run immediately, and the second task to be swapped out. So, a rather stuttering start, but not the fault of the application. The previous task (7253208) shows a number of # The simulation has become unstable. Terminating to avoid lock-up (1) which account for the false starts. |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The 8.13 app is still spitting out too much temperature data. On this task, I can't see which GPU it started on :( http://www.gpugrid.net/result.php?resultid=7253930 Are the temperature readings that important? If so, then maybe only output temp changes on the current-running-GPU, and even then, condense the text to just say "67*C" instead of "# GPU 0 Current Temp: 67 C" each line? It may even be more ideal to not have each reading on its own line; instead, maybe have a single long line that has temperature fluctuations for the current GPU? I just want to be able to always see what GPU it started on, and which GPUs it was restarted on. The temps are irrelevant to me, but if you want/need them, please find a way to consolidate further. Thanks, Jacob |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
New beta 8.14. Suspend and resume, of either favour, should now be working without problems. MJH |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Only maxima printed now |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
8.14 appears to be resuming appropriately from running CPU benchmarks. And I think you should keep the "event notifications" that are in the stderr.txt, they are very very helpful. # BOINC suspending at user request (thread suspend) # BOINC resuming at user request (thread suspend) # BOINC suspending at user request (exit) Great job! I also see you've done some work to condense the temp readings. Thanks for that.
If that means "Only printing a temperature reading if it has increased since the start of the run", then that is a GREAT compromise. Do you think you need them for all GPUs? Or could you maybe just limit to the running GPU? # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32680 # GPU 0 : 67C # GPU 1 : 66C # GPU 2 : 76C # GPU 1 : 67C # GPU 2 : 77C |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
The GPU numbering doesn't necessarily correspond to that that the rest of the app uses, so I'm going to leave them all in. MJH |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks Matt. The work you've done here, especially the suspend/resume work, will greatly improve the stability of people's machines, and the ability to diagnose problems. It is very much appreciated! |
![]() Send message Joined: 16 Aug 08 Posts: 145 Credit: 328,473,995 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Despite units GPUGRID test "Crash" my machine continues to produce blue screens and reboot, I need to work with, I am forced to stop GPUGRID and replace by Seti Beta. Sorry. @+ *_* |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Zarck, hello! Don't give up now - I've been watching your tasks and 8.13 has a fix especially for you! MJH |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Ok folks - last call for feature/mod requests for the beta. Next week I'm moving on to other things. MJH |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Performed a CPU Benchmark (with LAIM off). The WU running on the 8.13 app stopped and didn't resume, but the WU on the 8.14 app resumed normally (also with LAIM on). The WU resumed on the 8.13 app when I exited Boinc (and running tasks) and reopened it (not that it's an issue any more with 8.14). BTW. I've experienced this 'task not resuming' issue before, so it wasn't a new one, and benchmarks run periodically (just not often enough to have associate it with a task issue, especially when the tasks had plenty of other issues). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Ok folks - last call for feature/mod requests for the beta. Since you asked. There have been a number of comments about monitoring temperature, which is good. But I have found that cards can crash due to overclocking while still running relatively cool (less than 70C for example). I don't know if BOINC allows you to monitor the actual GPU core speed, but if so that would be worthwhile to report in some form. I don't know that it is high priority for this beta, but maybe the next one. |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Unsurprising - it's an inevitable consequence of the way the BOINC client library (which we build into our application) goes about doing suspend -resume[1] I've re-plumbed the whole thing entirely, using a much more reliable method. MJH [1] To paraphrase the old saying - 'Some people, when confronted with a problem, think "I'll uses threads". Now they have two problems'. |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Ok folks - last call for feature/mod requests for the beta. Can you make it print an ascii rainbow at the end of a successful task? Seriously, though, can't think of much, except maybe - Format the driver version to say 326.80 instead of 32680 - Add a timestamp with every start/restart block |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
There's a new batch of Beta WUs - "MJHARVEY-CRASHNPT". These test an important feature of the application that we've not been able to use much in the past because it seemed to be contributing towards crashes. The last series of CRASH units has given me good stats on the underlying error rates for control, so this batch should reveal whether there is in fact a bug with the feature. Please report here particularly if you have a failure mode unlike any you have sene with 8.14 and earlier CRASH units. MJH |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
First NPT processed with no errors at all - task 7269244. If I get any more, I'll try running them on the 'hot' GPU. |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have not had any problems processing the "MJHARVEY-CRASHNPT" units on my stable machine that runs GPUGrid on my GTX 660 Ti and GTX 460. :) I kinda wish the server would stop sending me beta units, but alas, I'm going to keep my settings set at "Give me any unit you think I should do" (aka: all apps checked). It just seems that lately it wants me to do beta! Just wanted to report that it is running smoothly for me. |
![]() ![]() Send message Joined: 13 Nov 10 Posts: 328 Credit: 72,619,453 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hello: You are about to finish without problems Beta " 102-MJHARVEY_CRASHNPT-7-25-RND3270_0 " and what I've noticed is a different behavior of the CPU usage, at least Linux The four cores enabled BOINC that I have are with an average load of 23-25% (no more running processes) although the task indicates the use of 1 CPU - 1 NVIDIA GPU. clearly there is an execution of the task in the form of multi-threaded on the CPU, even setting the app_config.xml to use 1 CPU and 1 GPU task. Note: Completed without problem. |
©2025 Universitat Pompeu Fabra