acemdbeta application - discussion

Message boards : News : acemdbeta application - discussion
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32790 - Posted: 6 Sep 2013, 12:58:31 UTC - in response to Message 32788.  

Thanks Richard,

By coincidence I was just looking into the suspend resume mechanism. I'm going to put out a new beta shortly that should allow more graceful termination, and also make suspend/resume to memory safer.

MJH
ID: 32790 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32791 - Posted: 6 Sep 2013, 13:01:01 UTC - in response to Message 32780.  

It was as Jacob explained in post 32786. No error message and not stopping the ACEMD app. BOINC manager said running and the time kept ticking, but no progress. Seems to be almost 2 hours in that state in my case.
Nice test from Jacob as well that it is happening from 8.13 onwards.
Greetings from TJ
ID: 32791 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32799 - Posted: 6 Sep 2013, 15:09:32 UTC

My most recent one (task 7253807) shows a crash and recovery from a

SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0

which is pretty impressive.

In case you're puzzled by the high frequency of restarts at the beginning of the task: at the moment, I'm restricting BOINC to running only one GPUGrid task at a time ('<max_concurrent>'). If the running task suffers a failure, the next in line gets called forward, and runs for a few seconds. But when the original task is ready to resume, 'high priority' (EDF) forces it to run immediately, and the second task to be swapped out. So, a rather stuttering start, but not the fault of the application.

The previous task (7253208) shows a number of

# The simulation has become unstable. Terminating to avoid lock-up (1)

which account for the false starts.
ID: 32799 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32800 - Posted: 6 Sep 2013, 15:46:58 UTC
Last modified: 6 Sep 2013, 15:47:14 UTC

The 8.13 app is still spitting out too much temperature data.

On this task, I can't see which GPU it started on :(
http://www.gpugrid.net/result.php?resultid=7253930

Are the temperature readings that important?
If so, then maybe only output temp changes on the current-running-GPU, and even then, condense the text to just say "67*C" instead of "# GPU 0 Current Temp: 67 C" each line? It may even be more ideal to not have each reading on its own line; instead, maybe have a single long line that has temperature fluctuations for the current GPU?

I just want to be able to always see what GPU it started on, and which GPUs it was restarted on. The temps are irrelevant to me, but if you want/need them, please find a way to consolidate further.

Thanks,
Jacob
ID: 32800 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32801 - Posted: 6 Sep 2013, 16:03:40 UTC

New beta 8.14. Suspend and resume, of either favour, should now be working without problems.

MJH
ID: 32801 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32802 - Posted: 6 Sep 2013, 16:13:43 UTC - in response to Message 32800.  


The 8.13 app is still spitting out too much temperature data.


Only maxima printed now
ID: 32802 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32803 - Posted: 6 Sep 2013, 16:15:24 UTC - in response to Message 32801.  
Last modified: 6 Sep 2013, 16:16:56 UTC

8.14 appears to be resuming appropriately from running CPU benchmarks.

And I think you should keep the "event notifications" that are in the stderr.txt, they are very very helpful.
# BOINC suspending at user request (thread suspend)
# BOINC resuming at user request (thread suspend)
# BOINC suspending at user request (exit)


Great job!

I also see you've done some work to condense the temp readings. Thanks for that.


The 8.13 app is still spitting out too much temperature data.


Only maxima printed now

If that means "Only printing a temperature reading if it has increased since the start of the run", then that is a GREAT compromise.
Do you think you need them for all GPUs? Or could you maybe just limit to the running GPU?

# GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 1	:
#	Name		: GeForce GTX 460
#	ECC		: Disabled
#	Global mem	: 1024MB
#	Capability	: 2.1
#	PCI ID		: 0000:08:00.0
#	Device clock	: 1526MHz
#	Memory clock	: 1900MHz
#	Memory width	: 256bit
#	Driver version	: r325_00 : 32680
# GPU 0 : 67C
# GPU 1 : 66C
# GPU 2 : 76C
# GPU 1 : 67C
# GPU 2 : 77C
ID: 32803 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32804 - Posted: 6 Sep 2013, 16:17:56 UTC - in response to Message 32803.  


Do you think you need them for all GPUs? Or could you maybe just limit to the running GPU?


The GPU numbering doesn't necessarily correspond to that that the rest of the app uses, so I'm going to leave them all in.

MJH
ID: 32804 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32806 - Posted: 6 Sep 2013, 16:19:18 UTC - in response to Message 32804.  
Last modified: 6 Sep 2013, 16:19:26 UTC

Thanks Matt.
The work you've done here, especially the suspend/resume work, will greatly improve the stability of people's machines, and the ability to diagnose problems. It is very much appreciated!
ID: 32806 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zarck

Send message
Joined: 16 Aug 08
Posts: 145
Credit: 328,473,995
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32807 - Posted: 6 Sep 2013, 17:01:08 UTC - in response to Message 32806.  
Last modified: 6 Sep 2013, 17:01:42 UTC

Despite units GPUGRID test "Crash" my machine continues to produce blue screens and reboot, I need to work with, I am forced to stop GPUGRID and replace by Seti Beta.

Sorry.

@+
*_*
ID: 32807 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32808 - Posted: 6 Sep 2013, 17:05:18 UTC - in response to Message 32807.  

Zarck, hello! Don't give up now - I've been watching your tasks and 8.13 has a fix especially for you!

MJH
ID: 32808 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32810 - Posted: 6 Sep 2013, 18:18:42 UTC

Ok folks - last call for feature/mod requests for the beta.
Next week I'm moving on to other things.

MJH
ID: 32810 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32811 - Posted: 6 Sep 2013, 18:33:25 UTC - in response to Message 32808.  

Performed a CPU Benchmark (with LAIM off). The WU running on the 8.13 app stopped and didn't resume, but the WU on the 8.14 app resumed normally (also with LAIM on).

The WU resumed on the 8.13 app when I exited Boinc (and running tasks) and reopened it (not that it's an issue any more with 8.14).

BTW. I've experienced this 'task not resuming' issue before, so it wasn't a new one, and benchmarks run periodically (just not often enough to have associate it with a task issue, especially when the tasks had plenty of other issues).
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 32811 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32812 - Posted: 6 Sep 2013, 18:34:28 UTC - in response to Message 32810.  

Ok folks - last call for feature/mod requests for the beta.

Since you asked. There have been a number of comments about monitoring temperature, which is good. But I have found that cards can crash due to overclocking while still running relatively cool (less than 70C for example). I don't know if BOINC allows you to monitor the actual GPU core speed, but if so that would be worthwhile to report in some form. I don't know that it is high priority for this beta, but maybe the next one.
ID: 32812 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32813 - Posted: 6 Sep 2013, 18:37:49 UTC - in response to Message 32811.  
Last modified: 6 Sep 2013, 18:41:53 UTC


BTW. I've experienced this 'task not resuming' issue before, so it wasn't a new one, and benchmarks run periodically (just not often enough to have associate it with a task issue, especially when the tasks had plenty of other issues).


Unsurprising - it's an inevitable consequence of the way the BOINC client library (which we build into our application) goes about doing suspend -resume[1] I've re-plumbed the whole thing entirely, using a much more reliable method.

MJH

[1] To paraphrase the old saying - 'Some people, when confronted with a problem, think "I'll uses threads". Now they have two problems'.
ID: 32813 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32814 - Posted: 6 Sep 2013, 19:06:53 UTC - in response to Message 32810.  
Last modified: 6 Sep 2013, 19:07:07 UTC

Ok folks - last call for feature/mod requests for the beta.
Next week I'm moving on to other things.

MJH

Can you make it print an ascii rainbow at the end of a successful task?

Seriously, though, can't think of much, except maybe
- Format the driver version to say 326.80 instead of 32680
- Add a timestamp with every start/restart block
ID: 32814 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32875 - Posted: 10 Sep 2013, 22:39:45 UTC
Last modified: 10 Sep 2013, 22:54:41 UTC

There's a new batch of Beta WUs - "MJHARVEY-CRASHNPT". These test an important feature of the application that we've not been able to use much in the past because it seemed to be contributing towards crashes. The last series of CRASH units has given me good stats on the underlying error rates for control, so this batch should reveal whether there is in fact a bug with the feature.

Please report here particularly if you have a failure mode unlike any you have sene with 8.14 and earlier CRASH units.

MJH
ID: 32875 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32887 - Posted: 11 Sep 2013, 13:37:37 UTC - in response to Message 32875.  

First NPT processed with no errors at all - task 7269244. If I get any more, I'll try running them on the 'hot' GPU.
ID: 32887 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32947 - Posted: 14 Sep 2013, 1:04:21 UTC
Last modified: 14 Sep 2013, 2:03:07 UTC

I have not had any problems processing the "MJHARVEY-CRASHNPT" units on my stable machine that runs GPUGrid on my GTX 660 Ti and GTX 460.

:) I kinda wish the server would stop sending me beta units, but alas, I'm going to keep my settings set at "Give me any unit you think I should do" (aka: all apps checked). It just seems that lately it wants me to do beta!

Just wanted to report that it is running smoothly for me.
ID: 32947 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Carlesa25
Avatar

Send message
Joined: 13 Nov 10
Posts: 328
Credit: 72,619,453
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32951 - Posted: 14 Sep 2013, 14:00:08 UTC
Last modified: 14 Sep 2013, 14:23:36 UTC

Hello: You are about to finish without problems Beta " 102-MJHARVEY_CRASHNPT-7-25-RND3270_0 " and what I've noticed is a different behavior of the CPU usage, at least Linux

The four cores enabled BOINC that I have are with an average load of 23-25% (no more running processes) although the task indicates the use of 1 CPU - 1 NVIDIA GPU. clearly there is an execution of the task in the form of multi-threaded on the CPU, even setting the app_config.xml to use 1 CPU and 1 GPU task.

Note: Completed without problem.
ID: 32951 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : News : acemdbeta application - discussion

©2025 Universitat Pompeu Fabra