acemdshort application 8.15 - discussion

Message boards : News : acemdshort application 8.15 - discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · Next

AuthorMessage
Profile dskagcommunity
Avatar

Send message
Joined: 28 Apr 11
Posts: 462
Credit: 958,266,958
RAC: 28,485
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32294 - Posted: 26 Aug 2013, 15:48:07 UTC

Wow even on Fermi the new app is ~120secs faster ^^

So, seems good from this front.
DSKAG Austria Research Team: http://www.research.dskag.at



ID: 32294 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
5pot

Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32295 - Posted: 26 Aug 2013, 16:50:21 UTC

Shouldnt there be a tick box for CUDA 5.5 under short runs, or am I missing something?
ID: 32295 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32301 - Posted: 26 Aug 2013, 18:21:13 UTC - in response to Message 32295.  

It's selected automatically, based on client driver version.

MJH
ID: 32301 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
5pot

Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32303 - Posted: 26 Aug 2013, 18:55:57 UTC

Looking like 90 min for the short runs.
ID: 32303 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32305 - Posted: 26 Aug 2013, 19:31:17 UTC

I think that renaming threads is not nice.
Anyway, it's good news that you put the new app to the short queue.
When do you plan to put it in the long queue too?
ID: 32305 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32306 - Posted: 26 Aug 2013, 19:55:46 UTC - in response to Message 32305.  


When do you plan to put it in the long queue too?


In a week or so.

MJH
ID: 32306 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile dskagcommunity
Avatar

Send message
Joined: 28 Apr 11
Posts: 462
Credit: 958,266,958
RAC: 28,485
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32310 - Posted: 26 Aug 2013, 23:30:43 UTC

OK its good for cc1.3 cards too. Tried the 285GTX witch normally is retired from my side from GPUGrid, but now as Test good enough ^^ 26,241.59 secs = 7,3h
DSKAG Austria Research Team: http://www.research.dskag.at



ID: 32310 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zarck

Send message
Joined: 16 Aug 08
Posts: 145
Credit: 328,473,995
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32466 - Posted: 29 Aug 2013, 15:07:42 UTC - in response to Message 32310.  
Last modified: 29 Aug 2013, 15:08:05 UTC

Load on the graphics card to O% increase at 0%, I left the unit after 45 minutes.

http://www.gpugrid.net/result.php?resultid=7221739

@+
*_*
ID: 32466 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32469 - Posted: 29 Aug 2013, 15:17:36 UTC - in response to Message 32466.  

You need app 8.01 and then the Noelia's run smooth as ever.
Greetings from TJ
ID: 32469 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32483 - Posted: 29 Aug 2013, 17:40:03 UTC - in response to Message 32469.  

8.02 makes NOELIA tasks run even smootherer
ID: 32483 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32559 - Posted: 31 Aug 2013, 0:00:54 UTC

There is a 8.04 app in the Beta queue.
I've received this alongside with some TEST14 workunits.
ID: 32559 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32681 - Posted: 4 Sep 2013, 10:39:58 UTC

There is a new acemdshort application, version 8.11 (Windows only). Since this is (hopefully!) the last app revision, now's a good time to summarise the changes in the 800 series over the older app:

* SM3.5 support for Titan, Geforce 780, etc.
* CUDA 4.2 and CUDA 5.5 builds, automatically assigned based on client driver version. This represents the first step in deprecating CUDA 4.2 and moving exclusively to CUDA 5.5.
* Improved stability Fixed several bugs that caused significant rates of compute errors.
* Reduced driver crashes Reduced incidence of driver hangs on suspend. The problem is not yet eliminated totally.
* Improved reporting GPU stats and temperatures now reported in the stderr. Error codes cleaned up to give better data on failure modes.
* application bug-fixes many fixes and enhancements for the science.

MJH
ID: 32681 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32700 - Posted: 4 Sep 2013, 15:29:41 UTC - in response to Message 32681.  
Last modified: 5 Sep 2013, 16:18:31 UTC

Here is a list of compute errors codes for the 8xx series applications and their meanings. If you encounter a new one, or have a question or observation about the circumstances of an error, please PM me.

* 255 See -1

* 247 See -9

* 212 See -44

* 197 The WU took longer to much complete than the client was expecting and so it was terminated. Indicates a WU misconfiguration. If recurrent, try re-attaching to the project.

* 194 Unknown. ("finish file present too long")

* 193 Unknown. (Segfault on Linux)

* 159 See -97

* 98 See -9

* -1 Unknown

* -9 The GPU compute capability is not supported by the application; for example a pre cc 1.3 G80.

* -44 The computer's date is wrong.

* -80 Failed to recover after an access violation (Win32)

* -97 "Simulation has become unstable". This indicates that the scientific simulation that the application performs has gone wrong. If this happens as soon as the WU starts, there may be a problem with the WU. If it happens frequently or after the WU has made some progress, a hardware problem is strongly indicated. Check GPU temperatures (now reported in stderr) and for the presence of other GPU-using programs (eg games).

* -185 Application doesn't start, "Access denied" error in the stderr. Check that the client has downloaded the application correctly - if unsure re-attach to the project. Could also be caused by antivirus preventing BOINC starting new processes.

* -226 "Too many exits" The app repeatedly exited without indicating BOINC that the WU was complete and was restarted, until a limit was reached. Cause unclear.

* -1073741515 (Windows only) The application failed to intialize properly. Indicates missing DLLs. Re-attach to the project, to force the application and its support DLLs to be re-downloaded. Ensure that VS2008 redistributables are installed http://www.microsoft.com/en-us/download/details.aspx?id=29

* -1073741819 (Windows only) Access violation (Segmentation fault). The application made an illegal memory access. These seem mostly to come from inside the Nvidia driver but root cause(s) unknown. If occurring repeatedly, reboot machine.
ID: 32700 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32719 - Posted: 5 Sep 2013, 9:49:00 UTC

The beta's are gone and the Santi's start to error again on my GTX660.
Greetings from TJ
ID: 32719 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32720 - Posted: 5 Sep 2013, 10:17:26 UTC - in response to Message 32719.  
Last modified: 5 Sep 2013, 10:27:08 UTC

Short runs (2-3 hours on fastest card) v8.11 (cuda55)

Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number

* -97 "Simulation has become unstable". This indicates that the scientific simulation that the application performs has gone wrong. If this happens as soon as the WU starts, there may be a problem with the WU. If it happens frequently or after the WU has made some progress, a hardware problem is strongly indicated. Check GPU temperatures (now reported in stderr) and for the presence of other GPU-using programs (eg games) - MJH

Your temps look reasonable (mostly around 66°C). I suggest you restart the system, and if errors continue to occur look into what else might be causing this problem (games, video programs, antivirus scans, updates...). You might want to note the failure time and check your logs to see what was happening at that time or just before.

Both times you had the error, the stderr log ends in,
# GPU 0 Current Temp: 64 C
# The simulation has become unstable. Terminating to avoid lock-up (1)

The slight GPU temperature drop from 66°C to 64°C might indicate resource consumption by something else on your system just before the WU was ended, or the GPU temperature might just have dropped as the WU was ended?
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 32720 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32721 - Posted: 5 Sep 2013, 10:56:05 UTC - in response to Message 32720.  


* -97 "Simulation has become unstable".


The new app does a much better job at determining when a WU has gone bad and aborting. Previously this might have manifested itself as non-specific crash/driver reset.

In some circumstances it may be possible to attempt recovery from this failure. Expect a new beta trying an idea out later today.

MJH
ID: 32721 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32723 - Posted: 5 Sep 2013, 11:21:32 UTC - in response to Message 32720.  

Thanks for the information skgiven (and moving my post).

No games or whatever on this machine, only 6 Rosetta WU's and 2 cores free for GPUGRID. Two Xeon's processors so it are real cores, no HT on these oldies.
In the right event that I am behind the system and I see a WU fail, I did notice a temperature drop of the GPU. This is normal off course as it is no longer working hard. It will not drop match as a new WU start again.

AV is F-Secure is this is no longer a problem for BOINC. I have been in very close contact with someone from there main office for a few months to get everything working after their update. I did a lot of testing for them and they even run Rosetta for themselves for a month to get it working. So that is no issue. I have it free with my ISP subscription and am using it for little longer than 2 years now, and for the last 8 months it has never been an issue on any PC for any project.

System was restarted just before night as a WU had down clocked my GPU.

I have set this system to beta to test the new Harvey's Matt will bring in shortly.
Greetings from TJ
ID: 32723 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Carlesa25
Avatar

Send message
Joined: 13 Nov 10
Posts: 328
Credit: 72,619,453
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32735 - Posted: 5 Sep 2013, 15:39:28 UTC
Last modified: 5 Sep 2013, 15:40:43 UTC

Hello: I have finished a task well 8.11 but comment some weird stuff.

Task: 9x2-SANTI_MAR4222-4-25-RND9976_0 - Runtime 8317.47 seconds, on my GTX 770 and FX8350 CPU to 4.4GHz.

stderr output
<core_client_version> 7.2.11 </ core_client_version>
<! [CDATA [
<stderr_txt>
mp: 59 C
# 0 Current GPU Temp: 60 C
# 0 Current GPU Temp: 59 C
# 0 Current GPU Temp: 59 C
# 0 Current GPU Temp: 59 ....... and more.. more GPU temperature records ...?? ending,

# GPU 0 Current Temp: 61 C
# Time per step (avg over 3000000 steps): 2.771 ms
# Approximate elapsed time for Entire WU: 8313.984 s
called boinc_finish

</ stderr_txt>
]]>

Execution times I see them too long if I compare with other short tasks. I think that something does not work very well. Greetings.
ID: 32735 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32737 - Posted: 5 Sep 2013, 17:27:37 UTC - in response to Message 32735.  

Execution times I see them too long if I compare with other short tasks. I think that something does not work very well. Greetings.


That is correct Carlesa25, there are problems with. There is a lengthy thread about this. Its this on: http://www.gpugrid.net/forum_thread.php?id=3450
If you start reading at the first post, you will quickly understand what is wrong.
Greetings from TJ
ID: 32737 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32743 - Posted: 5 Sep 2013, 21:09:25 UTC - in response to Message 32700.  
Last modified: 5 Sep 2013, 21:14:38 UTC

I just had an 8.13 task result in:
-1073741819 (0xffffffffc0000005)
Full details below.

I think the cause was that an 8.11 NOELIA_KLEBEbeta (which I was suspending to test) TDR'd the drivers, and this new task segfaulted trying to get started.

We still have some work to do resolving the suspending of tasks. I've noticed that they continue running for up to 15 seconds even after suspense. I thought I remember Einstein having a problem that might be related, via
http://einstein.phys.uwm.edu/forum_thread.php?id=10141
... I'm trying to dig up the BOINC API fix (from early June 2013) that solved it.

EDIT:
I found it. Check out this fix.

Would it be applicable towards helping us (GPUGrid project and users) suspend more correctly?
http://boinc.berkeley.edu/trac/changeset/b98bc309cceccf95b9fac578c47cbea06a8b8150/boinc-v2
If so, can you implement it?



http://www.gpugrid.net/result.php?resultid=7250888

Name 178-MJHARVEY_CRASH1-0-25-RND2676_0
Workunit 4754442
Created 5 Sep 2013 | 14:05:52 UTC
Sent 5 Sep 2013 | 16:39:18 UTC
Received 5 Sep 2013 | 21:02:06 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -1073741819 (0xffffffffc0000005) Unknown error number
Computer ID 153764
Report deadline 10 Sep 2013 | 16:39:18 UTC
Run time 8.18
CPU time 0.00
Validate state Invalid
Credit 0.00
Application version ACEMD beta version v8.13 (cuda55)
Stderr output

<core_client_version>7.2.11</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1073741819 (0xc0000005)
</message>
]]>
ID: 32743 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · Next

Message boards : News : acemdshort application 8.15 - discussion

©2025 Universitat Pompeu Fabra