Advanced search

Message boards : Number crunching : Short runs: trypsin_lig_1161_3-NOELIA_RL3_run-0-1- crashed

Author Message
Profile (retired account)
Send message
Joined: 22 Dec 11
Posts: 38
Credit: 28,606,255
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33685 - Posted: 30 Oct 2013 | 20:37:26 UTC

Hello,

just had a trypsin_lig_1161_3-NOELIA_RL3_run-0-1-RND4573_0 crashing immediately, here's the stderr output:


<core_client_version>7.0.64</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -98 (0xffffff9e)
</message>
<stderr_txt>
# GPU [GeForce GT 650M] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GT 650M
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 950MHz
# Memory clock : 900MHz
# Memory width : 128bit
# Driver version : r325_00 : 32723
ERROR: file pme.cpp line 85: PME NX too small
21:24:17 (4392): called boinc_finish

</stderr_txt>
]]>


After that I got a 6x9-SANTI_MARwtdim-16-25- which is running fine so far.

John
Send message
Joined: 15 Oct 11
Posts: 17
Credit: 81,085,378
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 33689 - Posted: 30 Oct 2013 | 23:18:03 UTC
Last modified: 30 Oct 2013 | 23:26:23 UTC

Same issue here.... x2 workunits

trypsin_lig_1316_4-NOELIA_RL3_run-0-1-RND5763_0

trypsin_lig_1097_4-NOELIA_RL3_run-0-1-RND4667_0

Stderr output

<core_client_version>7.0.64</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -98 (0xffffff9e)
</message>
<stderr_txt>
# GPU [GeForce GTS 450] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTS 450
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:01:00.0
# Device clock : 1760MHz
# Memory clock : 1840MHz
# Memory width : 128bit
# Driver version : r331_54 : 33158
ERROR: file pme.cpp line 85: PME NX too small
19:11:34 (4304): called boinc_finish

</stderr_txt>

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33692 - Posted: 31 Oct 2013 | 13:31:24 UTC - in response to Message 33689.

The aforementioned batch of WU's contains a bug. All 13 WU's I recieved failed (after 2 or 3 seconds) on all my systems (Linux and Windows).

Example Stderr output

<core_client_version>7.2.23</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -98 (0xffffff9e)
</message>
<stderr_txt>
# GPU [GeForce GTX 660] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 2 :
# Name : GeForce GTX 660
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:02:00.0
# Device clock : 1032MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r331_00 : 33140
ERROR: file pme.cpp line 85: PME NX too small
09:38:11 (6728): called boinc_finish

</stderr_txt>
]]>

The same WU also failed on other systems,

    7412562 151335 30 Oct 2013 | 23:00:12 UTC 30 Oct 2013 | 23:06:23 UTC Error while computing 2.27 0.08 --- Short runs (2-3 hours on fastest card) v8.14 (cuda42)
    7415514 143807 31 Oct 2013 | 0:50:38 UTC 31 Oct 2013 | 0:56:44 UTC Error while computing 2.43 0.14 --- Short runs (2-3 hours on fastest card) v8.14 (cuda42)
    7415906 152255 31 Oct 2013 | 2:16:17 UTC 31 Oct 2013 | 2:22:13 UTC Error while computing 2.02 0.08 --- Short runs (2-3 hours on fastest card) v8.14 (cuda42)
    7416291 131405 31 Oct 2013 | 4:21:57 UTC 31 Oct 2013 | 4:26:02 UTC Error while computing 1.62 0.11 --- Short runs (2-3 hours on fastest card) v8.14 (cuda42)
    7416888 127801 31 Oct 2013 | 6:33:24 UTC 31 Oct 2013 | 6:42:57 UTC Error while computing 2.09 0.42 --- Short runs (2-3 hours on fastest card) v8.14 (cuda55)
    7417449 139265 31 Oct 2013 | 8:00:12 UTC 31 Oct 2013 | 9:43:26 UTC Error while computing 2.22 0.20 --- Short runs (2-3 hours on fastest card) v8.14 (cuda55)
    7418231 139502 31 Oct 2013 | 11:44:17 UTC 31 Oct 2013 | 11:50:05 UTC Error while computing 2.03 0.11 --- Short runs (2-3 hours on fastest card) v8.14 (cuda55)
    7418699 --- --- --- Unsent --- --- ---



http://www.gpugrid.net/result.php?resultid=7417225
http://www.gpugrid.net/result.php?resultid=7416175...

All, ERROR: file pme.cpp line 85: PME NX too small
Too many errors (may have bug)
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 817,865,789
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33693 - Posted: 31 Oct 2013 | 15:14:23 UTC
Last modified: 31 Oct 2013 | 15:15:24 UTC

Oh and i though i could be my old card witch dont want to run with these batch O.o 30 Errors :/ didnt look into sderr, but now i see all fail on multiple machines.
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Tin Man
Send message
Joined: 1 Sep 09
Posts: 2
Credit: 214,365,451
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33694 - Posted: 31 Oct 2013 | 15:51:05 UTC

Same error Message for me as well!!

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33703 - Posted: 1 Nov 2013 | 8:31:54 UTC - in response to Message 33694.

The latest effort seems to work well.

Linux system (GTX670 FOC and GTX770):

trypsin_lig_1298_4x1-NOELIA_RL3run-0-1-RND2976_0 4888061 1 Nov 2013 | 4:27:41 UTC 1 Nov 2013 | 6:07:34 UTC Completed and validated 1,716.63 1,658.44 1,500.00 Short runs (2-3 hours on fastest card) v8.00 (cuda42)

trypsin_lig_1089_2x1-NOELIA_RL3run-0-1-RND9336_0 4887950 31 Oct 2013 | 21:55:42 UTC 1 Nov 2013 | 0:47:14 UTC Completed and validated 1,485.76 1,429.36 1,500.00 Short runs (2-3 hours on fastest card) v8.00 (cuda55)

Windows 7 (GTX770):

trypsin_lig_1706_1x1-NOELIA_RL3run-0-1-RND8824_0 4888310 1 Nov 2013 | 5:15:45 UTC 1 Nov 2013 | 7:32:23 UTC Completed and validated 2,028.98 2,007.25 1,500.00 Short runs (2-3 hours on fastest card) v8.14 (cuda55)

They seem to be much faster on Linux (35%) rather than the typical 11%. I expect they are small simulations.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile X1900AIW
Send message
Joined: 12 Sep 08
Posts: 74
Credit: 23,566,124
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33704 - Posted: 1 Nov 2013 | 11:28:01 UTC

32 crashes from 30.10.-31.10.2013. But GPU is oced, maybe these workunits were more sensitive to OC.

Werkstatt
Send message
Joined: 23 May 09
Posts: 121
Credit: 333,451,807
RAC: 357,200
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33706 - Posted: 1 Nov 2013 | 17:32:20 UTC

I've added my new GT630 a few days ago. Many many bad results, always right after a few seconds.
Host: http://www.gpugrid.net/results.php?userid=25200
Error -52, SWAN : FATAL Unable to load module .mshake_kernel.cu. (702)
Error -98, ERROR: file pme.cpp line 85: PME NX too small

One fault caused the system to crash.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33720 - Posted: 2 Nov 2013 | 12:31:21 UTC - in response to Message 33706.
Last modified: 2 Nov 2013 | 12:34:09 UTC

Werkstatt, other crunchers, mods and even scientists have 'No access' access to 'your' link - you need to link to the individual system (it's a Boinc server/site template security thing),
http://www.gpugrid.net/results.php?hostid=161299&offset=0&show_names=1&state=0&appid=

Fortunately you haven't hidden your systems.

You had the unfortunate experience of encountering a bad batch of WU's. These failed on everyone's cards.

trypsin_lig_1298_1-NOELIA_RL3_run-0-1-RND8370_6 - Already failed 6 times before being sent to you. As they failed in about 2 or 3seconds, numerous tasks would have failed for everyone before anyone noticed.

You seem to be having success now,

trypsin_lig_9_4x1-NOELIA_RC3run-0-1-RND4317_0 4889572 2 Nov 2013 | 0:13:18 UTC 2 Nov 2013 | 6:28:49 UTC Completed and validated 22,149.52 22,125.89 1,500.00 Short runs (2-3 hours on fastest card) v8.14 (cuda55)

trypsin_lig_1355_3x1-NOELIA_RC3run-0-1-RND4109_0 4889060 1 Nov 2013 | 17:10:25 UTC 2 Nov 2013 | 0:19:32 UTC Completed and validated 22,070.07 22,049.27 1,500.00 Short runs (2-3 hours on fastest card) v8.14 (cuda55)

trypsin_lig_458_2x1-NOELIA_RL3run-0-1-RND5559_0 4888617 1 Nov 2013 | 15:50:44 UTC 1 Nov 2013 | 17:11:01 UTC Completed and validated 4,769.82 4,769.82 1,500.00 Short runs (2-3 hours on fastest card) v8.14 (cuda55)

but that credit!

Either it's a new batch with more complicated molecules (8.827 ms vs 1.9ms per step) and poor credit, or your GPU downclocked, or your system was busy doing other things.

Noelia, if it's a new batch perhaps you could up the credits for the next batch?
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Werkstatt
Send message
Joined: 23 May 09
Posts: 121
Credit: 333,451,807
RAC: 357,200
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33727 - Posted: 2 Nov 2013 | 16:23:36 UTC - in response to Message 33720.
Last modified: 2 Nov 2013 | 16:24:03 UTC


You had the unfortunate experience of encountering a bad batch of WU's. These failed on everyone's cards.

Noelia, if it's a new batch perhaps you could up the credits for the next batch?


Hi Skygiven,

THX for the reply. Failing wu's right after they start is not really a problem for me, have internet flatrate. Just wanted to keep Admins informed that there is a proplem somwhere, maybe triggered by discussion @ Einstein about a fault in cuda 5 they ran in and still use cuda 32.

I want to test my new card with different projects. Its a 'Kepler' card, driven by the GK208 chip, it adds < 19W to the power budget and it's a passive cooled single slot slim size card.

Cheers

Alexander

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 817,865,789
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33728 - Posted: 2 Nov 2013 | 16:52:17 UTC

wow didnt see this low credits before, thats hard O.o
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Werkstatt
Send message
Joined: 23 May 09
Posts: 121
Credit: 333,451,807
RAC: 357,200
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33731 - Posted: 2 Nov 2013 | 20:39:52 UTC

last one was much better ... :))

Post to thread

Message boards : Number crunching : Short runs: trypsin_lig_1161_3-NOELIA_RL3_run-0-1- crashed

//