Message boards :
Number crunching :
GERARD_CXCL12LOCKMONO
Message board moderation
| Author | Message |
|---|---|
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just received 5 of these: GERARD_CXCL12LOCKMONO Haven't seen this type before. All failed in about 1 second. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Most failing on Linux too. Errors: Stderr output <core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process exited with code 176 (0xb0, -80) </message> <stderr_txt> # SWAN Device 0 : # Name : GeForce GTX 970 # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:01:00.0 # Device clock : 1215MHz # Memory clock : 3505MHz # Memory width : 256bit # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 5000) </stderr_txt> ]]> Stderr output <core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process exited with code 199 (0xc7, -57) </message> <stderr_txt> # SWAN Device 0 : # Name : GeForce GTX 970 # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:01:00.0 # Device clock : 1215MHz # Memory clock : 3505MHz # Memory width : 256bit SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert -57 </stderr_txt> ]]> Stderr output <core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process exited with code 176 (0xb0, -80) </message> <stderr_txt> # SWAN Device 0 : # Name : GeForce GTX 970 # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:01:00.0 # Device clock : 1215MHz # Memory clock : 3505MHz # Memory width : 256bit # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 5000) </stderr_txt> ]]> Stderr output <core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process exited with code 199 (0xc7, -57) </message> <stderr_txt> # SWAN Device 0 : # Name : GeForce GTX 970 # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:01:00.0 # Device clock : 1215MHz # Memory clock : 3505MHz # Memory width : 256bit SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert -57 </stderr_txt> ]]> Maybe these are designed to 'fail early' if they are likely to fail at all? Task 15166493 has reached 5.5% after 1h on my Linux system, so the odd one appears to be running normally. 1x39-GERARD_CXCL12LOCKMONO-0-3-RND8941_0 FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Server says: Detailed computing status Application unsent in progress success error rate -- GERARD_CXCL12LOCKMON 234 170 0 100% http://www.gpugrid.net/server_status.php - scroll down If it's a new batch this is to be expected, lots of immediate failures will return before completed tasks. The first completed task might not report until ~7h after it was sent. The first healthy looking LOCKMONO task I'm running is now at 10% now and I've another on W10 at 1.3% after 11 mins. GPU usage 80% @73% power and temp throttling enabled, using 1GB GDDR at present - that all looks quite normal. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,739,145,728 RAC: 116,723 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I had a couple of these WUs fail as well, before getting a couple of good WUs, which are crunching well, right now. I hope they finish successfully. Name 0x20-GERARD_CXCL12LOCKMONO-0-3-RND0335_2 Workunit 11645809 Created 22 Jun 2016 | 6:51:54 UTC Sent 22 Jun 2016 | 8:20:34 UTC Received 22 Jun 2016 | 9:06:57 UTC Server state Over Outcome Computation error Client state Compute error Exit status -97 (0xffffffffffffff9f) Unknown error number Computer ID 263612 Report deadline 27 Jun 2016 | 8:20:34 UTC Run time 0.00 CPU time 0.00 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65) Stderr output <core_client_version>7.6.22</core_client_version> <![CDATA[ <message> (unknown error) - exit code -97 (0xffffff9f) </message> <stderr_txt> # GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 1 : # Name : GeForce GTX 980 Ti # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:02:00.0 # Device clock : 1190MHz # Memory clock : 3505MHz # Memory width : 384bit # Driver version : r358_00 : 35906 # GPU 0 : 63C # GPU 1 : 48C # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 5000) # GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 1 : # Name : GeForce GTX 980 Ti # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:02:00.0 # Device clock : 1190MHz # Memory clock : 3505MHz # Memory width : 384bit # Driver version : r358_00 : 35906 # The simulation has become unstable. Terminating to avoid lock-up (1) </stderr_txt> ]]> Name 0x23-GERARD_CXCL12LOCKMONO-0-3-RND1112_0 Workunit 11645812 Created 21 Jun 2016 | 15:15:19 UTC Sent 22 Jun 2016 | 5:09:24 UTC Received 22 Jun 2016 | 5:53:14 UTC Server state Over Outcome Computation error Client state Compute error Exit status -97 (0xffffffffffffff9f) Unknown error number Computer ID 263612 Report deadline 27 Jun 2016 | 5:09:24 UTC Run time 0.00 CPU time 0.00 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65) Stderr output <core_client_version>7.6.22</core_client_version> <![CDATA[ <message> (unknown error) - exit code -97 (0xffffff9f) </message> <stderr_txt> # GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 980 Ti # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:01:00.0 # Device clock : 1266MHz # Memory clock : 3505MHz # Memory width : 384bit # Driver version : r358_00 : 35906 # GPU 0 : 61C # GPU 1 : 57C # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 5000) # GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 980 Ti # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:01:00.0 # Device clock : 1266MHz # Memory clock : 3505MHz # Memory width : 384bit # Driver version : r358_00 : 35906 # The simulation has become unstable. Terminating to avoid lock-up (1) </stderr_txt> ]]> |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
If it's a new batch this is to be expected, lots of immediate failures will return before completed tasks. The first completed task might not report until ~7h after it was sent. On all our machines (you, Bedrich and me), the failed ones all begin with "0x". Some are up to 5 errors, no successes. I have 4 more GERARD_CXCL12LOCKMONO running now that begin with "1x" that seem to be progressing normally (one's over 30%). |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Server says: Detailed computing status Looks like 4 have now been completed. Bet they all have the "1x" prefix, not "0x": GERARD_CXCL12LOCKMON 36 368 4 98.46% At least we know that some of them are OK. |
|
Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
If it's a new batch this is to be expected, lots of immediate failures will return before completed tasks. The first completed task might not report until ~7h after it was sent. My host #208061: 0x83-GERARD_CXCL12LOCKMONO-0-3-RND0285_1 -97 (0xffffffffffffff9f) Unknown error number) Attempting restart (step 5000) Run time 1.00 CPU time 0.00 Paradoxically a -97 error thought as an overclocking problem. 3x96-GERARD_CXCL12LOCKMONO-0-3-RND9182_0 -97 (0xffffffffffffff9f) Unknown error number Attempting restart (step 3845000) Run time 10,433.28 CPU time 3,276.63 A note about (e6s8_e5s7p0f230-GIANNI_MORC36bCHL1-0-1-RND7755_0: Faulted at 99.992% WU completion a couple days ago on my host. ERROR: file force.cpp line 513: TCL evaluation of [calcforces] -98 (0xffffffffffffff9e) Unknown error number Run time 68,745.34 CPU time 20,137.64 |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,739,145,728 RAC: 116,723 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I had 2 of these WUs complete successfully. They were 1x and 2x WUs. Name 1x28-GERARD_CXCL12LOCKMONO-0-3-RND5689_0 Workunit 11645918 Created 21 Jun 2016 | 15:17:47 UTC Sent 22 Jun 2016 | 5:59:44 UTC Received 22 Jun 2016 | 14:21:20 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 263612 Report deadline 27 Jun 2016 | 5:59:44 UTC Run time 29,436.96 CPU time 29,315.81 Validate state Valid Credit 294,750.00 Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65) Name 2x67-GERARD_CXCL12LOCKMONO-0-3-RND9322_0 Workunit 11646058 Created 21 Jun 2016 | 15:21:10 UTC Sent 22 Jun 2016 | 9:12:36 UTC Received 22 Jun 2016 | 17:48:52 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 263612 Report deadline 27 Jun 2016 | 9:12:36 UTC Run time 30,274.28 CPU time 30,138.36 Validate state Valid Credit 294,750.00 Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65) I also had another 0x WU fail, which makes 3 for me. Name 0x54-GERARD_CXCL12LOCKMONO-0-3-RND3534_2 Workunit 11645843 Created 22 Jun 2016 | 10:05:17 UTC Sent 22 Jun 2016 | 13:11:49 UTC Received 22 Jun 2016 | 13:33:01 UTC Server state Over Outcome Computation error Client state Compute error Exit status -97 (0xffffffffffffff9f) Unknown error number Computer ID 30790 Report deadline 27 Jun 2016 | 13:11:49 UTC Run time 1.31 CPU time 1.31 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65) Looks like the 0x WUs are all bad, and should be canceled. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,739,145,728 RAC: 116,723 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Enough with these 0x WUs, already, I had 2 more fail on me. Name 0x0-GERARD_CXCL12LOCKMONO-0-3-RND6293_6 Workunit 11645789 Created 23 Jun 2016 | 17:22:43 UTC Sent 23 Jun 2016 | 18:54:17 UTC Received 23 Jun 2016 | 19:40:13 UTC Server state Over Outcome Computation error Client state Compute error Exit status -97 (0xffffffffffffff9f) Unknown error number Computer ID 263612 Report deadline 28 Jun 2016 | 18:54:17 UTC Run time 13.19 CPU time 13.19 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65) Name 0x47-GERARD_CXCL12LOCKMONO-0-3-RND0003_4 Workunit 11645836 Created 24 Jun 2016 | 5:30:48 UTC Sent 24 Jun 2016 | 7:04:56 UTC Received 24 Jun 2016 | 7:31:29 UTC Server state Over Outcome Computation error Client state Compute error Exit status -97 (0xffffffffffffff9f) Unknown error number Computer ID 30790 Report deadline 29 Jun 2016 | 7:04:56 UTC Run time 1.22 CPU time 1.22 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65) These WUs are bad. Please cancel them. |
©2026 Universitat Pompeu Fabra