hERG: information and issues

Author	Message
Siegfried Niklas Send message Joined: 23 Feb 09 Posts: 39 Credit: 144,654,294 RAC: 0 Level Scientific publications	Message 15060 - Posted: 7 Feb 2010, 17:43:34 UTC - in response to Message 14981. The new HERGqext are out (note the middle "q"). I'm trying a variation of the FFT parameters, using a slightly longer computation than necessary, to see if they run more stably on more cards. Thanks for your support and patience... I notice a computation time of 11h to 14,5h on high overclocked GTX295(700MHz)/GTX265(750MHz) for the HERGqext. Time per step: 62.932 ms Example The TONI_HERGext running only ~6,5h Time per step: 37.026 ms Example "slightly? :-) longer computation" ID: 15060 · Rating: 0 · rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 15061 - Posted: 7 Feb 2010, 19:11:32 UTC - in response to Message 15060. Last modified: 7 Feb 2010, 19:26:25 UTC I also noticed the increase, and that was higher than expected. This is what I was trying to fix... The new ones should be back to the norm. ID: 15061 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 15735 - Posted: 13 Mar 2010, 18:10:42 UTC I've had a run of three successive failures from the current batch of TONI_HERG with ACEMD v6.03, Windows XP32: a43-TONI_HERG77a-1-100-RND4354_0 a317-TONI_HERG79a-0-100-RND8649_1 a268-TONI_HERG79a-1-100-RND6278_1 Three deifferent machines, three CUDA cards - two 9800GT at stock, one 9800GTX+ factory overclocked. ID: 15735 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 15812 - Posted: 18 Mar 2010, 12:43:10 UTC Does anyone have any idea on these? Since reporting these errors, all three cards have worked full time on GPUGrid (another refugee from SETI!), around 30 tasks completed, and with 100% success rate - including a couple of the long-running TONI_GA. But I've continued to abort TONI_HERG on sight (apologies once again to the researchers on that project) until the situation is clearer. ID: 15812 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 15814 - Posted: 18 Mar 2010, 13:13:11 UTC - in response to Message 15812. I think the only way round this sort of problem is for the server to identify cards abilities to complete the various types of work unit and allocate accordingly. If there is more than say a 25% chance of failure then dont allocate the task, unless there are no other tasks. ID: 15814 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 15823 - Posted: 19 Mar 2010, 8:23:45 UTC Another slipped in while I was asleep: a8-TONI_HERG77a-9-100-RND1351_1 ID: 15823 · Rating: 0 · rate: / Reply Quote

X-Files 27 Send message Joined: 11 Oct 08 Posts: 95 Credit: 68,023,693 RAC: 0 Level Scientific publications	Message 15966 - Posted: 24 Mar 2010, 21:37:34 UTC here's a bad WU: http://www.gpugrid.net/workunit.php?wuid=1282907 ID: 15966 · Rating: 0 · rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 15969 - Posted: 25 Mar 2010, 11:29:44 UTC - in response to Message 15966. This bad one seems to have been created by some file transfer error. It should fail immediately. ID: 15969 · Rating: 0 · rate: / Reply Quote

mwgiii Send message Joined: 22 Jan 09 Posts: 8 Credit: 988,332,833 RAC: 0 Level Scientific publications	Message 16173 - Posted: 5 Apr 2010, 14:36:26 UTC Last modified: 5 Apr 2010, 14:38:51 UTC I have now had 3 in a row fail. 1st: http://www.gpugrid.net/result.php?resultid=2093496 2nd: http://www.gpugrid.net/result.php?resultid=2096024 3dr: http://www.gpugrid.net/result.php?resultid=2103499 I rebooted after the 1st fail. The 2nd failed after 523 seconds and the 3rd after 9.1 seconds. The failures are also putting random sparkles on my screen. Looking back at my history, I also had one fail on April 1st: http://www.gpugrid.net/result.php?resultid=2082136 All 4 have the same error message: MDIO ERROR: cannot open file "restart.coor" SWAN : FATAL : Failure executing kernel sync [M_shake_position_kernel_step_1] [999] Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203 Intel Q9450 quad with Windows Vista Premium x64. Nvidia 9800 GTX+ with driver 197.13. Boinc 6.10.18 ID: 16173 · Rating: 0 · rate: / Reply Quote

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,578,903,157 RAC: 0 Level Scientific publications	Message 16195 - Posted: 7 Apr 2010, 15:55:24 UTC I have one similar error after crunching 13 hours. MDIO ERROR: cannot open file "restart.coor" SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [999] Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203 I do not know which gpu it failed on, either the GTS250 with 1mb of memory or the 9800gtx+ with .5mb memory. ID: 16195 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 16197 - Posted: 7 Apr 2010, 17:04:58 UTC - in response to Message 16195. A GTS250 is very similar (almost identical) to a 9800 GTX+ So it is probably not that important, unless you are getting lots of failures. The half a 1GB vs 500MB does not make any difference here. ID: 16197 · Rating: 0 · rate: / Reply Quote

ftpd Send message Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level Scientific publications	Message 16220 - Posted: 9 Apr 2010, 6:57:56 UTC After 14 hours 25 minutes crashed. GTS 250 - driver 197.13 - windows xp. Task 2120270. Ton (ftpd) Netherlands ID: 16220 · Rating: 0 · rate: / Reply Quote

Siegfried Niklas Send message Joined: 23 Feb 09 Posts: 39 Credit: 144,654,294 RAC: 0 Level Scientific publications	Message 16313 - Posted: 15 Apr 2010, 18:12:32 UTC Last modified: 15 Apr 2010, 18:33:47 UTC During the past weeks I had some hERG-WUs on my four 9800GT (Vista64) that stopped with a "acemd... error bubble". About 4 weeks ago I tried not to click "OK" but restarting the PC (with open "error bubble")- After the restart the WU has been restarted at the checkpoint and finished valid! I verified this behavior with 5 further WUs. Every (valid) result shows similar "stderr out" ..................................................................... # There is 1 device supporting CUDA # Device 0: "GeForce 9800 GT" # Clock rate: 1.52 GHz # Total amount of global memory: 519634944 bytes # Number of multiprocessors: 14 # Number of cores: 112 MDIO ERROR: cannot open file "restart.coor" SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [999] Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. # There is 1 device supporting CUDA # Device 0: "GeForce 9800 GT" # Clock rate: 1.52 GHz # Total amount of global memory: 519634944 bytes # Number of multiprocessors: 14 # Number of cores: 112 # Time per step: 69.189 ms # Approximate elapsed time for entire WU: 43242.851 s called boinc_finish Validate state Valid .......................................................................... Last example: http://www.gpugrid.net/result.php?resultid=2158139 ID: 16313 · Rating: 0 · rate: / Reply Quote