Message boards :
Graphics cards (GPUs) :
hERG: information and issues
Message board moderation
Previous · 1 · 2 · 3
| Author | Message |
|---|---|
|
Send message Joined: 23 Feb 09 Posts: 39 Credit: 144,654,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The new HERGqext are out (note the middle "q"). I'm trying a variation of the FFT parameters, using a slightly longer computation than necessary, to see if they run more stably on more cards. Thanks for your support and patience... I notice a computation time of 11h to 14,5h on high overclocked GTX295(700MHz)/GTX265(750MHz) for the HERGqext. Time per step: 62.932 ms Example The TONI_HERGext running only ~6,5h Time per step: 37.026 ms Example "slightly? :-) longer computation" |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
I also noticed the increase, and that was higher than expected. This is what I was trying to fix... The new ones should be back to the norm. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've had a run of three successive failures from the current batch of TONI_HERG with ACEMD v6.03, Windows XP32: a43-TONI_HERG77a-1-100-RND4354_0 a317-TONI_HERG79a-0-100-RND8649_1 a268-TONI_HERG79a-1-100-RND6278_1 Three deifferent machines, three CUDA cards - two 9800GT at stock, one 9800GTX+ factory overclocked. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Does anyone have any idea on these? Since reporting these errors, all three cards have worked full time on GPUGrid (another refugee from SETI!), around 30 tasks completed, and with 100% success rate - including a couple of the long-running TONI_GA. But I've continued to abort TONI_HERG on sight (apologies once again to the researchers on that project) until the situation is clearer. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I think the only way round this sort of problem is for the server to identify cards abilities to complete the various types of work unit and allocate accordingly. If there is more than say a 25% chance of failure then dont allocate the task, unless there are no other tasks. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
|
X-Files 27Send message Joined: 11 Oct 08 Posts: 95 Credit: 68,023,693 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
here's a bad WU: http://www.gpugrid.net/workunit.php?wuid=1282907
|
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
This bad one seems to have been created by some file transfer error. It should fail immediately. |
|
Send message Joined: 22 Jan 09 Posts: 8 Credit: 988,332,833 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have now had 3 in a row fail. 1st: http://www.gpugrid.net/result.php?resultid=2093496 2nd: http://www.gpugrid.net/result.php?resultid=2096024 3dr: http://www.gpugrid.net/result.php?resultid=2103499 I rebooted after the 1st fail. The 2nd failed after 523 seconds and the 3rd after 9.1 seconds. The failures are also putting random sparkles on my screen. Looking back at my history, I also had one fail on April 1st: http://www.gpugrid.net/result.php?resultid=2082136 All 4 have the same error message: MDIO ERROR: cannot open file "restart.coor" SWAN : FATAL : Failure executing kernel sync [M_shake_position_kernel_step_1] [999] Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203 Intel Q9450 quad with Windows Vista Premium x64. Nvidia 9800 GTX+ with driver 197.13. Boinc 6.10.18
|
JStatesonSend message Joined: 31 Oct 08 Posts: 186 Credit: 3,578,903,157 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have one similar error after crunching 13 hours. MDIO ERROR: cannot open file "restart.coor" SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [999] Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203 I do not know which gpu it failed on, either the GTS250 with 1mb of memory or the 9800gtx+ with .5mb memory. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
A GTS250 is very similar (almost identical) to a 9800 GTX+ So it is probably not that important, unless you are getting lots of failures. The half a 1GB vs 500MB does not make any difference here. |
|
Send message Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
After 14 hours 25 minutes crashed. GTS 250 - driver 197.13 - windows xp. Task 2120270. Ton (ftpd) Netherlands |
|
Send message Joined: 23 Feb 09 Posts: 39 Credit: 144,654,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
During the past weeks I had some hERG-WUs on my four 9800GT (Vista64) that stopped with a "acemd... error bubble". About 4 weeks ago I tried not to click "OK" but restarting the PC (with open "error bubble")- After the restart the WU has been restarted at the checkpoint and finished valid! I verified this behavior with 5 further WUs. Every (valid) result shows similar "stderr out" ..................................................................... # There is 1 device supporting CUDA # Device 0: "GeForce 9800 GT" # Clock rate: 1.52 GHz # Total amount of global memory: 519634944 bytes # Number of multiprocessors: 14 # Number of cores: 112 MDIO ERROR: cannot open file "restart.coor" SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [999] Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. # There is 1 device supporting CUDA # Device 0: "GeForce 9800 GT" # Clock rate: 1.52 GHz # Total amount of global memory: 519634944 bytes # Number of multiprocessors: 14 # Number of cores: 112 # Time per step: 69.189 ms # Approximate elapsed time for entire WU: 43242.851 s called boinc_finish Validate state Valid .......................................................................... Last example: http://www.gpugrid.net/result.php?resultid=2158139 |
©2026 Universitat Pompeu Fabra