hERG: information and issues

Message boards : Graphics cards (GPUs) : hERG: information and issues
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Siegfried Niklas
Avatar

Send message
Joined: 23 Feb 09
Posts: 39
Credit: 144,654,294
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 15060 - Posted: 7 Feb 2010, 17:43:34 UTC - in response to Message 14981.  

The new HERGqext are out (note the middle "q"). I'm trying a variation of the FFT parameters, using a slightly longer computation than necessary, to see if they run more stably on more cards. Thanks for your support and patience...



I notice a computation time of 11h to 14,5h on high overclocked GTX295(700MHz)/GTX265(750MHz) for the HERGqext.

Time per step: 62.932 ms

Example

The TONI_HERGext running only ~6,5h

Time per step: 37.026 ms

Example

"slightly? :-) longer computation"

ID: 15060 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 15061 - Posted: 7 Feb 2010, 19:11:32 UTC - in response to Message 15060.  
Last modified: 7 Feb 2010, 19:26:25 UTC

I also noticed the increase, and that was higher than expected. This is what I was trying to fix...

The new ones should be back to the norm.
ID: 15061 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 2
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15735 - Posted: 13 Mar 2010, 18:10:42 UTC

I've had a run of three successive failures from the current batch of TONI_HERG with ACEMD v6.03, Windows XP32:

a43-TONI_HERG77a-1-100-RND4354_0
a317-TONI_HERG79a-0-100-RND8649_1
a268-TONI_HERG79a-1-100-RND6278_1

Three deifferent machines, three CUDA cards - two 9800GT at stock, one 9800GTX+ factory overclocked.
ID: 15735 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 2
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15812 - Posted: 18 Mar 2010, 12:43:10 UTC

Does anyone have any idea on these?

Since reporting these errors, all three cards have worked full time on GPUGrid (another refugee from SETI!), around 30 tasks completed, and with 100% success rate - including a couple of the long-running TONI_GA.

But I've continued to abort TONI_HERG on sight (apologies once again to the researchers on that project) until the situation is clearer.
ID: 15812 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15814 - Posted: 18 Mar 2010, 13:13:11 UTC - in response to Message 15812.  

I think the only way round this sort of problem is for the server to identify cards abilities to complete the various types of work unit and allocate accordingly. If there is more than say a 25% chance of failure then dont allocate the task, unless there are no other tasks.
ID: 15814 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 2
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15823 - Posted: 19 Mar 2010, 8:23:45 UTC

Another slipped in while I was asleep:

a8-TONI_HERG77a-9-100-RND1351_1
ID: 15823 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile X-Files 27
Avatar

Send message
Joined: 11 Oct 08
Posts: 95
Credit: 68,023,693
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15966 - Posted: 24 Mar 2010, 21:37:34 UTC

here's a bad WU:
http://www.gpugrid.net/workunit.php?wuid=1282907
ID: 15966 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 15969 - Posted: 25 Mar 2010, 11:29:44 UTC - in response to Message 15966.  

This bad one seems to have been created by some file transfer error. It should fail immediately.
ID: 15969 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mwgiii

Send message
Joined: 22 Jan 09
Posts: 8
Credit: 988,332,833
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16173 - Posted: 5 Apr 2010, 14:36:26 UTC
Last modified: 5 Apr 2010, 14:38:51 UTC

I have now had 3 in a row fail.

1st: http://www.gpugrid.net/result.php?resultid=2093496
2nd: http://www.gpugrid.net/result.php?resultid=2096024
3dr: http://www.gpugrid.net/result.php?resultid=2103499

I rebooted after the 1st fail. The 2nd failed after 523 seconds and the 3rd after 9.1 seconds. The failures are also putting random sparkles on my screen.

Looking back at my history, I also had one fail on April 1st: http://www.gpugrid.net/result.php?resultid=2082136

All 4 have the same error message:
MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Failure executing kernel sync [M_shake_position_kernel_step_1] [999]
Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203


Intel Q9450 quad with Windows Vista Premium x64.
Nvidia 9800 GTX+ with driver 197.13.
Boinc 6.10.18
ID: 16173 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile JStateson
Avatar

Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,578,903,157
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16195 - Posted: 7 Apr 2010, 15:55:24 UTC

I have one similar error after crunching 13 hours.

MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [999]
Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203


I do not know which gpu it failed on, either the GTS250 with 1mb of memory or the 9800gtx+ with .5mb memory.
ID: 16195 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16197 - Posted: 7 Apr 2010, 17:04:58 UTC - in response to Message 16195.  

A GTS250 is very similar (almost identical) to a 9800 GTX+
So it is probably not that important, unless you are getting lots of failures. The half a 1GB vs 500MB does not make any difference here.
ID: 16197 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ftpd

Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16220 - Posted: 9 Apr 2010, 6:57:56 UTC

After 14 hours 25 minutes crashed. GTS 250 - driver 197.13 - windows xp.
Task 2120270.
Ton (ftpd) Netherlands
ID: 16220 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Siegfried Niklas
Avatar

Send message
Joined: 23 Feb 09
Posts: 39
Credit: 144,654,294
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 16313 - Posted: 15 Apr 2010, 18:12:32 UTC
Last modified: 15 Apr 2010, 18:33:47 UTC

During the past weeks I had some hERG-WUs on my four 9800GT (Vista64) that stopped
with a "acemd... error bubble".
About 4 weeks ago I tried not to click "OK" but restarting the PC (with open "error bubble")- After the restart the WU has been restarted at the checkpoint and finished valid!

I verified this behavior with 5 further WUs. Every (valid) result shows similar "stderr out"

.....................................................................
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.52 GHz
# Total amount of global memory: 519634944 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [999]
Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.52 GHz
# Total amount of global memory: 519634944 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Time per step: 69.189 ms
# Approximate elapsed time for entire WU: 43242.851 s
called boinc_finish

Validate state Valid
..........................................................................

Last example: http://www.gpugrid.net/result.php?resultid=2158139
ID: 16313 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : Graphics cards (GPUs) : hERG: information and issues

©2026 Universitat Pompeu Fabra