Message boards :
Graphics cards (GPUs) :
KASHIF_??? workunits fixed
Message board moderation
| Author | Message |
|---|---|
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
The new submitted workunits called KASHIF_??? should now work even on G90 cards. The large KASHIF_dim workunits have been reduced by half length, and the data upload by 4 times. There could be around old workunits with the same name, you could look at the creation date on the web site. Changes have been applied now: 20 May 16:44 CEST Hope it works. gdf |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
For the Devs Regards Zy |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
That's music to my ears :) Can you tell us a little about the problem and its fix? EDIT: GDF wrote: So, it seems that there is a bug in the compiler/hardware which appears only on pre G200 cards. Seems like the ball is in nVidias court now. MrS Scanning for our furry friends since Jan 2002 |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
A bug in a routine of cuda FFT. gdf |
HydropowerSend message Joined: 3 Apr 09 Posts: 70 Credit: 6,003,024 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
I just had a couple of these: "Cuda error in file '..\cuda/cutil.h' in line 968 : out of memory. Memory usage: host: bytes device: bytes Assertion failed: 0, file ..\cuda/cutil.h, line 968 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. " Is that a similar error we're talking about ? WU: 482275 and 482302 (IBUCHs) I notice these are all on GPU1 which may indicate a local problem. |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Some time ago this was an Nvidia driver problem which was sorted with latest drivers. gdf |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Nvidia is looking into the bug. gdf |
|
Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I had this WU die. It was run on a GTS250 (G92 chipset I believe) using 185.85 drivers. It will get reported later tonight, so i'm not sure what the actual error is until then. BOINC blog |
|
Send message Joined: 7 Mar 09 Posts: 12 Credit: 1,254,285 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Had a few tasks crashing with a similar error message. The latest was a KASHIF one. # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. MDIO ERROR: cannot open file "restart.coor" The card has been running overclocked but fairly cool. This should be safe but then again, overclocking never is. Hot spring weather may play a role (hot attick)... |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Actually your error message is "Incorrect function. (0x1) - exit code 1 (0x1)", which is quite a generic one. It's not "the nasty bug" and might be related to OC and temperature. MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I had this WU die. It was run on a GTS250 (G92 chipset I believe) using 185.85 drivers. It will get reported later tonight, so i'm not sure what the actual error is until then. Turns out its the cuda fft_data_swizzle_in error. So they don't appear to work on GTS250's with the 185.85 drivers. BOINC blog |
|
Send message Joined: 7 Mar 09 Posts: 12 Credit: 1,254,285 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
It's not "the nasty bug" and might be related to OC and temperature. OK, will throttle back the GPU to half the OC. The core ran about 65°C on hot days (high 50ties during the night) I suspect it will be the memory chips, but for safety measures I'll throttle down the CPU likewise. |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Had a real strange one, and with the new WU's. I've "lost" one (!) Sequence is below copying the key parts of the BOINC Manager messages: 25/05/2009 22:37:15 GPUGRID Computation for task p730000-IBUCH_pYEpYVk1_2105-3-10-RND7622_0 finished 25/05/2009 22:37:15 GPUGRID Starting 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0 25/05/2009 22:37:16 GPUGRID Starting task 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0 using acemd version 664 So far so good .... the RND7622 message correlates with my task list. Here comes the next one, getting ready for completion of RND3111, downloaded automatically (cache set to 0.1) 26/05/2009 10:09:30 GPUGRID Sending scheduler request: To fetch work. 26/05/2009 10:09:30 GPUGRID Requesting new tasks 26/05/2009 10:09:35 GPUGRID Scheduler request completed: got 1 new tasks 26/05/2009 10:09:37 GPUGRID Started download of p1480000-IBUCH_pYEpYIk1_2105-3-LICENSE 26/05/2009 10:09:37 GPUGRID Started download of p1480000-IBUCH_pYEpYIk1_2105-3-COPYRIGHT 26/05/2009 10:09:37 GPUGRID Started download of p1480000-IBUCH_pYEpYIk1_2105-3-p1480000-IBUCH_pYEpYIk1_2105-2-10-RND5345_1 Doing good correlates with Task list...... bare with me... 26/05/2009 12:13:53 climateprediction.net Scheduler request completed: got 0 new tasks 26/05/2009 12:14:43 GPUGRID Computation for task 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0 finished 26/05/2009 12:14:48 GPUGRID Starting p1480000-IBUCH_pYEpYIk1_2105-3-10-RND5345_0 26/05/2009 12:14:48 GPUGRID Starting task p1480000-IBUCH_pYEpYIk1_2105-3-10-RND5345_0 using acemd version 664 26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_0 26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_1 26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_2 26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_3 26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_4 26/05/2009 12:14:57 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_0 26/05/2009 12:15:32 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_3 26/05/2009 12:15:44 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_1 26/05/2009 12:15:44 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_2 26/05/2009 12:28:49 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_4 Kashif RND3111 has now finished and uploaded, RND5345 has now started. All good, the latter still crunching away. Problem is, the Kashif has disappeared from sight, there is no record of it either being downloaded as a task in the first place, nor uploaded when it was finished, nothing in my Task list at all. If I hadnt seen it coming through this end, and "blinked" it would have come and gone without me knowing .... something has stopped its recording as being issued, and something stopped it being recorded in Task list as complete. Suspect the credit side went wonky as well, but the key issue, is the WU which was "never issued" was crunched and returned, but according to the Task list never existed nor returned. I have no doubt it lurks on the server somewhere right now and server side all is probably normal, its not normal this end. It was uploaded and crunched, and the thought occured that since these are "new" WUs, maybe an unknown bug lurks .... dont know, but its wierd enough to report it. If that makes sense rofl:) Looks like Hollywood released Gremlins 5 and we were a secret Alpha for the pesky critters, and they eat my WU :) Regards Zy |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I suspect it will be the memory chips Memory almost never fails due to higher temperatures.. unless pushed really hard. (that's because in contrast to CPU and GPU the memory frequency is not limited by temperature to begin with) MrS Scanning for our furry friends since Jan 2002 |
©2025 Universitat Pompeu Fabra