KASHIF_??? workunits fixed

Author	Message
GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level Scientific publications	Message 10003 - Posted: 20 May 2009, 14:45:24 UTC The new submitted workunits called KASHIF_??? should now work even on G90 cards. The large KASHIF_dim workunits have been reduced by half length, and the data upload by 4 times. There could be around old workunits with the same name, you could look at the creation date on the web site. Changes have been applied now: 20 May 16:44 CEST Hope it works. gdf ID: 10003 · Rating: 0 · rate: / Reply Quote

Zydor Send message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level Scientific publications	Message 10006 - Posted: 20 May 2009, 15:43:52 UTC - in response to Message 10003. For the Devs Regards Zy ID: 10006 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 10016 - Posted: 21 May 2009, 9:35:58 UTC Last modified: 21 May 2009, 10:02:06 UTC That's music to my ears :) Can you tell us a little about the problem and its fix? EDIT: GDF wrote: So, it seems that there is a bug in the compiler/hardware which appears only on pre G200 cards. We found a way to avoid it for now, but it limits what we can do, so it is not a solution. Seems like the ball is in nVidias court now. MrS Scanning for our furry friends since Jan 2002 ID: 10016 · Rating: 0 · rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level Scientific publications	Message 10050 - Posted: 21 May 2009, 20:54:26 UTC - in response to Message 10016. A bug in a routine of cuda FFT. gdf ID: 10050 · Rating: 0 · rate: / Reply Quote

Hydropower Send message Joined: 3 Apr 09 Posts: 70 Credit: 6,003,024 RAC: 0 Level Scientific publications	Message 10056 - Posted: 22 May 2009, 0:20:33 UTC - in response to Message 10050. Last modified: 22 May 2009, 0:27:46 UTC I just had a couple of these: "Cuda error in file '..\cuda/cutil.h' in line 968 : out of memory. Memory usage: host: bytes device: bytes Assertion failed: 0, file ..\cuda/cutil.h, line 968 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. " Is that a similar error we're talking about ? WU: 482275 and 482302 (IBUCHs) I notice these are all on GPU1 which may indicate a local problem. ID: 10056 · Rating: 0 · rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level Scientific publications	Message 10062 - Posted: 22 May 2009, 7:18:24 UTC - in response to Message 10056. Some time ago this was an Nvidia driver problem which was sorted with latest drivers. gdf ID: 10062 · Rating: 0 · rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level Scientific publications	Message 10093 - Posted: 23 May 2009, 16:50:21 UTC - in response to Message 10062. Nvidia is looking into the bug. gdf ID: 10093 · Rating: 0 · rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 10157 - Posted: 25 May 2009, 12:19:32 UTC I had this WU die. It was run on a GTS250 (G92 chipset I believe) using 185.85 drivers. It will get reported later tonight, so i'm not sure what the actual error is until then. BOINC blog ID: 10157 · Rating: 0 · rate: / Reply Quote

SkyeHunter Send message Joined: 7 Mar 09 Posts: 12 Credit: 1,254,285 RAC: 0 Level Scientific publications	Message 10158 - Posted: 25 May 2009, 12:43:58 UTC - in response to Message 10157. Had a few tasks crashing with a similar error message. The latest was a KASHIF one. # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. MDIO ERROR: cannot open file "restart.coor" The card has been running overclocked but fairly cool. This should be safe but then again, overclocking never is. Hot spring weather may play a role (hot attick)... ID: 10158 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 10171 - Posted: 25 May 2009, 21:24:56 UTC - in response to Message 10158. Actually your error message is "Incorrect function. (0x1) - exit code 1 (0x1)", which is quite a generic one. It's not "the nasty bug" and might be related to OC and temperature. MrS Scanning for our furry friends since Jan 2002 ID: 10171 · Rating: 0 · rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 10173 - Posted: 25 May 2009, 21:38:00 UTC - in response to Message 10157. I had this WU die. It was run on a GTS250 (G92 chipset I believe) using 185.85 drivers. It will get reported later tonight, so i'm not sure what the actual error is until then. Turns out its the cuda fft_data_swizzle_in error. So they don't appear to work on GTS250's with the 185.85 drivers. BOINC blog ID: 10173 · Rating: 0 · rate: / Reply Quote

SkyeHunter Send message Joined: 7 Mar 09 Posts: 12 Credit: 1,254,285 RAC: 0 Level Scientific publications	Message 10180 - Posted: 26 May 2009, 8:36:06 UTC - in response to Message 10171. It's not "the nasty bug" and might be related to OC and temperature. MrS OK, will throttle back the GPU to half the OC. The core ran about 65°C on hot days (high 50ties during the night) I suspect it will be the memory chips, but for safety measures I'll throttle down the CPU likewise. ID: 10180 · Rating: 0 · rate: / Reply Quote

Zydor Send message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level Scientific publications	Message 10190 - Posted: 26 May 2009, 13:39:18 UTC - in response to Message 10171. Last modified: 26 May 2009, 13:40:14 UTC Had a real strange one, and with the new WU's. I've "lost" one (!) Sequence is below copying the key parts of the BOINC Manager messages: 25/05/2009 22:37:15 GPUGRID Computation for task p730000-IBUCH_pYEpYVk1_2105-3-10-RND7622_0 finished 25/05/2009 22:37:15 GPUGRID Starting 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0 25/05/2009 22:37:16 GPUGRID Starting task 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0 using acemd version 664 So far so good .... the RND7622 message correlates with my task list. Here comes the next one, getting ready for completion of RND3111, downloaded automatically (cache set to 0.1) 26/05/2009 10:09:30 GPUGRID Sending scheduler request: To fetch work. 26/05/2009 10:09:30 GPUGRID Requesting new tasks 26/05/2009 10:09:35 GPUGRID Scheduler request completed: got 1 new tasks 26/05/2009 10:09:37 GPUGRID Started download of p1480000-IBUCH_pYEpYIk1_2105-3-LICENSE 26/05/2009 10:09:37 GPUGRID Started download of p1480000-IBUCH_pYEpYIk1_2105-3-COPYRIGHT 26/05/2009 10:09:37 GPUGRID Started download of p1480000-IBUCH_pYEpYIk1_2105-3-p1480000-IBUCH_pYEpYIk1_2105-2-10-RND5345_1 Doing good correlates with Task list...... bare with me... 26/05/2009 12:13:53 climateprediction.net Scheduler request completed: got 0 new tasks 26/05/2009 12:14:43 GPUGRID Computation for task 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0 finished 26/05/2009 12:14:48 GPUGRID Starting p1480000-IBUCH_pYEpYIk1_2105-3-10-RND5345_0 26/05/2009 12:14:48 GPUGRID Starting task p1480000-IBUCH_pYEpYIk1_2105-3-10-RND5345_0 using acemd version 664 26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_0 26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_1 26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_2 26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_3 26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_4 26/05/2009 12:14:57 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_0 26/05/2009 12:15:32 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_3 26/05/2009 12:15:44 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_1 26/05/2009 12:15:44 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_2 26/05/2009 12:28:49 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_4 Kashif RND3111 has now finished and uploaded, RND5345 has now started. All good, the latter still crunching away. Problem is, the Kashif has disappeared from sight, there is no record of it either being downloaded as a task in the first place, nor uploaded when it was finished, nothing in my Task list at all. If I hadnt seen it coming through this end, and "blinked" it would have come and gone without me knowing .... something has stopped its recording as being issued, and something stopped it being recorded in Task list as complete. Suspect the credit side went wonky as well, but the key issue, is the WU which was "never issued" was crunched and returned, but according to the Task list never existed nor returned. I have no doubt it lurks on the server somewhere right now and server side all is probably normal, its not normal this end. It was uploaded and crunched, and the thought occured that since these are "new" WUs, maybe an unknown bug lurks .... dont know, but its wierd enough to report it. If that makes sense rofl:) Looks like Hollywood released Gremlins 5 and we were a secret Alpha for the pesky critters, and they eat my WU :) Regards Zy ID: 10190 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 10214 - Posted: 26 May 2009, 21:38:30 UTC - in response to Message 10180. I suspect it will be the memory chips Memory almost never fails due to higher temperatures.. unless pushed really hard. (that's because in contrast to CPU and GPU the memory frequency is not limited by temperature to begin with) MrS Scanning for our furry friends since Jan 2002 ID: 10214 · Rating: 0 · rate: / Reply Quote