New project in long queue

Author	Message
Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 28950 - Posted: 4 Mar 2013, 9:41:40 UTC - in response to Message 28948. So I think the size of the output file directly effects the run time (as usual). They may have to pull the plug on this batch and rework them, we'll have to wait and see what they decide. Far more likely that the tasks which run - by design - for a long time, generate a large output file. After the last NOELIA failure (which triggered a driver restart), I ran a couple of small BOINC tasks from another project. The first one errored, the second ran correctly. After that, I ran a long TONI - successful completion, no computer restart needed. I'm running the 314.07 driver. ID: 28950 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 28951 - Posted: 4 Mar 2013, 10:49:23 UTC My systems hasn't been changed since the application upgrade. I've had no problems with these new NOELIA tasks until now. (I've received a couple of tasks with their name ending with _4 and _5) They do all the strange behavior a workunit can do: - 95-100% GPU usage with no progress indicator increase (even after hours of processing) - the same thing as above, but 0% GPU usage. - Causing the following workunit (a TONI for example) do the same strange behavior (a system restart can fix this) - significant change in the GPU usage (from 75-80% to 95-100%) after a couple of minutes, but no progress. - the progress indicator stays at 0% when I abort a stuck task. ID: 28951 · Rating: 0 · rate: / Reply Quote

GPUGRID Send message Joined: 12 Dec 11 Posts: 91 Credit: 2,730,095,033 RAC: 0 Level Scientific publications	Message 28952 - Posted: 4 Mar 2013, 11:18:33 UTC I´m having some new weird issue, but only on my AMD 3x690 rig. For 3 times now, BSOD´s, systems restarts. It only go away if all the worunits (and the cache!!!) where aborted. I don´t have a clue on why this happens, but this AMD rig is rock solid in normal crunching and it´s doing more than 2m per day alone. ID: 28952 · Rating: 0 · rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 57 Level Scientific publications	Message 28953 - Posted: 4 Mar 2013, 11:41:00 UTC - in response to Message 28951. My systems hasn't been changed since the application upgrade. I've had no problems with these new NOELIA tasks until now. (I've received a couple of tasks with their name ending with _4 and _5) They do all the strange behavior a workunit can do: - 95-100% GPU usage with no progress indicator increase (even after hours of processing) - the same thing as above, but 0% GPU usage. - Causing the following workunit (a TONI for example) do the same strange behavior (a system restart can fix this) - significant change in the GPU usage (from 75-80% to 95-100%) after a couple of minutes, but no progress. - the progress indicator stays at 0% when I abort a stuck task. I have had the same issues and on top of that I got error message saying that acemd.2865.exe has crashed, and the video card ends up running at a slower speed. I have had more errors with this application than the last time I did beta testing. ID: 28953 · Rating: 0 · rate: / Reply Quote

Hans Sveen Send message Joined: 29 Oct 08 Posts: 3 Credit: 493,308,259 RAC: 10 Level Scientific publications	Message 28954 - Posted: 4 Mar 2013, 11:55:32 UTC Hello! I just want to add up my experience with the latest batch: Until late yesterday/ early this morning,my capable pc's run just fine! The two win pc's (ID: 67760 and ID: 145297)started to crash after running for about 4 minutes , when looking at the boinc messages they told me that output files were missing and during the short run before crashing no check pointing was done. I also did take a look at my wingmen, most errors was "The system cannot find the path specified. (0x3) - exit code 3 (0x3). Some times also exit code -1 and -9 occured. To elliminate Windows driver or other Window error, I loaded some wu's into this host (ID: 132991)running Ubuntu oh yes after running for about 5 minutes the crashed telling me ( by Boinc Message tab of course)"exited with zero status but no 'finished' file", did this several times before crashing and then with "Output file absent". When looking at the outcome after upload this is what I got. Stderr output <core_client_version>7.0.27</core_client_version> <![CDATA[ <message> process exited with code 255 (0xff, -1) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1841. MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1841. MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1841. MDIO: cannot open file "restart.coor" MDIO: cannot open file "restart.coor" </stderr_txt> ]]> Hope this can help debugging the batch! Ps: All three pc's now running "TONI" wu's without the need to restart! With regards, Hans Sveen Oslo, Norway ID: 28954 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 28955 - Posted: 4 Mar 2013, 12:32:29 UTC - in response to Message 28953. I have had more errors with this application than the last time I did beta testing. I think we need to try and distinguish between 'application' problems and 'task' problems - or 'project' (as in research project), as Noelia called it in starting this thread. ID: 28955 · Rating: 0 · rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 57 Level Scientific publications	Message 28957 - Posted: 4 Mar 2013, 13:18:10 UTC - in response to Message 28955. I have had more errors with this application than the last time I did beta testing. I think we need to try and distinguish between 'application' problems and 'task' problems - or 'project' (as in research project), as Noelia called it in starting this thread. So do you think that the fact that we getting these errors after we changed application to version 6.18 is mere coincidence? Maybe you are right. There is a way to prove this: run the failed units under application 6.17. If they fail, it's the units, but if they don't fail, it's the new application. ID: 28957 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 28958 - Posted: 4 Mar 2013, 13:51:39 UTC - in response to Message 28957. I have had more errors with this application than the last time I did beta testing. I think we need to try and distinguish between 'application' problems and 'task' problems - or 'project' (as in research project), as Noelia called it in starting this thread. So do you think that the fact that we getting these errors after we changed application to version 6.18 is mere coincidence? Maybe you are right. There is a way to prove this: run the failed units under application 6.17. If they fail, it's the units, but if they don't fail, it's the new application. In my personal experience, all TONI tasks, and 50% of NOELIA tasks, have run correctly under application version 6.18 ID: 28958 · Rating: 0 · rate: / Reply Quote

Jozef J Send message Joined: 7 Jun 12 Posts: 112 Credit: 1,140,895,172 RAC: 0 Level Scientific publications	Message 28959 - Posted: 4 Mar 2013, 14:31:08 UTC 041px48x2-NOELIA_041p-1-2-RND9263--After 15 hours of when the work on this task ended, nvidia driver crashed and the work has been marked as faulty .. Another was marked correctly--nn016_r2-TONI_AGGd8-38-100-RND3157_0--- but these problems are already more than a week, it's insane..nvidia driver falls for a proper shut down boinc manager,exempl.. Now comming this tasks Ann166_r2-TONI_AGGd8-11-100-RND7649_0 and nn137_r2-TONI_AGGd8-20-100-RND8105_0 and Ann027_r2-TONI_AGGd8-19-100-RND9134_3 But I'm skeptical and I do not think that they will end well. Counting two week without any sense..as many volunteers now ID: 28959 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 28961 - Posted: 4 Mar 2013, 14:51:54 UTC - in response to Message 28958. I have had more errors with this application than the last time I did beta testing. I think we need to try and distinguish between 'application' problems and 'task' problems - or 'project' (as in research project), as Noelia called it in starting this thread. So do you think that the fact that we getting these errors after we changed application to version 6.18 is mere coincidence? Maybe you are right. There is a way to prove this: run the failed units under application 6.17. If they fail, it's the units, but if they don't fail, it's the new application. In my personal experience, all TONI tasks, and 50% of NOELIA tasks, have run correctly under application version 6.18 Richard, this is my experience exactly. All TONIs run fine and 50% of NOELIAs crash. TONI should maybe give a clinic to the others. I don't think it has much to do with 6.18 either, it's just that the new NOELIAS were released at the same time as 6.18. ID: 28961 · Rating: 0 · rate: / Reply Quote

microchip Send message Joined: 4 Sep 11 Posts: 110 Credit: 326,102,587 RAC: 0 Level Scientific publications	Message 28962 - Posted: 4 Mar 2013, 14:59:41 UTC aborted a NOELIA one after it began crunching in circles... Team Belgium ID: 28962 · Rating: 0 · rate: / Reply Quote

Ken_g6 Send message Joined: 6 Aug 11 Posts: 8 Credit: 76,046,994 RAC: 0 Level Scientific publications	Message 28969 - Posted: 4 Mar 2013, 17:25:34 UTC Last modified: 4 Mar 2013, 17:31:53 UTC The first Noelia (the angels did say...) took over 48 hours (on a GTX 460 768mb that's completed most work in 25 or less, but it finally...) completed today. The second one I got, which many have apparently had a different problem with, kept restarting on my machine with: SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1841. That seems to be an out-of-GPU-memory error. So maybe someone should set stricter minimum memory limits on these Noelia tasks? Edit: Technically, that wasn't my first Noelia; just the first one of this batch. I got at least one, probably more, in February, and they took 25 hours but were otherwise fine. ID: 28969 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 28970 - Posted: 4 Mar 2013, 17:43:12 UTC - in response to Message 28969. The first Noelia The second one I got... I see that both WUs are marked errors WU cancelled Something may be happening behind the scenes. ID: 28970 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 28971 - Posted: 4 Mar 2013, 18:02:38 UTC Last modified: 4 Mar 2013, 18:25:11 UTC These NOELIA WUs have been cancelled. Their successors will have a slightly different configuration that will hopefully be more stable. Note that with this app GPUs of compute capabilities 1.0, 1.1 and 1.2 are no longer supported. This means that only Geforce GTX260s and higher will get Long WUs. MJH ID: 28971 · Rating: 0 · rate: / Reply Quote

nate Send message Joined: 6 Jun 11 Posts: 124 Credit: 2,928,865 RAC: 0 Level Scientific publications	Message 28972 - Posted: 4 Mar 2013, 18:03:17 UTC We're looking at the issue. The problematic WUs have been cancelled for now. The problem was clearly on our end, but it seems that there were multiple reasons they were having issues, and mostly not Noelia's fault. She'll resend new simulations that avoid the problems in the next day or so. The large upload sizes will also be fixed. As always, thanks for making your concerns known and alerting us to the issue. Nate ID: 28972 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 28974 - Posted: 4 Mar 2013, 18:09:50 UTC - in response to Message 28971. Be aware also these and subsequent WUs will fail if you have over-ridden the application version and are not running the latest. MJH ID: 28974 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 28978 - Posted: 4 Mar 2013, 19:46:04 UTC - in response to Message 28972. We're looking at the issue. The problematic WUs have been cancelled for now. Were the TONI WUs cancelled too? They ran fine.. ID: 28978 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 28980 - Posted: 4 Mar 2013, 20:01:21 UTC - in response to Message 28978. We're looking at the issue. The problematic WUs have been cancelled for now. Were the TONI WUs cancelled too? They ran fine.. And the two I have in progress are still fine, and shown as viable on the website. ID: 28980 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 28981 - Posted: 4 Mar 2013, 20:32:27 UTC - in response to Message 28980. We're looking at the issue. The problematic WUs have been cancelled for now. Were the TONI WUs cancelled too? They ran fine.. And the two I have in progress are still fine, and shown as viable on the website. Just got a couple new ones. Seems the queue coincidentally ran dry for a while: GPUGRID 03-04-13 13:45 Requesting new tasks for NVIDIA GPUGRID 03-04-13 13:45 Scheduler request completed: got 0 new tasks GPUGRID 03-04-13 13:45 No tasks sent GPUGRID 03-04-13 13:45 No tasks are available for Long runs (8-12 hours on fastest card) ID: 28981 · Rating: 0 · rate: / Reply Quote

GPUGRID Send message Joined: 12 Dec 11 Posts: 91 Credit: 2,730,095,033 RAC: 0 Level Scientific publications	Message 28984 - Posted: 4 Mar 2013, 20:44:46 UTC - in response to Message 28972. We're looking at the issue. The problematic WUs have been cancelled for now. The problem was clearly on our end, but it seems that there were multiple reasons they were having issues, and mostly not Noelia's fault. She'll resend new simulations that avoid the problems in the next day or so. The large upload sizes will also be fixed. As always, thanks for making your concerns known and alerting us to the issue. Nate Thank you guys. Another thing that I really appreciate on this project is your awesome and fast support. Wich didn´t happen on the project I ran in the past 13 years.... sadly. ID: 28984 · Rating: 0 · rate: / Reply Quote