New project in long queue

Message boards : News : New project in long queue
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28950 - Posted: 4 Mar 2013, 9:41:40 UTC - in response to Message 28948.  

So I think the size of the output file directly effects the run time (as usual). They may have to pull the plug on this batch and rework them, we'll have to wait and see what they decide.

Far more likely that the tasks which run - by design - for a long time, generate a large output file.

After the last NOELIA failure (which triggered a driver restart), I ran a couple of small BOINC tasks from another project. The first one errored, the second ran correctly. After that, I ran a long TONI - successful completion, no computer restart needed. I'm running the 314.07 driver.
ID: 28950 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28951 - Posted: 4 Mar 2013, 10:49:23 UTC

My systems hasn't been changed since the application upgrade.
I've had no problems with these new NOELIA tasks until now. (I've received a couple of tasks with their name ending with _4 and _5)
They do all the strange behavior a workunit can do:
- 95-100% GPU usage with no progress indicator increase (even after hours of processing)
- the same thing as above, but 0% GPU usage.
- Causing the following workunit (a TONI for example) do the same strange behavior (a system restart can fix this)
- significant change in the GPU usage (from 75-80% to 95-100%) after a couple of minutes, but no progress.
- the progress indicator stays at 0% when I abort a stuck task.
ID: 28951 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GPUGRID

Send message
Joined: 12 Dec 11
Posts: 91
Credit: 2,730,095,033
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 28952 - Posted: 4 Mar 2013, 11:18:33 UTC

I´m having some new weird issue, but only on my AMD 3x690 rig. For 3 times now, BSOD´s, systems restarts. It only go away if all the worunits (and the cache!!!) where aborted. I don´t have a clue on why this happens, but this AMD rig is rock solid in normal crunching and it´s doing more than 2m per day alone.
ID: 28952 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 47,738
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28953 - Posted: 4 Mar 2013, 11:41:00 UTC - in response to Message 28951.  

My systems hasn't been changed since the application upgrade.
I've had no problems with these new NOELIA tasks until now. (I've received a couple of tasks with their name ending with _4 and _5)
They do all the strange behavior a workunit can do:
- 95-100% GPU usage with no progress indicator increase (even after hours of processing)
- the same thing as above, but 0% GPU usage.
- Causing the following workunit (a TONI for example) do the same strange behavior (a system restart can fix this)
- significant change in the GPU usage (from 75-80% to 95-100%) after a couple of minutes, but no progress.
- the progress indicator stays at 0% when I abort a stuck task.


I have had the same issues and on top of that I got error message saying that acemd.2865.exe has crashed, and the video card ends up running at a slower speed.

I have had more errors with this application than the last time I did beta testing.


ID: 28953 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Hans Sveen

Send message
Joined: 29 Oct 08
Posts: 3
Credit: 493,308,259
RAC: 8,072
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28954 - Posted: 4 Mar 2013, 11:55:32 UTC

Hello!
I just want to add up my experience with the latest batch:
Until late yesterday/ early this morning,my capable pc's run just fine!

The two win pc's (ID: 67760 and ID: 145297)started to crash after running for about 4 minutes , when looking at the boinc messages they told me that output files were missing and during the short run before crashing no check pointing was done. I also did take a look at my wingmen, most errors was "The system cannot find the path specified. (0x3) - exit code 3 (0x3). Some times also exit code -1 and -9 occured.

To elliminate Windows driver or other Window error, I loaded some wu's into this host (ID: 132991)running Ubuntu oh yes after running for about 5 minutes the crashed telling me ( by Boinc Message tab of course)"exited with zero status but no 'finished' file", did this several times before crashing and then with "Output file absent".

When looking at the outcome after upload this is what I got.
Stderr output
<core_client_version>7.0.27</core_client_version>
<![CDATA[
<message>
process exited with code 255 (0xff, -1)
</message>
<stderr_txt>
MDIO: cannot open file "restart.coor"
SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1841.
MDIO: cannot open file "restart.coor"
SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1841.
MDIO: cannot open file "restart.coor"
SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1841.
MDIO: cannot open file "restart.coor"
MDIO: cannot open file "restart.coor"

</stderr_txt>
]]>

Hope this can help debugging the batch!

Ps:
All three pc's now running "TONI" wu's without the need to restart!

With regards,

Hans Sveen
Oslo, Norway
ID: 28954 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28955 - Posted: 4 Mar 2013, 12:32:29 UTC - in response to Message 28953.  

I have had more errors with this application than the last time I did beta testing.

I think we need to try and distinguish between 'application' problems and 'task' problems - or 'project' (as in research project), as Noelia called it in starting this thread.
ID: 28955 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 47,738
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28957 - Posted: 4 Mar 2013, 13:18:10 UTC - in response to Message 28955.  

I have had more errors with this application than the last time I did beta testing.

I think we need to try and distinguish between 'application' problems and 'task' problems - or 'project' (as in research project), as Noelia called it in starting this thread.


So do you think that the fact that we getting these errors after we changed application to version 6.18 is mere coincidence?

Maybe you are right. There is a way to prove this: run the failed units under application 6.17. If they fail, it's the units, but if they don't fail, it's the new application.



ID: 28957 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28958 - Posted: 4 Mar 2013, 13:51:39 UTC - in response to Message 28957.  

I have had more errors with this application than the last time I did beta testing.

I think we need to try and distinguish between 'application' problems and 'task' problems - or 'project' (as in research project), as Noelia called it in starting this thread.

So do you think that the fact that we getting these errors after we changed application to version 6.18 is mere coincidence?

Maybe you are right. There is a way to prove this: run the failed units under application 6.17. If they fail, it's the units, but if they don't fail, it's the new application.

In my personal experience, all TONI tasks, and 50% of NOELIA tasks, have run correctly under application version 6.18
ID: 28958 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jozef J

Send message
Joined: 7 Jun 12
Posts: 112
Credit: 1,140,895,172
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 28959 - Posted: 4 Mar 2013, 14:31:08 UTC

041px48x2-NOELIA_041p-1-2-RND9263--After 15 hours of when the work on this task ended, nvidia driver crashed and the work has been marked as faulty .. Another was marked correctly--nn016_r2-TONI_AGGd8-38-100-RND3157_0--- but these problems are already more than a week, it's insane..nvidia driver falls for a proper shut down boinc manager,exempl..
Now comming this tasks Ann166_r2-TONI_AGGd8-11-100-RND7649_0 and nn137_r2-TONI_AGGd8-20-100-RND8105_0 and Ann027_r2-TONI_AGGd8-19-100-RND9134_3
But I'm skeptical and I do not think that they will end well.
Counting two week without any sense..as many volunteers now
ID: 28959 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28961 - Posted: 4 Mar 2013, 14:51:54 UTC - in response to Message 28958.  

I have had more errors with this application than the last time I did beta testing.

I think we need to try and distinguish between 'application' problems and 'task' problems - or 'project' (as in research project), as Noelia called it in starting this thread.

So do you think that the fact that we getting these errors after we changed application to version 6.18 is mere coincidence?
Maybe you are right. There is a way to prove this: run the failed units under application 6.17. If they fail, it's the units, but if they don't fail, it's the new application.

In my personal experience, all TONI tasks, and 50% of NOELIA tasks, have run correctly under application version 6.18

Richard, this is my experience exactly. All TONIs run fine and 50% of NOELIAs crash. TONI should maybe give a clinic to the others. I don't think it has much to do with 6.18 either, it's just that the new NOELIAS were released at the same time as 6.18.
ID: 28961 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile microchip
Avatar

Send message
Joined: 4 Sep 11
Posts: 110
Credit: 326,102,587
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28962 - Posted: 4 Mar 2013, 14:59:41 UTC

aborted a NOELIA one after it began crunching in circles...

Team Belgium
ID: 28962 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ken_g6

Send message
Joined: 6 Aug 11
Posts: 8
Credit: 76,046,994
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 28969 - Posted: 4 Mar 2013, 17:25:34 UTC
Last modified: 4 Mar 2013, 17:31:53 UTC

The first Noelia
(the angels did say...)
took over 48 hours (on a GTX 460 768mb that's completed most work in 25 or less, but it finally...)
completed today.

The second one I got, which many have apparently had a different problem with, kept restarting on my machine with:
SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1841.

That seems to be an out-of-GPU-memory error. So maybe someone should set stricter minimum memory limits on these Noelia tasks?

Edit: Technically, that wasn't my first Noelia; just the first one of this batch. I got at least one, probably more, in February, and they took 25 hours but were otherwise fine.
ID: 28969 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28970 - Posted: 4 Mar 2013, 17:43:12 UTC - in response to Message 28969.  

The first Noelia
The second one I got...

I see that both WUs are marked

errors WU cancelled

Something may be happening behind the scenes.
ID: 28970 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 28971 - Posted: 4 Mar 2013, 18:02:38 UTC
Last modified: 4 Mar 2013, 18:25:11 UTC

These NOELIA WUs have been cancelled. Their successors will have a slightly different configuration that will hopefully be more stable.

Note that with this app GPUs of compute capabilities 1.0, 1.1 and 1.2 are no longer supported. This means that only Geforce GTX260s and higher will get Long WUs.

MJH
ID: 28971 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile nate

Send message
Joined: 6 Jun 11
Posts: 124
Credit: 2,928,865
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 28972 - Posted: 4 Mar 2013, 18:03:17 UTC

We're looking at the issue. The problematic WUs have been cancelled for now. The problem was clearly on our end, but it seems that there were multiple reasons they were having issues, and mostly not Noelia's fault. She'll resend new simulations that avoid the problems in the next day or so. The large upload sizes will also be fixed.

As always, thanks for making your concerns known and alerting us to the issue.

Nate
ID: 28972 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 28974 - Posted: 4 Mar 2013, 18:09:50 UTC - in response to Message 28971.  

Be aware also these and subsequent WUs will fail if you have over-ridden the application version and are not running the latest.

MJH
ID: 28974 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28978 - Posted: 4 Mar 2013, 19:46:04 UTC - in response to Message 28972.  

We're looking at the issue. The problematic WUs have been cancelled for now.

Were the TONI WUs cancelled too? They ran fine..
ID: 28978 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28980 - Posted: 4 Mar 2013, 20:01:21 UTC - in response to Message 28978.  

We're looking at the issue. The problematic WUs have been cancelled for now.

Were the TONI WUs cancelled too? They ran fine..

And the two I have in progress are still fine, and shown as viable on the website.
ID: 28980 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28981 - Posted: 4 Mar 2013, 20:32:27 UTC - in response to Message 28980.  

We're looking at the issue. The problematic WUs have been cancelled for now.

Were the TONI WUs cancelled too? They ran fine..

And the two I have in progress are still fine, and shown as viable on the website.

Just got a couple new ones. Seems the queue coincidentally ran dry for a while:

GPUGRID 03-04-13 13:45 Requesting new tasks for NVIDIA
GPUGRID 03-04-13 13:45 Scheduler request completed: got 0 new tasks
GPUGRID 03-04-13 13:45 No tasks sent
GPUGRID 03-04-13 13:45 No tasks are available for Long runs (8-12 hours on fastest card)
ID: 28981 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GPUGRID

Send message
Joined: 12 Dec 11
Posts: 91
Credit: 2,730,095,033
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 28984 - Posted: 4 Mar 2013, 20:44:46 UTC - in response to Message 28972.  

We're looking at the issue. The problematic WUs have been cancelled for now. The problem was clearly on our end, but it seems that there were multiple reasons they were having issues, and mostly not Noelia's fault. She'll resend new simulations that avoid the problems in the next day or so. The large upload sizes will also be fixed.

As always, thanks for making your concerns known and alerting us to the issue.

Nate

Thank you guys. Another thing that I really appreciate on this project is your awesome and fast support.
Wich didn´t happen on the project I ran in the past 13 years.... sadly.
ID: 28984 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : News : New project in long queue

©2025 Universitat Pompeu Fabra