acemdlong application 815 updated for Maxwell

Message boards : News : acemdlong application 815 updated for Maxwell
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 36074 - Posted: 3 Apr 2014, 15:10:19 UTC - in response to Message 36070.  

Should be fixed now.

Matt
ID: 36074 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Mumak
Avatar

Send message
Joined: 7 Dec 12
Posts: 92
Credit: 225,897,225
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 36238 - Posted: 11 Apr 2014, 21:44:30 UTC

I don't receive any long tasks on my 750 Ti. Should I be getting them ?
Win7 x64, driver 337.50
ID: 36238 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jozef J

Send message
Joined: 7 Jun 12
Posts: 112
Credit: 1,140,895,172
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 36239 - Posted: 11 Apr 2014, 21:54:38 UTC - in response to Message 35889.  
Last modified: 11 Apr 2014, 22:05:12 UTC

short and long..
ID: 36239 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter_M

Send message
Joined: 25 Feb 14
Posts: 15
Credit: 23,570,837
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwat
Message 36262 - Posted: 13 Apr 2014, 14:55:46 UTC - in response to Message 36238.  

@Mumak, I dont get any either (Ubuntu 12.04.4, GTX750Ti) since about two days, but the server status says always only 15, 21, 27 are available, so I ***guess*** there are simply much less long runs available than there is demand.
ID: 36262 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36389 - Posted: 18 Apr 2014, 15:32:17 UTC

Doesn't seem that long cuda60 WUs are being sent out any more. Any particular reason? Has the bug with the cuda55 app that causes WU crashes when a machine is rebooted or BOINC is restarted been fixed?
ID: 36389 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36402 - Posted: 18 Apr 2014, 21:37:55 UTC - in response to Message 36389.  
Last modified: 18 Apr 2014, 21:41:02 UTC

My research indicates that I am still being given 8.15 apps that still infuriatingly crash. It seems that they never updated the non-cuda6 applications to the fixed 8.20 version :(

I continue to lose massive amounts of work every 3 days or so, as my computer usage habits require that I shutdown BOINC for a couple hours, and then the 8.15 tasks fail. It sucks.
ID: 36402 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36436 - Posted: 19 Apr 2014, 14:54:37 UTC - in response to Message 36402.  

Please try to suspend all work first and to then the reboot. That works for me, no errors when I start all projects again after booting. I am only getting 8.15 apps as I don't upgrade my 331.82 drivers yet.
Greetings from TJ
ID: 36436 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GPUGRID Role account

Send message
Joined: 15 Feb 07
Posts: 134
Credit: 1,349,535,983
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 36437 - Posted: 19 Apr 2014, 15:30:54 UTC - in response to Message 36389.  
Last modified: 19 Apr 2014, 15:31:06 UTC


Doesn't seem that long cuda60 WUs are being sent out any more. Any particular reason?


That's correct for Linux. Too many clients were not correctly reporting the Nvidia driver version, which makes correct scheduling difficult.

Matt
ID: 36437 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36443 - Posted: 19 Apr 2014, 16:37:52 UTC - in response to Message 36437.  

For Windows, I am still regularly getting 8.15 tasks. It's almost as if the non-cuda60 app versions were not rebuilt for 8.20. Any hopes of seeing it get fixed (since the 8.15's have the restart/resume bug?)
ID: 36443 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36447 - Posted: 19 Apr 2014, 17:12:44 UTC - in response to Message 36443.  

For Windows, I am still regularly getting 8.15 tasks. It's almost as if the non-cuda60 app versions were not rebuilt for 8.20. Any hopes of seeing it get fixed (since the 8.15's have the restart/resume bug?)

You can always check the current build status on the applications page.

We do appear to be in a transitional state at the moment, with a number of imbalances between the long and short queues again.
ID: 36447 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36448 - Posted: 19 Apr 2014, 17:21:34 UTC - in response to Message 36447.  
Last modified: 19 Apr 2014, 17:22:53 UTC

To clarify what I meant: There was an 8.20 cuda60 app on the Long Queue (proof pasted below), but now that app is gone, leaving only the buggy 8.15 apps. The applications page doesn't even indicate 8.20 on Long at all, and is a bit misleading.

I don't know what's going on, really. I just know that 8.20 seems more stable, yet I'm still being given 8.15's on the Long Queue, and they end up wasting work. :(

Yes, we appear to be in a transitional state. It just doesn't make sense why we are.

Proof:
Name I1100R11-SDOERR_BARNA-3-4-RND7766_0
Workunit 6406602
Created 9 Apr 2014 | 11:43:56 UTC
Sent 9 Apr 2014 | 14:54:40 UTC
Received 10 Apr 2014 | 8:49:06 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 153764
Report deadline 14 Apr 2014 | 14:54:40 UTC
Run time 48,792.57
CPU time 7,842.47
Validate state Valid
Credit 137,700.00
Application version Long runs (8-12 hours on fastest card) v8.20 (cuda60)
ID: 36448 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36449 - Posted: 19 Apr 2014, 17:24:04 UTC
Last modified: 19 Apr 2014, 17:24:43 UTC

Worse yet, I think I had managed to, at some point, get an 8.20 task to error out with the "The file exists. (0x50) - exit code 80 (0x50)" error I had been seeing with the 8.15's. Saddening and maddenning.

Name e5s7_e3s9f68-GIANNI_trypben1MCavg09-0-1-RND4755_1
Workunit 6406263
Created 9 Apr 2014 | 10:04:32 UTC
Sent 9 Apr 2014 | 12:34:36 UTC
Received 9 Apr 2014 | 14:53:42 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 80 (0x50) Unknown error number
Computer ID 153764
Report deadline 14 Apr 2014 | 12:34:36 UTC
Run time 7,386.87
CPU time 1,743.23
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.20 (cuda60)
Stderr output

<core_client_version>7.3.15</core_client_version>
<![CDATA[
<message>
The file exists.
(0x50) - exit code 80 (0x50)
</message>
<stderr_txt>
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : r337_00 : 33750
# GPU 0 : 67C
# GPU 1 : 57C
# GPU 2 : 74C
# GPU 0 : 68C
# GPU 1 : 60C
# GPU 0 : 69C
# GPU 1 : 63C
# GPU 1 : 65C
# GPU 0 : 70C
# GPU 1 : 66C
# GPU 0 : 71C
# GPU 1 : 68C
# GPU 1 : 69C
# GPU 0 : 72C
# GPU 1 : 70C
# GPU 1 : 71C
# GPU 2 : 75C
# GPU 1 : 72C
# GPU 2 : 76C
# GPU 2 : 77C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : r337_00 : 33750
# GPU 0 : 66C
# GPU 1 : 58C
# GPU 2 : 70C
# GPU 0 : 67C
# GPU 1 : 61C
# GPU 2 : 72C
# GPU 0 : 68C
# GPU 1 : 63C
# GPU 2 : 73C
# BOINC suspending at user request (exit)

</stderr_txt>
]]>
ID: 36449 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36458 - Posted: 20 Apr 2014, 10:29:47 UTC - in response to Message 36447.  

At present the demand for all types of WU outstrips supply, server status. The projects current GigaFLOPs is 1,359,099. With 3450 GPU WU's in the wild, and a maximum of 2/GPU that means there is over 1725 GPU's attached to the project. A mere 1,401 CPU WU's is even more limiting. Clearly the project is struggling to maintain WU supply, never mind honing the apps, developing new research models and introducing server side fixes.

On the 22 Mar the CUDA6 Long app was suspended/removed. Matt later said he doesn’t want to put it on the Long queue until he’s happy with the results from the short queue. This makes sense as it's primarily there to support Maxwell's, which are entry level for GPUGrid (the top GPU's are 2.4 times faster).

max_compute_capability was applied a while ago, to override the scheduler (which tests other apps). Presumably this was only done for the short queue CUDA6 apps.

If all the apps need to be rebuilt (for this and other reasons) it will take time...
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 36458 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter_M

Send message
Joined: 25 Feb 14
Posts: 15
Credit: 23,570,837
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwat
Message 36464 - Posted: 20 Apr 2014, 16:52:32 UTC

Thanks skgiven,
finally I know why I dont get any long WU's anymore :)
ID: 36464 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 36466 - Posted: 20 Apr 2014, 21:55:41 UTC - in response to Message 36402.  

jacob,

There will be an update for the older versions of the windows application coming tomorrow.

Matt
ID: 36466 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36468 - Posted: 20 Apr 2014, 22:56:28 UTC - in response to Message 36466.  
Last modified: 20 Apr 2014, 22:56:43 UTC

Hurray!! Thanks!! [I'll be sure to test the normal scenarios of exiting/restarting BOINC, and suspending/resuming tasks without restarting... as I rely on those scenarios all of the time!]
ID: 36468 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36485 - Posted: 21 Apr 2014, 16:26:22 UTC - in response to Message 36466.  

There will be an update for the older versions of the windows application coming tomorrow.

Matt

Received one GERRARD cuda60 long WU about 4 hours ago. Hopeful that the app results will be good. The Maxwells will be happy and so will the rest of us :-)
ID: 36485 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 36488 - Posted: 21 Apr 2014, 16:37:09 UTC - in response to Message 36468.  

cuda-42 and cuda-55 are updated now.

As an aside, I'll be deprecating cuda-55 soon. Since we've had to deploy a cuda-60 app for the Maxwells, there's not much point keeping it around: it doesn't offer any performance benefit over cuda-42[1], and there are still plenty of hosts that need the older version.

Matt

[1] Any performance benefit you think you've seen is actually from using a more modern driver.

ID: 36488 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36490 - Posted: 21 Apr 2014, 16:41:58 UTC - in response to Message 36488.  

Thank you so much Matt. I know this will help to improve stability, although I think there is still some lingering issue, even in 8.20. :) I'm glad we are moving forward. Is there any way you would consider including additional debug in the stderr.txt, especially during suspends/resumes, especially so we might be better able to figure out why a task might abruptly just quit/exit with an error?
ID: 36490 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 36491 - Posted: 21 Apr 2014, 16:51:13 UTC - in response to Message 36490.  
Last modified: 21 Apr 2014, 16:51:52 UTC

Jacob,

It's on the todo list! I'm going to leave 820 to settle for a week or so to get some good failure stats, before making any more changes.

Right now I am doing work on the CPU app, and also on our internal infrastructure, improving the tools that the researchers here in the lab use to put work on GPUGRID.

The latter's relevant to you guys as it should i) reduce the number of bad WUs we put out and ii) reduce the variation in WU runtime and credit allocation.

Matt
ID: 36491 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : News : acemdlong application 815 updated for Maxwell

©2025 Universitat Pompeu Fabra