Message boards :
News :
acemdlong application 815 updated for Maxwell
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Should be fixed now. Matt |
![]() Send message Joined: 7 Dec 12 Posts: 92 Credit: 225,897,225 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I don't receive any long tasks on my 750 Ti. Should I be getting them ? Win7 x64, driver 337.50 |
Send message Joined: 7 Jun 12 Posts: 112 Credit: 1,140,895,172 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
short and long.. |
Send message Joined: 25 Feb 14 Posts: 15 Credit: 23,570,837 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
@Mumak, I dont get any either (Ubuntu 12.04.4, GTX750Ti) since about two days, but the server status says always only 15, 21, 27 are available, so I ***guess*** there are simply much less long runs available than there is demand. |
![]() Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Doesn't seem that long cuda60 WUs are being sent out any more. Any particular reason? Has the bug with the cuda55 app that causes WU crashes when a machine is rebooted or BOINC is restarted been fixed? |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My research indicates that I am still being given 8.15 apps that still infuriatingly crash. It seems that they never updated the non-cuda6 applications to the fixed 8.20 version :( I continue to lose massive amounts of work every 3 days or so, as my computer usage habits require that I shutdown BOINC for a couple hours, and then the 8.15 tasks fail. It sucks. |
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Please try to suspend all work first and to then the reboot. That works for me, no errors when I start all projects again after booting. I am only getting 8.15 apps as I don't upgrade my 331.82 drivers yet. Greetings from TJ |
Send message Joined: 15 Feb 07 Posts: 134 Credit: 1,349,535,983 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
That's correct for Linux. Too many clients were not correctly reporting the Nvidia driver version, which makes correct scheduling difficult. Matt |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
For Windows, I am still regularly getting 8.15 tasks. It's almost as if the non-cuda60 app versions were not rebuilt for 8.20. Any hopes of seeing it get fixed (since the 8.15's have the restart/resume bug?) |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
For Windows, I am still regularly getting 8.15 tasks. It's almost as if the non-cuda60 app versions were not rebuilt for 8.20. Any hopes of seeing it get fixed (since the 8.15's have the restart/resume bug?) You can always check the current build status on the applications page. We do appear to be in a transitional state at the moment, with a number of imbalances between the long and short queues again. |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
To clarify what I meant: There was an 8.20 cuda60 app on the Long Queue (proof pasted below), but now that app is gone, leaving only the buggy 8.15 apps. The applications page doesn't even indicate 8.20 on Long at all, and is a bit misleading. I don't know what's going on, really. I just know that 8.20 seems more stable, yet I'm still being given 8.15's on the Long Queue, and they end up wasting work. :( Yes, we appear to be in a transitional state. It just doesn't make sense why we are. Proof: Name I1100R11-SDOERR_BARNA-3-4-RND7766_0 Workunit 6406602 Created 9 Apr 2014 | 11:43:56 UTC Sent 9 Apr 2014 | 14:54:40 UTC Received 10 Apr 2014 | 8:49:06 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 153764 Report deadline 14 Apr 2014 | 14:54:40 UTC Run time 48,792.57 CPU time 7,842.47 Validate state Valid Credit 137,700.00 Application version Long runs (8-12 hours on fastest card) v8.20 (cuda60) |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Worse yet, I think I had managed to, at some point, get an 8.20 task to error out with the "The file exists. (0x50) - exit code 80 (0x50)" error I had been seeing with the 8.15's. Saddening and maddenning. Name e5s7_e3s9f68-GIANNI_trypben1MCavg09-0-1-RND4755_1 Workunit 6406263 Created 9 Apr 2014 | 10:04:32 UTC Sent 9 Apr 2014 | 12:34:36 UTC Received 9 Apr 2014 | 14:53:42 UTC Server state Over Outcome Computation error Client state Compute error Exit status 80 (0x50) Unknown error number Computer ID 153764 Report deadline 14 Apr 2014 | 12:34:36 UTC Run time 7,386.87 CPU time 1,743.23 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.20 (cuda60) Stderr output <core_client_version>7.3.15</core_client_version> <![CDATA[ <message> The file exists. (0x50) - exit code 80 (0x50) </message> <stderr_txt> # GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r337_00 : 33750 # GPU 0 : 67C # GPU 1 : 57C # GPU 2 : 74C # GPU 0 : 68C # GPU 1 : 60C # GPU 0 : 69C # GPU 1 : 63C # GPU 1 : 65C # GPU 0 : 70C # GPU 1 : 66C # GPU 0 : 71C # GPU 1 : 68C # GPU 1 : 69C # GPU 0 : 72C # GPU 1 : 70C # GPU 1 : 71C # GPU 2 : 75C # GPU 1 : 72C # GPU 2 : 76C # GPU 2 : 77C # BOINC suspending at user request (exit) # GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r337_00 : 33750 # GPU 0 : 66C # GPU 1 : 58C # GPU 2 : 70C # GPU 0 : 67C # GPU 1 : 61C # GPU 2 : 72C # GPU 0 : 68C # GPU 1 : 63C # GPU 2 : 73C # BOINC suspending at user request (exit) </stderr_txt> ]]> |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
At present the demand for all types of WU outstrips supply, server status. The projects current GigaFLOPs is 1,359,099. With 3450 GPU WU's in the wild, and a maximum of 2/GPU that means there is over 1725 GPU's attached to the project. A mere 1,401 CPU WU's is even more limiting. Clearly the project is struggling to maintain WU supply, never mind honing the apps, developing new research models and introducing server side fixes. On the 22 Mar the CUDA6 Long app was suspended/removed. Matt later said he doesn’t want to put it on the Long queue until he’s happy with the results from the short queue. This makes sense as it's primarily there to support Maxwell's, which are entry level for GPUGrid (the top GPU's are 2.4 times faster). max_compute_capability was applied a while ago, to override the scheduler (which tests other apps). Presumably this was only done for the short queue CUDA6 apps. If all the apps need to be rebuilt (for this and other reasons) it will take time... FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
Send message Joined: 25 Feb 14 Posts: 15 Credit: 23,570,837 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Thanks skgiven, finally I know why I dont get any long WU's anymore :) |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
jacob, There will be an update for the older versions of the windows application coming tomorrow. Matt |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hurray!! Thanks!! [I'll be sure to test the normal scenarios of exiting/restarting BOINC, and suspending/resuming tasks without restarting... as I rely on those scenarios all of the time!] |
![]() Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
There will be an update for the older versions of the windows application coming tomorrow. Received one GERRARD cuda60 long WU about 4 hours ago. Hopeful that the app results will be good. The Maxwells will be happy and so will the rest of us :-) |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
cuda-42 and cuda-55 are updated now. As an aside, I'll be deprecating cuda-55 soon. Since we've had to deploy a cuda-60 app for the Maxwells, there's not much point keeping it around: it doesn't offer any performance benefit over cuda-42[1], and there are still plenty of hosts that need the older version. Matt [1] Any performance benefit you think you've seen is actually from using a more modern driver. |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you so much Matt. I know this will help to improve stability, although I think there is still some lingering issue, even in 8.20. :) I'm glad we are moving forward. Is there any way you would consider including additional debug in the stderr.txt, especially during suspends/resumes, especially so we might be better able to figure out why a task might abruptly just quit/exit with an error? |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Jacob, It's on the todo list! I'm going to leave 820 to settle for a week or so to get some good failure stats, before making any more changes. Right now I am doing work on the CPU app, and also on our internal infrastructure, improving the tools that the researchers here in the lab use to put work on GPUGRID. The latter's relevant to you guys as it should i) reduce the number of bad WUs we put out and ii) reduce the variation in WU runtime and credit allocation. Matt |
©2025 Universitat Pompeu Fabra