Message boards :
News :
acemdbeta application - discussion
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
The Beta application may be somewhat volatile for the next few days, as we try to understand and fix the remaining common failure modes. This will ultimately lead to a more stable production application, so please do continue to take WUs from there. Your help's appreciated. Thanks, Matt |
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 52,725 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
These new betas did reduce the speed of the 690 video cards, on my windows 7 computer from 1097 Mhz to 914 Mhz when they were running on that GPU. If I had a non beta running along side the beta on another GPU, the non beta was running at the higher speed. When the beta finished, and a non beta started running on that GPU the speed returned to 1097 Mhz. This did not happened on windows xp, with the 690 video card. The driver on the windows 7 computer is the beta 326.80, with EVGA precision x 4.0, while the windows xp computer is running 314.22, with EVGA precision x 4.0. |
Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
How do we make to get the beta apps? I did´t receive anyone. My seetings: ACEMD short runs (2-3 hours on fastest card) for CUDA 4.2: no ACEMD short runs (2-3 hours on fastest card) for CUDA 3.1: no ACEMD beta: yes ACEMD long runs (8-12 hours on fastest GPU) for CUDA 4.2: yes ![]() |
Send message Joined: 16 Jul 07 Posts: 209 Credit: 5,496,673,456 RAC: 9,375,500 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
How do we make to get the beta apps? I did´t receive anyone. My seetings: Also select "Run test applications?" Reno, NV Team: SETI.USA |
Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I hope my massive quantity of failed beta wus is helping ;) |
Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
How do we make to get the beta apps? I did´t receive anyone. My seetings: OK i select it now. May i ask something else, i see a lot of beta WU with running times of about 60sec who paid 1500 credits. That´s right? Normaly the WU takes a long time to crunch here. All my GPU´s are Cuda55 capable, did i need to do change something else in the settings? Thanks for the help ![]() |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
5pot - that's exactly the problem that I am trying to fix. Your machine seems one of the worst affected. Could you PM more details about its setup please? In particular if you have any AV or GPU-related utilities installed. MJH |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
# GPU 0 Current Temp: 66 C Target Temp 1 # GPU 1 Current Temp: 57 C Target Temp 1 6653000 1116.8328 2531.7110 3339.7214 -271975.1685 33633.0608 -231353.8424 46210.2440 0.0000 -185143.5984 296.9487 0.0000 0.0000 # GPU 0 Current Temp: 66 C Target Temp 1 # GPU 1 Current Temp: 56 C Target Temp 1 6654000 1.#QNB 1.#QNB 1.#QNB 1.#QNB 0.0000 1.#QNB 1.#QNB 0.0000 1.#QNB 1.#QNB 0.0000 0.0000 # The simulation has become unstable. Terminating to avoid lock-up (1) Snippet from an ACEMD beta version v8.10 (cuda55) WU that failed when I started using the system. It ran for 8.8h before becoming unstable :( - Using MSI Afterburner to control GPU temps. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I got 3 beta's ACEMD v 8.07 overnight and all failed. I got 3 short ones with v 8.09 that ran longer. And now I have a NOELIA_KLEBEbeta-2-3 running v 8.09 and has done 76% in 16h32m on my 660. This is way longer than on version 8.00 to 8.04. Greetings from TJ |
Send message Joined: 5 Jun 09 Posts: 38 Credit: 2,880,758,878 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
May i ask something else, i see a lot of beta WU with running times of about 60sec who paid 1500 credits. That´s right? Normaly the WU takes a long time to crunch here. That is ok. I received a lot of small ones too: 7242845 4748357 3 Sep 2013 | 4:51:19 UTC 3 Sep 2013 | 9:11:28 UTC Completo e validado 14,906.95 7,437.96 15,000.00 ACEMD beta version v8.10 (cuda55) 7242760 4748284 3 Sep 2013 | 4:35:37 UTC 3 Sep 2013 | 4:48:40 UTC Completo e validado 152.31 76.89 1,500.00 ACEMD beta version v8.10 (cuda55) 7242759 4748283 3 Sep 2013 | 4:35:37 UTC 3 Sep 2013 | 4:42:40 UTC Completo e validado 152.00 74.30 1,500.00 ACEMD beta version v8.10 (cuda55) 7242751 4748277 3 Sep 2013 | 4:35:37 UTC 3 Sep 2013 | 4:51:19 UTC Completo e validado 151.85 74.69 1,500.00 ACEMD beta version v8.10 (cuda55) |
Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
But that kind of WU produces a lot of: 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED Since their initial estimate is very short but some runs for a long time, maybe a bug who knows? I.E. http://www.gpugrid.net/result.php?resultid=7244185 ![]() |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
But that kind of WU produces a lot of: Number crunching knows. |
Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
But that kind of WU produces a lot of: Thanks for the info, but edit client_state.xml is a dangerous territory, at least for me, i wait for the fix. ![]() |
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 52,725 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
On my computers, the beta versions 8.05 to 8.10 are significantly slower than version 8.04. The MJHARVEY_TEST14, for example ran about 70 seconds with version 8.04, but takes about 4 minutes to complete on versions 8.05 through 8.10. I ran NOELIA_KLEBEbeta WU's in 10 to 12 hours on version 8.04, while currently I am running 4 of these NOELIA unit on versions 8.05 and 8.10, and it looks like they will finish in about 16 to 20 hours. These results are typical for windows 7 and xp, cuda 4.2 and 5.5, Nvidia drivers 314.22 and 326.80. Windows 7 down clocks, but xp doesn't is the only difference. Please don't cancel the units, they seem to be running okay, and I want to finish them to proof that point. I hope the next beta version is faster and better. |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Had another Blue Screen crash! The culprit was a NOELIA_KLEBEbeta. It had run for 16h 30min on a GTX660Ti (which sounds a bit too long). Cold started the system and the same WU restarted from zero. I aborted the WU, 063px68-NOELIA_KLEBEbeta-0-3-RND7563_1. Unfortunately 3 WU’s from other projects erred, a WUProp task and two climate models (330h lost)! The same WU had already completed on a GTX 560 Ti using v8.02: 7221577 114293 29 Aug 2013 | 15:39:46 UTC 3 Sep 2013 | 18:13:38 UTC Completed and validated 64,854.86 2,489.65 95,200.00 ACEMD beta version v8.02 (cuda42) 7244112 139265 3 Sep 2013 | 15:40:26 UTC 4 Sep 2013 | 9:25:31 UTC Aborted by user 60,387.27 15,443.53 --- ACEMD beta version v8.10 (cuda55) Obviously the changes have made the WU run slower; a GTX660Ti should be much faster than a GTX 560 Ti. Are you trying to stabilize WU's by Temp targeted control of the GPU or do you just want to see if there is a temp issue? The NOELIA_KLEBE WU's are still causing driver restarts and occasionally blue screen crashing the system, which kills other work. The WU below might not have been running properly/using the GPU (seen previously with GPU load at 0). # GPU 0 Current Temp: 32 C Target Temp 1 # GPU 1 Current Temp: 54 C Target Temp 1 4269000 1119.7984 2489.9390 3374.5652 -270800.3207 33500.5416 -230315.4765 46055.2623 0.0000 -184260.2142 295.9528 20749.7732 20749.7732 #SWAN : Running in DEBUG mode # CUDA Synchronisation mode: BLOCKING # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:05:00.0 # Device clock : 1110MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : r325_00 : 32641 #SWAN NVAPI Version: NVidia Complete Version 1.10 # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3192M] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:05:00.0 # Device clock : 1110MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : r325_00 : 32641 # SWAN : Attempt to malloc 1234688. 2049331200 free # SWAN : Attempt to malloc 144640. 2048020480 free # SWAN : Attempt to malloc 317056. 2046971904 free # SWAN : Attempt to malloc 80128. 2045923328 free # SWAN : Attempt to malloc 1152. 2045923328 free # SWAN : Attempt to malloc 1152. 2044874752 free # SWAN : Attempt to malloc 1152. 2044874752 free # SWAN : Attempt to malloc 1152. 2044874752 free # SWAN : Attempt to malloc 2816. 2044874752 free # SWAN : Attempt to malloc 1792. 2044874752 free # SWAN : Attempt to malloc 1152. 2044874752 free ... # swanReallocHost: new allocation of 4 # swanRealloc: new allocation of 2580480 # SWAN : Attempt to malloc 2581504. 1747210240 free ... Unfortunately, the problems I've experienced with the NOELIA_KLEBE WU's have been too severe. While I don't mind testing a WU, and getting WU failures, testing to the point of self-destruction isn't for me. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Oh man, skgiven! You lost 330 hours of work? That's very unfortunate. It sounds like the beta app needs even more critical sections added to avoid more driver crashes and BSODs. MJH: Are there any other sections within the app that are missing the BOINC critical section logic? We need you to solve this. |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It's not the first time, or the second - Maybe I'll never learn! Fortunately only 2 climate WU's were running this time (I had others suspended, but was hoping I could get a couple done without any issues; my system has been very stable of late - close, but no cigar). Hopefully I won't see too many more... It seems WUProp and Climate have tasks that are particularly vulnerable to such failures. C'est la vie, FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Highly surprising that the driver reset would cause two non-GPU-using apps (I repsume) to crash. Were their graphical screen-savers enabled? Sure the BOINC client didn't get confused and terminate them? MJH |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
8.11 (now on beta and short) represents the best that can be done using that method. MJH |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Note: A Blue-Screen is worse than a driver reset. And I have sometimes seen GPUGrid tasks, when being suspended (looking at the NOELIA ones again)... give me blue screens in the past, with error DPC_WATCHDOG_VIOLATION. My experience has been: - A driver reset can cause GPU tasks to fail, even if they are on other GPUs working on other projects. - A BSOD can cause any task to fail, including CPU ones, if they aren't robust enough to handle resuming after restarting Windows from the abrupt BSOD. Make sense?
|
©2025 Universitat Pompeu Fabra