acemdbeta application - discussion

Message boards : News : acemdbeta application - discussion
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32645 - Posted: 2 Sep 2013, 22:34:54 UTC

The Beta application may be somewhat volatile for the next few days, as we try to understand and fix the remaining common failure modes. This will ultimately lead to a more stable production application, so please do continue to take WUs from there. Your help's appreciated.

Thanks,

Matt
ID: 32645 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 52,725
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32652 - Posted: 3 Sep 2013, 6:40:28 UTC - in response to Message 32645.  

These new betas did reduce the speed of the 690 video cards, on my windows 7 computer from 1097 Mhz to 914 Mhz when they were running on that GPU. If I had a non beta running along side the beta on another GPU, the non beta was running at the higher speed. When the beta finished, and a non beta started running on that GPU the speed returned to 1097 Mhz. This did not happened on windows xp, with the 690 video card. The driver on the windows 7 computer is the beta 326.80, with EVGA precision x 4.0, while the windows xp computer is running 314.22, with EVGA precision x 4.0.




ID: 32652 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
juan BFP

Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,887,858
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 32653 - Posted: 3 Sep 2013, 13:21:22 UTC

How do we make to get the beta apps? I did´t receive anyone. My seetings:

ACEMD short runs (2-3 hours on fastest card) for CUDA 4.2: no
ACEMD short runs (2-3 hours on fastest card) for CUDA 3.1: no
ACEMD beta: yes
ACEMD long runs (8-12 hours on fastest GPU) for CUDA 4.2: yes
ID: 32653 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]

Send message
Joined: 16 Jul 07
Posts: 209
Credit: 5,496,860,456
RAC: 8,582,660
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32654 - Posted: 3 Sep 2013, 13:22:36 UTC - in response to Message 32653.  

How do we make to get the beta apps? I did´t receive anyone. My seetings:

ACEMD short runs (2-3 hours on fastest card) for CUDA 4.2: no
ACEMD short runs (2-3 hours on fastest card) for CUDA 3.1: no
ACEMD beta: yes
ACEMD long runs (8-12 hours on fastest GPU) for CUDA 4.2: yes


Also select "Run test applications?"
Reno, NV
Team: SETI.USA
ID: 32654 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
5pot

Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32655 - Posted: 3 Sep 2013, 13:24:16 UTC

I hope my massive quantity of failed beta wus is helping ;)
ID: 32655 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
juan BFP

Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,887,858
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 32657 - Posted: 3 Sep 2013, 14:04:47 UTC - in response to Message 32654.  
Last modified: 3 Sep 2013, 14:43:03 UTC

How do we make to get the beta apps? I did´t receive anyone. My seetings:

ACEMD short runs (2-3 hours on fastest card) for CUDA 4.2: no
ACEMD short runs (2-3 hours on fastest card) for CUDA 3.1: no
ACEMD beta: yes
ACEMD long runs (8-12 hours on fastest GPU) for CUDA 4.2: yes


Also select "Run test applications?"

OK i select it now.

May i ask something else, i see a lot of beta WU with running times of about 60sec who paid 1500 credits. That´s right? Normaly the WU takes a long time to crunch here.

All my GPU´s are Cuda55 capable, did i need to do change something else in the settings?

Thanks for the help
ID: 32657 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32660 - Posted: 3 Sep 2013, 15:10:31 UTC - in response to Message 32655.  


I hope my massive quantity of failed beta wus is helping ;)



5pot - that's exactly the problem that I am trying to fix. Your machine seems one of the worst affected. Could you PM more details about its setup please? In particular if you have any AV or GPU-related utilities installed.

MJH
ID: 32660 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32661 - Posted: 3 Sep 2013, 15:56:51 UTC - in response to Message 32598.  
Last modified: 3 Sep 2013, 16:01:33 UTC

# GPU 0 Current Temp: 66 C Target Temp 1
# GPU 1 Current Temp: 57 C Target Temp 1
6653000 1116.8328 2531.7110 3339.7214 -271975.1685 33633.0608 -231353.8424 46210.2440 0.0000 -185143.5984 296.9487 0.0000 0.0000
# GPU 0 Current Temp: 66 C Target Temp 1
# GPU 1 Current Temp: 56 C Target Temp 1
6654000 1.#QNB 1.#QNB 1.#QNB 1.#QNB 0.0000 1.#QNB 1.#QNB 0.0000 1.#QNB 1.#QNB 0.0000 0.0000
# The simulation has become unstable. Terminating to avoid lock-up (1)

Snippet from an ACEMD beta version v8.10 (cuda55) WU that failed when I started using the system. It ran for 8.8h before becoming unstable :(
- Using MSI Afterburner to control GPU temps.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 32661 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32662 - Posted: 3 Sep 2013, 16:01:46 UTC

I got 3 beta's ACEMD v 8.07 overnight and all failed. I got 3 short ones with v 8.09 that ran longer. And now I have a NOELIA_KLEBEbeta-2-3 running v 8.09 and has done 76% in 16h32m on my 660. This is way longer than on version 8.00 to 8.04.
Greetings from TJ
ID: 32662 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Carlos Augusto Engel

Send message
Joined: 5 Jun 09
Posts: 38
Credit: 2,880,758,878
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32664 - Posted: 3 Sep 2013, 17:21:59 UTC - in response to Message 32657.  

May i ask something else, i see a lot of beta WU with running times of about 60sec who paid 1500 credits. That´s right? Normaly the WU takes a long time to crunch here.

All my GPU´s are Cuda55 capable, did i need to do change something else in the settings?

Thanks for the help


That is ok. I received a lot of small ones too:

7242845 4748357 3 Sep 2013 | 4:51:19 UTC 3 Sep 2013 | 9:11:28 UTC Completo e validado 14,906.95 7,437.96 15,000.00 ACEMD beta version v8.10 (cuda55)
7242760 4748284 3 Sep 2013 | 4:35:37 UTC 3 Sep 2013 | 4:48:40 UTC Completo e validado 152.31 76.89 1,500.00 ACEMD beta version v8.10 (cuda55)
7242759 4748283 3 Sep 2013 | 4:35:37 UTC 3 Sep 2013 | 4:42:40 UTC Completo e validado 152.00 74.30 1,500.00 ACEMD beta version v8.10 (cuda55)
7242751 4748277 3 Sep 2013 | 4:35:37 UTC 3 Sep 2013 | 4:51:19 UTC Completo e validado 151.85 74.69 1,500.00 ACEMD beta version v8.10 (cuda55)
ID: 32664 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
juan BFP

Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,887,858
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 32665 - Posted: 3 Sep 2013, 19:46:47 UTC

But that kind of WU produces a lot of:

197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED

Since their initial estimate is very short but some runs for a long time, maybe a bug who knows?

I.E. http://www.gpugrid.net/result.php?resultid=7244185
ID: 32665 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32666 - Posted: 3 Sep 2013, 21:16:58 UTC - in response to Message 32665.  

But that kind of WU produces a lot of:

197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED

Since their initial estimate is very short but some runs for a long time, maybe a bug who knows?

Number crunching knows.
ID: 32666 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
juan BFP

Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,887,858
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 32667 - Posted: 3 Sep 2013, 21:25:45 UTC - in response to Message 32666.  

But that kind of WU produces a lot of:

197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED

Since their initial estimate is very short but some runs for a long time, maybe a bug who knows?

Number crunching knows.

Thanks for the info, but edit client_state.xml is a dangerous territory, at least for me, i wait for the fix.
ID: 32667 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 52,725
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32668 - Posted: 4 Sep 2013, 1:18:27 UTC

On my computers, the beta versions 8.05 to 8.10 are significantly slower than version 8.04. The MJHARVEY_TEST14, for example ran about 70 seconds with version 8.04, but takes about 4 minutes to complete on versions 8.05 through 8.10. I ran NOELIA_KLEBEbeta WU's in 10 to 12 hours on version 8.04, while currently I am running 4 of these NOELIA unit on versions 8.05 and 8.10, and it looks like they will finish in about 16 to 20 hours. These results are typical for windows 7 and xp, cuda 4.2 and 5.5, Nvidia drivers 314.22 and 326.80. Windows 7 down clocks, but xp doesn't is the only difference. Please don't cancel the units, they seem to be running okay, and I want to finish them to proof that point. I hope the next beta version is faster and better.


ID: 32668 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32675 - Posted: 4 Sep 2013, 10:08:39 UTC - in response to Message 32661.  

Had another Blue Screen crash!
The culprit was a NOELIA_KLEBEbeta. It had run for 16h 30min on a GTX660Ti (which sounds a bit too long).

Cold started the system and the same WU restarted from zero. I aborted the WU, 063px68-NOELIA_KLEBEbeta-0-3-RND7563_1.

Unfortunately 3 WU’s from other projects erred, a WUProp task and two climate models (330h lost)!

The same WU had already completed on a GTX 560 Ti using v8.02:
7221577 114293 29 Aug 2013 | 15:39:46 UTC 3 Sep 2013 | 18:13:38 UTC Completed and validated 64,854.86 2,489.65 95,200.00 ACEMD beta version v8.02 (cuda42)
7244112 139265 3 Sep 2013 | 15:40:26 UTC 4 Sep 2013 | 9:25:31 UTC Aborted by user 60,387.27 15,443.53 --- ACEMD beta version v8.10 (cuda55)

Obviously the changes have made the WU run slower; a GTX660Ti should be much faster than a GTX 560 Ti.

Are you trying to stabilize WU's by Temp targeted control of the GPU or do you just want to see if there is a temp issue?

The NOELIA_KLEBE WU's are still causing driver restarts and occasionally blue screen crashing the system, which kills other work. The WU below might not have been running properly/using the GPU (seen previously with GPU load at 0).


# GPU 0 Current Temp: 32 C Target Temp 1
# GPU 1 Current Temp: 54 C Target Temp 1
4269000 1119.7984 2489.9390 3374.5652 -270800.3207 33500.5416 -230315.4765 46055.2623 0.0000 -184260.2142 295.9528 20749.7732 20749.7732
#SWAN : Running in DEBUG mode
# CUDA Synchronisation mode: BLOCKING
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:05:00.0
# Device clock : 1110MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r325_00 : 32641
#SWAN NVAPI Version: NVidia Complete Version 1.10
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3192M] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:05:00.0
# Device clock : 1110MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r325_00 : 32641
# SWAN : Attempt to malloc 1234688. 2049331200 free
# SWAN : Attempt to malloc 144640. 2048020480 free
# SWAN : Attempt to malloc 317056. 2046971904 free
# SWAN : Attempt to malloc 80128. 2045923328 free
# SWAN : Attempt to malloc 1152. 2045923328 free
# SWAN : Attempt to malloc 1152. 2044874752 free
# SWAN : Attempt to malloc 1152. 2044874752 free
# SWAN : Attempt to malloc 1152. 2044874752 free
# SWAN : Attempt to malloc 2816. 2044874752 free
# SWAN : Attempt to malloc 1792. 2044874752 free
# SWAN : Attempt to malloc 1152. 2044874752 free
...
# swanReallocHost: new allocation of 4
# swanRealloc: new allocation of 2580480
# SWAN : Attempt to malloc 2581504. 1747210240 free
...

Unfortunately, the problems I've experienced with the NOELIA_KLEBE WU's have been too severe. While I don't mind testing a WU, and getting WU failures, testing to the point of self-destruction isn't for me.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 32675 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32680 - Posted: 4 Sep 2013, 10:32:58 UTC - in response to Message 32675.  
Last modified: 4 Sep 2013, 10:35:40 UTC

Oh man, skgiven! You lost 330 hours of work? That's very unfortunate.
It sounds like the beta app needs even more critical sections added to avoid more driver crashes and BSODs.

MJH:
Are there any other sections within the app that are missing the BOINC critical section logic? We need you to solve this.
ID: 32680 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32686 - Posted: 4 Sep 2013, 12:57:22 UTC - in response to Message 32680.  

It's not the first time, or the second - Maybe I'll never learn!

Fortunately only 2 climate WU's were running this time (I had others suspended, but was hoping I could get a couple done without any issues; my system has been very stable of late - close, but no cigar). Hopefully I won't see too many more...

It seems WUProp and Climate have tasks that are particularly vulnerable to such failures. C'est la vie,



FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 32686 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32690 - Posted: 4 Sep 2013, 13:42:53 UTC - in response to Message 32686.  


It seems WUProp and Climate have tasks that are particularly vulnerable to such failures. C'est la vie,


Highly surprising that the driver reset would cause two non-GPU-using apps (I repsume) to crash. Were their graphical screen-savers enabled? Sure the BOINC client didn't get confused and terminate them?

MJH
ID: 32690 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32691 - Posted: 4 Sep 2013, 13:44:11 UTC - in response to Message 32680.  


Are there any other sections within the app that are missing the BOINC critical section logic? We need you to solve this.


8.11 (now on beta and short) represents the best that can be done using that method.

MJH
ID: 32691 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32696 - Posted: 4 Sep 2013, 14:29:53 UTC - in response to Message 32690.  

Note: A Blue-Screen is worse than a driver reset. And I have sometimes seen GPUGrid tasks, when being suspended (looking at the NOELIA ones again)... give me blue screens in the past, with error DPC_WATCHDOG_VIOLATION.

My experience has been:

- A driver reset can cause GPU tasks to fail, even if they are on other GPUs working on other projects.
- A BSOD can cause any task to fail, including CPU ones, if they aren't robust enough to handle resuming after restarting Windows from the abrupt BSOD.

Make sense?




It seems WUProp and Climate have tasks that are particularly vulnerable to such failures. C'est la vie,


Highly surprising that the driver reset would cause two non-GPU-using apps (I repsume) to crash. Were their graphical screen-savers enabled? Sure the BOINC client didn't get confused and terminate them?

MJH

ID: 32696 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : News : acemdbeta application - discussion

©2025 Universitat Pompeu Fabra