Author |
Message |
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
We made another release of the acemd3 version, which should support CUDA 10.1 and higher (all cards with the corresponding driver, including RTX family).
Please check if the WUs named DHFR207c for Windows support stop/restart, and generally work as expected. (Under Linux and older CUDAs for Windows they seem already ok).
Thanks! |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
BTW a known problem - suspend-restart between different cards will fail. |
|
|
Aurum Send message
Joined: 12 Jul 17 Posts: 399 Credit: 13,193,814,882 RAC: 11,258,327 Level
Scientific publications
|
Ok so stop unneeded testing with Linux and just test with Win7.
I got one on Win7 and tried the Suspend-Resume and it failed on a 1080 Ti.
____________
|
|
|
|
I got one of the new test units cuda101. It took almost a minute after the unit started for the GPU to start crunching. GPU usage was approximately 55% ( lower than before), but power usage is between 70% to 80%, according to Afterburner. It ran fine. I suspended it and resumed after about 30 seconds, and it crashed within a few seconds after that.
It was running on a windows 7 computer with a rtx 2080 ti card.
See link:
http://www.gpugrid.net/result.php?resultid=21391121
I ran one successfully, which I did not suspend and resume.
http://www.gpugrid.net/result.php?resultid=21391156
BTW, I also received cuda 100 unit, is this new unit or is it a left over old unit from before? Which has higher GPU usage 65% and power usage 85%
http://www.gpugrid.net/result.php?resultid=21391213
|
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Cuda 100 are leftovers. They are actually mislabeled 10.1.
More precisely, i'd like to investigate if after suspension the following processes are still in the task manager:
- wrapper*.exe
- acemd3
Thanks! |
|
|
|
Cuda 100 are leftovers. They are actually mislabeled 10.1.
More precisely, i'd like to investigate if after suspension the following processes are still in the task manager:
- wrapper*.exe
- acemd3
Thanks!
Both processes are gone from the task manager.
|
|
|
|
Cuda 100 are leftovers. They are actually mislabeled 10.1.
More precisely, i'd like to investigate if after suspension the following processes are still in the task manager:
- wrapper*.exe
- acemd3
Thanks!
Both processes are gone from the task manager.
It is still happening. The unit starts running. It runs well, and then I suspend it. Both processes listed above disappear from the task manager, I then resume the task, both processes reappear briefly, then disappear again. The unit crashes again.
http://www.gpugrid.net/result.php?resultid=21405696 |
|
|
rod4x4Send message
Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level
Scientific publications
|
Received first v2.07 (cuda101) work unit - a36-TONI_TESTDHFR207c-10-30-RND9893_0 on Win10 GTX 1060 host.
Could not test suspend / resume as it was received / processed overnight.
One comment is the runtime is shorter than v2.06 (cuda 100) test work unit.
v2.07 cuda 101 runtime - 2897 seconds
http://www.gpugrid.net/result.php?resultid=21409526
v2.06 cuda 100 runtime - 3974 seconds
http://www.gpugrid.net/result.php?resultid=21404652
Assuming Work units are comparable, it is a 27% improvement in processing speed. |
|
|
|
I managed to get 4 of these units on my windows 10 computer, which has 2 videos card (a gtx 980ti and a rtx 2080ti). I decided run the unit simultaneously on both cards. The 3 units that ran on the 980ti card all crashed, within about a half hour (no suspend and resume error).
http://www.gpugrid.net/result.php?resultid=21410390
http://www.gpugrid.net/result.php?resultid=21410391
http://www.gpugrid.net/result.php?resultid=21410403
The unit that ran on 2080ti, finished successfully, while this was going on.
http://www.gpugrid.net/result.php?resultid=21410392
This also caused afterburner to become non responsive.
I was able to run simultaneously, in the past week or so, a long unit on the 980ti and a new version unit on the 2080ti, successfully, or single new version unit on either card, while run either Einstein or Milkyway unit on the other card, again successfully.
BTW, 2080ti is more than twice as fast as the 980ti, on this computer.
|
|
|
|
New version of ACEMD v2.07 (cuda101)
New version of ACEMD v2.07 (cuda101)
Both tasks were downloaded while RTX 2018 was otherwise occupied. When started both tasks errored out in sequence after each was paused and then resumed.
Machine: I7 windows 10 RTX2080
Obviously very disappointing. |
|
|
|
New version of ACEMD v2.07 (cuda101)
New version of ACEMD v2.07 (cuda101)
Both tasks were downloaded while RTX 2018 was otherwise occupied. When started both tasks errored out in sequence after each was paused and then resumed.
Machine: I7 windows 10 RTX2080
Obviously very disappointing. |
|
|
|
I managed to get 4 of these units on my windows 10 computer, which has 2 videos card (a gtx 980ti and a rtx 2080ti). I decided run the unit simultaneously on both cards. The 3 units that ran on the 980ti card all crashed, within about a half hour (no suspend and resume error).
http://www.gpugrid.net/result.php?resultid=21410390
http://www.gpugrid.net/result.php?resultid=21410391
http://www.gpugrid.net/result.php?resultid=21410403
The unit that ran on 2080ti, finished successfully, while this was going on.
http://www.gpugrid.net/result.php?resultid=21410392
This also caused afterburner to become non responsive.
I was able to run simultaneously, in the past week or so, a long unit on the 980ti and a new version unit on the 2080ti, successfully, or single new version unit on either card, while run either Einstein or Milkyway unit on the other card, again successfully.
BTW, 2080ti is more than twice as fast as the 980ti, on this computer.
I received 2 more of these morning. They ran on the 980ti card. Both crashed without doing the suspend and resume. This is a new observation, previously I was able to finish them, when I was running Einstein units on the other card.
http://www.gpugrid.net/result.php?resultid=21411671
http://www.gpugrid.net/result.php?resultid=21411672
The long units are running well on this card, with only one exception recently, which was caused by abrupt computer showdown.
|
|
|
|
I managed to get 4 of these units on my windows 10 computer, which has 2 videos card (a gtx 980ti and a rtx 2080ti). I decided run the unit simultaneously on both cards. The 3 units that ran on the 980ti card all crashed, within about a half hour (no suspend and resume error).
http://www.gpugrid.net/result.php?resultid=21410390
http://www.gpugrid.net/result.php?resultid=21410391
http://www.gpugrid.net/result.php?resultid=21410403
The unit that ran on 2080ti, finished successfully, while this was going on.
http://www.gpugrid.net/result.php?resultid=21410392
This also caused afterburner to become non responsive.
I was able to run simultaneously, in the past week or so, a long unit on the 980ti and a new version unit on the 2080ti, successfully, or single new version unit on either card, while run either Einstein or Milkyway unit on the other card, again successfully.
BTW, 2080ti is more than twice as fast as the 980ti, on this computer.
I received 2 more of these morning. They ran on the 980ti card. Both crashed without doing the suspend and resume. This is a new observation, previously I was able to finish them, when I was running Einstein units on the other card.
http://www.gpugrid.net/result.php?resultid=21411671
http://www.gpugrid.net/result.php?resultid=21411672
The long units are running well on this card, with only one exception recently, which was caused by abrupt computer showdown.
I had one of these units today finish successfully on the 980ti card (no suspend/resume was done on this unit):
http://www.gpugrid.net/result.php?resultid=21412313
It took more double the time to complete than the same unit running on the 2080ti card.
Another interesting observation is the new ACEMD version seems to be more CPU dependent. A unit running on 2080ti with a Intel(R) Core(TM) i7-5820K CPU will finish in about a forth less time than a unit running on a 2080ti with a AuthenticAMD AMD Phenom(tm) II X6 1090T.
|
|
|
|
I got 2 more of these units today. I decided to run them both simultaneously in one card. (1 CPU w/ .5 GPU). They ran slowly together at a rate of about 16% per hour each, versus about 54% per hour running at 1 CPU w/ 1 GPU.
After running them for a few minutes, I decided to suspend one of them. Before the suspension, I had 2 wrapper tasks and 2 acemd3 tasks running in the task manager, after suspend 1 unit, the task manager shows 1 wrapper and 1 acemd3 running. After the resuming the unit, 2 acemd3 tasks and only 1 wrapper were running, then the unit crashed. Looks like the problem maybe with the wrapper.
See links:
http://www.gpugrid.net/result.php?resultid=21420201
http://www.gpugrid.net/result.php?resultid=21420184
|
|
|
rod4x4Send message
Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level
Scientific publications
|
Received e23s10_e19s1p1f205-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_4-1-2-RND4012_0 TEST v2.06 (Cuda100) work unit on Win10 Host with GTX1060.
Let it run for 6 hours 16 minutes (50 minutes run time left)
Suspended for 2 minutes.
Failed on restarted.
Wrapper and ACEMD3 tasks disappeared in Task Manager on suspend.
These tasks briefly reappeared in Task manager before the Work unit failed.
Link to Work Unit here:
http://gpugrid.net/result.php?resultid=21422582
The observations for all users testing Suspend/Resume on these TEST work units seem to be consistent with the above pattern.
Are there any other symptoms you would like us to monitor when testing? |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
That's sufficient, thanks. We are investigating. Sorry for the failed wus.
Looks like Windows apps fail on restart :(
The restart function itself (=process expected to disappear) seems correct. |
|
|
|
That's sufficient, thanks. We are investigating. Sorry for the failed wus.
Looks like Windows apps fail on restart :(
The restart function itself (=process expected to disappear) seems correct.
Is the restart function the same as the initial start function (which doesn't crash)? Have the saved work files from before the suspension been corrupted or not interacting properly with the other files?
|
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Restarts from a checkpoint file which is written periodically. There is a bug, possibly not in our code. |
|
|