Message boards :
Server and website :
Please check: windows workunits
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
We made another release of the acemd3 version, which should support CUDA 10.1 and higher (all cards with the corresponding driver, including RTX family). Please check if the WUs named DHFR207c for Windows support stop/restart, and generally work as expected. (Under Linux and older CUDAs for Windows they seem already ok). Thanks! |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
BTW a known problem - suspend-restart between different cards will fail. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
Ok so stop unneeded testing with Linux and just test with Win7. I got one on Win7 and tried the Suspend-Resume and it failed on a 1080 Ti.
|
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 301,281 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I got one of the new test units cuda101. It took almost a minute after the unit started for the GPU to start crunching. GPU usage was approximately 55% ( lower than before), but power usage is between 70% to 80%, according to Afterburner. It ran fine. I suspended it and resumed after about 30 seconds, and it crashed within a few seconds after that. It was running on a windows 7 computer with a rtx 2080 ti card. See link: http://www.gpugrid.net/result.php?resultid=21391121 I ran one successfully, which I did not suspend and resume. http://www.gpugrid.net/result.php?resultid=21391156 BTW, I also received cuda 100 unit, is this new unit or is it a left over old unit from before? Which has higher GPU usage 65% and power usage 85% http://www.gpugrid.net/result.php?resultid=21391213 |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
Cuda 100 are leftovers. They are actually mislabeled 10.1. More precisely, i'd like to investigate if after suspension the following processes are still in the task manager: - wrapper*.exe - acemd3 Thanks! |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 301,281 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Cuda 100 are leftovers. They are actually mislabeled 10.1. Both processes are gone from the task manager. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 301,281 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Cuda 100 are leftovers. They are actually mislabeled 10.1. It is still happening. The unit starts running. It runs well, and then I suspend it. Both processes listed above disappear from the task manager, I then resume the task, both processes reappear briefly, then disappear again. The unit crashes again. http://www.gpugrid.net/result.php?resultid=21405696 |
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Received first v2.07 (cuda101) work unit - a36-TONI_TESTDHFR207c-10-30-RND9893_0 on Win10 GTX 1060 host. Could not test suspend / resume as it was received / processed overnight. One comment is the runtime is shorter than v2.06 (cuda 100) test work unit. v2.07 cuda 101 runtime - 2897 seconds http://www.gpugrid.net/result.php?resultid=21409526 v2.06 cuda 100 runtime - 3974 seconds http://www.gpugrid.net/result.php?resultid=21404652 Assuming Work units are comparable, it is a 27% improvement in processing speed. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 301,281 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I managed to get 4 of these units on my windows 10 computer, which has 2 videos card (a gtx 980ti and a rtx 2080ti). I decided run the unit simultaneously on both cards. The 3 units that ran on the 980ti card all crashed, within about a half hour (no suspend and resume error). http://www.gpugrid.net/result.php?resultid=21410390 http://www.gpugrid.net/result.php?resultid=21410391 http://www.gpugrid.net/result.php?resultid=21410403 The unit that ran on 2080ti, finished successfully, while this was going on. http://www.gpugrid.net/result.php?resultid=21410392 This also caused afterburner to become non responsive. I was able to run simultaneously, in the past week or so, a long unit on the 980ti and a new version unit on the 2080ti, successfully, or single new version unit on either card, while run either Einstein or Milkyway unit on the other card, again successfully. BTW, 2080ti is more than twice as fast as the 980ti, on this computer. |
|
Send message Joined: 22 Oct 10 Posts: 42 Credit: 1,808,300,315 RAC: 123,524 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
New version of ACEMD v2.07 (cuda101) New version of ACEMD v2.07 (cuda101) Both tasks were downloaded while RTX 2018 was otherwise occupied. When started both tasks errored out in sequence after each was paused and then resumed. Machine: I7 windows 10 RTX2080 Obviously very disappointing. |
|
Send message Joined: 22 Oct 10 Posts: 42 Credit: 1,808,300,315 RAC: 123,524 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
New version of ACEMD v2.07 (cuda101) New version of ACEMD v2.07 (cuda101) Both tasks were downloaded while RTX 2018 was otherwise occupied. When started both tasks errored out in sequence after each was paused and then resumed. Machine: I7 windows 10 RTX2080 Obviously very disappointing. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 301,281 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I managed to get 4 of these units on my windows 10 computer, which has 2 videos card (a gtx 980ti and a rtx 2080ti). I decided run the unit simultaneously on both cards. The 3 units that ran on the 980ti card all crashed, within about a half hour (no suspend and resume error). I received 2 more of these morning. They ran on the 980ti card. Both crashed without doing the suspend and resume. This is a new observation, previously I was able to finish them, when I was running Einstein units on the other card. http://www.gpugrid.net/result.php?resultid=21411671 http://www.gpugrid.net/result.php?resultid=21411672 The long units are running well on this card, with only one exception recently, which was caused by abrupt computer showdown. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 301,281 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I managed to get 4 of these units on my windows 10 computer, which has 2 videos card (a gtx 980ti and a rtx 2080ti). I decided run the unit simultaneously on both cards. The 3 units that ran on the 980ti card all crashed, within about a half hour (no suspend and resume error). I had one of these units today finish successfully on the 980ti card (no suspend/resume was done on this unit): http://www.gpugrid.net/result.php?resultid=21412313 It took more double the time to complete than the same unit running on the 2080ti card. Another interesting observation is the new ACEMD version seems to be more CPU dependent. A unit running on 2080ti with a Intel(R) Core(TM) i7-5820K CPU will finish in about a forth less time than a unit running on a 2080ti with a AuthenticAMD AMD Phenom(tm) II X6 1090T. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 301,281 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I got 2 more of these units today. I decided to run them both simultaneously in one card. (1 CPU w/ .5 GPU). They ran slowly together at a rate of about 16% per hour each, versus about 54% per hour running at 1 CPU w/ 1 GPU. After running them for a few minutes, I decided to suspend one of them. Before the suspension, I had 2 wrapper tasks and 2 acemd3 tasks running in the task manager, after suspend 1 unit, the task manager shows 1 wrapper and 1 acemd3 running. After the resuming the unit, 2 acemd3 tasks and only 1 wrapper were running, then the unit crashed. Looks like the problem maybe with the wrapper. See links: http://www.gpugrid.net/result.php?resultid=21420201 http://www.gpugrid.net/result.php?resultid=21420184 |
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Received e23s10_e19s1p1f205-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_4-1-2-RND4012_0 TEST v2.06 (Cuda100) work unit on Win10 Host with GTX1060. Let it run for 6 hours 16 minutes (50 minutes run time left) Suspended for 2 minutes. Failed on restarted. Wrapper and ACEMD3 tasks disappeared in Task Manager on suspend. These tasks briefly reappeared in Task manager before the Work unit failed. Link to Work Unit here: http://gpugrid.net/result.php?resultid=21422582 The observations for all users testing Suspend/Resume on these TEST work units seem to be consistent with the above pattern. Are there any other symptoms you would like us to monitor when testing? |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
That's sufficient, thanks. We are investigating. Sorry for the failed wus. Looks like Windows apps fail on restart :( The restart function itself (=process expected to disappear) seems correct. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 301,281 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
That's sufficient, thanks. We are investigating. Sorry for the failed wus. Is the restart function the same as the initial start function (which doesn't crash)? Have the saved work files from before the suspension been corrupted or not interacting properly with the other files? |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
Restarts from a checkpoint file which is written periodically. There is a bug, possibly not in our code. |
©2026 Universitat Pompeu Fabra