Pablo, When is enough enough???

Author	Message
Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 51134 - Posted: 29 Dec 2018, 13:29:41 UTC Last modified: 29 Dec 2018, 13:30:15 UTC Two days ago I completed a WU that ran for 21 days, next one took 14 days and today there's two running at 5 and 8 days (below): Computer: Rig-26 Project GPUGRID Name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561_8 Application Long runs (8-12 hours on fastest card) 9.19 (cuda80) Workunit name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561 State Running High P. Received 12/20/2018 7:51:57 AM Report deadline 12/25/2018 7:51:56 AM Estimated app speed 193.79 GFLOPs/sec Estimated task size 5,000,000 GFLOPs Resources 1 CPU + 0.5 NVIDIA GPUs (device 1) CPU time at last checkpoint 00:00:00 CPU time 08d,19:21:05 Elapsed time 08d,19:23:43 Estimated time remaining 00:00:00 Fraction done 100.000% Virtual memory size 22,284.77 MB Working set size 77.79 MB Directory slots/31 Process ID 2411 Computer: Rig-12 Project GPUGRID Name e2s18_e1s312p2f159-PABLO_V3_p27_sj403_IDP-3-4-RND6569_0 Application Long runs (8-12 hours on fastest card) 9.19 (cuda80) Workunit name e2s18_e1s312p2f159-PABLO_V3_p27_sj403_IDP-3-4-RND6569 State Running High P. Received 12/18/2018 12:36:04 PM Report deadline 12/23/2018 12:36:02 PM Estimated app speed 121.06 GFLOPs/sec Estimated task size 5,000,000 GFLOPs Resources 1 CPU + 0.5 NVIDIA GPUs (device 0) CPU time at last checkpoint 00:00:00 CPU time 05d,12:06:03 Elapsed time 05d,12:38:21 Estimated time remaining 00:00:00 Fraction done 99.999% Virtual memory size 30,588.04 MB Working set size 81.13 MB Directory slots/27 Process ID 1794 I would think there should be a builtin rule that says: IF ElapsedTime > 2(PredictedTime) THEN Abort&Flag ELSE WTF ID: 51134 · Rating: 0 · rate: / Reply Quote

Zalster Send message Joined: 26 Feb 14 Posts: 211 Credit: 4,496,324,562 RAC: 0 Level Scientific publications	Message 51137 - Posted: 29 Dec 2018, 14:31:45 UTC - in response to Message 51134. Two days ago I completed a WU that ran for 21 days, next one took 14 days and today there's two running at 5 and 8 days (below): Computer: Rig-26 Project GPUGRID Name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561_8 Application Long runs (8-12 hours on fastest card) 9.19 (cuda80) Workunit name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561 State Running High P. Received 12/20/2018 7:51:57 AM Report deadline 12/25/2018 7:51:56 AM Estimated app speed 193.79 GFLOPs/sec Estimated task size 5,000,000 GFLOPs Resources 1 CPU + 0.5 NVIDIA GPUs (device 1) CPU time at last checkpoint 00:00:00 CPU time 08d,19:21:05 Elapsed time 08d,19:23:43 Estimated time remaining 00:00:00 Fraction done 100.000% Virtual memory size 22,284.77 MB Working set size 77.79 MB Directory slots/31 Process ID 2411 Computer: Rig-12 Project GPUGRID Name e2s18_e1s312p2f159-PABLO_V3_p27_sj403_IDP-3-4-RND6569_0 Application Long runs (8-12 hours on fastest card) 9.19 (cuda80) Workunit name e2s18_e1s312p2f159-PABLO_V3_p27_sj403_IDP-3-4-RND6569 State Running High P. Received 12/18/2018 12:36:04 PM Report deadline 12/23/2018 12:36:02 PM Estimated app speed 121.06 GFLOPs/sec Estimated task size 5,000,000 GFLOPs Resources 1 CPU + 0.5 NVIDIA GPUs (device 0) CPU time at last checkpoint 00:00:00 CPU time 05d,12:06:03 Elapsed time 05d,12:38:21 Estimated time remaining 00:00:00 Fraction done 99.999% Virtual memory size 30,588.04 MB Working set size 81.13 MB Directory slots/27 Process ID 1794 I would think there should be a builtin rule that says: IF ElapsedTime > 2(PredictedTime) THEN Abort&Flag ELSE WTF When running GPUGrid, you really should only run 1 work unit per GPU. Without knowing what the other work unit is on that same GPU, you are probably starving the GPUGrid work unit with time on the GPU. What GPUs are in those machines? ID: 51137 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 51138 - Posted: 29 Dec 2018, 15:05:47 UTC - in response to Message 51137. So gpuGRID has been shown to be incapable of sharing a GPU??? Both of these computers have SWAN_SYNC enabled but their app_config.xml was changed after enabling SWAN_SYNC so they still show 0.5 GPU even though its mate completed days ago and it's been running alone. So I bet you a GRC that is not the problem here. These WUs could not converge on a solution for any number of reasons that only Pablo can know. Rig-12: http://www.gpugrid.net/show_host_detail.php?hostid=484061 Rig-26: http://www.gpugrid.net/show_host_detail.php?hostid=484035 ID: 51138 · Rating: 0 · rate: / Reply Quote

3de64piB5uZAS6SUNt1GFDU9dRhY Send message Joined: 20 Apr 15 Posts: 285 Credit: 1,102,216,607 RAC: 0 Level Scientific publications	Message 51139 - Posted: 29 Dec 2018, 16:27:44 UTC When running GPUGrid, you really should only run 1 work unit per GPU. Without knowing what the other work unit is on that same GPU, you are probably starving the GPUGrid work unit with time on the GPU. What GPUs are in those machines? I dont have any problems with running two tasks per GPU. My Ryzen 1700 + 2x gtx1070 System runs 4 long jobs in parallel and needs 30.000 - 50.000 sec for completion. This config keeps the GPU working and at approximately constant temperature (-> longer lifetime) whereas a singe job config would put more thermal stress on the card especially when jobs are rare (many down/zero load times inbetween). I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday. ID: 51139 · Rating: 0 · rate: / Reply Quote

Zalster Send message Joined: 26 Feb 14 Posts: 211 Credit: 4,496,324,562 RAC: 0 Level Scientific publications	Message 51140 - Posted: 29 Dec 2018, 18:23:50 UTC - in response to Message 51139. This config keeps the GPU working and at approximately constant temperature (-> longer lifetime) whereas a singe job config would put more thermal stress on the card especially when jobs are rare (many down/zero load times inbetween). My temps are never very high. All my cards are hybrids so they never go above 50C. Extended warranties for total 10 years (longer than most people keep their cards) so no worries about them failing and not being able to replace them. It is troubling that his machines are taking anywhere from 5-21 days to do what most do in a few hours. The only way to know would be to run those exact work units on another machine and see if the results are the same. ID: 51140 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 51141 - Posted: 29 Dec 2018, 19:26:53 UTC - in response to Message 51138. Last modified: 29 Dec 2018, 19:28:19 UTC So gpuGRID has been shown to be incapable of sharing a GPU??? It's capable to share the GPU, but enabling SWAN_SYNC means that you dedicate your GPU to the GPUGrid app to make it as fast as possible (=maximize GPU utilization). In this case you should not share it with other project(s). Both of these computers have SWAN_SYNC enabled but their app_config.xml was changed after enabling SWAN_SYNC so they still show 0.5 GPU even though its mate completed days ago and it's been running alone. The app_config tells the BOINC manager what to expect from the given app, it does not instruct the app to what extent utilize resources (GPU, CPU). (There's no way to config the GPUGrid app to a given GPU utilization percentage.) So if you enable SWAN_SYNC, you should set 1.0 GPU and 1.0 CPU in app_config.xml like this: <app_config> <app> <name>acemdlong</name> <fraction_done_exact/> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> <app> <name>acemdshort</name> <fraction_done_exact/> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> <app> <name>acemdbeta</name> <fraction_done_exact/> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> </app_config> BTW the deadline is 5 days, so if a workunit takes much more than that you should abort it, and check for a solution, as it should take 5-6 hours on a GTX 1070. ID: 51141 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 51142 - Posted: 30 Dec 2018, 1:05:25 UTC Pablo et alia, Please provide some feedback on these WUs that take days and weeks to complete. Do you learn something from them if we let them run so long??? Do you notice them if we abort them before they complete??? If we should just abort them if they take too long can you automate the process as I suggested in the lead post??? IF ElapsedTime > 2*PredictedTime THEN Abort&Flag ELSE WTF If I abort them will they just queue up and go out again??? I don't want you to miss the best binding pocket the universe has ever seen. {BTW, this thread is not about SWAN_SYNCing.} ID: 51142 · Rating: 0 · rate: / Reply Quote

tullio Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level Scientific publications	Message 51147 - Posted: 30 Dec 2018, 8:48:32 UTC Last modified: 30 Dec 2018, 8:49:42 UTC On my SuSE Linux host, with a GTX 750 Ti graphic board, I find in the stderr.txt the following line: CUDA Synchronization mode SPIN What does it mean? Tullio ID: 51147 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 593 Credit: 12,149,186,510 RAC: 4,404,438 Level Scientific publications	Message 51148 - Posted: 30 Dec 2018, 11:48:53 UTC On my SuSE Linux host, with a GTX 750 Ti graphic board, I find in the stderr.txt the following line: CUDA Synchronization mode SPIN What does it mean? This means that SWAN_SYNC is successfully enabled in your system. When it is not, it should read: CUDA Synchronization mode BLOCKING {But this matter better fits in SWAN_SYNC in Linux client thread http://www.gpugrid.net/forum_thread.php?id=4813, as Aurum suggests.}[/list][/code] ID: 51148 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 51149 - Posted: 30 Dec 2018, 12:07:24 UTC Today's long running WU, to abort or not to abort: Computer: Rig-26 Project GPUGRID Name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561_8 Application Long runs (8-12 hours on fastest card) 9.19 (cuda80) Workunit name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561 State Running High P. Received 12/20/2018 7:51:57 AM Report deadline 12/25/2018 7:51:56 AM Estimated app speed 199.45 GFLOPs/sec Estimated task size 5,000,000 GFLOPs Resources 1 CPU + 0.5 NVIDIA GPUs (device 1) CPU time at last checkpoint 00:00:00 CPU time 09d,18:08:21 Elapsed time 09d,18:11:16 Estimated time remaining 00:00:00 Fraction done 100.000% Virtual memory size 22,284.77 MB Working set size 77.79 MB Directory slots/31 Process ID 2411 ID: 51149 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 593 Credit: 12,149,186,510 RAC: 4,404,438 Level Scientific publications	Message 51150 - Posted: 30 Dec 2018, 12:43:17 UTC Last modified: 30 Dec 2018, 12:50:01 UTC If we should just abort them if they take too long can you automate the process as I suggested in the lead post??? There is a clue in your never-ending task 15602360 http://www.gpugrid.net/workunit.php?wuid=15602360 This task has failied or not finished in many different systems, so you can freely abort a task like this. It is clearly a defective one. As you suggest, some in-task protection would be appreciated for eviting such problem. If I abort them will they just queue up and go out again??? As seen in this particular defective task, it has been resent to many systems. This task has been automatically retired by the project, as it has reached a total number of 10 resendings with no successful result. ID: 51150 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 51151 - Posted: 30 Dec 2018, 13:26:04 UTC - in response to Message 51150. Wow! If the other nine spent 9 days each that's a quarter of a GPU-year that could've been better used. Ok, I'm convinced to abort this one, but, I don't want to toss them out if there's something that Team GDF might learn from them. ID: 51151 · Rating: 0 · rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 513 Credit: 4,673,458,277 RAC: 0 Level Scientific publications	Message 51152 - Posted: 30 Dec 2018, 17:24:31 UTC - in response to Message 51151. Wow! If the other nine spent 9 days each that's a quarter of a GPU-year that could've been better used. Ok, I'm convinced to abort this one, but, I don't want to toss them out if there's something that Team GDF might learn from them. Even when you abort I'm pretty sure the results are still uploaded for their analysis ID: 51152 · Rating: 0 · rate: / Reply Quote

Nick Name Send message Joined: 3 Sep 13 Posts: 53 Credit: 1,533,531,731 RAC: 0 Level Scientific publications	Message 51153 - Posted: 30 Dec 2018, 20:29:39 UTC - in response to Message 51138. BOINC will sometimes report run times that aren't correct although not usually off by days. I would first try to make sure these problem tasks are really running that long. If they are that's almost certainly a problem with your host and not the app, problems like that usually get reported pretty quickly by multiple users. You may have a problem similar to what's reported here. https://www.gpugrid.net/forum_thread.php?id=4868 Rig-12: http://www.gpugrid.net/show_host_detail.php?hostid=484061 definitely has a card or cards not working well, you shouldn't have that many errors. Team USA forum \| Team USA page Join us and #crunchforcures. We are now also folding:join team ID 236370! ID: 51153 · Rating: 0 · rate: / Reply Quote

jiipee Send message Joined: 4 Jun 15 Posts: 19 Credit: 8,813,808,416 RAC: 51,614 Level Scientific publications	Message 51156 - Posted: 31 Dec 2018, 7:34:49 UTC - in response to Message 51134. I would think there should be a builtin rule that says: IF ElapsedTime > 2(PredictedTime) THEN Abort&Flag ELSE WTF I second that approach. Now some extremely long running (or endlessly looping?) tasks keep timing out and so waist many days of GPU time. For example: e1s363_8_gen-PABLO_V3_p27_sj403_IDP-1-4-RND1764 timed out after running five days on a GTX 1080 Ti (single GPU task for that computer, id:274120). Looks like that task has been further assigned to others after being errored out or cancelled by operators. ID: 51156 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 51157 - Posted: 31 Dec 2018, 17:58:47 UTC - in response to Message 51153. ...almost certainly a problem with your host and not the app, problems like that usually get reported pretty quickly by multiple users. Rig-12: http://www.gpugrid.net/show_host_detail.php?hostid=484061 definitely has a card or cards not working well, you shouldn't have that many errors. Thanks for alerting me to that & I'm taking Rig-12 offline. From its stderr.txt file: # CUDA Synchronisation mode: SPIN # SWAN Device 2 : # Name : GeForce GTX 1070 # ECC : Disabled # Global mem : 8117MB # Capability : 6.1 # PCI ID : 0000:03:00.0 # Device clock : 1784MHz # Memory clock : 4004MHz # Memory width : 256bit # GPU [GeForce GTX 1070] Platform [Linux] Rev [3212] VERSION [80] # SWAN Device 2 : # Name : GeForce GTX 1070 # ECC : Disabled # Global mem : 8117MB # Capability : 6.1 # PCI ID : 0000:03:00.0 # Device clock : 1784MHz # Memory clock : 4004MHz # Memory width : 256bit # Simulation unstable. Flag 5 value 11 # Simulation unstable. Flag 6 value 28 # Simulation unstable. Flag 7 value 11 # Simulation unstable. Flag 9 value 18654 # Simulation unstable. Flag 10 value 20896 # The simulation has become unstable. Terminating to avoid lock-up # The simulation has become unstable. Terminating to avoid lock-up (2) # Attempting restart (step 5000) Those 3 Gigabyte 1070s are probably my oldest cards. May be time to retire. Nick, How did you spot this??? {It's so hard for me to use this web site. It took me 15 minutes to get that link to appear and another few minutes to get its Tasks list to appear.} ID: 51157 · Rating: 0 · rate: / Reply Quote

Nick Name Send message Joined: 3 Sep 13 Posts: 53 Credit: 1,533,531,731 RAC: 0 Level Scientific publications	Message 51168 - Posted: 1 Jan 2019, 2:14:29 UTC - in response to Message 51157. Nick, How did you spot this??? {It's so hard for me to use this web site. It took me 15 minutes to get that link to appear and another few minutes to get its Tasks list to appear.} I also have problems getting the site to load, but once it does it usually works ok. I guess I had a little better luck than you. 200+ errors on a single machine stood out pretty quickly, since I hadn't experienced nor read reports here about a bad batch of tasks. Team USA forum \| Team USA page Join us and #crunchforcures. We are now also folding:join team ID 236370! ID: 51168 · Rating: 0 · rate: / Reply Quote