Message boards :
Number crunching :
Pablo, When is enough enough???
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
Two days ago I completed a WU that ran for 21 days, next one took 14 days and today there's two running at 5 and 8 days (below): Computer: Rig-26 Project GPUGRID Name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561_8 Application Long runs (8-12 hours on fastest card) 9.19 (cuda80) Workunit name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561 State Running High P. Received 12/20/2018 7:51:57 AM Report deadline 12/25/2018 7:51:56 AM Estimated app speed 193.79 GFLOPs/sec Estimated task size 5,000,000 GFLOPs Resources 1 CPU + 0.5 NVIDIA GPUs (device 1) CPU time at last checkpoint 00:00:00 CPU time 08d,19:21:05 Elapsed time 08d,19:23:43 Estimated time remaining 00:00:00 Fraction done 100.000% Virtual memory size 22,284.77 MB Working set size 77.79 MB Directory slots/31 Process ID 2411 Computer: Rig-12 Project GPUGRID Name e2s18_e1s312p2f159-PABLO_V3_p27_sj403_IDP-3-4-RND6569_0 Application Long runs (8-12 hours on fastest card) 9.19 (cuda80) Workunit name e2s18_e1s312p2f159-PABLO_V3_p27_sj403_IDP-3-4-RND6569 State Running High P. Received 12/18/2018 12:36:04 PM Report deadline 12/23/2018 12:36:02 PM Estimated app speed 121.06 GFLOPs/sec Estimated task size 5,000,000 GFLOPs Resources 1 CPU + 0.5 NVIDIA GPUs (device 0) CPU time at last checkpoint 00:00:00 CPU time 05d,12:06:03 Elapsed time 05d,12:38:21 Estimated time remaining 00:00:00 Fraction done 99.999% Virtual memory size 30,588.04 MB Working set size 81.13 MB Directory slots/27 Process ID 1794 I would think there should be a builtin rule that says: IF ElapsedTime > 2(PredictedTime) THEN Abort&Flag ELSE WTF |
|
Send message Joined: 26 Feb 14 Posts: 211 Credit: 4,496,324,562 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Two days ago I completed a WU that ran for 21 days, next one took 14 days and today there's two running at 5 and 8 days (below): When running GPUGrid, you really should only run 1 work unit per GPU. Without knowing what the other work unit is on that same GPU, you are probably starving the GPUGrid work unit with time on the GPU. What GPUs are in those machines? |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
So gpuGRID has been shown to be incapable of sharing a GPU??? Both of these computers have SWAN_SYNC enabled but their app_config.xml was changed after enabling SWAN_SYNC so they still show 0.5 GPU even though its mate completed days ago and it's been running alone. So I bet you a GRC that is not the problem here. These WUs could not converge on a solution for any number of reasons that only Pablo can know. Rig-12: http://www.gpugrid.net/show_host_detail.php?hostid=484061 Rig-26: http://www.gpugrid.net/show_host_detail.php?hostid=484035 |
|
Send message Joined: 20 Apr 15 Posts: 285 Credit: 1,102,216,607 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
When running GPUGrid, you really should only run 1 work unit per GPU. Without knowing what the other work unit is on that same GPU, you are probably starving the GPUGrid work unit with time on the GPU. What GPUs are in those machines? I dont have any problems with running two tasks per GPU. My Ryzen 1700 + 2x gtx1070 System runs 4 long jobs in parallel and needs 30.000 - 50.000 sec for completion. This config keeps the GPU working and at approximately constant temperature (-> longer lifetime) whereas a singe job config would put more thermal stress on the card especially when jobs are rare (many down/zero load times inbetween). I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday. |
|
Send message Joined: 26 Feb 14 Posts: 211 Credit: 4,496,324,562 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This config keeps the GPU working and at approximately constant temperature (-> longer lifetime) whereas a singe job config would put more thermal stress on the card especially when jobs are rare (many down/zero load times inbetween). My temps are never very high. All my cards are hybrids so they never go above 50C. Extended warranties for total 10 years (longer than most people keep their cards) so no worries about them failing and not being able to replace them. It is troubling that his machines are taking anywhere from 5-21 days to do what most do in a few hours. The only way to know would be to run those exact work units on another machine and see if the results are the same.
|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
So gpuGRID has been shown to be incapable of sharing a GPU???It's capable to share the GPU, but enabling SWAN_SYNC means that you dedicate your GPU to the GPUGrid app to make it as fast as possible (=maximize GPU utilization). In this case you should not share it with other project(s). Both of these computers have SWAN_SYNC enabled but their app_config.xml was changed after enabling SWAN_SYNC so they still show 0.5 GPU even though its mate completed days ago and it's been running alone.The app_config tells the BOINC manager what to expect from the given app, it does *not* instruct the app to what extent utilize resources (GPU, CPU). (There's no way to config the GPUGrid app to a given GPU utilization percentage.) So if you enable SWAN_SYNC, you should set 1.0 GPU and 1.0 CPU in app_config.xml like this: <app_config> <app> <name>acemdlong</name> <fraction_done_exact/> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> <app> <name>acemdshort</name> <fraction_done_exact/> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> <app> <name>acemdbeta</name> <fraction_done_exact/> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> </app_config> BTW the deadline is 5 days, so if a workunit takes much more than that you should abort it, and check for a solution, as it should take 5-6 hours on a GTX 1070. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
Pablo et alia, Please provide some feedback on these WUs that take days and weeks to complete. Do you learn something from them if we let them run so long??? Do you notice them if we abort them before they complete??? If we should just abort them if they take too long can you automate the process as I suggested in the lead post??? IF ElapsedTime > 2*PredictedTime THEN Abort&Flag ELSE WTF If I abort them will they just queue up and go out again??? I don't want you to miss the best binding pocket the universe has ever seen. {BTW, this thread is not about SWAN_SYNCing.} |
|
Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level ![]() Scientific publications
|
On my SuSE Linux host, with a GTX 750 Ti graphic board, I find in the stderr.txt the following line: CUDA Synchronization mode SPIN What does it mean? Tullio |
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,187 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
On my SuSE Linux host, with a GTX 750 Ti graphic board, I find in the stderr.txt the following line: This means that SWAN_SYNC is successfully enabled in your system. When it is not, it should read: CUDA Synchronization mode BLOCKING {But this matter better fits in SWAN_SYNC in Linux client thread http://www.gpugrid.net/forum_thread.php?id=4813, as Aurum suggests.}[/list][/code] |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
Today's long running WU, to abort or not to abort: Computer: Rig-26 Project GPUGRID Name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561_8 Application Long runs (8-12 hours on fastest card) 9.19 (cuda80) Workunit name e7s25_e1s146p0f12-PABLO_v2O00512_MOR_5_IDP-1-2-RND0561 State Running High P. Received 12/20/2018 7:51:57 AM Report deadline 12/25/2018 7:51:56 AM Estimated app speed 199.45 GFLOPs/sec Estimated task size 5,000,000 GFLOPs Resources 1 CPU + 0.5 NVIDIA GPUs (device 1) CPU time at last checkpoint 00:00:00 CPU time 09d,18:08:21 Elapsed time 09d,18:11:16 Estimated time remaining 00:00:00 Fraction done 100.000% Virtual memory size 22,284.77 MB Working set size 77.79 MB Directory slots/31 Process ID 2411 |
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,187 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
If we should just abort them if they take too long can you automate the process as I suggested in the lead post??? There is a clue in your never-ending task 15602360 http://www.gpugrid.net/workunit.php?wuid=15602360 This task has failied or not finished in many different systems, so you can freely abort a task like this. It is clearly a defective one. As you suggest, some in-task protection would be appreciated for eviting such problem. If I abort them will they just queue up and go out again??? As seen in this particular defective task, it has been resent to many systems. This task has been automatically retired by the project, as it has reached a total number of 10 resendings with no successful result. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
Wow! If the other nine spent 9 days each that's a quarter of a GPU-year that could've been better used. Ok, I'm convinced to abort this one, but, I don't want to toss them out if there's something that Team GDF might learn from them. |
|
Send message Joined: 21 Mar 16 Posts: 513 Credit: 4,673,458,277 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Wow! If the other nine spent 9 days each that's a quarter of a GPU-year that could've been better used. Even when you abort I'm pretty sure the results are still uploaded for their analysis |
|
Send message Joined: 3 Sep 13 Posts: 53 Credit: 1,533,531,731 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
BOINC will sometimes report run times that aren't correct although not usually off by days. I would first try to make sure these problem tasks are really running that long. If they are that's almost certainly a problem with your host and not the app, problems like that usually get reported pretty quickly by multiple users. You may have a problem similar to what's reported here. https://www.gpugrid.net/forum_thread.php?id=4868 Rig-12: http://www.gpugrid.net/show_host_detail.php?hostid=484061 definitely has a card or cards not working well, you shouldn't have that many errors. Team USA forum | Team USA page Join us and #crunchforcures. We are now also folding:join team ID 236370! |
|
Send message Joined: 4 Jun 15 Posts: 19 Credit: 8,813,058,416 RAC: 93 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I would think there should be a builtin rule that says:IF ElapsedTime > 2(PredictedTime) THEN Abort&Flag ELSE WTF I second that approach. Now some extremely long running (or endlessly looping?) tasks keep timing out and so waist many days of GPU time. For example: e1s363_8_gen-PABLO_V3_p27_sj403_IDP-1-4-RND1764 timed out after running five days on a GTX 1080 Ti (single GPU task for that computer, id:274120). Looks like that task has been further assigned to others after being errored out or cancelled by operators. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
...almost certainly a problem with your host and not the app, problems like that usually get reported pretty quickly by multiple users.Thanks for alerting me to that & I'm taking Rig-12 offline. From its stderr.txt file: # CUDA Synchronisation mode: SPIN # SWAN Device 2 : # Name : GeForce GTX 1070 # ECC : Disabled # Global mem : 8117MB # Capability : 6.1 # PCI ID : 0000:03:00.0 # Device clock : 1784MHz # Memory clock : 4004MHz # Memory width : 256bit # GPU [GeForce GTX 1070] Platform [Linux] Rev [3212] VERSION [80] # SWAN Device 2 : # Name : GeForce GTX 1070 # ECC : Disabled # Global mem : 8117MB # Capability : 6.1 # PCI ID : 0000:03:00.0 # Device clock : 1784MHz # Memory clock : 4004MHz # Memory width : 256bit # Simulation unstable. Flag 5 value 11 # Simulation unstable. Flag 6 value 28 # Simulation unstable. Flag 7 value 11 # Simulation unstable. Flag 9 value 18654 # Simulation unstable. Flag 10 value 20896 # The simulation has become unstable. Terminating to avoid lock-up # The simulation has become unstable. Terminating to avoid lock-up (2) # Attempting restart (step 5000) Those 3 Gigabyte 1070s are probably my oldest cards. May be time to retire. Nick, How did you spot this??? {It's so hard for me to use this web site. It took me 15 minutes to get that link to appear and another few minutes to get its Tasks list to appear.} |
|
Send message Joined: 3 Sep 13 Posts: 53 Credit: 1,533,531,731 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Nick, How did you spot this??? I also have problems getting the site to load, but once it does it usually works ok. I guess I had a little better luck than you. 200+ errors on a single machine stood out pretty quickly, since I hadn't experienced nor read reports here about a bad batch of tasks. Team USA forum | Team USA page Join us and #crunchforcures. We are now also folding:join team ID 236370! |
©2025 Universitat Pompeu Fabra