Windows GPU Applications broken

Author	Message
Erich56 Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level Scientific publications	Message 50102 - Posted: 27 Jul 2018, 19:11:54 UTC What I noticed so far is that with the Cuda_80 app the GPU load now is about 75%, whereas for the same type of task, crunched on WinXP with the Cuda_65 app, the GPU load is between 96% and 98% (like is was with the former Cuda_80 app, too). This somehow seems interesting, if not to say strange. Any explanation for this? ID: 50102 · Rating: 0 · rate: / Reply Quote

Zalster Send message Joined: 26 Feb 14 Posts: 211 Credit: 4,496,324,562 RAC: 0 Level Scientific publications	Message 50103 - Posted: 27 Jul 2018, 19:16:22 UTC Just had 2 cuda 8 GPU work unit error out due to the time exceeded on my 1080TIs on Win 7. Reset the project and will see if that fixes the problem. ID: 50103 · Rating: 0 · rate: / Reply Quote

3de64piB5uZAS6SUNt1GFDU9dRhY Send message Joined: 20 Apr 15 Posts: 285 Credit: 1,102,216,607 RAC: 0 Level Scientific publications	Message 50104 - Posted: 27 Jul 2018, 20:25:09 UTC Last modified: 27 Jul 2018, 20:27:04 UTC happens to me also https://www.gpugrid.net/result.php?resultid=18277793 I have already reset the project and re-run the benchmarks, it apparently didnt change anything. Shall we power down our machines again? I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday. ID: 50104 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 50105 - Posted: 27 Jul 2018, 22:27:22 UTC - in response to Message 50098. Last modified: 27 Jul 2018, 22:51:17 UTC Thinking about it in the shower, that's the wrong way round - apps faster than expected shouldn't cause a problem. Jacob, while I'm drinking/eating/drinking/sleeping/travelling/sleeping, can you pull the guts out of the <app_version> for 9.22 and a matching WU&task - just the BOINC metadata, not the file references - and post them for me to look at before I get home. Even better if you could subsequently run it and point me to the outcome online. I'm wondering if the project might have slipped half-a-dozen orders of magnitude in <rsc_fpops_est>. Now I've got a bus to catch. Do you need this?: <app_version> <app_name>acemdlong</app_name> <version_num>922</version_num> <platform>windows_intelx86</platform> <avg_ncpus>1.000000</avg_ncpus> <flops>43004890022276.586000</flops> <plan_class>cuda80</plan_class> <api_version>6.7.0</api_version> ... <coproc> <type>NVIDIA</type> <count>1.000000</count> </coproc> <gpu_ram>512.000000</gpu_ram> <dont_throttle/> </app_version> <workunit> <name>e38s4_e29s9p0f212-ADRIA_FOLDT1015_v2_predicted_pred_ss_contacts_50_T1015s1_3-0-1-RND4166</name> <app_name>acemdlong</app_name> <version_num>922</version_num> <rsc_fpops_est>5000000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>250000000000000000000.000000</rsc_fpops_bound> <rsc_memory_bound>300000000.000000</rsc_memory_bound> <rsc_disk_bound>4000000000.000000</rsc_disk_bound> ... </workunit> I've got lost in that many zeroes, so I've cut 12 of them: App flops: 43e12 rsc_fpops_est: 5 000e12 rsc_fpops_bound: 250 000 000e12 The outcome will be here. I think it will succeed. Judging by the previous error message: exceeded elapsed time limit 5659.12 (250000000.00G/43782.46G) the rsc_fpops_bound was only 250 000e12 before. ID: 50105 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 50106 - Posted: 27 Jul 2018, 23:03:58 UTC - in response to Message 50105. That's good start - thanks. <flops>43004890022276.586000</flops> is 43,004,890,022,276 or 43 teraflops Your host 113852 had an APR (processing speed) of 519 GigaFlops under v9.18 At the new speed, the tasks would run for (flops/(flops/sec)) seconds rsc_fpops_est/flops 5,000,000,000,000,000 / 43,004,890,022,276 116 seconds (initial runtime estimate) and be deemed to have 'run too long' after rsc_fpops_bound/flops 250,000,000,000,000,000,000 / 43,004,890,022,276 5,813,292 seconds The errors on that machine earlier today were after ~5,670 seconds - maybe they took my advice while I was out, and upped it by three orders of magnitude? OK, that's as far as I can go here. I need to watch it on one of my own machines. But the problem seems to be that absurd 43 teraflop speed rating. The only cause of that I can think of might be if they put through some shortened test units WITHOUT CHANGING <rsc_fpops_est>. Anybody see anything like that? (which machine did that data come from, please?) ID: 50106 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 50107 - Posted: 27 Jul 2018, 23:07:30 UTC - in response to Message 50106. The errors on that machine earlier today were after ~5,670 seconds - maybe they took my advice while I was out, and upped it by three orders of magnitude? That is my conclusion too (see the end of my previous post). ID: 50107 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level Scientific publications	Message 50108 - Posted: 27 Jul 2018, 23:36:54 UTC - in response to Message 50107. Last modified: 27 Jul 2018, 23:40:31 UTC The errors on that machine earlier today were after ~5,670 seconds - maybe they took my advice while I was out, and upped it by three orders of magnitude? That is my conclusion too (see the end of my previous post). How much longer will they need to let tasks run before they get enough information to fix the problem? It looks like one more order of magnitude for run time should at least give them more information. Also, users might help by mention whether their tasks were able to write a checkpoint, and then continue after this. ID: 50108 · Rating: 0 · rate: / Reply Quote

bcavnaugh Send message Joined: 8 Nov 13 Posts: 56 Credit: 1,002,640,163 RAC: 0 Level Scientific publications	Message 50109 - Posted: 28 Jul 2018, 0:05:17 UTC No help here; http://www.gpugrid.net/result.php?resultid=18294177 Stderr output <core_client_version>7.12.1</core_client_version> <![CDATA[ <message> exceeded elapsed time limit 3487.29 (250000000.00G/71688.95G)</message> <stderr_txt> # GPU [GeForce GTX 1080 Ti] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 1080 Ti # ECC : Disabled # Global mem : 11264MB # Capability : 6.1 # PCI ID : 0000:01:00.0 # Device clock : 1683MHz # Memory clock : 5505MHz # Memory width : 352bit # Driver version : r391_33 : 39135 # GPU 0 : 28C # GPU 0 : 29C # GPU 0 : 30C # GPU 0 : 31C # GPU 0 : 32C # GPU 0 : 33C # GPU 0 : 34C # GPU 0 : 35C # GPU 0 : 36C # Access violation : progress made, try to restart called boinc_finish </stderr_txt> ]]> Crunching@EVGA The Number One Team in the BOINC Community. Folding@EVGA The Number One Team in the Folding@Home Community. ID: 50109 · Rating: 0 · rate: / Reply Quote

Michael Send message Joined: 30 Nov 10 Posts: 4 Credit: 485,756,438 RAC: 0 Level Scientific publications	Message 50110 - Posted: 28 Jul 2018, 0:17:33 UTC - in response to Message 50106. It's currently running on my Windows 10 w/ 1080ti. 86.4 % complete in 14:31 (m:s). It's an ADRIA job. So the jobs are running much faster than they did before. I leave that up to your interpretation. That job took 16:30 to complete about the same as the job that ran before it. Now starting on job 3. This one's a PABLO and took 2:01 to reach 1%. 2% done and estimate is 2:07:20 (and falling) to completion. Specifically now running e22s56_e25s16p0f231-PABLO_2IDP_P01106_1_GLUP33P_IDP-0-1-RND9574_1 That's good start - thanks. <flops>43004890022276.586000</flops> is 43,004,890,022,276 or 43 teraflops Your host 113852 had an APR (processing speed) of 519 GigaFlops under v9.18 At the new speed, the tasks would run for (flops/(flops/sec)) seconds rsc_fpops_est/flops 5,000,000,000,000,000 / 43,004,890,022,276 116 seconds (initial runtime estimate) and be deemed to have 'run too long' after rsc_fpops_bound/flops 250,000,000,000,000,000,000 / 43,004,890,022,276 5,813,292 seconds The errors on that machine earlier today were after ~5,670 seconds - maybe they took my advice while I was out, and upped it by three orders of magnitude? OK, that's as far as I can go here. I need to watch it on one of my own machines. But the problem seems to be that absurd 43 teraflop speed rating. The only cause of that I can think of might be if they put through some shortened test units WITHOUT CHANGING <rsc_fpops_est>. Anybody see anything like that? (which machine did that data come from, please?) ID: 50110 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 50111 - Posted: 28 Jul 2018, 0:20:43 UTC - in response to Message 50109. Last modified: 28 Jul 2018, 0:25:45 UTC No help here; http://www.gpugrid.net/result.php?resultid=18294177 The workunits generated with the improper <rsc_fpops_bound> will be around until they error out. I have two such workunits on my host, so I've manually edited the client_state.xml file to have the right <rsc_fpops_bound> value. the method of this fix: 1. exit BOINC manager 2. windows key + r 3. type or copy and paste: notepad c:\ProgramData\BOINC\client_state.xml 4. press <ENTER> 5. CTRL + H 6. search field: <rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound> 7. replace field: <rsc_fpops_bound>250000000000000000000.000000</rsc_fpops_bound> 8. it should replace as many times as the number of GPUGrid tasks on the given host 9. save and exit notepad 10. restart BOINC manager ID: 50111 · Rating: 0 · rate: / Reply Quote

bcavnaugh Send message Joined: 8 Nov 13 Posts: 56 Credit: 1,002,640,163 RAC: 0 Level Scientific publications	Message 50112 - Posted: 28 Jul 2018, 1:01:33 UTC Last modified: 28 Jul 2018, 1:24:00 UTC Thanks <workunit> <name>e17s86_e4s46p0f53-PABLO_2IDP_P01106_4_LEUP23P_IDP-0-1-RND2735</name> <app_name>acemdlong</app_name> <version_num>922</version_num> <rsc_fpops_est>5000000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound> <rsc_memory_bound>300000000.000000</rsc_memory_bound> <rsc_disk_bound>4000000000.000000</rsc_disk_bound> <file_ref> Testing now with 250000000000000000000.000000 Do we have to do this from now on, on each GPU Task? Crunching@EVGA The Number One Team in the BOINC Community. Folding@EVGA The Number One Team in the Folding@Home Community. ID: 50112 · Rating: 0 · rate: / Reply Quote

tullio Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level Scientific publications	Message 50113 - Posted: 28 Jul 2018, 1:27:42 UTC Last modified: 28 Jul 2018, 1:28:56 UTC No go Stderr output <core_client_version>7.12.1</core_client_version> <![CDATA[ <message> (unknown error) - exit code -80 (0xffffffb0)</message> <stderr_txt> # GPU [GeForce GTX 1050 Ti] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 1050 Ti # ECC : Disabled # Global mem : 4096MB # Capability : 6.1 # PCI ID : 0000:01:00.0 # Device clock : 1392MHz # Memory clock : 3504MHz # Memory width : 128bit # Driver version : r397_05 : 39764 # GPU 0 : 64C # GPU 0 : 67C # GPU 0 : 68C # GPU 0 : 70C # GPU 0 : 72C # GPU 0 : 73C # GPU 0 : 74C # GPU 0 : 75C # GPU 0 : 76C # GPU 0 : 77C # GPU 0 : 78C # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 1755000) called boinc_finish </stderr_txt> ]]> ID: 50113 · Rating: 0 · rate: / Reply Quote

bcavnaugh Send message Joined: 8 Nov 13 Posts: 56 Credit: 1,002,640,163 RAC: 0 Level Scientific publications	Message 50114 - Posted: 28 Jul 2018, 1:40:04 UTC - in response to Message 50113. Last modified: 28 Jul 2018, 1:43:42 UTC No go Stderr output <core_client_version>7.12.1</core_client_version> <![CDATA[ <message> (unknown error) - exit code -80 (0xffffffb0)</message> <stderr_txt> # GPU [GeForce GTX 1050 Ti] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 1050 Ti # GPU 0 : 78C # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 1755000) called boinc_finish </stderr_txt> ]]> exit code -80 is a Driver Issue (OpenCL Missing) as can also be C++ Runtimes issue maybe even missing. You need both the x86 (32Bit) and the x64 Bit versions. As well as unstable GPU and or CPU. This is not the same issue as "Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED" Crunching@EVGA The Number One Team in the BOINC Community. Folding@EVGA The Number One Team in the Folding@Home Community. ID: 50114 · Rating: 0 · rate: / Reply Quote

tullio Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level Scientific publications	Message 50115 - Posted: 28 Jul 2018, 2:12:13 UTC - in response to Message 50114. SETI@home tasks complete using opencl_nvidia_SoG. Temperature using Thundermaster is 66 C. Tullio ID: 50115 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 50116 - Posted: 28 Jul 2018, 2:17:11 UTC Please start a new thread for the Simulation Unstable issue, if you must. It typically means your GPU is overclocked too much, and this project pushes it harder than other projects. If you want help determining a max stable overclock, PM me and be patient. ID: 50116 · Rating: 0 · rate: / Reply Quote

tullio Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level Scientific publications	Message 50117 - Posted: 28 Jul 2018, 2:28:40 UTC - in response to Message 50116. MY GPU is not overclocked. I never overclock. Tullio ID: 50117 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 50119 - Posted: 28 Jul 2018, 6:43:28 UTC How much longer will they need to let tasks run before they get enough information to fix the problem? Do we have to do this from now on, on each GPU Task? From what I saw yesterday, somehow the system got itself into a state where it thought our machines were much faster than they really are. 'machine speed' comes from one of two places: either the aggregate returns across the whole project, or the actual behaviour of each individual computer. The speed of the individual computer takes over in the end - after 11 tasks have made it all the way through and been validated. So "11 times per computer" should be the maximum number of manual interventions required. But since they seem to have put in a workround for the faulty kill-switch, you may not have to do it that many times, or even at all. Because work is now being completed properly, the system-wide speed assessment will be correcting itself at the same time, so that machines which have been inactive while waiting for the new app may never even see the problem. But it's hard to predict when that will kick in: I may find out when I get home. As Retvari has pointed out, there will be faulty workunits circulating around the system for a while yet, and they are a problem because they waste resources for a significant length of time. Those are the ones it is most helpful to patch via the file edit: once they have been completed and validated, they won't come back to haunt us again. ID: 50119 · Rating: 0 · rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 50123 - Posted: 28 Jul 2018, 7:57:39 UTC - in response to Message 50119. To summarize: the problem AFAIK were the test WUs, sent without changing the ops estimate. I now cancelled them all, and temporarily raised the OPS bound by 10^3. ID: 50123 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 50124 - Posted: 28 Jul 2018, 8:19:33 UTC - in response to Message 50123. To summarize: the problem AFAIK were the test WUs, sent without changing the ops estimate. I now cancelled them all, and temporarily raised the OPS bound by 10^3. That sounds good. I agree with you about the cause, and the workround will let the system clean itself out with no further intervention. Just one final task: buy a 2019 calendar, and put a big red circle round the next licence expiry date! (or perhaps a month before...) I think you once said that the rsc_fpops_est was fixed by the workunit generation script: it might be a good idea to start thinking about making it easier to vary that. But not this weekend - take some time off! ID: 50124 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 50125 - Posted: 28 Jul 2018, 12:04:16 UTC - in response to Message 50123. Last modified: 28 Jul 2018, 12:16:59 UTC To summarize: the problem AFAIK were the test WUs, sent without changing the ops estimate. I now cancelled them all, and temporarily raised the OPS bound by 10^3. I still received a task which has the lower rsc_fpops_bound value. So we should watch these workunits carefully (and fix those which have the lower rsc_fpops_bound) until they've cleared out from the scheduler. ID: 50125 · Rating: 0 · rate: / Reply Quote