Message boards :
News :
Windows GPU Applications broken
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next
| Author | Message |
|---|---|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
What I noticed so far is that with the Cuda_80 app the GPU load now is about 75%, whereas for the same type of task, crunched on WinXP with the Cuda_65 app, the GPU load is between 96% and 98% (like is was with the former Cuda_80 app, too). This somehow seems interesting, if not to say strange. Any explanation for this? |
|
Send message Joined: 26 Feb 14 Posts: 211 Credit: 4,496,324,562 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just had 2 cuda 8 GPU work unit error out due to the time exceeded on my 1080TIs on Win 7. Reset the project and will see if that fixes the problem.
|
|
Send message Joined: 20 Apr 15 Posts: 285 Credit: 1,102,216,607 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
happens to me also https://www.gpugrid.net/result.php?resultid=18277793 I have already reset the project and re-run the benchmarks, it apparently didnt change anything. Shall we power down our machines again? I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thinking about it in the shower, that's the wrong way round - apps faster than expected shouldn't cause a problem. Do you need this?: <app_version>
<app_name>acemdlong</app_name>
<version_num>922</version_num>
<platform>windows_intelx86</platform>
<avg_ncpus>1.000000</avg_ncpus>
<flops>43004890022276.586000</flops>
<plan_class>cuda80</plan_class>
<api_version>6.7.0</api_version>
...
<coproc>
<type>NVIDIA</type>
<count>1.000000</count>
</coproc>
<gpu_ram>512.000000</gpu_ram>
<dont_throttle/>
</app_version>
<workunit>
<name>e38s4_e29s9p0f212-ADRIA_FOLDT1015_v2_predicted_pred_ss_contacts_50_T1015s1_3-0-1-RND4166</name>
<app_name>acemdlong</app_name>
<version_num>922</version_num>
<rsc_fpops_est>5000000000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>250000000000000000000.000000</rsc_fpops_bound>
<rsc_memory_bound>300000000.000000</rsc_memory_bound>
<rsc_disk_bound>4000000000.000000</rsc_disk_bound>
...
</workunit>
I've got lost in that many zeroes, so I've cut 12 of them: App flops: 43e12 rsc_fpops_est: 5 000e12 rsc_fpops_bound: 250 000 000e12 The outcome will be here. I think it will succeed. Judging by the previous error message: exceeded elapsed time limit 5659.12 (250000000.00G/43782.46G) the rsc_fpops_bound was only 250 000e12 before. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
That's good start - thanks. <flops>43004890022276.586000</flops> is 43,004,890,022,276 or 43 teraflops Your host 113852 had an APR (processing speed) of 519 GigaFlops under v9.18 At the new speed, the tasks would run for (flops/(flops/sec)) seconds rsc_fpops_est/flops 5,000,000,000,000,000 / 43,004,890,022,276 116 seconds (initial runtime estimate) and be deemed to have 'run too long' after rsc_fpops_bound/flops 250,000,000,000,000,000,000 / 43,004,890,022,276 5,813,292 seconds The errors on that machine earlier today were after ~5,670 seconds - maybe they took my advice while I was out, and upped it by three orders of magnitude? OK, that's as far as I can go here. I need to watch it on one of my own machines. But the problem seems to be that absurd 43 teraflop speed rating. The only cause of that I can think of might be if they put through some shortened test units WITHOUT CHANGING <rsc_fpops_est>. Anybody see anything like that? (which machine did that data come from, please?) |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The errors on that machine earlier today were after ~5,670 seconds - maybe they took my advice while I was out, and upped it by three orders of magnitude?That is my conclusion too (see the end of my previous post). |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The errors on that machine earlier today were after ~5,670 seconds - maybe they took my advice while I was out, and upped it by three orders of magnitude?That is my conclusion too (see the end of my previous post). How much longer will they need to let tasks run before they get enough information to fix the problem? It looks like one more order of magnitude for run time should at least give them more information. Also, users might help by mention whether their tasks were able to write a checkpoint, and then continue after this. |
bcavnaughSend message Joined: 8 Nov 13 Posts: 56 Credit: 1,002,640,163 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
No help here; http://www.gpugrid.net/result.php?resultid=18294177 Stderr output <core_client_version>7.12.1</core_client_version> <![CDATA[ <message> exceeded elapsed time limit 3487.29 (250000000.00G/71688.95G)</message> <stderr_txt> # GPU [GeForce GTX 1080 Ti] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 1080 Ti # ECC : Disabled # Global mem : 11264MB # Capability : 6.1 # PCI ID : 0000:01:00.0 # Device clock : 1683MHz # Memory clock : 5505MHz # Memory width : 352bit # Driver version : r391_33 : 39135 # GPU 0 : 28C # GPU 0 : 29C # GPU 0 : 30C # GPU 0 : 31C # GPU 0 : 32C # GPU 0 : 33C # GPU 0 : 34C # GPU 0 : 35C # GPU 0 : 36C # Access violation : progress made, try to restart called boinc_finish </stderr_txt> ]]> Crunching@EVGA The Number One Team in the BOINC Community. Folding@EVGA The Number One Team in the Folding@Home Community. |
|
Send message Joined: 30 Nov 10 Posts: 4 Credit: 485,756,438 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It's currently running on my Windows 10 w/ 1080ti. 86.4 % complete in 14:31 (m:s). It's an ADRIA job. So the jobs are running much faster than they did before. I leave that up to your interpretation. That job took 16:30 to complete about the same as the job that ran before it. Now starting on job 3. This one's a PABLO and took 2:01 to reach 1%. 2% done and estimate is 2:07:20 (and falling) to completion. Specifically now running e22s56_e25s16p0f231-PABLO_2IDP_P01106_1_GLUP33P_IDP-0-1-RND9574_1 That's good start - thanks. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
No help here;The workunits generated with the improper <rsc_fpops_bound> will be around until they error out. I have two such workunits on my host, so I've manually edited the client_state.xml file to have the right <rsc_fpops_bound> value. the method of this fix: 1. exit BOINC manager 2. windows key + r 3. type or copy and paste: notepad c:\ProgramData\BOINC\client_state.xml4. press <ENTER> 5. CTRL + H 6. search field: <rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound>7. replace field: <rsc_fpops_bound>250000000000000000000.000000</rsc_fpops_bound>8. it should replace as many times as the number of GPUGrid tasks on the given host 9. save and exit notepad 10. restart BOINC manager |
bcavnaughSend message Joined: 8 Nov 13 Posts: 56 Credit: 1,002,640,163 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks <workunit> <name>e17s86_e4s46p0f53-PABLO_2IDP_P01106_4_LEUP23P_IDP-0-1-RND2735</name> <app_name>acemdlong</app_name> <version_num>922</version_num> <rsc_fpops_est>5000000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound> <rsc_memory_bound>300000000.000000</rsc_memory_bound> <rsc_disk_bound>4000000000.000000</rsc_disk_bound> <file_ref> Testing now with 250000000000000000000.000000 Do we have to do this from now on, on each GPU Task? Crunching@EVGA The Number One Team in the BOINC Community. Folding@EVGA The Number One Team in the Folding@Home Community. |
|
Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level ![]() Scientific publications
|
No go Stderr output <core_client_version>7.12.1</core_client_version> <![CDATA[ <message> (unknown error) - exit code -80 (0xffffffb0)</message> <stderr_txt> # GPU [GeForce GTX 1050 Ti] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 1050 Ti # ECC : Disabled # Global mem : 4096MB # Capability : 6.1 # PCI ID : 0000:01:00.0 # Device clock : 1392MHz # Memory clock : 3504MHz # Memory width : 128bit # Driver version : r397_05 : 39764 # GPU 0 : 64C # GPU 0 : 67C # GPU 0 : 68C # GPU 0 : 70C # GPU 0 : 72C # GPU 0 : 73C # GPU 0 : 74C # GPU 0 : 75C # GPU 0 : 76C # GPU 0 : 77C # GPU 0 : 78C # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 1755000) called boinc_finish </stderr_txt> ]]> |
bcavnaughSend message Joined: 8 Nov 13 Posts: 56 Credit: 1,002,640,163 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
No go exit code -80 is a Driver Issue (OpenCL Missing) as can also be C++ Runtimes issue maybe even missing. You need both the x86 (32Bit) and the x64 Bit versions. As well as unstable GPU and or CPU. This is not the same issue as "Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED" Crunching@EVGA The Number One Team in the BOINC Community. Folding@EVGA The Number One Team in the Folding@Home Community. |
|
Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level ![]() Scientific publications
|
SETI@home tasks complete using opencl_nvidia_SoG. Temperature using Thundermaster is 66 C. Tullio |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Please start a new thread for the Simulation Unstable issue, if you must. It typically means your GPU is overclocked too much, and this project pushes it harder than other projects. If you want help determining a max stable overclock, PM me and be patient. |
|
Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level ![]() Scientific publications
|
MY GPU is not overclocked. I never overclock. Tullio |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
How much longer will they need to let tasks run before they get enough information to fix the problem? Do we have to do this from now on, on each GPU Task? From what I saw yesterday, somehow the system got itself into a state where it thought our machines were much faster than they really are. 'machine speed' comes from one of two places: either the aggregate returns across the whole project, or the actual behaviour of each individual computer. The speed of the individual computer takes over in the end - after 11 tasks have made it all the way through and been validated. So "11 times per computer" should be the maximum number of manual interventions required. But since they seem to have put in a workround for the faulty kill-switch, you may not have to do it that many times, or even at all. Because work is now being completed properly, the system-wide speed assessment will be correcting itself at the same time, so that machines which have been inactive while waiting for the new app may never even see the problem. But it's hard to predict when that will kick in: I may find out when I get home. As Retvari has pointed out, there will be faulty workunits circulating around the system for a while yet, and they are a problem because they waste resources for a significant length of time. Those are the ones it is most helpful to patch via the file edit: once they have been completed and validated, they won't come back to haunt us again. |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
To summarize: the problem AFAIK were the test WUs, sent without changing the ops estimate. I now cancelled them all, and temporarily raised the OPS bound by 10^3. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
To summarize: the problem AFAIK were the test WUs, sent without changing the ops estimate. I now cancelled them all, and temporarily raised the OPS bound by 10^3. That sounds good. I agree with you about the cause, and the workround will let the system clean itself out with no further intervention. Just one final task: buy a 2019 calendar, and put a big red circle round the next licence expiry date! (or perhaps a month before...) I think you once said that the rsc_fpops_est was fixed by the workunit generation script: it might be a good idea to start thinking about making it easier to vary that. But not this weekend - take some time off! |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
To summarize: the problem AFAIK were the test WUs, sent without changing the ops estimate. I now cancelled them all, and temporarily raised the OPS bound by 10^3.I still received a task which has the lower rsc_fpops_bound value. So we should watch these workunits carefully (and fix those which have the lower rsc_fpops_bound) until they've cleared out from the scheduler. |
©2025 Universitat Pompeu Fabra