Windows GPU Applications broken

Message boards : News : Windows GPU Applications broken
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 50102 - Posted: 27 Jul 2018, 19:11:54 UTC

What I noticed so far is that with the Cuda_80 app the GPU load now is about 75%, whereas for the same type of task, crunched on WinXP with the Cuda_65 app, the GPU load is between 96% and 98% (like is was with the former Cuda_80 app, too).

This somehow seems interesting, if not to say strange.

Any explanation for this?
ID: 50102 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zalster
Avatar

Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50103 - Posted: 27 Jul 2018, 19:16:22 UTC

Just had 2 cuda 8 GPU work unit error out due to the time exceeded on my 1080TIs on Win 7.

Reset the project and will see if that fixes the problem.
ID: 50103 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
3de64piB5uZAS6SUNt1GFDU9dRhY
Avatar

Send message
Joined: 20 Apr 15
Posts: 285
Credit: 1,102,216,607
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwat
Message 50104 - Posted: 27 Jul 2018, 20:25:09 UTC
Last modified: 27 Jul 2018, 20:27:04 UTC

happens to me also

https://www.gpugrid.net/result.php?resultid=18277793

I have already reset the project and re-run the benchmarks, it apparently didnt change anything. Shall we power down our machines again?
I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday.
ID: 50104 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50105 - Posted: 27 Jul 2018, 22:27:22 UTC - in response to Message 50098.  
Last modified: 27 Jul 2018, 22:51:17 UTC

Thinking about it in the shower, that's the wrong way round - apps faster than expected shouldn't cause a problem.

Jacob, while I'm drinking/eating/drinking/sleeping/travelling/sleeping, can you pull the guts out of the <app_version> for 9.22 and a matching WU&task - just the BOINC metadata, not the file references - and post them for me to look at before I get home. Even better if you could subsequently run it and point me to the outcome online. I'm wondering if the project might have slipped half-a-dozen orders of magnitude in <rsc_fpops_est>.

Now I've got a bus to catch.

Do you need this?:
<app_version>
    <app_name>acemdlong</app_name>
    <version_num>922</version_num>
    <platform>windows_intelx86</platform>
    <avg_ncpus>1.000000</avg_ncpus>
    <flops>43004890022276.586000</flops>
    <plan_class>cuda80</plan_class>
    <api_version>6.7.0</api_version>
    ...
    <coproc>
        <type>NVIDIA</type>
        <count>1.000000</count>
    </coproc>
    <gpu_ram>512.000000</gpu_ram>
    <dont_throttle/>
</app_version>
<workunit>
    <name>e38s4_e29s9p0f212-ADRIA_FOLDT1015_v2_predicted_pred_ss_contacts_50_T1015s1_3-0-1-RND4166</name>
    <app_name>acemdlong</app_name>
    <version_num>922</version_num>
    <rsc_fpops_est>5000000000000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>250000000000000000000.000000</rsc_fpops_bound>
    <rsc_memory_bound>300000000.000000</rsc_memory_bound>
    <rsc_disk_bound>4000000000.000000</rsc_disk_bound>
...
</workunit>

I've got lost in that many zeroes, so I've cut 12 of them:
App flops: 43e12
rsc_fpops_est: 5 000e12
rsc_fpops_bound: 250 000 000e12

The outcome will be here.
I think it will succeed.
Judging by the previous error message:

exceeded elapsed time limit 5659.12 (250000000.00G/43782.46G)

the rsc_fpops_bound was only 250 000e12 before.
ID: 50105 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50106 - Posted: 27 Jul 2018, 23:03:58 UTC - in response to Message 50105.  

That's good start - thanks.

<flops>43004890022276.586000</flops>

is 43,004,890,022,276 or 43 teraflops

Your host 113852 had an APR (processing speed) of 519 GigaFlops under v9.18

At the new speed, the tasks would run for (flops/(flops/sec)) seconds

rsc_fpops_est/flops

5,000,000,000,000,000 / 43,004,890,022,276

116 seconds (initial runtime estimate)

and be deemed to have 'run too long' after

rsc_fpops_bound/flops

250,000,000,000,000,000,000 / 43,004,890,022,276

5,813,292 seconds

The errors on that machine earlier today were after ~5,670 seconds - maybe they took my advice while I was out, and upped it by three orders of magnitude?

OK, that's as far as I can go here. I need to watch it on one of my own machines. But the problem seems to be that absurd 43 teraflop speed rating.

The only cause of that I can think of might be if they put through some shortened test units WITHOUT CHANGING <rsc_fpops_est>. Anybody see anything like that?

(which machine did that data come from, please?)
ID: 50106 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50107 - Posted: 27 Jul 2018, 23:07:30 UTC - in response to Message 50106.  

The errors on that machine earlier today were after ~5,670 seconds - maybe they took my advice while I was out, and upped it by three orders of magnitude?
That is my conclusion too (see the end of my previous post).
ID: 50107 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile robertmiles

Send message
Joined: 16 Apr 09
Posts: 503
Credit: 769,991,668
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50108 - Posted: 27 Jul 2018, 23:36:54 UTC - in response to Message 50107.  
Last modified: 27 Jul 2018, 23:40:31 UTC

The errors on that machine earlier today were after ~5,670 seconds - maybe they took my advice while I was out, and upped it by three orders of magnitude?
That is my conclusion too (see the end of my previous post).

How much longer will they need to let tasks run before they get enough information to fix the problem?

It looks like one more order of magnitude for run time should at least give them more information.

Also, users might help by mention whether their tasks were able to write a checkpoint, and then continue after this.
ID: 50108 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile bcavnaugh

Send message
Joined: 8 Nov 13
Posts: 56
Credit: 1,002,640,163
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 50109 - Posted: 28 Jul 2018, 0:05:17 UTC

No help here;
http://www.gpugrid.net/result.php?resultid=18294177

Stderr output
<core_client_version>7.12.1</core_client_version>
<![CDATA[
<message>
exceeded elapsed time limit 3487.29 (250000000.00G/71688.95G)</message>
<stderr_txt>
# GPU [GeForce GTX 1080 Ti] Platform [Windows] Rev [3212] VERSION [80]
# SWAN Device 0 :
# Name : GeForce GTX 1080 Ti
# ECC : Disabled
# Global mem : 11264MB
# Capability : 6.1
# PCI ID : 0000:01:00.0
# Device clock : 1683MHz
# Memory clock : 5505MHz
# Memory width : 352bit
# Driver version : r391_33 : 39135
# GPU 0 : 28C
# GPU 0 : 29C
# GPU 0 : 30C
# GPU 0 : 31C
# GPU 0 : 32C
# GPU 0 : 33C
# GPU 0 : 34C
# GPU 0 : 35C
# GPU 0 : 36C
# Access violation : progress made, try to restart
called boinc_finish

</stderr_txt>
]]>

Crunching@EVGA The Number One Team in the BOINC Community. Folding@EVGA The Number One Team in the Folding@Home Community.
ID: 50109 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Michael

Send message
Joined: 30 Nov 10
Posts: 4
Credit: 485,756,438
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 50110 - Posted: 28 Jul 2018, 0:17:33 UTC - in response to Message 50106.  

It's currently running on my Windows 10 w/ 1080ti. 86.4 % complete in 14:31 (m:s). It's an ADRIA job. So the jobs are running much faster than they did before. I leave that up to your interpretation. That job took 16:30 to complete about the same as the job that ran before it. Now starting on job 3. This one's a PABLO and took 2:01 to reach 1%. 2% done and estimate is 2:07:20 (and falling) to completion.

Specifically now running e22s56_e25s16p0f231-PABLO_2IDP_P01106_1_GLUP33P_IDP-0-1-RND9574_1

That's good start - thanks.

<flops>43004890022276.586000</flops>

is 43,004,890,022,276 or 43 teraflops

Your host 113852 had an APR (processing speed) of 519 GigaFlops under v9.18

At the new speed, the tasks would run for (flops/(flops/sec)) seconds

rsc_fpops_est/flops

5,000,000,000,000,000 / 43,004,890,022,276

116 seconds (initial runtime estimate)

and be deemed to have 'run too long' after

rsc_fpops_bound/flops

250,000,000,000,000,000,000 / 43,004,890,022,276

5,813,292 seconds

The errors on that machine earlier today were after ~5,670 seconds - maybe they took my advice while I was out, and upped it by three orders of magnitude?

OK, that's as far as I can go here. I need to watch it on one of my own machines. But the problem seems to be that absurd 43 teraflop speed rating.

The only cause of that I can think of might be if they put through some shortened test units WITHOUT CHANGING <rsc_fpops_est>. Anybody see anything like that?

(which machine did that data come from, please?)

ID: 50110 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50111 - Posted: 28 Jul 2018, 0:20:43 UTC - in response to Message 50109.  
Last modified: 28 Jul 2018, 0:25:45 UTC

No help here;
http://www.gpugrid.net/result.php?resultid=18294177
The workunits generated with the improper <rsc_fpops_bound> will be around until they error out.
I have two such workunits on my host, so I've manually edited the client_state.xml file to have the right <rsc_fpops_bound> value.

the method of this fix:
1. exit BOINC manager
2. windows key + r
3. type or copy and paste:
notepad c:\ProgramData\BOINC\client_state.xml
4. press <ENTER>
5. CTRL + H
6. search field:
<rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound>
7. replace field:
<rsc_fpops_bound>250000000000000000000.000000</rsc_fpops_bound>
8. it should replace as many times as the number of GPUGrid tasks on the given host
9. save and exit notepad
10. restart BOINC manager
ID: 50111 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile bcavnaugh

Send message
Joined: 8 Nov 13
Posts: 56
Credit: 1,002,640,163
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 50112 - Posted: 28 Jul 2018, 1:01:33 UTC
Last modified: 28 Jul 2018, 1:24:00 UTC

Thanks
<workunit>
<name>e17s86_e4s46p0f53-PABLO_2IDP_P01106_4_LEUP23P_IDP-0-1-RND2735</name>
<app_name>acemdlong</app_name>
<version_num>922</version_num>
<rsc_fpops_est>5000000000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound>
<rsc_memory_bound>300000000.000000</rsc_memory_bound>
<rsc_disk_bound>4000000000.000000</rsc_disk_bound>
<file_ref>

Testing now with
250000000000000000000.000000

Do we have to do this from now on, on each GPU Task?

Crunching@EVGA The Number One Team in the BOINC Community. Folding@EVGA The Number One Team in the Folding@Home Community.
ID: 50112 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tullio

Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50113 - Posted: 28 Jul 2018, 1:27:42 UTC
Last modified: 28 Jul 2018, 1:28:56 UTC

No go
Stderr output

<core_client_version>7.12.1</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -80 (0xffffffb0)</message>
<stderr_txt>
# GPU [GeForce GTX 1050 Ti] Platform [Windows] Rev [3212] VERSION [80]
# SWAN Device 0 :
# Name : GeForce GTX 1050 Ti
# ECC : Disabled
# Global mem : 4096MB
# Capability : 6.1
# PCI ID : 0000:01:00.0
# Device clock : 1392MHz
# Memory clock : 3504MHz
# Memory width : 128bit
# Driver version : r397_05 : 39764
# GPU 0 : 64C
# GPU 0 : 67C
# GPU 0 : 68C
# GPU 0 : 70C
# GPU 0 : 72C
# GPU 0 : 73C
# GPU 0 : 74C
# GPU 0 : 75C
# GPU 0 : 76C
# GPU 0 : 77C
# GPU 0 : 78C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 1755000)
called boinc_finish

</stderr_txt>
]]>
ID: 50113 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile bcavnaugh

Send message
Joined: 8 Nov 13
Posts: 56
Credit: 1,002,640,163
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 50114 - Posted: 28 Jul 2018, 1:40:04 UTC - in response to Message 50113.  
Last modified: 28 Jul 2018, 1:43:42 UTC

No go
Stderr output

<core_client_version>7.12.1</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -80 (0xffffffb0)</message>
<stderr_txt>
# GPU [GeForce GTX 1050 Ti] Platform [Windows] Rev [3212] VERSION [80]
# SWAN Device 0 :
# Name : GeForce GTX 1050 Ti
# GPU 0 : 78C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 1755000)
called boinc_finish

</stderr_txt>
]]>


exit code -80 is a Driver Issue (OpenCL Missing) as can also be C++ Runtimes issue maybe even missing.
You need both the x86 (32Bit) and the x64 Bit versions.
As well as unstable GPU and or CPU.

This is not the same issue as "Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED"

Crunching@EVGA The Number One Team in the BOINC Community. Folding@EVGA The Number One Team in the Folding@Home Community.
ID: 50114 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tullio

Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50115 - Posted: 28 Jul 2018, 2:12:13 UTC - in response to Message 50114.  

SETI@home tasks complete using opencl_nvidia_SoG. Temperature using Thundermaster is 66 C.
Tullio
ID: 50115 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50116 - Posted: 28 Jul 2018, 2:17:11 UTC

Please start a new thread for the Simulation Unstable issue, if you must. It typically means your GPU is overclocked too much, and this project pushes it harder than other projects. If you want help determining a max stable overclock, PM me and be patient.
ID: 50116 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tullio

Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50117 - Posted: 28 Jul 2018, 2:28:40 UTC - in response to Message 50116.  

MY GPU is not overclocked. I never overclock.
Tullio
ID: 50117 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50119 - Posted: 28 Jul 2018, 6:43:28 UTC

How much longer will they need to let tasks run before they get enough information to fix the problem?

Do we have to do this from now on, on each GPU Task?

From what I saw yesterday, somehow the system got itself into a state where it thought our machines were much faster than they really are.

'machine speed' comes from one of two places: either the aggregate returns across the whole project, or the actual behaviour of each individual computer.

The speed of the individual computer takes over in the end - after 11 tasks have made it all the way through and been validated. So "11 times per computer" should be the maximum number of manual interventions required.

But since they seem to have put in a workround for the faulty kill-switch, you may not have to do it that many times, or even at all.

Because work is now being completed properly, the system-wide speed assessment will be correcting itself at the same time, so that machines which have been inactive while waiting for the new app may never even see the problem. But it's hard to predict when that will kick in: I may find out when I get home.

As Retvari has pointed out, there will be faulty workunits circulating around the system for a while yet, and they are a problem because they waste resources for a significant length of time. Those are the ones it is most helpful to patch via the file edit: once they have been completed and validated, they won't come back to haunt us again.
ID: 50119 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 50123 - Posted: 28 Jul 2018, 7:57:39 UTC - in response to Message 50119.  

To summarize: the problem AFAIK were the test WUs, sent without changing the ops estimate. I now cancelled them all, and temporarily raised the OPS bound by 10^3.
ID: 50123 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50124 - Posted: 28 Jul 2018, 8:19:33 UTC - in response to Message 50123.  

To summarize: the problem AFAIK were the test WUs, sent without changing the ops estimate. I now cancelled them all, and temporarily raised the OPS bound by 10^3.

That sounds good. I agree with you about the cause, and the workround will let the system clean itself out with no further intervention.

Just one final task: buy a 2019 calendar, and put a big red circle round the next licence expiry date! (or perhaps a month before...)

I think you once said that the rsc_fpops_est was fixed by the workunit generation script: it might be a good idea to start thinking about making it easier to vary that. But not this weekend - take some time off!
ID: 50124 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50125 - Posted: 28 Jul 2018, 12:04:16 UTC - in response to Message 50123.  
Last modified: 28 Jul 2018, 12:16:59 UTC

To summarize: the problem AFAIK were the test WUs, sent without changing the ops estimate. I now cancelled them all, and temporarily raised the OPS bound by 10^3.
I still received a task which has the lower rsc_fpops_bound value. So we should watch these workunits carefully (and fix those which have the lower rsc_fpops_bound) until they've cleared out from the scheduler.
ID: 50125 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : News : Windows GPU Applications broken

©2025 Universitat Pompeu Fabra