New workunits

Message boards : News : New workunits
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next

AuthorMessage
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 53175 - Posted: 28 Nov 2019, 0:49:23 UTC - in response to Message 53171.  

Thanks, but I believe you misread me. The CPU is fine. The measurement is wrong.

No, I believe the measurement is incorrect but is still going to be rather high in actuality. The Ryzen 3600 ships with the Wraith Stealth cooler which is just the normal Intel solution of a copper plug embedded into a aluminum casting. It just doesn't have the ability to quickly move heat away from the IHS.

You would see much better temps if you switched to the Wraith MAX or Wraith Prism cooler which have real heat pipes and normal sized fans.

The temps are correct for the Ryzen and Ryzen+ cpus, but the k10temp driver which is stock in Ubuntu didn't get the change needed to accommodate the Ryzen 2 cpus with the correct 0 temp offset. That only is shipping in the 5.3.4 or 5.4 kernels.

https://www.phoronix.com/scan.php?page=news_item&px=AMD-Zen2-k10temp-Patches

There are other solutions you could use in the meantime like the ASUS temp driver if you have a compatible motherboard or there also is a zenpower driver that can report the proper temp as well as the cpu power.

https://github.com/ocerman/zenpower
ID: 53175 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 53176 - Posted: 28 Nov 2019, 0:56:03 UTC - in response to Message 53154.  
Last modified: 28 Nov 2019, 1:05:06 UTC

Damn! Wishful thinking!

How about 4.75. To many numbers on my screen
It's because it shows 4075 and then I automatically drop in the . at 2 places not realizing my mistake!

As far as temperature goes, I am only reporting the CPU temp at the sensor point.
I have sent a webform to Arctic asking them what the temp would be at the radiator after passing by the CPU heatsink. The air temp of the exhaust air does not feel anywhere near 80. I would put it down around 40C or less.

I will see what they say and let you know.
ID: 53176 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 53177 - Posted: 28 Nov 2019, 1:06:19 UTC
Last modified: 28 Nov 2019, 1:07:59 UTC

Tony - I keep getting this on random tasks
unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
13:11:40 (25792): wrapper (7.9.26016): starting
13:11:40 (25792): wrapper: running acemd3.exe (--boinc input --device 1)
# Engine failed: Particle coordinate is nan
13:37:25 (25792): acemd3.exe exited; CPU time 1524.765625
13:37:25 (25792): app exit status: 0x1
13:37:25 (25792): called boinc_finish(195)

It runs 1524 seconds and bombs.
What's up with that?

It also appears that BOINC or the task is ignoring the appconfig command to use only my 1050. I see another task that is starting on the 1050 and then jumping to the 650 since the 1050 is tied up with another project.
ID: 53177 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53178 - Posted: 28 Nov 2019, 3:06:52 UTC - in response to Message 53175.  

The temps are correct for the Ryzen and Ryzen+ cpus, but the k10temp driver which is stock in Ubuntu didn't get the change needed to accommodate the Ryzen 2 cpus with the correct 0 temp offset. That only is shipping in the 5.3.4 or 5.4 kernels.

Then it is probably reading 20C too high, and the CPU is really at 75C.
Yes, I can improve on that. Thanks.
ID: 53178 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 53179 - Posted: 28 Nov 2019, 3:47:00 UTC - in response to Message 53177.  

Tony - I keep getting this on random tasks
unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
13:11:40 (25792): wrapper (7.9.26016): starting
13:11:40 (25792): wrapper: running acemd3.exe (--boinc input --device 1)
# Engine failed: Particle coordinate is nan
13:37:25 (25792): acemd3.exe exited; CPU time 1524.765625
13:37:25 (25792): app exit status: 0x1
13:37:25 (25792): called boinc_finish(195)

It runs 1524 seconds and bombs.
What's up with that?

It also appears that BOINC or the task is ignoring the appconfig command to use only my 1050. I see another task that is starting on the 1050 and then jumping to the 650 since the 1050 is tied up with another project.


# Engine failed: Particle coordinate is nan

Two issues can cause this error:
1. Error in the Task. This would mean all Hosts fail the task. See this link for details: https://github.com/openmm/openmm/issues/2308
2. If other Hosts do not fail the task, the error could be in the GPU Clock rate. I have tested this on one of my hosts and am able to produce this error when I Clock the GPU too high.

It also appears that BOINC or the task is ignoring the appconfig command to use only my 1050.

One setting to try....In Boinc Manager, Computer Preferences, set the "Switch between tasks every xxx minutes" to between 800 - 9999. This should allow the task to finish on the same GPU it started on.
Can you post your app_config.xml file contents?
ID: 53179 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 53180 - Posted: 28 Nov 2019, 4:34:21 UTC

I've had a couple of the NaN errors. One where everyone errors out the task and another recently where it errored out after running through to completion. I had already removed all overclocking on the card but it still must have been too hot for the stock clockrate. It is my hottest card being sandwiched in the middle of the gpu stack with very little airflow. I am going to have to start putting in negative clock offset on it to get the temps down I think to avoid any further NaN errors on that card.
ID: 53180 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 53181 - Posted: 28 Nov 2019, 5:52:47 UTC - in response to Message 53180.  
Last modified: 28 Nov 2019, 5:55:02 UTC

I've had a couple of the NaN errors. One where everyone errors out the task and another recently where it errored out after running through to completion. I had already removed all overclocking on the card but it still must have been too hot for the stock clockrate. It is my hottest card being sandwiched in the middle of the gpu stack with very little airflow. I am going to have to start putting in negative clock offset on it to get the temps down I think to avoid any further NaN errors on that card.

Would be interested to hear if the Under Clocking / Heat reduction fixes the issue.
I am fairly confident this is the issue, but need validation / more data from fellow volunteers to be sure.
ID: 53181 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 869
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 53183 - Posted: 28 Nov 2019, 6:32:03 UTC - in response to Message 53163.  

http://www.gpugrid.net/show_host_detail.php?hostid=147723
http://www.gpugrid.net/show_host_detail.php?hostid=482132
...
Might it be that the Wrapper is slower on slower CPUs and therefore slows down the GPUs? Is this the experience from other users as well?


that's really interesting: the comparison of above two tasks shows that the host with the GTX1660ti yields lower GFLOP figures (single as well as double) as the host with the GTX1650.
In both hosts, the CPU ist the same: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz.

And now the even more surprising fact: by coincidence, exactly the same CPU is running in one of my hosts (http://www.gpugrid.net/show_host_detail.php?hostid=205584) with a GTX750ti - and here the GFLOP figures are even markedly higher than in the abeove cited hosts with more modern GPUs.

So, is the conclusion now: the weaker the GPU, the higher the number of GFLOPs generated by the system?

ID: 53183 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53184 - Posted: 28 Nov 2019, 9:28:26 UTC - in response to Message 53183.  
Last modified: 28 Nov 2019, 9:34:47 UTC

http://www.gpugrid.net/show_host_detail.php?hostid=147723
http://www.gpugrid.net/show_host_detail.php?hostid=482132
...
Might it be that the Wrapper is slower on slower CPUs and therefore slows down the GPUs? Is this the experience from other users as well?
that's really interesting: the comparison of above two tasks shows that the host with the GTX1660ti yields lower GFLOP figures (single as well as double) as the host with the GTX1650.
In both hosts, the CPU ist the same: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz.

And now the even more surprising fact: by coincidence, exactly the same CPU is running in one of my hosts (http://www.gpugrid.net/show_host_detail.php?hostid=205584) with a GTX750ti - and here the GFLOP figures are even markedly higher than in the abeove cited hosts with more modern GPUs.

So, is the conclusion now: the weaker the GPU, the higher the number of GFLOPs generated by the system?
The "Integer" (I hope it's called this way in English) speed measured is way much higher under Linux than under Windows.
(the 1st and 2nd host use Linux, the 3rd use Windows)
See the stats of my dual boot host:
Linux 139876.18 - Windows 12615.42
There's more than one order of magnitude difference between the two OS on the same hardware, one of them must be wrong.
ID: 53184 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 53185 - Posted: 28 Nov 2019, 13:27:26 UTC - in response to Message 53176.  

Damn! Wishful thinking!

How about 4.75. To many numbers on my screen
It's because it shows 4075 and then I automatically drop in the . at 2 places not realizing my mistake!

As far as temperature goes, I am only reporting the CPU temp at the sensor point.
I have sent a webform to Arctic asking them what the temp would be at the radiator after passing by the CPU heatsink. The air temp of the exhaust air does not feel anywhere near 80. I would put it down around 40C or less.

I will see what they say and let you know.

------------------------

Hi Greg

I talked to my colleague who is in the Liquid Freezer II Dev. Team and he said that theese temps are normal with this kind of load.
Installation sounds good to me.


With kind regards


Your ARCTIC Team,
Stephan
Arctic/Service Manager
ID: 53185 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 53186 - Posted: 28 Nov 2019, 13:30:27 UTC - in response to Message 53179.  

Tony - I keep getting this on random tasks
unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
13:11:40 (25792): wrapper (7.9.26016): starting
13:11:40 (25792): wrapper: running acemd3.exe (--boinc input --device 1)
# Engine failed: Particle coordinate is nan
13:37:25 (25792): acemd3.exe exited; CPU time 1524.765625
13:37:25 (25792): app exit status: 0x1
13:37:25 (25792): called boinc_finish(195)

It runs 1524 seconds and bombs.
What's up with that?

It also appears that BOINC or the task is ignoring the appconfig command to use only my 1050. I see another task that is starting on the 1050 and then jumping to the 650 since the 1050 is tied up with another project.


# Engine failed: Particle coordinate is nan

Two issues can cause this error:
1. Error in the Task. This would mean all Hosts fail the task. See this link for details: https://github.com/openmm/openmm/issues/2308
2. If other Hosts do not fail the task, the error could be in the GPU Clock rate. I have tested this on one of my hosts and am able to produce this error when I Clock the GPU too high.

It also appears that BOINC or the task is ignoring the appconfig command to use only my 1050.

One setting to try....In Boinc Manager, Computer Preferences, set the "Switch between tasks every xxx minutes" to between 800 - 9999. This should allow the task to finish on the same GPU it started on.
Can you post your app_config.xml file contents?


--------------------

<?xml version="1.0"?>

-<app_config>


-<exclude_gpu>

<url>www.gpugrid.net</url>

<device_num>1</device_num>

<type>NVIDIA</type>

</exclude_gpu>

</app_config>

I was having some issues with LHC ATLAS and was in the process of putting the tasks on pause and then disconnecting the client. In this process I discovered that another instance popped up right after I closed the one I was looking at and then I got another instance popping up with a message saying that there were two running. I shut that down and it shut down the last instance. This is a first for me.

I have restarted my computer and now will wait and see whats going on.
ID: 53186 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle

Send message
Joined: 10 Jun 13
Posts: 9
Credit: 295,692,471
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwat
Message 53187 - Posted: 28 Nov 2019, 16:52:29 UTC - in response to Message 53186.  

What you posted is a mix of app_config.xml and cc_config.xml.
Be so kind as to strictly follow the hints and examples on this page:
https://boinc.berkeley.edu/wiki/Client_configuration
ID: 53187 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 53189 - Posted: 28 Nov 2019, 19:24:12 UTC - in response to Message 53187.  
Last modified: 28 Nov 2019, 19:26:52 UTC

What you posted is a mix of app_config.xml and cc_config.xml.
Be so kind as to strictly follow the hints and examples on this page:
https://boinc.berkeley.edu/wiki/Client_configuration



You give me a page on CC config. I jumped down to what appears to be stuff related to app_config and copied this

<exclude_gpu>
<url>project_URL</url>
[<device_num>N</device_num>]
[<type>NVIDIA|ATI|intel_gpu</type>]
[<app>appname</app>]
</exclude_gpu>

project id is the gpugrid.net
device = 1
type is nvidia
removed app name since app name changes so much

*****GPUGRID: Notice from BOINC
Missing <app_config> in app_config.xml
11/28/2019 8:24:51 PM***

This is why I had it in the text.
ID: 53189 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 53190 - Posted: 28 Nov 2019, 19:57:02 UTC

What the heck now???!!!

A slew of Exit child errors! What is this? Is this speed problems with OC?
Also getting restart on different device errors!!!
Now this...is that because something is not right in the app_config?
ID: 53190 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zalster
Avatar

Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 53191 - Posted: 28 Nov 2019, 20:05:23 UTC - in response to Message 53189.  



You give me a page on CC config. I jumped down to what appears to be stuff related to app_config and copied this

<exclude_gpu>
<url>project_URL</url>
[<device_num>N</device_num>]
[<type>NVIDIA|ATI|intel_gpu</type>]
[<app>appname</app>]
</exclude_gpu>

project id is the gpugrid.net
device = 1
type is nvidia
removed app name since app name changes so much

*****GPUGRID: Notice from BOINC
Missing <app_config> in app_config.xml
11/28/2019 8:24:51 PM***

This is why I had it in the text.



<cc_config>
<exclude_gpu>
<url>project_URL</url>
[<device_num>N</device_num>]
[<type>NVIDIA|ATI|intel_gpu</type>]
[<app>appname</app>]
</exclude_gpu
</cc_config>

This needs to go into the Boinc folder not the GPUGrid project folder

ID: 53191 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 53192 - Posted: 28 Nov 2019, 20:56:21 UTC - in response to Message 53190.  

If you are going to use an exclude, then you need to exclude all dissimilar devices than the one you want to use. That is how to get rid of restart on different device errors. Or just set the switch between tasks to 360minutes or greater and don't exit BOINC while the task is running.

The device number you use in the exclude statement is defined by how BOINC enumerates the cards in the Event Log at startup.

The gpu_exclude statement goes into cc_config.xml in the main BOINC directory, not a project directory.
ID: 53192 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 53193 - Posted: 28 Nov 2019, 21:18:18 UTC - in response to Message 53190.  

What the heck now???!!!

A slew of Exit child errors! What is this? Is this speed problems with OC?
Also getting restart on different device errors!!!
Now this...is that because something is not right in the app_config?


I see two types of errors:

ERROR: src\mdsim\context.cpp line 322: Cannot use a restart file on a different device!


as the name says, exclusion not working. And

# Engine failed: Particle coordinate is nan


this usually indicates mathematical errors in the operations performed, memory corruption, or similar (or a faulty wu, unlikely in this case). Maybe a reboot will solve it.
ID: 53193 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle

Send message
Joined: 10 Jun 13
Posts: 9
Credit: 295,692,471
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwat
Message 53194 - Posted: 28 Nov 2019, 21:28:01 UTC - in response to Message 53189.  

You give me a page on CC config.

I posted the official documentation for more than just cc_config.xml:
cc_config.xml
nvc_config.xml
app_config.xml

It's worth to carefully read this page a couple of times as it provides all you need to know.

Long ago the page had a direct link to the app_config.xml section.
Unfortunately that link is not available any more but you may use your browser's find function.
ID: 53194 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 53195 - Posted: 28 Nov 2019, 22:58:46 UTC - in response to Message 53192.  
Last modified: 28 Nov 2019, 23:08:50 UTC

If you are going to use an exclude, then you need to exclude all dissimilar devices than the one you want to use. That is how to get rid of restart on different device errors. Or just set the switch between tasks to 360minutes or greater and don't exit BOINC while the task is running.

The device number you use in the exclude statement is defined by how BOINC enumerates the cards in the Event Log at startup.

The gpu_exclude statement goes into cc_config.xml in the main BOINC directory, not a project directory.



Ok, on point 1, it was set for 360 already because that's a good time for LHC ATLAS to run complete. I moved it up to 480 now to try and deal with this stuff in GPUGRID.

Point 2 - Going to try a cc_config with a triple exclude gpu code block for here and for 2 other projects. From what I read this should be possible.
ID: 53195 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 53196 - Posted: 28 Nov 2019, 23:00:43 UTC - in response to Message 53193.  
Last modified: 28 Nov 2019, 23:01:58 UTC

What the heck now???!!!

A slew of Exit child errors! What is this? Is this speed problems with OC?
Also getting restart on different device errors!!!
Now this...is that because something is not right in the app_config?


I see two types of errors:

ERROR: src\mdsim\context.cpp line 322: Cannot use a restart file on a different device!


as the name says, exclusion not working. And

# Engine failed: Particle coordinate is nan


this usually indicates mathematical errors in the operations performed, memory corruption, or similar (or a faulty wu, unlikely in this case). Maybe a reboot will solve it.



One of these days I will get this problem solved. Driving me nuts!
ID: 53196 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next

Message boards : News : New workunits

©2025 Universitat Pompeu Fabra