New workunits

Author	Message
Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 53175 - Posted: 28 Nov 2019, 0:49:23 UTC - in response to Message 53171. Thanks, but I believe you misread me. The CPU is fine. The measurement is wrong. No, I believe the measurement is incorrect but is still going to be rather high in actuality. The Ryzen 3600 ships with the Wraith Stealth cooler which is just the normal Intel solution of a copper plug embedded into a aluminum casting. It just doesn't have the ability to quickly move heat away from the IHS. You would see much better temps if you switched to the Wraith MAX or Wraith Prism cooler which have real heat pipes and normal sized fans. The temps are correct for the Ryzen and Ryzen+ cpus, but the k10temp driver which is stock in Ubuntu didn't get the change needed to accommodate the Ryzen 2 cpus with the correct 0 temp offset. That only is shipping in the 5.3.4 or 5.4 kernels. https://www.phoronix.com/scan.php?page=news_item&px=AMD-Zen2-k10temp-Patches There are other solutions you could use in the meantime like the ASUS temp driver if you have a compatible motherboard or there also is a zenpower driver that can report the proper temp as well as the cpu power. https://github.com/ocerman/zenpower ID: 53175 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 154 Credit: 131,154,684 RAC: 11 Level Scientific publications	Message 53176 - Posted: 28 Nov 2019, 0:56:03 UTC - in response to Message 53154. Last modified: 28 Nov 2019, 1:05:06 UTC Damn! Wishful thinking! How about 4.75. To many numbers on my screen It's because it shows 4075 and then I automatically drop in the . at 2 places not realizing my mistake! As far as temperature goes, I am only reporting the CPU temp at the sensor point. I have sent a webform to Arctic asking them what the temp would be at the radiator after passing by the CPU heatsink. The air temp of the exhaust air does not feel anywhere near 80. I would put it down around 40C or less. I will see what they say and let you know. ID: 53176 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 154 Credit: 131,154,684 RAC: 11 Level Scientific publications	Message 53177 - Posted: 28 Nov 2019, 1:06:19 UTC Last modified: 28 Nov 2019, 1:07:59 UTC Tony - I keep getting this on random tasks unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 13:11:40 (25792): wrapper (7.9.26016): starting 13:11:40 (25792): wrapper: running acemd3.exe (--boinc input --device 1) # Engine failed: Particle coordinate is nan 13:37:25 (25792): acemd3.exe exited; CPU time 1524.765625 13:37:25 (25792): app exit status: 0x1 13:37:25 (25792): called boinc_finish(195) It runs 1524 seconds and bombs. What's up with that? It also appears that BOINC or the task is ignoring the appconfig command to use only my 1050. I see another task that is starting on the 1050 and then jumping to the 650 since the 1050 is tied up with another project. ID: 53177 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 53178 - Posted: 28 Nov 2019, 3:06:52 UTC - in response to Message 53175. The temps are correct for the Ryzen and Ryzen+ cpus, but the k10temp driver which is stock in Ubuntu didn't get the change needed to accommodate the Ryzen 2 cpus with the correct 0 temp offset. That only is shipping in the 5.3.4 or 5.4 kernels. Then it is probably reading 20C too high, and the CPU is really at 75C. Yes, I can improve on that. Thanks. ID: 53178 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 53179 - Posted: 28 Nov 2019, 3:47:00 UTC - in response to Message 53177. Tony - I keep getting this on random tasks unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 13:11:40 (25792): wrapper (7.9.26016): starting 13:11:40 (25792): wrapper: running acemd3.exe (--boinc input --device 1) # Engine failed: Particle coordinate is nan 13:37:25 (25792): acemd3.exe exited; CPU time 1524.765625 13:37:25 (25792): app exit status: 0x1 13:37:25 (25792): called boinc_finish(195) It runs 1524 seconds and bombs. What's up with that? It also appears that BOINC or the task is ignoring the appconfig command to use only my 1050. I see another task that is starting on the 1050 and then jumping to the 650 since the 1050 is tied up with another project. # Engine failed: Particle coordinate is nan Two issues can cause this error: 1. Error in the Task. This would mean all Hosts fail the task. See this link for details: https://github.com/openmm/openmm/issues/2308 2. If other Hosts do not fail the task, the error could be in the GPU Clock rate. I have tested this on one of my hosts and am able to produce this error when I Clock the GPU too high. It also appears that BOINC or the task is ignoring the appconfig command to use only my 1050. One setting to try....In Boinc Manager, Computer Preferences, set the "Switch between tasks every xxx minutes" to between 800 - 9999. This should allow the task to finish on the same GPU it started on. Can you post your app_config.xml file contents? ID: 53179 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 53180 - Posted: 28 Nov 2019, 4:34:21 UTC I've had a couple of the NaN errors. One where everyone errors out the task and another recently where it errored out after running through to completion. I had already removed all overclocking on the card but it still must have been too hot for the stock clockrate. It is my hottest card being sandwiched in the middle of the gpu stack with very little airflow. I am going to have to start putting in negative clock offset on it to get the temps down I think to avoid any further NaN errors on that card. ID: 53180 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 53181 - Posted: 28 Nov 2019, 5:52:47 UTC - in response to Message 53180. Last modified: 28 Nov 2019, 5:55:02 UTC I've had a couple of the NaN errors. One where everyone errors out the task and another recently where it errored out after running through to completion. I had already removed all overclocking on the card but it still must have been too hot for the stock clockrate. It is my hottest card being sandwiched in the middle of the gpu stack with very little airflow. I am going to have to start putting in negative clock offset on it to get the temps down I think to avoid any further NaN errors on that card. Would be interested to hear if the Under Clocking / Heat reduction fixes the issue. I am fairly confident this is the issue, but need validation / more data from fellow volunteers to be sure. ID: 53181 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 4,374 Level Scientific publications	Message 53183 - Posted: 28 Nov 2019, 6:32:03 UTC - in response to Message 53163. http://www.gpugrid.net/show_host_detail.php?hostid=147723 http://www.gpugrid.net/show_host_detail.php?hostid=482132 ... Might it be that the Wrapper is slower on slower CPUs and therefore slows down the GPUs? Is this the experience from other users as well? that's really interesting: the comparison of above two tasks shows that the host with the GTX1660ti yields lower GFLOP figures (single as well as double) as the host with the GTX1650. In both hosts, the CPU ist the same: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz. And now the even more surprising fact: by coincidence, exactly the same CPU is running in one of my hosts (http://www.gpugrid.net/show_host_detail.php?hostid=205584) with a GTX750ti - and here the GFLOP figures are even markedly higher than in the abeove cited hosts with more modern GPUs. So, is the conclusion now: the weaker the GPU, the higher the number of GFLOPs generated by the system? ID: 53183 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 53184 - Posted: 28 Nov 2019, 9:28:26 UTC - in response to Message 53183. Last modified: 28 Nov 2019, 9:34:47 UTC http://www.gpugrid.net/show_host_detail.php?hostid=147723 http://www.gpugrid.net/show_host_detail.php?hostid=482132 ... Might it be that the Wrapper is slower on slower CPUs and therefore slows down the GPUs? Is this the experience from other users as well? that's really interesting: the comparison of above two tasks shows that the host with the GTX1660ti yields lower GFLOP figures (single as well as double) as the host with the GTX1650. In both hosts, the CPU ist the same: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz. And now the even more surprising fact: by coincidence, exactly the same CPU is running in one of my hosts (http://www.gpugrid.net/show_host_detail.php?hostid=205584) with a GTX750ti - and here the GFLOP figures are even markedly higher than in the abeove cited hosts with more modern GPUs. So, is the conclusion now: the weaker the GPU, the higher the number of GFLOPs generated by the system? The "Integer" (I hope it's called this way in English) speed measured is way much higher under Linux than under Windows. (the 1st and 2nd host use Linux, the 3rd use Windows) See the stats of my dual boot host: Linux 139876.18 - Windows 12615.42 There's more than one order of magnitude difference between the two OS on the same hardware, one of them must be wrong. ID: 53184 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 154 Credit: 131,154,684 RAC: 11 Level Scientific publications	Message 53185 - Posted: 28 Nov 2019, 13:27:26 UTC - in response to Message 53176. Damn! Wishful thinking! How about 4.75. To many numbers on my screen It's because it shows 4075 and then I automatically drop in the . at 2 places not realizing my mistake! As far as temperature goes, I am only reporting the CPU temp at the sensor point. I have sent a webform to Arctic asking them what the temp would be at the radiator after passing by the CPU heatsink. The air temp of the exhaust air does not feel anywhere near 80. I would put it down around 40C or less. I will see what they say and let you know. ------------------------ Hi Greg I talked to my colleague who is in the Liquid Freezer II Dev. Team and he said that theese temps are normal with this kind of load. Installation sounds good to me. With kind regards Your ARCTIC Team, Stephan Arctic/Service Manager ID: 53185 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 154 Credit: 131,154,684 RAC: 11 Level Scientific publications	Message 53186 - Posted: 28 Nov 2019, 13:30:27 UTC - in response to Message 53179. Tony - I keep getting this on random tasks unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 13:11:40 (25792): wrapper (7.9.26016): starting 13:11:40 (25792): wrapper: running acemd3.exe (--boinc input --device 1) # Engine failed: Particle coordinate is nan 13:37:25 (25792): acemd3.exe exited; CPU time 1524.765625 13:37:25 (25792): app exit status: 0x1 13:37:25 (25792): called boinc_finish(195) It runs 1524 seconds and bombs. What's up with that? It also appears that BOINC or the task is ignoring the appconfig command to use only my 1050. I see another task that is starting on the 1050 and then jumping to the 650 since the 1050 is tied up with another project. # Engine failed: Particle coordinate is nan Two issues can cause this error: 1. Error in the Task. This would mean all Hosts fail the task. See this link for details: https://github.com/openmm/openmm/issues/2308 2. If other Hosts do not fail the task, the error could be in the GPU Clock rate. I have tested this on one of my hosts and am able to produce this error when I Clock the GPU too high. It also appears that BOINC or the task is ignoring the appconfig command to use only my 1050. One setting to try....In Boinc Manager, Computer Preferences, set the "Switch between tasks every xxx minutes" to between 800 - 9999. This should allow the task to finish on the same GPU it started on. Can you post your app_config.xml file contents? -------------------- <?xml version="1.0"?> -<app_config> -<exclude_gpu> <url>www.gpugrid.net</url> <device_num>1</device_num> <type>NVIDIA</type> </exclude_gpu> </app_config> I was having some issues with LHC ATLAS and was in the process of putting the tasks on pause and then disconnecting the client. In this process I discovered that another instance popped up right after I closed the one I was looking at and then I got another instance popping up with a message saying that there were two running. I shut that down and it shut down the last instance. This is a first for me. I have restarted my computer and now will wait and see whats going on. ID: 53186 · Rating: 0 · rate: / Reply Quote

computezrmle Send message Joined: 10 Jun 13 Posts: 9 Credit: 295,692,471 RAC: 0 Level Scientific publications	Message 53187 - Posted: 28 Nov 2019, 16:52:29 UTC - in response to Message 53186. What you posted is a mix of app_config.xml and cc_config.xml. Be so kind as to strictly follow the hints and examples on this page: https://boinc.berkeley.edu/wiki/Client_configuration ID: 53187 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 154 Credit: 131,154,684 RAC: 11 Level Scientific publications	Message 53189 - Posted: 28 Nov 2019, 19:24:12 UTC - in response to Message 53187. Last modified: 28 Nov 2019, 19:26:52 UTC What you posted is a mix of app_config.xml and cc_config.xml. Be so kind as to strictly follow the hints and examples on this page: https://boinc.berkeley.edu/wiki/Client_configuration You give me a page on CC config. I jumped down to what appears to be stuff related to app_config and copied this <exclude_gpu> <url>project_URL</url> [<device_num>N</device_num>] [<type>NVIDIA\|ATI\|intel_gpu</type>] [<app>appname</app>] </exclude_gpu> project id is the gpugrid.net device = 1 type is nvidia removed app name since app name changes so much ***GPUGRID: Notice from BOINC Missing <app_config> in app_config.xml 11/28/2019 8:24:51 PM* This is why I had it in the text. ID: 53189 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 154 Credit: 131,154,684 RAC: 11 Level Scientific publications	Message 53190 - Posted: 28 Nov 2019, 19:57:02 UTC What the heck now???!!! A slew of Exit child errors! What is this? Is this speed problems with OC? Also getting restart on different device errors!!! Now this...is that because something is not right in the app_config? ID: 53190 · Rating: 0 · rate: / Reply Quote

Zalster Send message Joined: 26 Feb 14 Posts: 211 Credit: 4,496,324,562 RAC: 0 Level Scientific publications	Message 53191 - Posted: 28 Nov 2019, 20:05:23 UTC - in response to Message 53189. You give me a page on CC config. I jumped down to what appears to be stuff related to app_config and copied this <exclude_gpu> <url>project_URL</url> [<device_num>N</device_num>] [<type>NVIDIA\|ATI\|intel_gpu</type>] [<app>appname</app>] </exclude_gpu> project id is the gpugrid.net device = 1 type is nvidia removed app name since app name changes so much ***GPUGRID: Notice from BOINC Missing <app_config> in app_config.xml 11/28/2019 8:24:51 PM* This is why I had it in the text. <cc_config> <exclude_gpu> <url>project_URL</url> [<device_num>N</device_num>] [<type>NVIDIA\|ATI\|intel_gpu</type>] [<app>appname</app>] </exclude_gpu </cc_config> This needs to go into the Boinc folder not the GPUGrid project folder ID: 53191 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 53192 - Posted: 28 Nov 2019, 20:56:21 UTC - in response to Message 53190. If you are going to use an exclude, then you need to exclude all dissimilar devices than the one you want to use. That is how to get rid of restart on different device errors. Or just set the switch between tasks to 360minutes or greater and don't exit BOINC while the task is running. The device number you use in the exclude statement is defined by how BOINC enumerates the cards in the Event Log at startup. The gpu_exclude statement goes into cc_config.xml in the main BOINC directory, not a project directory. ID: 53192 · Rating: 0 · rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 53193 - Posted: 28 Nov 2019, 21:18:18 UTC - in response to Message 53190. What the heck now???!!! A slew of Exit child errors! What is this? Is this speed problems with OC? Also getting restart on different device errors!!! Now this...is that because something is not right in the app_config? I see two types of errors: ERROR: src\mdsim\context.cpp line 322: Cannot use a restart file on a different device! as the name says, exclusion not working. And # Engine failed: Particle coordinate is nan this usually indicates mathematical errors in the operations performed, memory corruption, or similar (or a faulty wu, unlikely in this case). Maybe a reboot will solve it. ID: 53193 · Rating: 0 · rate: / Reply Quote

computezrmle Send message Joined: 10 Jun 13 Posts: 9 Credit: 295,692,471 RAC: 0 Level Scientific publications	Message 53194 - Posted: 28 Nov 2019, 21:28:01 UTC - in response to Message 53189. You give me a page on CC config. I posted the official documentation for more than just cc_config.xml: cc_config.xml nvc_config.xml app_config.xml It's worth to carefully read this page a couple of times as it provides all you need to know. Long ago the page had a direct link to the app_config.xml section. Unfortunately that link is not available any more but you may use your browser's find function. ID: 53194 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 154 Credit: 131,154,684 RAC: 11 Level Scientific publications	Message 53195 - Posted: 28 Nov 2019, 22:58:46 UTC - in response to Message 53192. Last modified: 28 Nov 2019, 23:08:50 UTC If you are going to use an exclude, then you need to exclude all dissimilar devices than the one you want to use. That is how to get rid of restart on different device errors. Or just set the switch between tasks to 360minutes or greater and don't exit BOINC while the task is running. The device number you use in the exclude statement is defined by how BOINC enumerates the cards in the Event Log at startup. The gpu_exclude statement goes into cc_config.xml in the main BOINC directory, not a project directory. Ok, on point 1, it was set for 360 already because that's a good time for LHC ATLAS to run complete. I moved it up to 480 now to try and deal with this stuff in GPUGRID. Point 2 - Going to try a cc_config with a triple exclude gpu code block for here and for 2 other projects. From what I read this should be possible. ID: 53195 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 154 Credit: 131,154,684 RAC: 11 Level Scientific publications	Message 53196 - Posted: 28 Nov 2019, 23:00:43 UTC - in response to Message 53193. Last modified: 28 Nov 2019, 23:01:58 UTC What the heck now???!!! A slew of Exit child errors! What is this? Is this speed problems with OC? Also getting restart on different device errors!!! Now this...is that because something is not right in the app_config? I see two types of errors: ERROR: src\mdsim\context.cpp line 322: Cannot use a restart file on a different device! as the name says, exclusion not working. And # Engine failed: Particle coordinate is nan this usually indicates mathematical errors in the operations performed, memory corruption, or similar (or a faulty wu, unlikely in this case). Maybe a reboot will solve it. One of these days I will get this problem solved. Driving me nuts! ID: 53196 · Rating: 0 · rate: / Reply Quote