Advanced search

Message boards : Multicore CPUs : Updates to the QMML app

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 747
Credit: 4,285,282
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48835 - Posted: 6 Feb 2018 | 10:47:20 UTC

Two changes were made yesterday:

* CPU threads are limited to 4 (you should still be able to crunch multiple WUs at once, please check)
* Credits should be in line with other projects'

Let us know.

biodoc
Send message
Joined: 26 Aug 08
Posts: 121
Credit: 847,311,950
RAC: 89,086
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48841 - Posted: 6 Feb 2018 | 15:28:23 UTC

I have a 2600K processor with 6 logical cores available for the QC app.

This app_config.xml starts 3 concurrent QC apps with 2 cores each. We'll see how it goes.

<app_config>
<app>
<name>QC</name>
<max_concurrent>3</max_concurrent>
</app>
<app_version>
<app_name>QC</app_name>
<plan_class>mt</plan_class>
<avg_ncpus>2</avg_ncpus>
<cmdline>--nthreads 2</cmdline>
</app_version>
</app_config>

biodoc
Send message
Joined: 26 Aug 08
Posts: 121
Credit: 847,311,950
RAC: 89,086
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48844 - Posted: 6 Feb 2018 | 16:48:38 UTC

It seems one task of 3 is progressing much more slowly than the other 2 so I'll reduce max concurrent tasks to 2 in the app_config.

Is there a limit to using 4 logical cores per processor?

klepel
Send message
Joined: 23 Dec 09
Posts: 153
Credit: 2,290,296,388
RAC: 1,522,493
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48846 - Posted: 6 Feb 2018 | 17:57:45 UTC - in response to Message 48835.

Toni, there has been an other reason, I abstained from the QMML app on the AMD1700x computer apart from failing WUs and freezing the computer:

QMML app clogs the scheduler of BOINC and therefor blocks the other projects to download additional WUs automaticly when requiered (I do run only one instance of QMML app on this computer). So the computer ends-up to run only one instance of QMML app and a GPU app. When I ask manually for additional WUs on the other projects, then they get downloaded. So it is to run only GPUGRID on this computer or run GPUGRID only on the GPU.

* CPU threads are limited to 4 (you should still be able to crunch multiple WUs at once, please check)

This would not have been necessary as it could easily changed by app-config. And a working app_config is already circulating in the forums. This now limits power user to make there own adjustments, means let run all threads on one single QMML app WU. See problem above.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 136
Credit: 105,816,438
RAC: 780,687
Level
Cys
Scientific publications
wat
Message 48848 - Posted: 6 Feb 2018 | 21:54:16 UTC

I haven't had any issue with the QC tasks running alongside my usual SETI tasks. I reduce the number of Seti cpu tasks when I run the QC task set to use 4 cores. The Seti and Einstein projects download and run tasks normally without any manual intervention.

I did have to increase my disk space for the QC task I ran this morning when GPUGrid complained it needed an additional 600 MB of space. I have a larger than normal amount of Seti work on board today to make through the scheduled outage.

Maybe the disk space needs to be increase or the resource share changed.

I have a AMD 1800X in my Ryzen cruncher.

klepel
Send message
Joined: 23 Dec 09
Posts: 153
Credit: 2,290,296,388
RAC: 1,522,493
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48849 - Posted: 6 Feb 2018 | 22:20:29 UTC

Thanks Keith for your comments.

I do not run SETI on the CPU, I run PRIMEGRID and SETI on the GPU only.

I do run ODLK1 on the CPU, as this is the only project that does not freeze frequently my computer/CPU (once a day).

The ODLK1 tasks however do not download when I am crunching the QMML app.

The freezing might as well be that I am mildly overclocked the CPU (3770 MHz) and run the RAM at 2966 MHz (RAM is rated at 3000 MHz), but I have not had the time to investigate further.


Keith Myers
Send message
Joined: 13 Dec 17
Posts: 136
Credit: 105,816,438
RAC: 780,687
Level
Cys
Scientific publications
wat
Message 48852 - Posted: 7 Feb 2018 | 0:38:42 UTC - in response to Message 48849.

I've gotten pretty good at tuning Ryzen for 24/7 distributed computing. I have had the 1700X since launch in March of last year.

I run the 1700X at 3.9Ghz and the memory at 3333Mhz CL14 with fast timings.

The 1800X being newer and better made runs at 3.95 Ghz and 3333Mhz CL14 fast timings.

Do you get BSOD's or black screens? BSOD's are almost invariably due to aggressive memory clocks, memory instability or IMC weakness.

Black screens, (computer appears frozen, no display or keyboard or mouse input recognized) with no error logs generated are caused by cpu lockup because of insufficient VDDCR cpu voltage for the desired cpu clocks.

Both my Ryzens run for weeks without errors. Only reason uptime is not longer is because of OS updates or whatever.

klepel
Send message
Joined: 23 Dec 09
Posts: 153
Credit: 2,290,296,388
RAC: 1,522,493
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48854 - Posted: 7 Feb 2018 | 4:07:03 UTC - in response to Message 48852.

Black screens, (computer appears frozen, no display or keyboard or mouse input recognized) “with no error logs generated” are caused by cpu lockup because of insufficient VDDCR cpu voltage for the desired cpu clocks.

It is a black screen with the symptoms you describe. I would even say, I am not overclocking at all: I do have a ASUS Prime X370-Pro motherboard and in Bios Settings it asks me, if I am on Water-Cooling, which I am (Corsair Liquid CPU Cooler H60) and then it gives me 3770 MHz, that is all what I did. Similar with RAM: I just adjusted the frequency in BIOS to the Frequency of the RAM specification nothing else.

So if you might help with overclocking or with stabilizing the system at these frequencies, would be highly appreciated.

Then I will try to switch back to the QMML app.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 136
Credit: 105,816,438
RAC: 780,687
Level
Cys
Scientific publications
wat
Message 48855 - Posted: 7 Feb 2018 | 5:06:23 UTC - in response to Message 48854.

We probably should converse via PM so as to not pollute or hijack the thread.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 136
Credit: 105,816,438
RAC: 780,687
Level
Cys
Scientific publications
wat
Message 48856 - Posted: 7 Feb 2018 | 5:34:48 UTC

Wow, how did you manage to get 1200-1300 credits for your QC tasks today. What's your secret?

NUCCpod_NAPTIMELABS_01
Send message
Joined: 18 Aug 17
Posts: 6
Credit: 112,376,348
RAC: 540,840
Level
Cys
Scientific publications
wat
Message 48860 - Posted: 7 Feb 2018 | 7:54:45 UTC - in response to Message 48835.

So far with testing, I have only been able to run QMML work units on systems with up to 4 cores.

On any of my systems with 8 or 16 cores, attempting to run multiple QMMLs, they all end prematurely with a computational error.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 181
Credit: 360,305,389
RAC: 1,074,774
Level
Asp
Scientific publications
wat
Message 48861 - Posted: 7 Feb 2018 | 11:28:32 UTC - in response to Message 48860.

So far with testing, I have only been able to run QMML work units on systems with up to 4 cores.

On any of my systems with 8 or 16 cores, attempting to run multiple QMMLs, they all end prematurely with a computational error.


Probably because there are issues if two tasks start up at the same time. You'll have to limit QC tasks to 1 concurrent task at a time.

klepel
Send message
Joined: 23 Dec 09
Posts: 153
Credit: 2,290,296,388
RAC: 1,522,493
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48865 - Posted: 7 Feb 2018 | 16:32:31 UTC

@Keith: Thanks for you PM.
@Keith: I did nothing! Toni changed the credits and my two computers have higher credits.

@NUCCpod_NAPTIMELABS_01 and mmonnin: It is correct you have to limit QMML app to only one concurrent task at a time. Then it works on my AMDs. There is circulating an app_config in the forums that works.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 136
Credit: 105,816,438
RAC: 780,687
Level
Cys
Scientific publications
wat
Message 48868 - Posted: 7 Feb 2018 | 17:03:50 UTC - in response to Message 48865.

@klepel: Well Toni didn't change the credits for me it appears.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 136
Credit: 105,816,438
RAC: 780,687
Level
Cys
Scientific publications
wat
Message 48875 - Posted: 7 Feb 2018 | 19:42:17 UTC

I just downloaded a couple more QC tasks hoping that the one I did yesterday was a fluke or carryover from the "old" tasks with the tiny credit.

Nope. Still getting very little credit for these QC tasks and not worth tying up 4 cores.

Haven't a clue why I get such little credit and others are getting 24 times more for the same cpu elapsed times.

Run time 1,854.06
CPU time 7,224.95
Validate state Valid
Credit 47.13

Jim1348
Send message
Joined: 28 Jul 12
Posts: 640
Credit: 1,207,282,694
RAC: 96,054
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 48878 - Posted: 7 Feb 2018 | 23:01:29 UTC - in response to Message 48875.

Haven't a clue why I get such little credit and others are getting 24 times more for the same cpu elapsed times.

Not consistently. They vary all over the place. Your values are a little low for the moment, but you need more data points to draw much of a conclusion. Mine vary a lot too (Ryzen 1700, not overclocked).
http://www.gpugrid.net/results.php?hostid=452287&offset=0&show_names=0&state=3&appid=

Note that those are with two cores per work unit, but that should not affect the credit per work unit, in a perfect world at least. And note that the longer work units often get less credit than the shorter ones, so the credit system is strange in any case.

I think the points are a little more consistent on my Intel machines, and probably a little higher than on the Ryzen machine on average, though I have not tried to calculate it yet.
i7-3770: http://www.gpugrid.net/results.php?hostid=433866&offset=0&show_names=0&state=3&appid=
i7-4790: http://www.gpugrid.net/results.php?hostid=334241&offset=0&show_names=0&state=3&appid=

However, I normally pay no attention to credits, and the Ryzen seems to run comparably fast as the Intels insofar as I can see at the moment, which is the only thing that matters to me.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 136
Credit: 105,816,438
RAC: 780,687
Level
Cys
Scientific publications
wat
Message 48882 - Posted: 8 Feb 2018 | 1:13:22 UTC - in response to Message 48878.

Just give me one QC task that gets as much credit as yours or klebel's and I would have hope. Alas the 110 credits I got yesterday for this Task 16998146 is the most I've ever seen. My credits have ranged from 6-47 over 35 tasks so far with the one above the only outlier.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 136
Credit: 105,816,438
RAC: 780,687
Level
Cys
Scientific publications
wat
Message 48885 - Posted: 8 Feb 2018 | 20:29:14 UTC

OK, so I once again crunched some more QC task. This time I reduced the core count to two to see if it made any difference. Nope. Still extremely low credit compared to everyone else that has posted in these threads.

Run time 4,362.76
CPU time 8,635.24
Validate state Valid
Credit 91.52

Run time 4,270.13
CPU time 8,440.52
Validate state Valid
Credit 90.31

Task 17003892
Task 17003905

DRSMT
Send message
Joined: 23 Feb 17
Posts: 15
Credit: 296,276,818
RAC: 207,206
Level
Asn
Scientific publications
wat
Message 48886 - Posted: 9 Feb 2018 | 9:05:03 UTC

The problem with multiple WUs starting at the same time, should be fixed (or otherwise a lot of calculation errors will be produced).

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 136
Credit: 105,816,438
RAC: 780,687
Level
Cys
Scientific publications
wat
Message 48889 - Posted: 9 Feb 2018 | 18:42:29 UTC - in response to Message 48886.

Or just set max_concurrent to 1 and avoid the issue entirely until the applications and software for the problem gets resolved.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 181
Credit: 360,305,389
RAC: 1,074,774
Level
Asp
Scientific publications
wat
Message 48890 - Posted: 9 Feb 2018 | 21:58:26 UTC - in response to Message 48889.
Last modified: 9 Feb 2018 | 22:03:54 UTC

Or just set max_concurrent to 1 and avoid the issue entirely until the applications and software for the problem gets resolved.


Until the BM queue fills up with just QC tasks and all other cores go idle. Better to just avoid the entire application in the 1st place. If it's not worth the admins time to fix known issues that cause errors 100% of the time in known situations then its not worth the time for donors to run.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 136
Credit: 105,816,438
RAC: 780,687
Level
Cys
Scientific publications
wat
Message 48891 - Posted: 9 Feb 2018 | 22:44:25 UTC - in response to Message 48890.

Every donor is different. I don't have GPUGrid as my sole project so the crunchers never go idle, there is always work being done for someone.

As with most projects, there is always a shortage of manpower, money or time for keeping applications current and working.

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 77
Credit: 1,341,620,139
RAC: 548,107
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48925 - Posted: 13 Feb 2018 | 3:27:21 UTC

I'll be watching for updates, but for now I'm also turning off CPU tasks.

Scheduler is telling batches to activate all at once. Rows of errors. I acknowledge that user fixes have been recommended but I don't want to program a solution that's beyond my capacity to correct once circumstances change.

Hope to be back soon!

biodoc
Send message
Joined: 26 Aug 08
Posts: 121
Credit: 847,311,950
RAC: 89,086
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48939 - Posted: 14 Feb 2018 | 15:48:12 UTC

I set up the quantum chem app on a second computer. I started with the following app_config.xml file:

<app_config>
<app>
<name>QC</name>
<max_concurrent>1</max_concurrent>
</app>
<app_version>
<app_name>QC</app_name>
<plan_class>mt</plan_class>
<avg_ncpus>4</avg_ncpus>
<cmdline>--nthreads 4</cmdline>
</app_version>
</app_config>


Once the Work Units downloaded and 1 WU started w/4 threads, I edited the app_config.xml to change <max_concurrent>1</max_concurrent> to <max_concurrent>2</max_concurrent>. Then I restarted the boinc client and now I have 2 work units running simultaneously with 4 threads each. So far so good.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 136
Credit: 105,816,438
RAC: 780,687
Level
Cys
Scientific publications
wat
Message 48948 - Posted: 14 Feb 2018 | 21:53:11 UTC

But there is still the issue that whenever your QC tasks finish and more than 2 tasks are downloaded, that your system can try to start both tasks at the same time and then both will fail. There is no guaranteed method to stagger starting of multiple QC tasks in an unattended system on auto.

The only way to get around this situation is to do exactly what you posted. But requires your intervention at each new task startup.

Or just set max_concurrent to 1 and be done with it.

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 24
Credit: 461,418
RAC: 825
Level

Scientific publications
wat
Message 48950 - Posted: 14 Feb 2018 | 23:10:33 UTC
Last modified: 14 Feb 2018 | 23:12:52 UTC

I am not seeing this situation at all.
My AMD Linux Fedora systems download more than one at a time but only ever try to run one at a time.
Even on my 8 core (+ 8 HT) only one starts and other projects keep running.

I am also running other applications but on a 4 core computer when GPU Grid starts it is the only thing that runs and only one at a time.

Not trying to control how things are run.

So I don't know what could be causing problems on your computers.

BOINC versions are 7.4.25 and 7.6.22

Conan

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 77
Credit: 1,341,620,139
RAC: 548,107
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48954 - Posted: 15 Feb 2018 | 1:19:37 UTC

Quick update: I checked my logs and this thread again to see if I could resume tasks. To my surprise, tasks were resumed while I had still opted out.

ACEMD short runs (2-3 hours on fastest card): yes
ACEMD long runs (8-12 hours on fastest GPU): yes
ACEMD Beta: yes
Quantum Chemistry (Linux, CPU): no
Python Runtime : yes

I don't know what's broken but it's not cool.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 136
Credit: 105,816,438
RAC: 780,687
Level
Cys
Scientific publications
wat
Message 48955 - Posted: 15 Feb 2018 | 2:55:35 UTC - in response to Message 48950.

I am not seeing this situation at all.
My AMD Linux Fedora systems download more than one at a time but only ever try to run one at a time.
Even on my 8 core (+ 8 HT) only one starts and other projects keep running.

I am also running other applications but on a 4 core computer when GPU Grid starts it is the only thing that runs and only one at a time.

Not trying to control how things are run.

So I don't know what could be causing problems on your computers.

BOINC versions are 7.4.25 and 7.6.22

Conan

Several posters have reported the problem of starting two QC tasks at the same time or within 5 seconds of each other causes the first task to error out.

See Message 48589

captainjack
Send message
Joined: 9 May 13
Posts: 145
Credit: 1,004,138,167
RAC: 1,157,808
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 48956 - Posted: 15 Feb 2018 | 3:20:37 UTC

Dayle Diamond said:

Quick update: I checked my logs and this thread again to see if I could resume tasks. To my surprise, tasks were resumed while I had still opted out.


Dayle, do you perchance have the preference box checked for
If no work for selected applications is available, accept work from other applications?


mmonnin
Send message
Joined: 2 Jul 16
Posts: 181
Credit: 360,305,389
RAC: 1,074,774
Level
Asp
Scientific publications
wat
Message 48959 - Posted: 15 Feb 2018 | 3:50:20 UTC

If one has more than 4 cores and running max concurrent = 1, there is still the chance that boinc manager will flood your queue with all QC tasks and nothing from the other project. Esp when just starting up that setup before BM gets a better handle of the resource share it is a long term resource share.

I've personally had this happen using max concurrent on another project. 4 threads working, 28 idle. It's far from a perfect solution.

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 77
Credit: 1,341,620,139
RAC: 548,107
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48982 - Posted: 18 Feb 2018 | 20:19:20 UTC - in response to Message 48956.

Dayle, do you perchance have the preference box checked for
If no work for selected applications is available, accept work from other applications?


Oops. Thank you ><.

Happy Crunching, I'll be back with these tasks once things have stabilized.

biodoc
Send message
Joined: 26 Aug 08
Posts: 121
Credit: 847,311,950
RAC: 89,086
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48993 - Posted: 19 Feb 2018 | 12:45:44 UTC - in response to Message 48939.

I set up the quantum chem app on a second computer. I started with the following app_config.xml file:

<app_config>
<app>
<name>QC</name>
<max_concurrent>1</max_concurrent>
</app>
<app_version>
<app_name>QC</app_name>
<plan_class>mt</plan_class>
<avg_ncpus>4</avg_ncpus>
<cmdline>--nthreads 4</cmdline>
</app_version>
</app_config>


Once the Work Units downloaded and 1 WU started w/4 threads, I edited the app_config.xml to change <max_concurrent>1</max_concurrent> to <max_concurrent>2</max_concurrent>. Then I restarted the boinc client and now I have 2 work units running simultaneously with 4 threads each. So far so good.


This approach, although a bit cumbersome, works. I've completed 189 Work Units with no errors.

I think special badges for Quantum Chemistry contribution would be a draw for more users.

NUCCpod_NAPTIMELABS_01
Send message
Joined: 18 Aug 17
Posts: 6
Credit: 112,376,348
RAC: 540,840
Level
Cys
Scientific publications
wat
Message 49065 - Posted: 22 Feb 2018 | 2:23:34 UTC

So, forgive me if this has been answered.

Is this planned to be fixed at some point?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 136
Credit: 105,816,438
RAC: 780,687
Level
Cys
Scientific publications
wat
Message 49066 - Posted: 22 Feb 2018 | 2:41:29 UTC - in response to Message 49065.

No it hasn't been answered. Or even addressed by the developer as far as I can tell. Seems the resources lately have been in deploying and debugging the WSL QC app.

Post to thread

Message boards : Multicore CPUs : Updates to the QMML app