Advanced search

Message boards : Multicore CPUs : New batch of QC tasks (QMML)

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48356 - Posted: 13 Dec 2017 | 17:30:24 UTC

These are called QMML, and rather experimental (more dependencies). Let's see how they work.

captainjack
Send message
Joined: 9 May 13
Posts: 116
Credit: 852,083,647
RAC: 923,632
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48358 - Posted: 13 Dec 2017 | 21:21:51 UTC

Toni,

I have one of these that has been running for over 2 hours and so far looks like it is only using one CPU (thread). It has 4 threads allocated to each task.

There are also a number of warnings messages in the stderr.txt that look like this:

/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/envs/qmml/lib/python3.6/site-packages/tables/path.py:112: NaturalNameWarning: object name is not a valid Python identifier: '122'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)


You should be able to see all of them when the task uploads.

Let me know if you need more info.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48359 - Posted: 13 Dec 2017 | 22:10:31 UTC - in response to Message 48358.
Last modified: 13 Dec 2017 | 22:14:22 UTC

Thanks, the warnings are expected and harmless.

The thread allocation has some bug. I would have expected to use more threads than allocated, not less, but hey. I'll be debugging.

Also I hope suspend/resume and the progress bar work (more or less), unlike the old "plain" QC tasks.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 25
Credit: 349,637,412
RAC: 1,347,315
Level
Asp
Scientific publications
watwatwatwatwat
Message 48362 - Posted: 14 Dec 2017 | 0:44:20 UTC

Subj: Observation of QC CPU WU's Core Utilization

Regarding the last go around several weeks ago with the QC CPU WU's that ran successfully on my 8 and 4 core cpu's, the following was observed.

4-core (Phenom II 3 GHz) cpu: core utilization was consistantly near 100% all 4 cores. Work completed in about 40 minutes per WU calender time.

8-core (FX-8350 4 GHz) cpu's (2): 4-cores (alternatively) were utilized at 100% with the remaining cores at much lesser utilization. Work completed with about 20 minutes real time per WU.

With the most recent QC WU's (12/13/2017), one FX-8350 errors out consistantly and will not run the WU's (maybe it requires software or drivers not currently installed) and the other FX-8350 runs them but at a substanially less core utilization rate than the prior WU batches. So far, the first WU in progress is currently less than 40% complete with about 2 hours calender time invested. Not sure if these recent WU's are of the same length as the earlier ones however.

I have uploaded photos of ksysguard graphic cpu utilization that can be observed if interested with addresses below.

Basically, as I am sure that this is a work in progress and bugs need to be resolved but I would conclude so far that these WU's process efficiently on a 4-core system but that all 8-cores should be fully utilized to make it worth while sacrificing 8-cores to a single WU when 8 WU's from other projects use all 8 at 100% thereby being much more efficient. Haven't tried the suspend/resume yet but will when the opprotunity is available. The previous poster appears correct re thread utilization being one issue.

Screenshots:
http://members.toast.net/obc/computing/grid_computing/images/QC_4-core_cpu.png
http://members.toast.net/obc/computing/grid_computing/images/QC_8-core_cpu.png
http://members.toast.net/obc/computing/grid_computing/images/QC-12-13-2017.png

(Sorry, crude way to present photos but wanted to get them up for anyone interested)

mmonnin
Send message
Joined: 2 Jul 16
Posts: 75
Credit: 93,757,492
RAC: 1,426,180
Level
Thr
Scientific publications
wat
Message 48363 - Posted: 14 Dec 2017 | 0:52:44 UTC
Last modified: 14 Dec 2017 | 0:54:04 UTC

Mine is also just a bit less than 1 thread with 7 available for CPU usage. Good thing I have another client available to keep the CPU busy. If the progress bar is correct it will take about 9.5 hours on 1950x at 3.75 GHz.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48364 - Posted: 14 Dec 2017 | 8:44:09 UTC - in response to Message 48363.

The next batch (QMML313a) should respect the number of threads requested by your client.

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 19
Credit: 89,737
RAC: 51
Level

Scientific publications
wat
Message 48365 - Posted: 14 Dec 2017 | 11:26:38 UTC

The Multiple threaded work units that were sent out last month worked fine for me with no issues.

These new ones however are all failing (so far on two different 64bit Linux machines) with this error

ERROR conda.core.link:_execute_actions(337): An error occurred while installing package 'psi4::gcc-5-5.2.0-1'.
LinkError: post-link script failed for package psi4::gcc-5-5.2.0-1
running your command again with `-v` will provide additional information
location of failed script: /home/Conan/BOINC/projects/www.gpugrid.net/miniconda/envs/qmml/bin/.gcc-5-post-link.sh

When checking this path I found that there is nothing in the /envs/ folder, which is probably where the job is failing.

Conan

mmonnin
Send message
Joined: 2 Jul 16
Posts: 75
Credit: 93,757,492
RAC: 1,426,180
Level
Thr
Scientific publications
wat
Message 48366 - Posted: 14 Dec 2017 | 11:58:44 UTC

Hmm 1st one completed for me and the 2nd one is at around 86%.

captainjack
Send message
Joined: 9 May 13
Posts: 116
Credit: 852,083,647
RAC: 923,632
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48370 - Posted: 14 Dec 2017 | 19:36:39 UTC

Toni said,

The next batch (QMML313a) should respect the number of threads requested by your client.


It looks like this one does respect the number of threads requested by my client. My app_config specifies 4 threads and it looks to be using 4 threads.

Let me know if you need more info.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48371 - Posted: 14 Dec 2017 | 19:56:33 UTC

I am not able to get QC on my Ryzen 1700 machine running Ubuntu 17.10. I just get a "No tasks are available for Quantum Chemistry" message when I request them. However, I am able to get QC on my i7 3770 machine running Ubuntu 16.04 (both machines have BOINC 7.8.3). Both machines are set to the same profile (work), so they should be treated identically.

But I see that some people with AMD machines get work. Is this a bug or a feature?

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48373 - Posted: 14 Dec 2017 | 23:43:10 UTC - in response to Message 48371.

The reason why otherwise similar machines get/do not get work completely baffles me. I don't think it's related to the maker of the CPU. Perhaps with the history of tasks/host reliability or somesuch. With this respects BOINC is of no help.

Profile bcavnaugh
Send message
Joined: 8 Nov 13
Posts: 42
Credit: 347,451,274
RAC: 6,693
Level
Asp
Scientific publications
watwatwatwatwatwatwat
Message 48374 - Posted: 15 Dec 2017 | 0:11:25 UTC - in response to Message 48356.
Last modified: 15 Dec 2017 | 0:12:32 UTC

These are called QMML, and rather experimental (more dependencies). Let's see how they work.


I would really like to get some tasks but seems they are not being given out ATM

http://www.gpugrid.net/show_host_detail.php?hostid=457056
Been Trying for awhile now. Intel(R) Core(TM) i7-3970X
____________
Crunching@EVGA The Number One Team in the BOINC Community.
Folding@EVGA The Number One Team in the Folding@Home Community.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48375 - Posted: 15 Dec 2017 | 2:10:38 UTC - in response to Message 48373.

The reason why otherwise similar machines get/do not get work completely baffles me.

I have seen many instances of it myself over the years, but had hoped the latest BOINC clients were past that. Unfortunately not.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 75
Credit: 93,757,492
RAC: 1,426,180
Level
Thr
Scientific publications
wat
Message 48376 - Posted: 15 Dec 2017 | 2:54:01 UTC
Last modified: 15 Dec 2017 | 3:00:20 UTC

My 1950x also on 17.10 is getting tasks.

Were the old 3.13 tasks producing bad data as my processing time was just wasted by the server canceling them in the middle of processing them? Thats the absolute worst thing a project admin can do. Cancel ones not started but don't ever take a crap on donated resources.

Its still not working right. In 22.5min of the task running it has used 1:37min of CPU time when the task is limited to 3 cores. That's over 4 cores of CPU usage. And they just had a computation error.

At least 3.13 worked.

http://www.gpugrid.net/result.php?resultid=16767178

Prob a good thing tasks are being sent to some AMD CPUs. Damn seg fault.

klepel
Send message
Joined: 23 Dec 09
Posts: 143
Credit: 1,869,862,543
RAC: 1,083,757
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48378 - Posted: 15 Dec 2017 | 4:25:33 UTC

Most of the Quantum Chemistry v3.14 (mt) fail on my AMD 1700x. v3.13 (mt) worked more or less.

As an example:
http://www.gpugrid.net/result.php?resultid=16767309

I use an app_config to limit the use to 4 cores for each work unit (WU) and runs 3 WUs in parallel. Two cores are reserved for the GPU.

I had to change the configuration to accept only GPU Work Unites as the computer crashed twice today.

Hope this helps.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48379 - Posted: 15 Dec 2017 | 8:58:27 UTC - in response to Message 48378.
Last modified: 15 Dec 2017 | 8:59:36 UTC

I understand that seing cancelled WUs is not nice, but it saves future crunching and network bandwidth (both server and client) that would otherwise be lost. Also, I thought that the function we use only cancelled UNSENT or un started wus.

JoergF
Avatar
Send message
Joined: 20 Apr 15
Posts: 216
Credit: 340,297,861
RAC: 1,410,528
Level
Asp
Scientific publications
watwat
Message 48380 - Posted: 15 Dec 2017 | 15:20:57 UTC
Last modified: 15 Dec 2017 | 15:24:44 UTC

Hey friends, in case you need some more machines for testing, I can set up another one that is Linux based. As I have seen in the below conversation, there might be some issues with the CPU type. So .. I have both brands available for testing. Which one would help you most, Intel or AMD?

If you want me to, I could even let you choose the generation. From older Sandy/Ivy Bridge to new Skylake to Ryzen. Lust let me know and I will make one available on short notice.

Edit: I can even contribute a very slow Celeron or Pentium, if that would give you some information on how older and slower systems will perform later... as there still are many older units out there.
____________
I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 75
Credit: 93,757,492
RAC: 1,426,180
Level
Thr
Scientific publications
wat
Message 48381 - Posted: 15 Dec 2017 | 16:15:44 UTC

Quite a few errors on 3.14 so I stopped running QC. Seg faults on AMD and Intel machines.

Profile bcavnaugh
Send message
Joined: 8 Nov 13
Posts: 42
Credit: 347,451,274
RAC: 6,693
Level
Asp
Scientific publications
watwatwatwatwatwatwat
Message 48383 - Posted: 15 Dec 2017 | 16:43:14 UTC

So I did get some Tasks but it seems that only AMD Processors can run them.
http://www.gpugrid.net/show_host_detail.php?hostid=179948

A nice setting in the Preference for this Tasks would be to allow us to set the number of Cores;
Like on a 32 Core Host you could set 2 Tasks running 16 Cores Each.
Or even 2 Tasks running 8 Cores Each.
____________
Crunching@EVGA The Number One Team in the BOINC Community.
Folding@EVGA The Number One Team in the Folding@Home Community.

el_gallo_azul
Send message
Joined: 14 Jun 14
Posts: 9
Credit: 28,094,797
RAC: 24
Level
Val
Scientific publications
wat
Message 48384 - Posted: 16 Dec 2017 | 8:20:48 UTC
Last modified: 16 Dec 2017 | 8:21:40 UTC

I received ~60 new WUs yesterday, but I didn't see what happened with them. I was surprised when I went back to the computer an hour or so later and they had all disappeared.

I received another batch of ~60 WUs today, and this time I see that they all resulted in "Computation error".

Intel Xeon E5-2680 x 2 (ie. 32 hyperthreading cores).

Profile bormolino
Send message
Joined: 16 May 13
Posts: 22
Credit: 18,763,866
RAC: 21,753
Level
Pro
Scientific publications
watwat
Message 48386 - Posted: 16 Dec 2017 | 11:53:50 UTC

I'm not getting any WUs, neither on my AMD, nor on my Intel CPUs.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48388 - Posted: 16 Dec 2017 | 17:22:23 UTC - in response to Message 48386.

Let me summarize the current status. We are making tests in view of a large production run.

The WUs which are out now are called QMML314long and last several hours. This longer test have a couple of new failure modes which I think are related to restarts, and can be fixed.

Another different problem is task distribution by the BOINC scheduler. First of all, as said above, some hosts are ignored for no reason I can fathom. Another is that some hosts are "soaking up" dozens of WUs, which means they are not available to others. I am hoping that both problems will sort out by themselves with a sufficiently large batch.

Final notes: (a) CPU maker is irrelevant. (b) disappeared WUs were tests which I cancelled from the server.

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 19
Credit: 89,737
RAC: 51
Level

Scientific publications
wat
Message 48390 - Posted: 17 Dec 2017 | 2:57:50 UTC - in response to Message 48365.
Last modified: 17 Dec 2017 | 2:58:31 UTC

The Multiple threaded work units that were sent out last month worked fine for me with no issues.

These new ones however are all failing (so far on two different 64bit Linux machines) with this error

ERROR conda.core.link:_execute_actions(337): An error occurred while installing package 'psi4::gcc-5-5.2.0-1'.
LinkError: post-link script failed for package psi4::gcc-5-5.2.0-1
running your command again with `-v` will provide additional information
location of failed script: /home/xxxxxxxx/BOINC/projects/www.gpugrid.net/miniconda/envs/qmml/bin/.gcc-5-post-link.sh

When checking this path I found that there is nothing in the /envs/ folder, which is probably where the job is failing.

Conan


Just an update as I am still am getting these errors, this is the full error

CondaValueError: prefix already exists: /home/xxxxxxxx/BOINC/projects/www.gpugrid.net/miniconda/envs/qmml

ERROR conda.core.link:_execute_actions(337): An error occurred while installing package 'psi4::gcc-5-5.2.0-1'.
LinkError: post-link script failed for package psi4::gcc-5-5.2.0-1
running your command again with `-v` will provide additional information
location of failed script: /home/xxxxxxxx/BOINC/projects/www.gpugrid.net/miniconda/envs/qmml/bin/.gcc-5-post-link.sh
==> script messages <==
<None>

Attempting to roll back.


LinkError: post-link script failed for package psi4::gcc-5-5.2.0-1
running your command again with `-v` will provide additional information
location of failed script: /home/xxxxxxxx/BOINC/projects/www.gpugrid.net/miniconda/envs/qmml/bin/.gcc-5-post-link.sh
==> script messages <==
<None>



Traceback (most recent call last):
File "pre_script.py", line 20, in <module>
raise Exception("Error installing psi4 dev")
Exception: Error installing psi4 dev
10:18:33 (23979): $PROJECT_DIR/miniconda/bin/python exited; CPU time 69.668408
10:18:33 (23979): app exit status: 0x1
10:18:33 (23979): called boinc_finish(195)

Conan

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48392 - Posted: 17 Dec 2017 | 14:04:51 UTC - in response to Message 48390.

@conan: do you have "gcc" installed in your system? If not, can you try to install it?

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 19
Credit: 89,737
RAC: 51
Level

Scientific publications
wat
Message 48393 - Posted: 17 Dec 2017 | 15:15:43 UTC - in response to Message 48392.
Last modified: 17 Dec 2017 | 15:18:09 UTC

@conan: do you have "gcc" installed in your system? If not, can you try to install it?


It was installed on on one computer with Fedora 25, but was not installed on the other two with Fedora 16 and Fedora 21, all 64 bit.
Have installed now and await to see what happens.
Versions range from 4.6.3-2 (Fedora 16), 4.9.2-6 (Fedora 21) to 6.4.1-1 (Fedora 25).

Thanks
Conan

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48394 - Posted: 17 Dec 2017 | 19:28:37 UTC - in response to Message 48388.

Another different problem is task distribution by the BOINC scheduler. First of all, as said above, some hosts are ignored for no reason I can fathom. Another is that some hosts are "soaking up" dozens of WUs, which means they are not available to others. I am hoping that both problems will sort out by themselves with a sufficiently large batch.

Final notes: (a) CPU maker is irrelevant. (b) disappeared WUs were tests which I cancelled from the server.

Yes! I just got some QC on my Ryzen 1700. All good things come to those that wait. (The first four errored out after a couple of minutes, but the fifth one is running fine after 50 minutes and I think it will fly, running two cores on each WU.)

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48395 - Posted: 17 Dec 2017 | 19:36:10 UTC - in response to Message 48394.
Last modified: 17 Dec 2017 | 19:37:00 UTC

I may have understood the problem of hosts not getting WUs. I was sending tasks at a high priority, which means they crossed the threshold to only go to "reliable hosts" -- a questionable heuristic.

100 tasks named "s*-QMML314long" I made at a lower priority seem to have been sent quickly.

Profile bormolino
Send message
Joined: 16 May 13
Posts: 22
Credit: 18,763,866
RAC: 21,753
Level
Pro
Scientific publications
watwat
Message 48396 - Posted: 17 Dec 2017 | 20:07:46 UTC

I got my first WU today. Unfortunately the WU needs 4,7 GB of ram. Can you optimise that?

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 19
Credit: 89,737
RAC: 51
Level

Scientific publications
wat
Message 48397 - Posted: 17 Dec 2017 | 23:41:55 UTC - in response to Message 48393.

@conan: do you have "gcc" installed in your system? If not, can you try to install it?


It was installed on on one computer with Fedora 25, but was not installed on the other two with Fedora 16 and Fedora 21, all 64 bit.
Have installed now and await to see what happens.
Versions range from 4.6.3-2 (Fedora 16), 4.9.2-6 (Fedora 21) to 6.4.1-1 (Fedora 25).

Thanks
Conan


The Fedora 16 host still has the same error, but the Fedora 21 host is processing a work unit now for the last 8 hours 21 minutes and 68% done, so it looks good at this point.
My Fedora 25 host has not received any work yet so can't say about that one.

My WU is using 1.5 GB of RAM.

Thanks
Conan

klepel
Send message
Joined: 23 Dec 09
Posts: 143
Credit: 1,869,862,543
RAC: 1,083,757
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48398 - Posted: 18 Dec 2017 | 2:58:48 UTC - in response to Message 48395.

I may have understood the problem of hosts not getting WUs. I was sending tasks at a high priority, which means they crossed the threshold to only go to "reliable hosts" -- a questionable heuristic.

100 tasks named "s*-QMML314long" I made at a lower priority seem to have been sent quickly.

This solved it for my second computer. Works on a USB Stick with Lubuntu 17.04. Unfortunatelly, crashed:
http://www.gpugrid.net/result.php?resultid=16776102

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 19
Credit: 89,737
RAC: 51
Level

Scientific publications
wat
Message 48400 - Posted: 18 Dec 2017 | 5:31:13 UTC - in response to Message 48397.

@conan: do you have "gcc" installed in your system? If not, can you try to install it?


It was installed on on one computer with Fedora 25, but was not installed on the other two with Fedora 16 and Fedora 21, all 64 bit.
Have installed now and await to see what happens.
Versions range from 4.6.3-2 (Fedora 16), 4.9.2-6 (Fedora 21) to 6.4.1-1 (Fedora 25).

Thanks
Conan


The Fedora 16 host still has the same error, but the Fedora 21 host is processing a work unit now for the last 8 hours 21 minutes and 68% done, so it looks good at this point.
My Fedora 25 host has not received any work yet so can't say about that one.

My WU is using 1.5 GB of RAM.

Thanks
Conan


This WU on the Fedora 21 host worked and completed successfully, my first of this batch.

Conan

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48401 - Posted: 18 Dec 2017 | 7:48:05 UTC - in response to Message 48398.

@klepel - can you try installing gcc (if not already there)?

tks

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 73
Credit: 1,095,624,936
RAC: 814,126
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 48402 - Posted: 18 Dec 2017 | 9:56:27 UTC

https://drive.google.com/file/d/1bKmSXT4IAVTR8b-fpiGdC6Gm4szduk0X/view?usp=sharing

Running one now. 1950x, Linux 17.10.
Average time taken is two hours, fifteen minutes per task.

At the time of the screenshot, the work unit is around fourty percent done. I'm watching my CPU usage hit 100%, stay there for a while, then...waves.
I don't think it's thermal throttling. It's not overclocked and WCG tasks only make those patterns when tasks are starting/finishing.

If it's working as intended, ok.


Profile Conan
Send message
Joined: 25 Mar 09
Posts: 19
Credit: 89,737
RAC: 51
Level

Scientific publications
wat
Message 48403 - Posted: 18 Dec 2017 | 10:10:17 UTC - in response to Message 48401.
Last modified: 18 Dec 2017 | 10:20:18 UTC

What are the requirements for the running of Psi4?

My Fedora 16 computer after installing "gcc" is still getting the same error that it failed to install, but the Fedora 21 computer is now running fine.

Is there a certain "glibc", "gcc" or Linux kernel that is required to install this programme?
My older Fedora 16 install may not meet the requirements perhaps?

I still can't get any work on my Intel Xeon running Fedora 25, keeps saying that there is no work available when in fact there is, but that is another issue.

Thanks
Conan

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48405 - Posted: 18 Dec 2017 | 12:44:38 UTC - in response to Message 48403.

@dayle - oscillating CPU% is expected and due to the parts of the calculation which are not parallelized. Thermal throttling is unlikely imho (and I imagine it would manifests as a decrease in CPU clock, not CPU%).

@conan - in principle a system with GLIBC>=2.14 should be capable to run; Fedora 16 seemed to have it but it is old, so probably something else is missing. Sorry.

I've made another thread with information which may be useful.

Profile bormolino
Send message
Joined: 16 May 13
Posts: 22
Credit: 18,763,866
RAC: 21,753
Level
Pro
Scientific publications
watwat
Message 48406 - Posted: 18 Dec 2017 | 13:18:30 UTC - in response to Message 48397.

My WU is using 1.5 GB of RAM.


18-Dec-2017 14:07:22 [GPUGRID] Quantum Chemistry needs 4768.37 MB RAM but only 3469.24 MB is available for use.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48407 - Posted: 18 Dec 2017 | 13:33:47 UTC - in response to Message 48406.
Last modified: 18 Dec 2017 | 13:34:25 UTC

We are talking about 3 different memory use figures:

A. The amount of memory actually used (which varies with time), which Conan measuread as 1.5 GB
B. The amount "requested" by the workunit, currently 4 GB
C. The maximum amount your boinc client allows to use (you can configure this to some extent)

The following should hold: A < B < C

In your case, B>C and therefore the WU was not allowed to start (I guess).

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48409 - Posted: 18 Dec 2017 | 14:23:14 UTC - in response to Message 48405.

@dayle - oscillating CPU% is expected and due to the parts of the calculation which are not parallelized. Thermal throttling is unlikely imho (and I imagine it would manifests as a decrease in CPU clock, not CPU%).

That one is a little confusing. The remaining time also increases as a consequence. I think I aborted some unnecessarily when it appeared that they were stuck. I now just let them run. Maybe you should make a big sticky on it to catch people's attention?

langfod
Send message
Joined: 15 Dec 17
Posts: 2
Credit: 5,577,735
RAC: 61,331
Level
Ser
Scientific publications
wat
Message 48413 - Posted: 18 Dec 2017 | 17:50:23 UTC

Did tasks just get aborted by the system?

Name s51-TONI_QMML314long-0-1-RND1523_1
Workunit 12930407
Exit status 202 (0xca) EXIT_ABORTED_BY_PROJECT

<core_client_version>7.8.3</core_client_version>
<![CDATA[
<message>
aborted by project - no longer usable</message>
<stderr_txt>

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48414 - Posted: 18 Dec 2017 | 18:02:01 UTC - in response to Message 48413.
Last modified: 18 Dec 2017 | 18:03:03 UTC

Did tasks just get aborted by the system?

Yes. I just had a bunch aborted at 13:44 UTC.
But there are now new ones in the pipeline.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48415 - Posted: 18 Dec 2017 | 18:32:13 UTC - in response to Message 48414.
Last modified: 18 Dec 2017 | 18:32:27 UTC

Can you please confirm that those WUs were cancelled while running and not just while waiting to start?

langfod
Send message
Joined: 15 Dec 17
Posts: 2
Credit: 5,577,735
RAC: 61,331
Level
Ser
Scientific publications
wat
Message 48416 - Posted: 18 Dec 2017 | 18:41:43 UTC - in response to Message 48415.

All I can tell is that the one task I had running was a couple hours from completion the last I looked.

Then I checked the task list and saw the cancellation:

16776352 12930407 457243
18 Dec 2017 | 6:16:48 UTC 18 Dec 2017 | 14:17:56 UTC
Cancelled by server 28,813.53 238,856.00 ---
Quantum Chemistry v3.14 (mt)

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48417 - Posted: 18 Dec 2017 | 19:11:56 UTC - in response to Message 48415.
Last modified: 18 Dec 2017 | 19:17:36 UTC

Can you please confirm that those WUs were cancelled while running and not just while waiting to start?

On my i7-4770 machine, there were 13 aborted at 13:43:51 UTC. Twelve of them show 0 elapsed time, but the other one shows 05:02:06 (19:52:01) elapsed time. They are all listed as "cancelled by server".

And on an i7-3770 machine, three of them completed just after that, at 13:45:09 UTC, after running for around 24 hours or more each, and all show "cancelled by server".

Finally, on my Ryzen 1700 machine, two of them completed at 13:52:55 UTC and show "cancelled by server" after running about 18 to 19 hours.

So it works.

EDIT: But BoincTasks shows the i7-4770 and the Ryzen 1700 machines as "Reported: OK+", so it is only on the GPUGrid status page that the true story is told apparently.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48418 - Posted: 18 Dec 2017 | 19:30:51 UTC - in response to Message 48417.

It's true that running tasks are being killed. This is not what I expected.

By the way: these WUs should not run 10+ hours on modern CPUs. That's strange.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48419 - Posted: 18 Dec 2017 | 20:59:20 UTC - in response to Message 48418.

By the way: these WUs should not run 10+ hours on modern CPUs. That's strange.

The i7-3770 machine and the Ryzen 1700 machine were running only 2 cores per work unit, while the i7-4770 was running 4 cores per work unit.

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 73
Credit: 1,095,624,936
RAC: 814,126
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 48420 - Posted: 18 Dec 2017 | 22:21:25 UTC

It's true that running tasks are being killed. This is not what I expected.


So far 76 tasks on my machine have been canceled.
Please continue killing any task in progress if you don't want the data.
No point squandering precious CPU cycles when the science/programming has moved on to a newer revision.

Happy Holidays!

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 19
Credit: 89,737
RAC: 51
Level

Scientific publications
wat
Message 48421 - Posted: 19 Dec 2017 | 3:17:38 UTC - in response to Message 48415.

Can you please confirm that those WUs were cancelled while running and not just while waiting to start?


Yes I had one that had been running for 33,977 seconds (CPU time 140,002 seconds) and it was cancelled, as well as 2 that had not started.

Just an aside to my Fedora 16 Host problems running these work units that are all failing, it is running 'gcc' 4.6.
I did some reading on Psi4 and found that it seems to need gcc 4.9 or later in order to run.

I have since installed this 'gcc' version on that computer and am awaiting a work unit to see if it works or not. There may still be something missing.

I may just have to update Fedora 16 to something more recent.

Conan

Petr Kriz
Send message
Joined: 22 Feb 09
Posts: 3
Credit: 72,992
RAC: 3,449
Level

Scientific publications
wat
Message 48422 - Posted: 19 Dec 2017 | 8:05:25 UTC

I have still problems with task miniconda-installer reached time limit 360. Tried 4 tasks today with same result (other 2 task I cancelled). Have standard Fedora 26, nothing special.
I don't think the problem is in firewall or slow connection (as suggested in another thread). Is miniconda-installer really downloading something? I rather think, that there is something wrong with installation of files already downloaded on hard-drive.
I think I will now wait some time, and come back later (maybe month or two). Hopefully, it will be resolved.

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 45
Credit: 80,281
RAC: 0
Level

Scientific publications
wat
Message 48424 - Posted: 19 Dec 2017 | 9:16:06 UTC

I downloaded 3 wus 3.14 on my vbox linux.
They don't start....."Waiting to run". No message on boinc manager.

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 19
Credit: 89,737
RAC: 51
Level

Scientific publications
wat
Message 48425 - Posted: 19 Dec 2017 | 10:28:30 UTC - in response to Message 48422.

I have still problems with task miniconda-installer reached time limit 360. Tried 4 tasks today with same result (other 2 task I cancelled). Have standard Fedora 26, nothing special.
I don't think the problem is in firewall or slow connection (as suggested in another thread). Is miniconda-installer really downloading something? I rather think, that there is something wrong with installation of files already downloaded on hard-drive.
I think I will now wait some time, and come back later (maybe month or two). Hopefully, it will be resolved.


Check that SELINUX is not blocking any files from running. I had this problem on my Fedora 25 install and had to create an exception for it.

Also make sure your 'gcc' packages are up to date
dnf install gcc, or dnf install gcc-c++, should help if you haven't already done so.

Conan

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48426 - Posted: 19 Dec 2017 | 10:32:19 UTC - in response to Message 48424.
Last modified: 19 Dec 2017 | 10:33:32 UTC

@petr - miniconda is downloaded from our servers (~50 MB). After that, at the beginning, psi4 and other packages are downloaded from Anaconda's servers (only the first time). If you suspect a mixup, feel free to "reset" the GPUGRID project and everything should be deleted (and downloaded again at the next WU). Beware that it would kill running tasks!

@conan - in principle part of the run is indeed to download its dependencies from Anaconda, including a GCC 5 version which is installed in the project's directory. However, to complete its installation, a library is needed which is generally shipped with... the system's GCC. It's indeed a bit confusing. Maybe your tweak solves the problem, maybe not.

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 19
Credit: 89,737
RAC: 51
Level

Scientific publications
wat
Message 48428 - Posted: 19 Dec 2017 | 12:16:24 UTC
Last modified: 19 Dec 2017 | 12:22:40 UTC

@ Toni, when these work units run do they get to a certain point then just idle along on a single core for hours on end?

I finally got a work unit to download to my Fedora 25 host, and it ran fine up to about 6 hours or so run time and 78.698% completed.
After this it has been running on a single core for almost 8 hours now and the progress is still locked at 78.698%.
The time to completion has increased from 1 hour 39 minutes to 3 hours 44 minutes and counting.

What happened to the Multi-Threading I thought these work units were supposed to do?

Run time is now approaching 14 hours and the % done has not moved, this is on a 16 core computer.

Conan

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48429 - Posted: 19 Dec 2017 | 12:47:20 UTC - in response to Message 48428.

Run time is now approaching 14 hours and the % done has not moved, this is on a 16 core computer.

I seem to recall problems on LHC/ATLAS when running on more than 8 cores, though I was not involved with the problem myself as I run only 7 cores there anyway. But you could try an app_config.xml to limit it to 8 cores.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48430 - Posted: 19 Dec 2017 | 13:14:27 UTC - in response to Message 48428.

The computation is done looping over several molecules (~60 if i remember correctly). A checkpoint is written after each loop. Inside a loop there is a part which is multithreaded, and a part which is not. The relative sizes are different. So it's not strange that thread occupancy oscillates. Limiting the number of cores to, say, 4, via the client is ok.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 75
Credit: 93,757,492
RAC: 1,426,180
Level
Thr
Scientific publications
wat
Message 48431 - Posted: 19 Dec 2017 | 14:25:37 UTC - in response to Message 48429.

Run time is now approaching 14 hours and the % done has not moved, this is on a 16 core computer.

I seem to recall problems on LHC/ATLAS when running on more than 8 cores, though I was not involved with the problem myself as I run only 7 cores there anyway. But you could try an app_config.xml to limit it to 8 cores.


Someone did tests there. Atlas runs best around 3/4/5 threads. More threads are not utilized very well. I don't recall any mt BOINC app utilizing all threads at 8+. It would probably have to be a straight math project that calculates more #s in parallel to do that.

Petr Kriz
Send message
Joined: 22 Feb 09
Posts: 3
Credit: 72,992
RAC: 3,449
Level

Scientific publications
wat
Message 48433 - Posted: 19 Dec 2017 | 16:20:43 UTC

@Toni, Conan: Thanks to both of you. Looks like the SELINUX was blocking it. It's kind of black box for me. I have found this procedure, which I applied on my system. I hope, that I didn't open the Pandora's box instead :).
But after this, the wu started to download additional pkgs and now the computation is running. I will see, if it will succeed, but so far all 6 cores (12 threads) runs at full speed (with some small slowdowns from time to time).
So again thx for help.

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 19
Credit: 89,737
RAC: 51
Level

Scientific publications
wat
Message 48438 - Posted: 19 Dec 2017 | 22:18:59 UTC - in response to Message 48428.
Last modified: 19 Dec 2017 | 22:36:38 UTC

@ Toni, when these work units run do they get to a certain point then just idle along on a single core for hours on end?

I finally got a work unit to download to my Fedora 25 host, and it ran fine up to about 6 hours or so run time and 78.698% completed.
After this it has been running on a single core for almost 8 hours now and the progress is still locked at 78.698%.
The time to completion has increased from 1 hour 39 minutes to 3 hours 44 minutes and counting.

What happened to the Multi-Threading I thought these work units were supposed to do?

Run time is now approaching 14 hours and the % done has not moved, this is on a 16 core computer.

Conan


Well I got sick of waiting for this one to finish (it had now been running for over 22 hours still on 1 core) so I created an "app_config.xml" file and inserted that in the project folder and restarted the BOINC Client and Manager.

The WU reset itself to 1.098% completed, 5 hours run time and 18 days 23 hours to completion.
So 17 to 18 hours run time disappeared and all processing went as well, so apparently no checkpoints.

It is now running on 8 cpus instead of 16 which had stopped other work for a day.
Will now see what happens.

EDIT:: Just after I posted the WU has jumped to 81.741% done and to completion has now dropped to 1 day 11 minutes.
So appears to be working heaps better.


Conan

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1857
Credit: 11,005,828,094
RAC: 10,829,540
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48440 - Posted: 20 Dec 2017 | 1:28:12 UTC - in response to Message 48438.

It is now running on 8 cpus instead of 16 which had stopped other work for a day.
Will now see what happens.

Your host has 2 CPUs, both have 4 cores hyperthreaded, so the performance scaling will drop rapidly if you run more than 8 threads of Floating Point calculations (most of the science projects are using FP).
To all multi-threaded CPU crunchers: Hyperthreaded CPUs have half as many cores as BOINC reports, so you should limit the threads utilized by the app to obtain optimal performance / reliability.

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 73
Credit: 1,095,624,936
RAC: 814,126
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 48441 - Posted: 20 Dec 2017 | 1:32:50 UTC

If these workunits are gonna take an average of Fifty five hours of CPU time, they really shouldn't crash when I reboot the system.

https://www.gpugrid.net/workunit.php?wuid=12932734

Needed to apply updates. Waited until one finished uploading before risking it. Glad I waited. This one was active for less than five minutes.


Select Language​▼

Twitter
Facebook
Follow us on:
|
Server status
|
Dayle Diamond [log out]

logo
About Science Volunteers Performance Forum Join us Donate

Name c448-TONI_QMML314rst-0-1-RND2050_0
Workunit 12932734
Created 18 Dec 2017 | 17:32:38 UTC
Sent 18 Dec 2017 | 21:17:59 UTC
Received 20 Dec 2017 | 1:26:41 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 195 (0xc3) EXIT_CHILD_FAILED
Computer ID 453935
Report deadline 23 Dec 2017 | 21:17:59 UTC
Run time 10.36
CPU time 6.37
Validate state Invalid
Credit 0.00
Application version Quantum Chemistry v3.14 (mt)
Stderr output

<core_client_version>7.8.3</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
17:21:56 (70558): wrapper (7.7.26016): starting
17:21:56 (70558): wrapper (7.7.26016): starting
17:21:56 (70558): wrapper: running ../../projects/www.gpugrid.net/Miniconda3-4.3.30-Linux-x86_64.sh (-b -f -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda)
Python 3.6.3 :: Anaconda, Inc.
17:22:04 (70558): miniconda-installer exited; CPU time 6.370986
17:22:04 (70558): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/python (pre_script.py)

CondaValueError: prefix already exists: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/envs/qmml

forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpcm.so.1 00007F3B35B1D725 Unknown Unknown Unknown
libpcm.so.1 00007F3B35B1B347 Unknown Unknown Unknown
libpcm.so.1 00007F3B35A32AA2 Unknown Unknown Unknown
libpcm.so.1 00007F3B35A328F6 Unknown Unknown Unknown
libpcm.so.1 00007F3B35A00EFD Unknown Unknown Unknown
libpcm.so.1 00007F3B35A04298 Unknown Unknown Unknown
libpthread.so.0 00007F3B48D8E150 Unknown Unknown Unknown
libmkl_def.so 00007F3B2293D916 Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpcm.so.1 00007F3B35A04B9A Unknown Unknown Unknown
libpthread.so.0 00007F3B48D8E150 Unknown Unknown Unknown

Stack trace terminated abnormally.
SIGSEGV: segmentation violation
Stack trace (11 frames):
../../projects/www.gpugrid.net/wrapper_26198_x86_64-pc-linux-gnu[0x457672]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x13150)[0x7fa252998150]
../../projects/www.gpugrid.net/wrapper_26198_x86_64-pc-linux-gnu[0x494313]
../../projects/www.gpugrid.net/wrapper_26198_x86_64-pc-linux-gnu[0x4905b5]
/lib/x86_64-linux-gnu/libc.so.6(+0x37140)[0x7fa2525dc140]
/lib/x86_64-linux-gnu/libc.so.6(nanosleep+0x58)[0x7fa25267db98]
/lib/x86_64-linux-gnu/libc.so.6(usleep+0x44)[0x7fa2526b0134]
../../projects/www.gpugrid.net/wrapper_26198_x86_64-pc-linux-gnu[0x467f2f]
../../projects/www.gpugrid.net/wrapper_26198_x86_64-pc-linux-gnu[0x40b1a1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fa2525c61c1]
../../projects/www.gpugrid.net/wrapper_26198_x86_64-pc-linux-gnu[0x407ca2]

Exiting...
17:24:57 (1324): wrapper (7.7.26016): starting
17:24:57 (1324): wrapper (7.7.26016): starting
17:24:57 (1324): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/python (pre_script.py)

CondaValueError: prefix already exists: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/envs/qmml


CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.continuum.io/pkgs/main/linux-64/repodata.json.bz2>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
ConnectionError(MaxRetryError("HTTPSConnectionPool(host='repo.continuum.io', port=443): Max retries exceeded with url: /pkgs/main/linux-64/repodata.json.bz2 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f19ff5e8898>: Failed to establish a new connection: [Errno -2] Name or service not known',))",),)


Traceback (most recent call last):
File "pre_script.py", line 13, in <module>
raise Exception("Error installing h5py")
Exception: Error installing h5py
17:24:58 (1324): $PROJECT_DIR/miniconda/bin/python exited; CPU time 0.257581
17:24:58 (1324): app exit status: 0x1
17:24:58 (1324): called boinc_finish(195)

</stderr_txt>
]]>

About
Science
Volunteers
Performance
Forum
Join us
Contact

Google+ Facebook Twitter
© 2017 Universitat Pompeu Fabra

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48444 - Posted: 20 Dec 2017 | 4:23:12 UTC

I finally finished a task so I can post now.

Can someone explain what the QC app shows for Status in the BOINC Manager. I had a app_config.xml loaded to limit the number of cpu cores it was supposed to use to 4. However in the Status column it showed 16C for the number of cores allotted.

Is that normal? Is that just how it describes itself to BOINC or was it really using all 16 cores?

This was my app_confg.xml


<app_config>
<app>
<name>acemdlong</name>
<max_concurrent>1</max_concurrent>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemdshort</name>
<max_concurrent>1</max_concurrent>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>QC</name>
<max_concurrent>1</max_concurrent>
</app>
<app_version>
<app_name>QC</app_name>
<plan_class>mt</plan_class>
<avg_ncpus>4</avg_ncpus>
<cmdline>--nthreads 4</cmdline>
</app_version>
</app_config>


Does anyone see anything wrong with the app_config?

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 19
Credit: 89,737
RAC: 51
Level

Scientific publications
wat
Message 48445 - Posted: 20 Dec 2017 | 9:07:10 UTC - in response to Message 48444.
Last modified: 20 Dec 2017 | 9:09:36 UTC

I finally finished a task so I can post now.

Can someone explain what the QC app shows for Status in the BOINC Manager. I had a app_config.xml loaded to limit the number of cpu cores it was supposed to use to 4. However in the Status column it showed 16C for the number of cores allotted.

Is that normal? Is that just how it describes itself to BOINC or was it really using all 16 cores?

This was my app_confg.xml

<app_config>
<app>
<name>acemdlong</name>
<max_concurrent>1</max_concurrent>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemdshort</name>
<max_concurrent>1</max_concurrent>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>QC</name>
<max_concurrent>1</max_concurrent>
</app>
<app_version>
<app_name>QC</app_name>
<plan_class>mt</plan_class>
<avg_ncpus>4</avg_ncpus>
<cmdline>--nthreads 4</cmdline>
</app_version>
</app_config>


Does anyone see anything wrong with the app_config?


Try <avg_ncpus>4.000000</avg_ncpus>

The red highlighted line above, it works for me.

Conan

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48446 - Posted: 20 Dec 2017 | 9:08:52 UTC - in response to Message 48444.
Last modified: 20 Dec 2017 | 9:21:33 UTC

Can someone explain what the QC app shows for Status in the BOINC Manager. I had a app_config.xml loaded to limit the number of cpu cores it was supposed to use to 4. However in the Status column it showed 16C for the number of cores allotted.

Is that normal? Is that just how it describes itself to BOINC or was it really using all 16 cores?

The app_config looks the same as mine. Did you reboot in order to activate it?

In some of these multi-core projects, the Status is not updated until the next group of work units comes in after you have set the app_config. But a reboot usually fixes it.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48449 - Posted: 20 Dec 2017 | 10:46:22 UTC - in response to Message 48446.

To clarify run times: all the QMML314rst wus are the same length. Even on a single core, they should not take longer than 20h maximum (on a relatively modern PC). The HTTP messages indicate a connectivity problem of course. I hope they cause a failure soon rather than remaining stuck. Re SElinux... I hope it leaves us in peace.

Sebastian M. Bobrecki
Send message
Joined: 4 Oct 09
Posts: 6
Credit: 110,798,797
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 48454 - Posted: 20 Dec 2017 | 13:27:54 UTC
Last modified: 20 Dec 2017 | 13:38:38 UTC

After about 10h and reaching 69.568% app started to use only one core. What's worst it stays in that state for another 10h and perf is indicating that it's in OMP spinlock:

83.49% python libiomp5.so [.] __kmp_wait_yield_4
6.76% python libiomp5.so [.] __kmp_eq_4
5.74% python libiomp5.so [.] __kmp_yield
0.66% python [kernel.vmlinux] [k] entry_SYSCALL_64
...

Edit: On second machine it looks similar but after 6h and 78.698% it stays in that state for about 11h now. Perf:

84.60% python libiomp5.so [.] __kmp_wait_yield_4
6.80% python libiomp5.so [.] __kmp_eq_4
5.77% python libiomp5.so [.] __kmp_yield
0.59% python [kernel.vmlinux] [k] entry_SYSCALL_64
0.37% python [kernel.vmlinux] [k] __schedule
...

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48455 - Posted: 20 Dec 2017 | 13:33:18 UTC - in response to Message 48449.

Even on a single core, they should not take longer than 20h maximum (on a relatively modern PC).

They are not behaving that well at all. I did not have any work units complete yesterday on four machines. That was running two cores each on two i7-3770s and four cores each on an i7-4770 and a Ryzen 1700. These machines are all Ubuntu 16/17, and run 24/7.
http://www.gpugrid.net/results.php?userid=90514

They must loop back at some point, but I will let them run for another couple of days.

By the way, posting is difficult as the website is often unaccessible for a few minutes at a time. Maybe that is related to some of problems some people are having, but I have not looked into it further.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48458 - Posted: 20 Dec 2017 | 14:54:10 UTC - in response to Message 48455.

Two have just completed on my Ryzen 1700 (4 cores each). The elapsed time shows as 4 hours 10 minutes, but the CPU time is over two days.
http://www.gpugrid.net/results.php?hostid=452287

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48460 - Posted: 20 Dec 2017 | 17:10:57 UTC - in response to Message 48446.

Can someone explain what the QC app shows for Status in the BOINC Manager. I had a app_config.xml loaded to limit the number of cpu cores it was supposed to use to 4. However in the Status column it showed 16C for the number of cores allotted.

Is that normal? Is that just how it describes itself to BOINC or was it really using all 16 cores?

The app_config looks the same as mine. Did you reboot in order to activate it?

In some of these multi-core projects, the Status is not updated until the next group of work units comes in after you have set the app_config. But a reboot usually fixes it.

I reloaded the app_config via the Manager. I was afraid to reboot the machine because I had read earlier in the thread that the tasks would restart and I would lose the processing up to that point. It is normal for BOINC to identify downloaded tasks with the existing cpu/gpu resource usage at time of download. But I could swear I had the app_config in place before I finally snagged my first two tasks. Will wait and see when I can get my next task.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48464 - Posted: 20 Dec 2017 | 21:02:56 UTC - in response to Message 48460.

But I could swear I had the app_config in place before I finally snagged my first two tasks. Will wait and see when I can get my next task.

You have to activate the app_config. If you have BoincTasks, there is a way to read all the cc_config and app_config files for any connected machine. (I don't have it in front of me at the moment). Otherwise, a reboot will be necessary. I am having all sorts of problems with the work units, and a reboot is probably not worse than anything else at the moment.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 75
Credit: 93,757,492
RAC: 1,426,180
Level
Thr
Scientific publications
wat
Message 48465 - Posted: 20 Dec 2017 | 23:11:28 UTC - in response to Message 48464.

The official BOINC Manager has an option to reread config files as well. I use BOINC Tasks and a new/updated file is picked up without a reboot or client restart.

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 19
Credit: 89,737
RAC: 51
Level

Scientific publications
wat
Message 48467 - Posted: 21 Dec 2017 | 3:40:51 UTC

There is also an option in the BOINC code that allows for the number of cpus that you want to use per host.
Each host can have a different setting.

Ask over at Amicable Numbers as they have done that there.
Then you wont need app_config.xml files at all.

Conan

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48469 - Posted: 21 Dec 2017 | 6:26:09 UTC - in response to Message 48467.

Setting CPU % in BOINC is system and project wide. Not very good for fine tuning per project. The app_config was specifically introduced for specific project tuning and is the preferred method to control gpu and cpu usage per application.

I have the cpu cores limited in my app_config for both the ACEMD and QC apps. I was wondering why it wasn't picked up after all config files were re-read.

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 19
Credit: 89,737
RAC: 51
Level

Scientific publications
wat
Message 48471 - Posted: 21 Dec 2017 | 12:02:09 UTC - in response to Message 48467.
Last modified: 21 Dec 2017 | 12:03:40 UTC

There is also an option in the BOINC code that allows for the number of cpus that you want to use per host.
Each host can have a different setting.

Ask over at Amicable Numbers as they have done that there.
Then you wont need app_config.xml files at all.

Conan


Setting CPU % in BOINC is system and project wide. Not very good for fine tuning per project. The app_config was specifically introduced for specific project tuning and is the preferred method to control gpu and cpu usage per application.

I have the cpu cores limited in my app_config for both the ACEMD and QC apps. I was wondering why it wasn't picked up after all config files were re-read.
Keith Myers


I am not referring to the Boinc Client on your personal computer.
My comments were aimed at the BOINC Server Code and therefore are relevant to fine tuning per project.
The option I am referring to is meant for Multiple Threading, so you can set the number of cores that you want to run a MT work unit on.

Over at Amicable Numbers I have normal for my 4 Core host, 5 cores for my 6 core host and 8 cores for my 16 core host (allowing 2 work units to run at the same time), so that none of the computers have the same setting but could if I wanted them to.

Had the same issue I found here with the 16 core machine and that is why I set it to 8 cores.

App_config.xml works and works well, I was offering an option especially for those of us that are not too good at creating these xml files, and to show that BOINC does have an option in its code to cover this situation.

Conan

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48472 - Posted: 21 Dec 2017 | 12:35:13 UTC - in response to Message 48471.

I am not referring to the Boinc Client on your personal computer.
My comments were aimed at the BOINC Server Code and therefore are relevant to fine tuning per project.
The option I am referring to is meant for Multiple Threading, so you can set the number of cores that you want to run a MT work unit on.

They have that at LHC too, for the ATLAS project. And they used to do something similar at WCG for the CEP2 project (though that was not mt), in order to limit the high number of writes to the disk drive.

I think it would be very valuable here, since it appears that limiting the number of cores will be needed for many people, and not everyone will be willing to use app_config.xml files.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48473 - Posted: 21 Dec 2017 | 13:04:51 UTC - in response to Message 48472.

We'll try to limit the number of cores indeed. It requires server-side changes so may not be soon & may not work.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48476 - Posted: 21 Dec 2017 | 19:11:10 UTC - in response to Message 48471.

I understood what you are saying. The only way that works is if you set up different venues for different projects. I work primarily at SETI and Einstein. The venue mechanism does not work correctly and will likely never be updated. Very low chance that any major rework of the BOINC server code happens in the future with the lack of developers.

I am very comfortable with writing and editing app_info and app_config. Been doing it for a very long while. App_config is the simplest way to tune for individual projects as long as you are using a later version of the Client.

I also run more than one project simultaneously which makes your solution unworkable.

ETQuestor
Send message
Joined: 11 Jul 09
Posts: 27
Credit: 794,392,589
RAC: 487,722
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48477 - Posted: 21 Dec 2017 | 19:41:07 UTC
Last modified: 21 Dec 2017 | 19:44:41 UTC

I just had to restart the BOINC client and the QMML work unit started back from 0% "fraction done" even though it had a checkpoint time of ~130000 and was at ~65%. Boo.

name: c457-TONI_QMML314rst-0-1-RND3080_4
WU name: c457-TONI_QMML314rst-0-1-RND3080
project URL: http://www.gpugrid.net/
received: Thu Dec 21 01:43:45 2017
report deadline: Tue Dec 26 01:43:44 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 314
checkpoint CPU time: 135211.300000
current CPU time: 138876.400000
fraction done: 0.010989
swap size: 1236 MB
working set size: 304 MB
estimated CPU time remaining: 747697.676876

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48483 - Posted: 22 Dec 2017 | 21:45:12 UTC - in response to Message 48477.

Can anybody please try if a couple of tasks can run in simultaneously?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48500 - Posted: 24 Dec 2017 | 17:15:46 UTC - in response to Message 48483.

If another one would be made available, I could try. I only ever see one task ready to be snagged. Just got one. Happy to report the change in allowed cores limit was properly applied after I changed 4 to 4.0000.

Profile bormolino
Send message
Joined: 16 May 13
Posts: 22
Credit: 18,763,866
RAC: 21,753
Level
Pro
Scientific publications
watwat
Message 48511 - Posted: 26 Dec 2017 | 13:56:46 UTC

Why is there such a big credit difference?



16794335 12932706 25 Dec 2017 | 10:46:18 UTC 26 Dec 2017 | 5:21:03 UTC Fertig und Bestätigt 66,830.38 199,504.30 440.38 Quantum Chemistry v3.14 (mt)

16792051 12932673 24 Dec 2017 | 10:37:41 UTC 25 Dec 2017 | 5:22:21 UTC Fertig und Bestätigt 67,429.09 200,424.90 1,627.77 Quantum Chemistry v3.14 (mt)

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48512 - Posted: 27 Dec 2017 | 3:04:31 UTC - in response to Message 48483.

I just grabbed 2 QC tasks and I will attempt to run them simultaneously tomorrow during the SETI outage.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 25
Credit: 349,637,412
RAC: 1,347,315
Level
Asp
Scientific publications
watwatwatwatwat
Message 48514 - Posted: 27 Dec 2017 | 18:20:32 UTC

These two WU's ran concurrently on one of my FX8350 using 4 cores each. They were started within about twenty minutes of each other and finished about 4 minutes apart.

16795026 12932509 426610 25 Dec 2017 | 15:47:53 UTC 27 Dec 2017 | 12:23:12 UTC Completed and validated 53,395.91 188,995.80 3,034.10 Quantum Chemistry v3.14 (mt)

16795025 12932492 426610 25 Dec 2017 | 15:47:53 UTC 27 Dec 2017 | 12:19:03 UTC Completed and validated 53,957.92 193,510.10 3,066.04 Quantum Chemistry v3.14 (mt)

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48520 - Posted: 28 Dec 2017 | 9:07:14 UTC - in response to Message 48512.

I just grabbed 2 QC tasks and I will attempt to run them simultaneously tomorrow during the SETI outage.


I just finished these two QC tasks run concurrently.

16798843 12932712 456812 27 Dec 2017 | 3:00:54 UTC 28 Dec 2017 | 7:57:47 UTC Completed and validated 26,162.47 103,683.40 4,277.76 Quantum Chemistry v3.14 (mt)
16798838 12932759 456812 27 Dec 2017 | 2:58:51 UTC 28 Dec 2017 | 7:57:47 UTC Completed and validated 26,201.39 103,805.90 4,284.12 Quantum Chemistry v3.14 (mt)

I started them within a minute of each other using 4 cores each. I also had 3 Einstein GPU tasks running concurrently with them. System is a AMD Ryzen 1800X 16 core CPU and three Nvidia GTX 970's.

Didn't appear to have any problems. Tasks ran right through with about 70% CPU utilization.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48521 - Posted: 28 Dec 2017 | 9:11:41 UTC - in response to Message 48520.

Thanks @keith, thanks @starbase!

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48527 - Posted: 28 Dec 2017 | 17:31:07 UTC - in response to Message 48521.
Last modified: 28 Dec 2017 | 17:31:41 UTC

Toni, well there's your answer if you discount the low population sample. I doubt that two successful runs were due to the brand of cpu, but could be wrong. As long as you are using a Client later than 7.0.40 you can use an app_config.xml file to tune the number of cores you allow the task to run on.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48532 - Posted: 29 Dec 2017 | 0:16:02 UTC - in response to Message 48527.
Last modified: 29 Dec 2017 | 0:19:27 UTC

My Ryzen 1700 machine has certainly done better than my two i7-3770 PCs (all machines on Ubuntu, run 24/7 and otherwise set up the same):

Ryzen 1700: http://www.gpugrid.net/results.php?hostid=452287

i7-3770:
http://www.gpugrid.net/results.php?hostid=433866
http://www.gpugrid.net/results.php?hostid=448995

EDIT: My i7-4770 PC also tended to hang or otherwise fail.
http://www.gpugrid.net/results.php?hostid=357332

I have often gotten hung work units on the Intel machines, but seldom or never on the AMD. And looking around at the other users who fail the work units, they seem to be predominantly Intel, while the ones that succeed seem to be AMD, though I have not done a count myself. Presumably Toni can get those figures.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48534 - Posted: 29 Dec 2017 | 2:22:44 UTC

Thanks for the post. Interesting. I could discount the fact that the Ryzen's have actual 8 physical cores so on paper a good head start over the Intel 4 core cpu's. But the FX-9350 earlier in the thread had a good result too with only 4 physical cores too and much more handicapped FFT registers in its modules compared to Ryzen and Intel. Need a lot more samples to definitively clarify I think.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48535 - Posted: 29 Dec 2017 | 2:38:32 UTC - in response to Message 48534.

Need a lot more samples to definitively clarify I think.

Yes. I am not looking at the output, but really only the error rate. All the machines are now running two cores per work unit, and only one work unit per machine, though earlier I had been running four cores on the AMD machine. And both the Intel and AMD cores have about the same speed, so the output should be comparable anyway now.

I have changed one of the i7-3770 machines (GTX-1070-PC) from Ubuntu 17.10 (and BOINC 7.8.3) back to Ubuntu 16.04 (and BOINC 7.6.31).
I doubt that it will make much difference, but I will let it run for a couple of weeks. If I continue to get more errors on Intel, I think I will go with just the Ryzen PC.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48537 - Posted: 29 Dec 2017 | 8:07:09 UTC - in response to Message 48535.

Good experiment. I was just thinking that spreading the compute load over 4 cores is less hard compared to 100% workload over 4 cores on Intel. If the test on 2 and one cores is equally stable, just slower, it might suggest something bothersome on Intel architecture. The different OS platform could have a big effect too.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 25
Credit: 349,637,412
RAC: 1,347,315
Level
Asp
Scientific publications
watwatwatwatwat
Message 48550 - Posted: 29 Dec 2017 | 20:52:48 UTC
Last modified: 29 Dec 2017 | 21:05:30 UTC

It appears one cannot start two or more of these type of WU's simultaneously. One of the two errored as shown below.

16803962 12952984 426610 29 Dec 2017 | 16:16:51 UTC 29 Dec 2017 | 17:15:51 UTC Completed and validated 916.51 6,109.03 35.20 Quantum Chemistry v3.14 (mt)
16803949 12952971 426610 29 Dec 2017 | 16:16:51 UTC 29 Dec 2017 | 17:26:30 UTC Error while computing 5.04 0.00 --- Quantum Chemistry v3.14 (mt)

So I started two more but with a five second delay and they both are now happily processing together on one of my FX8350's 4 cores each. With that in mind, the boinc client may not be relied upon to run these cpu jobs unattended using a split cpu core configuration to allow multiple WU processing lest two or more were to start simultaneously causing a possible failure on at least one WU.

Edit: Will try a simultaneous start with my last two WU's to see if this is repeatable.

Edit2: Yep, happened again. Will copy the pair to this post when the one in progress finishes.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 25
Credit: 349,637,412
RAC: 1,347,315
Level
Asp
Scientific publications
watwatwatwatwat
Message 48551 - Posted: 29 Dec 2017 | 22:02:47 UTC

These are the last two, one of which errored:

16804044 12953066 426610 29 Dec 2017 | 17:28:46 UTC 29 Dec 2017 | 21:02:26 UTC Error while computing 6.06 0.00 --- Quantum Chemistry v3.14 (mt)
16804026 12953048 426610 29 Dec 2017 | 17:28:46 UTC 29 Dec 2017 | 21:24:44 UTC Completed and validated 1,393.35 5,255.89 60.62 Quantum Chemistry v3.14 (mt)

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48552 - Posted: 30 Dec 2017 | 12:14:42 UTC - in response to Message 48550.

It appears one cannot start two or more of these type of WU's simultaneously.

That is curious. Thanks for the report.

The new DOMINIKs run fine on my two i7-3770s thus far, probably since they are shorter than the TONIs and don't get to the point of hanging up.

And the Ryzen 1700 continues to do well. But there must be some selection process going on at the server, since it is the getting only the TONIs. They are all reissues now, but it has handled them all thus far, even a _8. That is a good idea, since it makes optimum use of each CPU type.

If things continue this way, I will just let all the machines run.


Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 619
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48553 - Posted: 30 Dec 2017 | 14:22:06 UTC - in response to Message 48552.

Looks like that multiple WUs together is ok, but starting exactly at the same time is not. I hope it is a relatively rare occurrence. In principle I could put a locking mechanism, but I am not enthusiastic because that would be inviting more failure modes (e.g. stale locks) to solve a relatively rare case.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 75
Credit: 93,757,492
RAC: 1,426,180
Level
Thr
Scientific publications
wat
Message 48554 - Posted: 30 Dec 2017 | 14:27:07 UTC - in response to Message 48553.

Looks like that multiple WUs together is ok, but starting exactly at the same time is not. I hope it is a relatively rare occurrence. In principle I could put a locking mechanism, but I am not enthusiastic because that would be inviting more failure modes (e.g. stale locks) to solve a relatively rare case.


So every new member crunching these will get an error on their 1st task. Genius. It absolutely should be fixed.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48555 - Posted: 30 Dec 2017 | 15:02:08 UTC - in response to Message 48554.
Last modified: 30 Dec 2017 | 15:18:46 UTC

Looks like that multiple WUs together is ok, but starting exactly at the same time is not. I hope it is a relatively rare occurrence. In principle I could put a locking mechanism, but I am not enthusiastic because that would be inviting more failure modes (e.g. stale locks) to solve a relatively rare case.


So every new member crunching these will get an error on their 1st task. Genius. It absolutely should be fixed.

I got two errors at first also, but none since and I did not look at the reason. But it is over and done with, and not a problem.

I think if you look hard at the logic of lock mechanisms, it is a logical impossibility to fix simultaneity problems. That is, any delay you put it will match some other starting situation, and result in an error also. You can try, but I don't think it is worth the effort either.

EDIT: I would look to see if it happens again with the next batch. If so, then I would investigate something, whatever it is. But the errors were very short for me, and no real time lost.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 25
Credit: 349,637,412
RAC: 1,347,315
Level
Asp
Scientific publications
watwatwatwatwat
Message 48556 - Posted: 30 Dec 2017 | 16:19:59 UTC

I would speculate that the probability of two or more WU's starting at exactly the same time in an unattended environment would be low however, those running multiple projects competing for the same GPU/CPU resources are usually switched at a specified time interval to give each project process time. With that in mind, it is possible for two QC jobs to finish and the boinc client nearing the end of a switch app interval, change to the other project and then when it came time to switch back to run the cpu gpugrid WU's with a queue full of QC jobs start two or more simultaneously.

The only way I can see avoiding the possibility of such an error completely would be to keep all cores dedicated to one QC job on unattended machines (headless crunchers in my case). When the next batch is available, I will cut my switch app time down to say 5 minutes and see if I can get a feel for how the client handles things and the possibility of simultaneous starts with the QC jobs. The two jobs that failed due to simultaneous starts both ended with exit code 195 EXIT_CHILD_FAILED. Not sure if that means the app failed to spawn a child thread or close one for one of the two WU's.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48557 - Posted: 30 Dec 2017 | 16:30:49 UTC - in response to Message 48556.

You are right that BOINC is not really a random environment, and if it happens in a more-or-less predictable manner, it should be possible to prevent it. We will see how often that is.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 25
Credit: 349,637,412
RAC: 1,347,315
Level
Asp
Scientific publications
watwatwatwatwat
Message 48569 - Posted: 31 Dec 2017 | 19:02:23 UTC - in response to Message 48557.

You are right that BOINC is not really a random environment, and if it happens in a more-or-less predictable manner, it should be possible to prevent it. We will see how often that is.


Agreed, predicable is the key. I really need to find out how the projects I crunch for work unattended because all my little headless crunchers are in process of being converted to diskless/headless cluster nodes with one of the FX system being the master. This should all prove interesting as this is the first project I've worked with that uses multicores for a single WU.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 75
Credit: 93,757,492
RAC: 1,426,180
Level
Thr
Scientific publications
wat
Message 48571 - Posted: 1 Jan 2018 | 0:34:29 UTC

When there is work or the GPUGrid servers sent out work they come in batches. Two tasks end up starting at once since they have short deadlines. I also said new members so again multiple tasks starting at once. Not everyone runs the same project all the time so that tasks have a chance to get off sequence.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 490
Credit: 1,131,369,187
RAC: 11,122
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48578 - Posted: 2 Jan 2018 | 19:34:34 UTC - in response to Message 48535.

I have changed one of the i7-3770 machines (GTX-1070-PC) from Ubuntu 17.10 (and BOINC 7.8.3) back to Ubuntu 16.04 (and BOINC 7.6.31).
I doubt that it will make much difference, but I will let it run for a couple of weeks. If I continue to get more errors on Intel, I think I will go with just the Ryzen PC.

I just completed two TONI work units, one on each of my i7-3770 PCs (2 cores per work unit):
http://www.gpugrid.net/workunit.php?wuid=12932866
http://www.gpugrid.net/workunit.php?wuid=12932333

They had each errored out on other PCs, and I don't know why they worked on mine. But I do know that they each got stuck at 78.698% until I rebooted, and then they completed normally. However, the total Run time shown does not include the time they were stuck, which was about two hours in each case.

This is no way to get work done; I can't be rebooting for each work unit. So I will have to just stop on the i7-3770 machines and continue only with the Ryzen 1700, which continues to work fine.

Note that the new DOMINIK work units are no problem - if they could send only those to the Intel machines, I think the problem would be solved.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48586 - Posted: 3 Jan 2018 | 23:31:07 UTC

I just ran a DOMINIK QC task and it ran very fast.
Task 16815024

Anybody else run one of these yet? I see that there are a TON of them available.

Server Status

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 285
Credit: 1,472,261,531
RAC: 4,701,386
Level
Met
Scientific publications
watwat
Message 48587 - Posted: 3 Jan 2018 | 23:39:37 UTC

50,000 WUs... Holy ****. If only those were GPU WUs

mmonnin
Send message
Joined: 2 Jul 16
Posts: 75
Credit: 93,757,492
RAC: 1,426,180
Level
Thr
Scientific publications
wat
Message 48588 - Posted: 4 Jan 2018 | 0:14:48 UTC - in response to Message 48587.

50,000 WUs... Holy ****. If only those were GPU WUs


I saw that too and downloaded 4 on a computer I haven't run any of these on. Started at the same time and boom all went to crap. I even tried to pause some to stop the error.

http://www.gpugrid.net/results.php?hostid=458003

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 25
Credit: 349,637,412
RAC: 1,347,315
Level
Asp
Scientific publications
watwatwatwatwat
Message 48589 - Posted: 4 Jan 2018 | 1:02:26 UTC
Last modified: 4 Jan 2018 | 1:05:56 UTC

So far from what I have learned starting more than one multiple cpu job at a time in a split core senerio is that they need not be started at exactly the same time for one to error. Given not an exact start at the same time, the one started first always errors and the second one started processes to successful completion with the next WU in the queue started to also complete successfully. Four of four tries, near simultaneous starts the one started first ended up failing. Secondly, controlling the time to when the second WU is allowed to start following the first start time is up to 5 seconds as tested so far. Third, when the boinc client switches between projects, the QC WU's so far observed are completed in pairs leaving no single job left in progress (suspended) to stagger start the times. This unfortunate characteristic means "simultaneous (or nearly so) starts" cause an error whenever the client switches back to the gpugrid cpu jobs with a queue larger than one WU. Guess the only way to prevent this behavior is to not split cores, especially for unattended clients.

Edit: correct my lousy spelling, stupid keyboard :)

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48590 - Posted: 4 Jan 2018 | 1:15:38 UTC - in response to Message 48589.

I didn't have that experience with the two TONI tasks I started simultaneously. Or within the 5 second window you described. Both completed successfully. I am limiting core usage to four with an app_config file. Limiting the max_concurrent to 1 now since I also crunch SETI cpu tasks on that computer. I ran the two concurrent jobs when Toni requested users to try that experiment.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48591 - Posted: 4 Jan 2018 | 2:02:50 UTC

Wow, the credit awarded is all over the place for these DOMINIK tasks. Obviously NOT tied to computation time or resources used for compute.

Task 16862848
3715 seconds CPU time Credit awarded 21

Task 16815159
3687 seconds CPU time Credit awarded 161

FredoGuan
Send message
Joined: 29 Dec 16
Posts: 2
Credit: 1,397
RAC: 38
Level

Scientific publications
wat
Message 48592 - Posted: 4 Jan 2018 | 2:12:21 UTC

I just got a task and it finished on a dual e5-2450l 32g ram server fine.
resultid=16815237
Keep developing this, please. This is quite nice and I would really like to see this as part of GPUGRID permanently.

FredoGuan
Send message
Joined: 29 Dec 16
Posts: 2
Credit: 1,397
RAC: 38
Level

Scientific publications
wat
Message 48593 - Posted: 4 Jan 2018 | 2:57:04 UTC - in response to Message 48592.

resultid=16815470

Dominik
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 15 Dec 17
Posts: 8
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 48597 - Posted: 4 Jan 2018 | 11:10:04 UTC

Hello Keith,

Wow, the credit awarded is all over the place for these DOMINIK tasks. Obviously NOT tied to computation time or resources used for compute.


Did you observe this behavior multiple times? Really strange to be honest.


Thanks for helping out everyone!

mmonnin
Send message
Joined: 2 Jul 16
Posts: 75
Credit: 93,757,492
RAC: 1,426,180
Level
Thr
Scientific publications
wat
Message 48603 - Posted: 4 Jan 2018 | 13:07:24 UTC - in response to Message 48597.

All tasks were erroring out in 2 minutes due to the app using gcc5.5

This got it to go to farther and start being multithreaded. We'll see if it actually completes.

sudo apt-get install gcc-5 g++-5

mmonnin
Send message
Joined: 2 Jul 16
Posts: 75
Credit: 93,757,492
RAC: 1,426,180
Level
Thr
Scientific publications
wat
Message 48608 - Posted: 4 Jan 2018 | 14:08:09 UTC - in response to Message 48603.

Yup, it completed.
http://www.gpugrid.net/result.php?resultid=16817115

Dominik
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 15 Dec 17
Posts: 8
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 48610 - Posted: 4 Jan 2018 | 14:14:33 UTC

Great! Thank you very much

klepel
Send message
Joined: 23 Dec 09
Posts: 143
Credit: 1,869,862,543
RAC: 1,083,757
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48613 - Posted: 4 Jan 2018 | 16:01:40 UTC - in response to Message 48401.

@klepel - can you try installing gcc (if not already there)?

tks

I tried it yesterday. I installed gcc-5 and gcc-6. And it worked on the computer http://www.gpugrid.net/results.php?hostid=452211

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 25
Credit: 349,637,412
RAC: 1,347,315
Level
Asp
Scientific publications
watwatwatwatwat
Message 48616 - Posted: 4 Jan 2018 | 16:49:17 UTC - in response to Message 48590.

These are the two WU's that were started about 5 seconds apart on an AMD FX-8350 with the first one started failing.

Stdoutdea.txt:
03-Jan-2018 17:20:14 [GPUGRID] [css] running e113s22_e86s4p0f123-PABLO_p53_PHEX10P_IDP-0-1-RND2720_0 (0.987 CPUs + 1 NVIDIA GPU)
03-Jan-2018 17:20:14 [GPUGRID] Starting task c00000_00024-DOMINIK_QMML2_m0000000055-0-1-RND3244_0
03-Jan-2018 17:20:14 [GPUGRID] [cpu_sched] Starting task c00000_00024-DOMINIK_QMML2_m0000000055-0-1-RND3244_0 using QC version 314 (mt) in slot 9
03-Jan-2018 17:20:14 [GPUGRID] [css] running c00000_00024-DOMINIK_QMML2_m0000000055-0-1-RND3244_0 (4 CPUs)
03-Jan-2018 17:20:20 [GPUGRID] task c06475_06499-DOMINIK_QMML2_m0000000054-0-1-RND3067_0 resumed by user
03-Jan-2018 17:20:21 [GPUGRID] [css] running e113s22_e86s4p0f123-PABLO_p53_PHEX10P_IDP-0-1-RND2720_0 (0.987 CPUs + 1 NVIDIA GPU)
03-Jan-2018 17:20:21 [GPUGRID] [css] running c00000_00024-DOMINIK_QMML2_m0000000055-0-1-RND3244_0 (4 CPUs)
03-Jan-2018 17:20:21 [GPUGRID] Starting task c06475_06499-DOMINIK_QMML2_m0000000054-0-1-RND3067_0
03-Jan-2018 17:20:21 [GPUGRID] [cpu_sched] Starting task c06475_06499-DOMINIK_QMML2_m0000000054-0-1-RND3067_0 using QC version 314 (mt) in slot 10
03-Jan-2018 17:20:21 [GPUGRID] [css] running c06475_06499-DOMINIK_QMML2_m0000000054-0-1-RND3067_0 (4 CPUs)
03-Jan-2018 17:20:25 [GPUGRID] [sched_op] Deferring communication for 00:01:39
03-Jan-2018 17:20:25 [GPUGRID] [sched_op] Reason: Unrecoverable error for task c00000_00024-DOMINIK_QMML2_m0000000055-0-1-RND3244_0
03-Jan-2018 17:20:25 [GPUGRID] Computation for task c00000_00024-DOMINIK_QMML2_m0000000055-0-1-RND3244_0 finished
03-Jan-2018 17:20:25 [GPUGRID] [css] running e113s22_e86s4p0f123-PABLO_p53_PHEX10P_IDP-0-1-RND2720_0 (0.987 CPUs + 1 NVIDIA GPU)
03-Jan-2018 17:20:25 [GPUGRID] [css] running c06475_06499-DOMINIK_QMML2_m0000000054-0-1-RND3067_0 (4 CPUs)


I didn't have that experience with the two TONI tasks I started simultaneously. Or within the 5 second window you described. Both completed successfully. I am limiting core usage to four with an app_config file. Limiting the max_concurrent to 1 now since I also crunch SETI cpu tasks on that computer. I ran the two concurrent jobs when Toni requested users to try that experiment.


Since this hasn't been the case with your Intel's implies this could be a cpu related phenomenom (architecture/scheduling differences). Perhaps the Intel's can handle initial start up processes faster than the FX series AMD, (might spring for a Ryen7 soon just to check them as well). Regardless, the issue is resolved with my systems by limiting concurrent QC jobs to one and use the other four cores to run WCG as to date I have not experienced a concurrent issue with the WCG WU's.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48617 - Posted: 4 Jan 2018 | 18:14:01 UTC - in response to Message 48597.

Hello Keith,

Wow, the credit awarded is all over the place for these DOMINIK tasks. Obviously NOT tied to computation time or resources used for compute.


Did you observe this behavior multiple times? Really strange to be honest.


Thanks for helping out everyone!

Yes, the first completed tasks got reasonable credit. Then when I downloaded more, all the credit for them nosedived. Once I saw that they weren't worth running I set NNT.




    16862848 12962574 456812 3 Jan 2018 | 23:38:10 UTC 4 Jan 2018 | 1:20:46 UTC Completed and validated 1,000.69 3,715.60 21.12 Quantum Chemistry v3.14 (mt)
    16815333 12963093 456812 4 Jan 2018 | 1:00:00 UTC 4 Jan 2018 | 2:57:45 UTC Completed and validated 1,020.00 3,812.63 27.38 Quantum Chemistry v3.14 (mt)
    16815332 12963092 456812 4 Jan 2018 | 0:59:23 UTC 4 Jan 2018 | 2:40:47 UTC Completed and validated 990.38 3,697.44 25.78 Quantum Chemistry v3.14 (mt)
    16815320 12963080 456812 4 Jan 2018 | 1:00:37 UTC 4 Jan 2018 | 3:14:28 UTC Completed and validated 997.66 3,731.61 27.28 Quantum Chemistry v3.14 (mt)
    16815307 12963067 456812 4 Jan 2018 | 1:07:01 UTC 4 Jan 2018 | 4:04:01 UTC Completed and validated 1,009.28 3,642.48 26.40 Quantum Chemistry v3.14 (mt)
    16815275 12963035 456812 4 Jan 2018 | 1:07:38 UTC 4 Jan 2018 | 4:21:09 UTC Completed and validated 1,033.63 3,668.31 26.52 Quantum Chemistry v3.14 (mt)
    16815264 12963024 456812 4 Jan 2018 | 1:06:24 UTC 4 Jan 2018 | 3:46:54 UTC Completed and validated 969.15 3,592.69 25.72 Quantum Chemistry v3.14 (mt)
    16815248 12963008 456812 4 Jan 2018 | 0:58:46 UTC 4 Jan 2018 | 2:24:17 UTC Completed and validated 935.31 3,503.52 23.59 Quantum Chemistry v3.14 (mt)
    16815234 12962994 456812 4 Jan 2018 | 1:05:48 UTC 4 Jan 2018 | 3:30:44 UTC Completed and validated 981.30 3,616.45 26.43 Quantum Chemistry v3.14 (mt)
    16815171 12962931 456812 3 Jan 2018 | 23:37:35 UTC 4 Jan 2018 | 1:04:10 UTC Completed and validated 929.82 3,523.44 18.45 Quantum Chemistry v3.14 (mt)
    16815159 12962919 456812 3 Jan 2018 | 23:35:43 UTC 4 Jan 2018 | 0:16:22 UTC Completed and validated 986.88 3,687.96 161.36 Quantum Chemistry v3.14 (mt)

klepel
Send message
Joined: 23 Dec 09
Posts: 143
Credit: 1,869,862,543
RAC: 1,083,757
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48618 - Posted: 4 Jan 2018 | 19:31:13 UTC
Last modified: 4 Jan 2018 | 19:36:13 UTC

I have to report back on the AMD Ryzen 1700x Computer: http://www.gpugrid.net/results.php?hostid=420971

If I run 3 instances, the WUs crashes after about 200 seconds, and after that the computer crashes completely. If I am running one (01) instance (WU), the computer runs without any problem.

However, as BOINC downloads several of this Quantum Chemistry v3.14 (mt) WUs, BOINC thinks my CPU cache is full and refuses to download additional CPU WUs from PRIMGRID. So after a while the CPU is only loaded with one QC WU (4 threads) and the rest of the cores are idle - Not very efficient.

Sorry, Dominik unter diesen Umständen kann ich keine weiteren QC WUs für diesen Computer herunterladen. Komme aber gerne zurück, wenn wir ohne Probleme mehrere MultiCores WUs gleichzeitig bearbeiten können.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 25
Credit: 349,637,412
RAC: 1,347,315
Level
Asp
Scientific publications
watwatwatwatwat
Message 48619 - Posted: 4 Jan 2018 | 19:45:44 UTC - in response to Message 48618.
Last modified: 4 Jan 2018 | 20:02:00 UTC

However, as BOINC downloads several of this Quantum Chemistry v3.14 (mt) WUs, BOINC thinks my CPU cache is full and refuses to download additional CPU WUs from PRIMGRID. So after a while the CPU is only loaded with one QC WU (4 threads) and the rest of the cores are idle - Not very efficient.


If you temporarily suspend the QC jobs (except perhaps the one in progress), you should download more work from your other projects and once downloaded, resume the QC jobs and let boinc take over running the various projects as you have them configured. You may need to "update" the projects you want more work from under the "projects" tab to initiate the downloads right away.

Edit: Close quote and add last sentence.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48624 - Posted: 5 Jan 2018 | 2:50:48 UTC

Seti@home has been down all day so ran out of work. Decided to give the QC Dominik tasks another try. Thought that possibly the last batch were unique or one-offs or something different about the computer from when I first ran them.

Nope. Even worse credit for the batch I ran this afternoon. Credits awarded = 6.

If you expect to get anybody to want to run these, you are going to have to make them more appealing, credit-wise at least. For me, not worth the electricity to run them. Would rather let the computer go cold and give my power bill a temporary reprieve.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 75
Credit: 93,757,492
RAC: 1,426,180
Level
Thr
Scientific publications
wat
Message 48632 - Posted: 5 Jan 2018 | 11:58:50 UTC
Last modified: 5 Jan 2018 | 12:07:27 UTC

Yeah credit took a dump on the last two I completed.

Run Time-----CPU Time-----Credit
2,389.43-----26,206.90-----549.62
3,034.92-----36,710.33-----94.32

Profile bormolino
Send message
Joined: 16 May 13
Posts: 22
Credit: 18,763,866
RAC: 21,753
Level
Pro
Scientific publications
watwat
Message 48639 - Posted: 5 Jan 2018 | 20:01:49 UTC

I got 5.97 points for 32 minutes calculation on my fx 6100.

That's ridiculous!

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48642 - Posted: 5 Jan 2018 | 21:16:29 UTC

If you care to learn about why the low credit or why certain tasks get hi-middle-low credit, read my post here

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48643 - Posted: 5 Jan 2018 | 21:21:52 UTC

The project is using the "old" BOINC credit algorithm. If I remember how it works, the very first few tasks on a new application that BOINC sees gives very high credit. Then when more tasks are returned the algorithm tunes the credit downwards. The APR for the application stabilizes after 11 valid tasks have been returned. This project utilizes the application APR function of BOINC. It is up to each project to decide whether to follow BOINC standards or write their own.

So far I have seen credit awards of 160, 21 and 7. That must be because those were 3 different molecules.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 25
Credit: 349,637,412
RAC: 1,347,315
Level
Asp
Scientific publications
watwatwatwatwat
Message 48653 - Posted: 7 Jan 2018 | 1:31:32 UTC
Last modified: 7 Jan 2018 | 1:37:35 UTC

Seems to me that credit should be based on CPU time only. One WU should be worth a specific amount of credit regardless of how fast or slowly processed. Probably require a large bureaucratic committee to attempt to quantify such a value but that's my 2 cents and maybe it would be worth it if we get an answer within a few years :).

Edit: Added last sentence, sorry, couldn't resist.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 30
Credit: 2,797,089
RAC: 85,230
Level
Ala
Scientific publications
wat
Message 48654 - Posted: 7 Jan 2018 | 1:47:05 UTC - in response to Message 48653.

Seems to me that credit should be based on CPU time only. One WU should be worth a specific amount of credit regardless of how fast or slowly processed. Probably require a large bureaucratic committee to attempt to quantify such a value but that's my 2 cents and maybe it would be worth it if we get an answer within a few years :).

Edit: Added last sentence, sorry, couldn't resist.

That's how Einstein handles credit. They don't use the BOINC algorithm and decide how much credit any application awards independent of run_time.

Post to thread

Message boards : Multicore CPUs : New batch of QC tasks (QMML)