Advanced search

Message boards : Number crunching : something changed linux: chemestry all erroring out

Author Message
Profile BeemerBiker
Avatar
Send message
Joined: 31 Oct 08
Posts: 102
Credit: 1,029,856,253
RAC: 1,802,153
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49426 - Posted: 10 May 2018 | 7:01:19 UTC

I re-enabled quantum chemistry on 3 of my Linux systems for the first time in over a month and it appears all are going to error out.

    Stderr output
    <core_client_version>7.8.3</core_client_version>
    <![CDATA[
    <message>
    process exited with code 195 (0xc3, -61)</message>
    <stderr_txt>
    01:21:33 (24790): wrapper (7.7.26016): starting
    01:21:33 (24790): wrapper (7.7.26016): starting
    01:21:33 (24790): wrapper: running ../../projects/www.gpugrid.net/Miniconda3-4.3.30-Linux-x86_64.sh (-b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda)
    Python 3.6.3 :: Anaconda, Inc.
    cat: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/.messages.txt: No such file or directory
    01:21:45 (24790): miniconda-installer exited; CPU time 10.905773
    01:21:45 (24790): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/python (pre_script.py)

    CondaError: FileNotFoundError(2, 'No such file or directory')
    CondaError: FileNotFoundError(2, 'No such file or directory')
    CondaError: FileNotFoundError(2, 'No such file or directory')


    Traceback (most recent call last):
    File "pre_script.py", line 13, in <module>
    raise Exception("Error installing h5py")
    Exception: Error installing h5py
    01:25:43 (24790): $PROJECT_DIR/miniconda/bin/python exited; CPU time 90.551737
    01:25:43 (24790): app exit status: 0x1
    01:25:43 (24790): called boinc_finish(195)

    </stderr_txt>
    ]]>



This should have worked. These were 16 thread dual Xeon and at one time I recall having 14 threads assigned. Now I see only 4 and something is not working as before. These are small systems with 32gb flash drives and running 16.10 but that was not a problem before.

tullio
Send message
Joined: 8 May 18
Posts: 36
Credit: 5,145,667
RAC: 119,840
Level
Ser
Scientific publications
wat
Message 49431 - Posted: 10 May 2018 | 15:40:15 UTC

All QC tasks error on my 2 Linux boxes, one with Opteron 1210 and the other AMD E-450, both running SuSE Linux Leap 42.3.
Tullio

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 370
Credit: 2,642,682,385
RAC: 3,450,697
Level
Phe
Scientific publications
watwat
Message 49434 - Posted: 10 May 2018 | 16:50:56 UTC - in response to Message 49431.

All QC tasks error on my 2 Linux boxes, one with Opteron 1210 and the other AMD E-450, both running SuSE Linux Leap 42.3.
Tullio

They might error at first but if you let it run it might end up working.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1911
Credit: 12,167,696,719
RAC: 1,864,395
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49435 - Posted: 10 May 2018 | 16:58:24 UTC - in response to Message 49426.
Last modified: 10 May 2018 | 16:59:26 UTC

These were 16 thread dual Xeon and at one time I recall having 14 threads assigned. Now I see only 4 and something is not working as before. These are small systems with 32gb flash drives and running 16.10 but that was not a problem before.

There's a bug in the app, which prevents more than 1 task starting simultaneously. When the task started successfully, you can start another task manually. Since there's no automated way for this, you should pause all but one task before you shut down your computer, then start them one by one. The other option is to limit the concurrently running QC apps to 1. Since this app uses only 4 threads (cores) you should utilize your other CPU cores with a different project.
To do this you should create / modify your app_config.xml file in the projects\www.gpugrid.net folder.

<app_config> <app> <name>QC</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>QC</app_name> <plan_class>mt</plan_class> <avg_ncpus>4</avg_ncpus> </app_version> </app_config>

tullio
Send message
Joined: 8 May 18
Posts: 36
Credit: 5,145,667
RAC: 119,840
Level
Ser
Scientific publications
wat
Message 49436 - Posted: 10 May 2018 | 17:17:33 UTC - in response to Message 49434.
Last modified: 10 May 2018 | 17:23:14 UTC

All QC tasks error on my 2 Linux boxes, one with Opteron 1210 and the other AMD E-450, both running SuSE Linux Leap 42.3.
Tullio

They might error at first but if you let it run it might end up working.

I've run nine of them, all failed.
Tullio

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 370
Credit: 2,642,682,385
RAC: 3,450,697
Level
Phe
Scientific publications
watwat
Message 49438 - Posted: 10 May 2018 | 17:49:15 UTC - in response to Message 49436.

All QC tasks error on my 2 Linux boxes, one with Opteron 1210 and the other AMD E-450, both running SuSE Linux Leap 42.3.
Tullio

They might error at first but if you let it run it might end up working.

I've run nine of them, all failed.
Tullio

I've had 9 fail consecutively and eventually they start running without error. Leave it running for say an hour and see if it works. It all has to do with the consecutive start bug.

Profile BeemerBiker
Avatar
Send message
Joined: 31 Oct 08
Posts: 102
Credit: 1,029,856,253
RAC: 1,802,153
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49443 - Posted: 10 May 2018 | 20:42:13 UTC - in response to Message 49435.

These were 16 thread dual Xeon and at one time I recall having 14 threads assigned. Now I see only 4 and something is not working as before. These are small systems with 32gb flash drives and running 16.10 but that was not a problem before.

There's a bug in the app, which prevents more than 1 task starting simultaneously. When the task started successfully, you can start another task manually. Since there's no automated way for this, you should pause all but one task before you shut down your computer, then start them one by one. The other option is to limit the concurrently running QC apps to 1. Since this app uses only 4 threads (cores) you should utilize your other CPU cores with a different project.
To do this you should create / modify your app_config.xml file in the projects\www.gpugrid.net folder.

<app_config> <app> <name>QC</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>QC</app_name> <plan_class>mt</plan_class> <avg_ncpus>4</avg_ncpus> </app_version> </app_config>


I wonder if this is similar to a problem I have at MilkyWay. My AMD S9100 runs FP64 at 2600 flops, triple what an HD7950 can do (717), but the more concurrent programs I run, exponentially more "invalid" work units are generated. I keep the number of concurent work units to only 3 to prevent too many invalids from haveing to be farmed out to other wingmen. OTOH, my genuine HD7950s can process 5 with not a single invalid. Other users report the same exact problem with S9150.

I will try your app_config with 14 thread instead of 4 and see what happens. I have 14 available.

Profile BeemerBiker
Avatar
Send message
Joined: 31 Oct 08
Posts: 102
Credit: 1,029,856,253
RAC: 1,802,153
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49444 - Posted: 10 May 2018 | 21:16:33 UTC - in response to Message 49435.

<app_config> <app> <name>QC</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>QC</app_name> <plan_class>mt</plan_class> <avg_ncpus>4</avg_ncpus> </app_version> </app_config>


Nope, did not work for me. I suspended all projects then allowed only one 14 cpu task to run and it quit with an error. I then allowed another still same problem. I changed the 4 above to 14 as that had worked before and issued a "read app_config", saw the 14 show up but it crashed. Looks like same error as before, this computer
http://www.gpugrid.net/results.php?hostid=472211

tullio
Send message
Joined: 8 May 18
Posts: 36
Credit: 5,145,667
RAC: 119,840
Level
Ser
Scientific publications
wat
Message 49445 - Posted: 11 May 2018 | 0:12:18 UTC

I've downloaded 2 tasks on the HP laptop which has no GPU board, so it won't download GPU tasks. The first erred after 1'51". It is running a single core Atlas task with VirtualBox.
Tullio

Profile BeemerBiker
Avatar
Send message
Joined: 31 Oct 08
Posts: 102
Credit: 1,029,856,253
RAC: 1,802,153
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49446 - Posted: 11 May 2018 | 1:45:27 UTC - in response to Message 49445.

I've downloaded 2 tasks on the HP laptop which has no GPU board, so it won't download GPU tasks. The first erred after 1'51". It is running a single core Atlas task with VirtualBox.
Tullio


Is VirtualBox required for this project? Looking around I don't see a list of applications with requirements like at other projects. I might have overlooked the app list. Maybe that is my problem

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 848
Credit: 1,686,015,745
RAC: 882,860
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49447 - Posted: 11 May 2018 | 6:52:33 UTC - in response to Message 49446.

https://www.gpugrid.net/apps.php

Profile BeemerBiker
Avatar
Send message
Joined: 31 Oct 08
Posts: 102
Credit: 1,029,856,253
RAC: 1,802,153
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49452 - Posted: 11 May 2018 | 13:03:47 UTC - in response to Message 49447.
Last modified: 11 May 2018 | 13:08:04 UTC

https://www.gpugrid.net/apps.php


Thanks Richard!

The hyperlink at "... has two types of apps: ACEMD ...." is not obvious in Edge on the system I am using. I even looked in the "Donate" page for the application list. When I looked for it useing Firefox on my Linux system that has a better monitor the hyperlink showed up immediately.

Earlier I had found gpugrid was listed as using VirtualBox but this problem system was Linux and not Windows.

Looking through that list you found for me, I do not see VirtualBox listed explicitly. Maybe the failures are because I enabled "test apps" ?

The last time I ran QC successfully was when i was using GRCPOOL which manages configurations. I no longer use GRCPOOL and am setting parameters myself through venue which is not possible when using that pool manager.

tullio
Send message
Joined: 8 May 18
Posts: 36
Credit: 5,145,667
RAC: 119,840
Level
Ser
Scientific publications
wat
Message 49456 - Posted: 11 May 2018 | 16:11:36 UTC

All LHC@home projects use VirtualBox save SixTrack. I am running only Atlas@home on my 3 PCs. two Linux and one Windows 10 since all other VirtualBox projects fail, for reasons nobody explained to me. I was one of the Alpha tester at Test4Theory@home on invitation by dr.Ben Segal of CERN, a member of the Internet Hall of Fame.
Tullio

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 37
Credit: 358,628,328
RAC: 771,301
Level
Asp
Scientific publications
watwat
Message 49559 - Posted: 28 May 2018 | 12:29:30 UTC
Last modified: 28 May 2018 | 12:31:34 UTC

Same problem over here.
I ran QC tasks successfully for several months. Then, they suddenly all errored out. The system was unchanged. I checked a second system: Same issue.
The .xml file shown above was there all the time, it does not prevent the problem.

My suspicion is:
If another BOINC CPU project is allowed to fetch and compute tasks on the same machine, it appears that GPUGRID occupies just one core and then fails to free-up the other cores from the tasks of the other project to start MT computation.
Bafflingly, this worked without any problems around 4 weeks ago (probably the auxiliary apps were different than they are at present).
So, to me there seems to be an issue in the app - maybe with task priorities?
Please correct that, I can't support the project as long as this bug hasn't been resolved and for now I have withdrawn my machines.

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 696
Credit: 4,285,282
RAC: 20
Level
Ala
Scientific publications
watwatwatwat
Message 49561 - Posted: 28 May 2018 | 14:11:45 UTC - in response to Message 49559.

Please provide the id of a failed task. We have been debugging one specific issue (see the other thread).

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 37
Credit: 358,628,328
RAC: 771,301
Level
Asp
Scientific publications
watwat
Message 49562 - Posted: 28 May 2018 | 16:00:58 UTC - in response to Message 49561.
Last modified: 28 May 2018 | 16:02:19 UTC

Please provide the id of a failed task.

3 examples of an i5 machine:
http://www.gpugrid.net/result.php?resultid=17650661
http://www.gpugrid.net/result.php?resultid=17650648
http://www.gpugrid.net/result.php?resultid=17650310

3 examples of an i7 machine:
http://www.gpugrid.net/result.php?resultid=17465735
http://www.gpugrid.net/result.php?resultid=17465712
http://www.gpugrid.net/result.php?resultid=17464261

We have been debugging one specific issue (see the other thread).

Well, these errors occurred this morning around 10:30 am (i5 machine). So, if you haven't fixed anything thereafter, the bug is still alive...

Please note that most wing-men attempts to complete these tasks successfully failed as well. In two cases, however, the task was completed by another machine (often after multiple failures), but I could not identify a similarity in these machines which is making them unique compared to my own settings or those of the other failing computers.

Which of the many other threads relevant to this issue exactly do you mean?

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

captainjack
Send message
Joined: 9 May 13
Posts: 135
Credit: 943,558,285
RAC: 3,302
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 49563 - Posted: 28 May 2018 | 18:28:32 UTC

Michael asked

Which of the many other threads relevant to this issue exactly do you mean?


https://www.gpugrid.net/forum_thread.php?id=4750&nowrap=true#49560

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 37
Credit: 358,628,328
RAC: 771,301
Level
Asp
Scientific publications
watwat
Message 49564 - Posted: 28 May 2018 | 19:28:32 UTC

So, where is the ultimate solution to the problem?
Unfortunately, I couldn't find it in the linked thread.

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 163
Credit: 261,660,789
RAC: 624,968
Level
Asn
Scientific publications
wat
Message 49565 - Posted: 29 May 2018 | 12:49:49 UTC - in response to Message 49564.

So, where is the ultimate solution to the problem?
Unfortunately, I couldn't find it in the linked thread.

Michael.


Sounds like its on server/admin side and nothing something a user can fix since tasks were canceled.

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 37
Credit: 358,628,328
RAC: 771,301
Level
Asp
Scientific publications
watwat
Message 49610 - Posted: 6 Jun 2018 | 9:50:48 UTC

Looks like the problem is solved. Both of my machines now return valid CPU QC results.

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Post to thread

Message boards : Number crunching : something changed linux: chemestry all erroring out