something changed linux: chemestry all erroring out

Message boards : Number crunching : something changed linux: chemestry all erroring out

Author	Message
JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,331,546,800 RAC: 0 Level Scientific publications	Message 49426 - Posted: 10 May 2018 \| 7:01:19 UTC
	I re-enabled quantum chemistry on 3 of my Linux systems for the first time in over a month and it appears all are going to error out. Stderr output <core_client_version>7.8.3</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 01:21:33 (24790): wrapper (7.7.26016): starting 01:21:33 (24790): wrapper (7.7.26016): starting 01:21:33 (24790): wrapper: running ../../projects/www.gpugrid.net/Miniconda3-4.3.30-Linux-x86_64.sh (-b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda) Python 3.6.3 :: Anaconda, Inc. cat: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/.messages.txt: No such file or directory 01:21:45 (24790): miniconda-installer exited; CPU time 10.905773 01:21:45 (24790): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/python (pre_script.py) CondaError: FileNotFoundError(2, 'No such file or directory') CondaError: FileNotFoundError(2, 'No such file or directory') CondaError: FileNotFoundError(2, 'No such file or directory') Traceback (most recent call last): File "pre_script.py", line 13, in <module> raise Exception("Error installing h5py") Exception: Error installing h5py 01:25:43 (24790): $PROJECT_DIR/miniconda/bin/python exited; CPU time 90.551737 01:25:43 (24790): app exit status: 0x1 01:25:43 (24790): called boinc_finish(195) </stderr_txt> ]]> This should have worked. These were 16 thread dual Xeon and at one time I recall having 14 threads assigned. Now I see only 4 and something is not working as before. These are small systems with 32gb flash drives and running 16.10 but that was not a problem before.
	ID: 49426 \| Rating: 0 \| rate: / Reply Quote

tullio Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level Scientific publications	Message 49431 - Posted: 10 May 2018 \| 15:40:15 UTC
	All QC tasks error on my 2 Linux boxes, one with Opteron 1210 and the other AMD E-450, both running SuSE Linux Leap 42.3. Tullio
	ID: 49431 \| Rating: 0 \| rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 511 Credit: 4,652,742,755 RAC: 2,555,673 Level Scientific publications	Message 49434 - Posted: 10 May 2018 \| 16:50:56 UTC - in response to Message 49431.
	All QC tasks error on my 2 Linux boxes, one with Opteron 1210 and the other AMD E-450, both running SuSE Linux Leap 42.3. Tullio They might error at first but if you let it run it might end up working.
	ID: 49434 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 49435 - Posted: 10 May 2018 \| 16:58:24 UTC - in response to Message 49426. Last modified: 10 May 2018 \| 16:59:26 UTC
	These were 16 thread dual Xeon and at one time I recall having 14 threads assigned. Now I see only 4 and something is not working as before. These are small systems with 32gb flash drives and running 16.10 but that was not a problem before. There's a bug in the app, which prevents more than 1 task starting simultaneously. When the task started successfully, you can start another task manually. Since there's no automated way for this, you should pause all but one task before you shut down your computer, then start them one by one. The other option is to limit the concurrently running QC apps to 1. Since this app uses only 4 threads (cores) you should utilize your other CPU cores with a different project. To do this you should create / modify your app_config.xml file in the projects\www.gpugrid.net folder. <app_config> <app> <name>QC</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>QC</app_name> <plan_class>mt</plan_class> <avg_ncpus>4</avg_ncpus> </app_version> </app_config>
	ID: 49435 \| Rating: 0 \| rate: / Reply Quote

tullio Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level Scientific publications	Message 49436 - Posted: 10 May 2018 \| 17:17:33 UTC - in response to Message 49434. Last modified: 10 May 2018 \| 17:23:14 UTC
	All QC tasks error on my 2 Linux boxes, one with Opteron 1210 and the other AMD E-450, both running SuSE Linux Leap 42.3. Tullio They might error at first but if you let it run it might end up working. I've run nine of them, all failed. Tullio
	ID: 49436 \| Rating: 0 \| rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 511 Credit: 4,652,742,755 RAC: 2,555,673 Level Scientific publications	Message 49438 - Posted: 10 May 2018 \| 17:49:15 UTC - in response to Message 49436.
	All QC tasks error on my 2 Linux boxes, one with Opteron 1210 and the other AMD E-450, both running SuSE Linux Leap 42.3. Tullio They might error at first but if you let it run it might end up working. I've run nine of them, all failed. Tullio I've had 9 fail consecutively and eventually they start running without error. Leave it running for say an hour and see if it works. It all has to do with the consecutive start bug.
	ID: 49438 \| Rating: 0 \| rate: / Reply Quote

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,331,546,800 RAC: 0 Level Scientific publications	Message 49443 - Posted: 10 May 2018 \| 20:42:13 UTC - in response to Message 49435.
	These were 16 thread dual Xeon and at one time I recall having 14 threads assigned. Now I see only 4 and something is not working as before. These are small systems with 32gb flash drives and running 16.10 but that was not a problem before. There's a bug in the app, which prevents more than 1 task starting simultaneously. When the task started successfully, you can start another task manually. Since there's no automated way for this, you should pause all but one task before you shut down your computer, then start them one by one. The other option is to limit the concurrently running QC apps to 1. Since this app uses only 4 threads (cores) you should utilize your other CPU cores with a different project. To do this you should create / modify your app_config.xml file in the projects\www.gpugrid.net folder. <app_config> <app> <name>QC</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>QC</app_name> <plan_class>mt</plan_class> <avg_ncpus>4</avg_ncpus> </app_version> </app_config> I wonder if this is similar to a problem I have at MilkyWay. My AMD S9100 runs FP64 at 2600 flops, triple what an HD7950 can do (717), but the more concurrent programs I run, exponentially more "invalid" work units are generated. I keep the number of concurent work units to only 3 to prevent too many invalids from haveing to be farmed out to other wingmen. OTOH, my genuine HD7950s can process 5 with not a single invalid. Other users report the same exact problem with S9150. I will try your app_config with 14 thread instead of 4 and see what happens. I have 14 available.
	ID: 49443 \| Rating: 0 \| rate: / Reply Quote

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,331,546,800 RAC: 0 Level Scientific publications	Message 49444 - Posted: 10 May 2018 \| 21:16:33 UTC - in response to Message 49435.
	<app_config> <app> <name>QC</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>QC</app_name> <plan_class>mt</plan_class> <avg_ncpus>4</avg_ncpus> </app_version> </app_config> Nope, did not work for me. I suspended all projects then allowed only one 14 cpu task to run and it quit with an error. I then allowed another still same problem. I changed the 4 above to 14 as that had worked before and issued a "read app_config", saw the 14 show up but it crashed. Looks like same error as before, this computer http://www.gpugrid.net/results.php?hostid=472211
	ID: 49444 \| Rating: 0 \| rate: / Reply Quote

tullio Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level Scientific publications	Message 49445 - Posted: 11 May 2018 \| 0:12:18 UTC
	I've downloaded 2 tasks on the HP laptop which has no GPU board, so it won't download GPU tasks. The first erred after 1'51". It is running a single core Atlas task with VirtualBox. Tullio
	ID: 49445 \| Rating: 0 \| rate: / Reply Quote

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,331,546,800 RAC: 0 Level Scientific publications	Message 49446 - Posted: 11 May 2018 \| 1:45:27 UTC - in response to Message 49445.
	I've downloaded 2 tasks on the HP laptop which has no GPU board, so it won't download GPU tasks. The first erred after 1'51". It is running a single core Atlas task with VirtualBox. Tullio Is VirtualBox required for this project? Looking around I don't see a list of applications with requirements like at other projects. I might have overlooked the app list. Maybe that is my problem
	ID: 49446 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1576 Credit: 5,797,161,851 RAC: 9,340,312 Level Scientific publications	Message 49447 - Posted: 11 May 2018 \| 6:52:33 UTC - in response to Message 49446.
	https://www.gpugrid.net/apps.php
	ID: 49447 \| Rating: 0 \| rate: / Reply Quote

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,331,546,800 RAC: 0 Level Scientific publications	Message 49452 - Posted: 11 May 2018 \| 13:03:47 UTC - in response to Message 49447. Last modified: 11 May 2018 \| 13:08:04 UTC
	https://www.gpugrid.net/apps.php Thanks Richard! The hyperlink at "... has two types of apps: ACEMD ...." is not obvious in Edge on the system I am using. I even looked in the "Donate" page for the application list. When I looked for it useing Firefox on my Linux system that has a better monitor the hyperlink showed up immediately. Earlier I had found gpugrid was listed as using VirtualBox but this problem system was Linux and not Windows. Looking through that list you found for me, I do not see VirtualBox listed explicitly. Maybe the failures are because I enabled "test apps" ? The last time I ran QC successfully was when i was using GRCPOOL which manages configurations. I no longer use GRCPOOL and am setting parameters myself through venue which is not possible when using that pool manager.
	ID: 49452 \| Rating: 0 \| rate: / Reply Quote

tullio Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level Scientific publications	Message 49456 - Posted: 11 May 2018 \| 16:11:36 UTC
	All LHC@home projects use VirtualBox save SixTrack. I am running only Atlas@home on my 3 PCs. two Linux and one Windows 10 since all other VirtualBox projects fail, for reasons nobody explained to me. I was one of the Alpha tester at Test4Theory@home on invitation by dr.Ben Segal of CERN, a member of the Internet Hall of Fame. Tullio
	ID: 49456 \| Rating: 0 \| rate: / Reply Quote

Michael H.W. Weber Send message Joined: 9 Feb 16 Posts: 71 Credit: 607,916,391 RAC: 0 Level Scientific publications	Message 49559 - Posted: 28 May 2018 \| 12:29:30 UTC Last modified: 28 May 2018 \| 12:31:34 UTC
	Same problem over here. I ran QC tasks successfully for several months. Then, they suddenly all errored out. The system was unchanged. I checked a second system: Same issue. The .xml file shown above was there all the time, it does not prevent the problem. My suspicion is: If another BOINC CPU project is allowed to fetch and compute tasks on the same machine, it appears that GPUGRID occupies just one core and then fails to free-up the other cores from the tasks of the other project to start MT computation. Bafflingly, this worked without any problems around 4 weeks ago (probably the auxiliary apps were different than they are at present). So, to me there seems to be an issue in the app - maybe with task priorities? Please correct that, I can't support the project as long as this bug hasn't been resolved and for now I have withdrawn my machines. Michael. ____________ President of Rechenkraft.net - Germany's first and largest distributed computing organization.
	ID: 49559 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 49561 - Posted: 28 May 2018 \| 14:11:45 UTC - in response to Message 49559.
	Please provide the id of a failed task. We have been debugging one specific issue (see the other thread).
	ID: 49561 \| Rating: 0 \| rate: / Reply Quote

Michael H.W. Weber Send message Joined: 9 Feb 16 Posts: 71 Credit: 607,916,391 RAC: 0 Level Scientific publications	Message 49562 - Posted: 28 May 2018 \| 16:00:58 UTC - in response to Message 49561. Last modified: 28 May 2018 \| 16:02:19 UTC
	Please provide the id of a failed task. 3 examples of an i5 machine: http://www.gpugrid.net/result.php?resultid=17650661 http://www.gpugrid.net/result.php?resultid=17650648 http://www.gpugrid.net/result.php?resultid=17650310 3 examples of an i7 machine: http://www.gpugrid.net/result.php?resultid=17465735 http://www.gpugrid.net/result.php?resultid=17465712 http://www.gpugrid.net/result.php?resultid=17464261 We have been debugging one specific issue (see the other thread). Well, these errors occurred this morning around 10:30 am (i5 machine). So, if you haven't fixed anything thereafter, the bug is still alive... Please note that most wing-men attempts to complete these tasks successfully failed as well. In two cases, however, the task was completed by another machine (often after multiple failures), but I could not identify a similarity in these machines which is making them unique compared to my own settings or those of the other failing computers. Which of the many other threads relevant to this issue exactly do you mean? Michael. ____________ President of Rechenkraft.net - Germany's first and largest distributed computing organization.
	ID: 49562 \| Rating: 0 \| rate: / Reply Quote

captainjack Send message Joined: 9 May 13 Posts: 171 Credit: 2,370,379,288 RAC: 2,399,417 Level Scientific publications	Message 49563 - Posted: 28 May 2018 \| 18:28:32 UTC
	Michael asked Which of the many other threads relevant to this issue exactly do you mean? https://www.gpugrid.net/forum_thread.php?id=4750&nowrap=true#49560
	ID: 49563 \| Rating: 0 \| rate: / Reply Quote

Michael H.W. Weber Send message Joined: 9 Feb 16 Posts: 71 Credit: 607,916,391 RAC: 0 Level Scientific publications	Message 49564 - Posted: 28 May 2018 \| 19:28:32 UTC
	So, where is the ultimate solution to the problem? Unfortunately, I couldn't find it in the linked thread. Michael. ____________ President of Rechenkraft.net - Germany's first and largest distributed computing organization.
	ID: 49564 \| Rating: 0 \| rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 332 Credit: 4,049,421,065 RAC: 13,256,134 Level Scientific publications	Message 49565 - Posted: 29 May 2018 \| 12:49:49 UTC - in response to Message 49564.
	So, where is the ultimate solution to the problem? Unfortunately, I couldn't find it in the linked thread. Michael. Sounds like its on server/admin side and nothing something a user can fix since tasks were canceled.
	ID: 49565 \| Rating: 0 \| rate: / Reply Quote

Michael H.W. Weber Send message Joined: 9 Feb 16 Posts: 71 Credit: 607,916,391 RAC: 0 Level Scientific publications	Message 49610 - Posted: 6 Jun 2018 \| 9:50:48 UTC
	Looks like the problem is solved. Both of my machines now return valid CPU QC results. Michael. ____________ President of Rechenkraft.net - Germany's first and largest distributed computing organization.
	ID: 49610 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 399 Credit: 13,037,436,882 RAC: 1,314,112 Level Scientific publications	Message 51193 - Posted: 6 Jan 2019 \| 13:47:51 UTC - in response to Message 49435.
	There's a bug in the app, which prevents more than 1 task starting simultaneously. To do this you should create / modify your app_config.xml file in the projects\www.gpugrid.net folder. <app_config> <app> <name>QC</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>QC</app_name> <plan_class>mt</plan_class> <avg_ncpus>4</avg_ncpus> </app_version> </app_config> Have you ever tried using??? <fetch_minimal_work>1</fetch_minimal_work> Fetch one job per device (see --fetch_minimal_work). --fetch_minimal_work Fetch only 1 job per device (CPU, GPU). Used with --exit_when_idle, the client will process one job per device, then exit. [/code]
	ID: 51193 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : something changed linux: chemestry all erroring out

	About	Science	Volunteers	Performance	Forum	Join us	Donate