Message boards :
Number crunching :
something changed linux: chemestry all erroring out
Message board moderation
| Author | Message |
|---|---|
JStatesonSend message Joined: 31 Oct 08 Posts: 186 Credit: 3,578,903,157 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I re-enabled quantum chemistry on 3 of my Linux systems for the first time in over a month and it appears all are going to error out.
<core_client_version>7.8.3</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 01:21:33 (24790): wrapper (7.7.26016): starting 01:21:33 (24790): wrapper (7.7.26016): starting 01:21:33 (24790): wrapper: running ../../projects/www.gpugrid.net/Miniconda3-4.3.30-Linux-x86_64.sh (-b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda) Python 3.6.3 :: Anaconda, Inc. cat: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/.messages.txt: No such file or directory 01:21:45 (24790): miniconda-installer exited; CPU time 10.905773 01:21:45 (24790): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/python (pre_script.py) CondaError: FileNotFoundError(2, 'No such file or directory') CondaError: FileNotFoundError(2, 'No such file or directory') CondaError: FileNotFoundError(2, 'No such file or directory') Traceback (most recent call last): File "pre_script.py", line 13, in <module> raise Exception("Error installing h5py") Exception: Error installing h5py 01:25:43 (24790): $PROJECT_DIR/miniconda/bin/python exited; CPU time 90.551737 01:25:43 (24790): app exit status: 0x1 01:25:43 (24790): called boinc_finish(195) </stderr_txt> ]]>
|
|
Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level ![]() Scientific publications
|
All QC tasks error on my 2 Linux boxes, one with Opteron 1210 and the other AMD E-450, both running SuSE Linux Leap 42.3. Tullio |
|
Send message Joined: 21 Mar 16 Posts: 513 Credit: 4,673,458,277 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
All QC tasks error on my 2 Linux boxes, one with Opteron 1210 and the other AMD E-450, both running SuSE Linux Leap 42.3. They might error at first but if you let it run it might end up working. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
These were 16 thread dual Xeon and at one time I recall having 14 threads assigned. Now I see only 4 and something is not working as before. These are small systems with 32gb flash drives and running 16.10 but that was not a problem before. There's a bug in the app, which prevents more than 1 task starting simultaneously. When the task started successfully, you can start another task manually. Since there's no automated way for this, you should pause all but one task before you shut down your computer, then start them one by one. The other option is to limit the concurrently running QC apps to 1. Since this app uses only 4 threads (cores) you should utilize your other CPU cores with a different project. To do this you should create / modify your app_config.xml file in the projects\www.gpugrid.net folder. <app_config>
<app>
<name>QC</name>
<max_concurrent>1</max_concurrent>
</app>
<app_version>
<app_name>QC</app_name>
<plan_class>mt</plan_class>
<avg_ncpus>4</avg_ncpus>
</app_version>
</app_config> |
|
Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level ![]() Scientific publications
|
All QC tasks error on my 2 Linux boxes, one with Opteron 1210 and the other AMD E-450, both running SuSE Linux Leap 42.3. I've run nine of them, all failed. Tullio |
|
Send message Joined: 21 Mar 16 Posts: 513 Credit: 4,673,458,277 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
All QC tasks error on my 2 Linux boxes, one with Opteron 1210 and the other AMD E-450, both running SuSE Linux Leap 42.3. I've had 9 fail consecutively and eventually they start running without error. Leave it running for say an hour and see if it works. It all has to do with the consecutive start bug. |
JStatesonSend message Joined: 31 Oct 08 Posts: 186 Credit: 3,578,903,157 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
These were 16 thread dual Xeon and at one time I recall having 14 threads assigned. Now I see only 4 and something is not working as before. These are small systems with 32gb flash drives and running 16.10 but that was not a problem before. I wonder if this is similar to a problem I have at MilkyWay. My AMD S9100 runs FP64 at 2600 flops, triple what an HD7950 can do (717), but the more concurrent programs I run, exponentially more "invalid" work units are generated. I keep the number of concurent work units to only 3 to prevent too many invalids from haveing to be farmed out to other wingmen. OTOH, my genuine HD7950s can process 5 with not a single invalid. Other users report the same exact problem with S9150. I will try your app_config with 14 thread instead of 4 and see what happens. I have 14 available. |
JStatesonSend message Joined: 31 Oct 08 Posts: 186 Credit: 3,578,903,157 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Nope, did not work for me. I suspended all projects then allowed only one 14 cpu task to run and it quit with an error. I then allowed another still same problem. I changed the 4 above to 14 as that had worked before and issued a "read app_config", saw the 14 show up but it crashed. Looks like same error as before, this computer http://www.gpugrid.net/results.php?hostid=472211 |
|
Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level ![]() Scientific publications
|
I've downloaded 2 tasks on the HP laptop which has no GPU board, so it won't download GPU tasks. The first erred after 1'51". It is running a single core Atlas task with VirtualBox. Tullio |
JStatesonSend message Joined: 31 Oct 08 Posts: 186 Credit: 3,578,903,157 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've downloaded 2 tasks on the HP laptop which has no GPU board, so it won't download GPU tasks. The first erred after 1'51". It is running a single core Atlas task with VirtualBox. Is VirtualBox required for this project? Looking around I don't see a list of applications with requirements like at other projects. I might have overlooked the app list. Maybe that is my problem |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
|
JStatesonSend message Joined: 31 Oct 08 Posts: 186 Credit: 3,578,903,157 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
https://www.gpugrid.net/apps.php Thanks Richard! The hyperlink at "... has two types of apps: ACEMD ...." is not obvious in Edge on the system I am using. I even looked in the "Donate" page for the application list. When I looked for it useing Firefox on my Linux system that has a better monitor the hyperlink showed up immediately. Earlier I had found gpugrid was listed as using VirtualBox but this problem system was Linux and not Windows. Looking through that list you found for me, I do not see VirtualBox listed explicitly. Maybe the failures are because I enabled "test apps" ? The last time I ran QC successfully was when i was using GRCPOOL which manages configurations. I no longer use GRCPOOL and am setting parameters myself through venue which is not possible when using that pool manager. |
|
Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level ![]() Scientific publications
|
All LHC@home projects use VirtualBox save SixTrack. I am running only Atlas@home on my 3 PCs. two Linux and one Windows 10 since all other VirtualBox projects fail, for reasons nobody explained to me. I was one of the Alpha tester at Test4Theory@home on invitation by dr.Ben Segal of CERN, a member of the Internet Hall of Fame. Tullio |
Michael H.W. WeberSend message Joined: 9 Feb 16 Posts: 78 Credit: 656,229,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Same problem over here. I ran QC tasks successfully for several months. Then, they suddenly all errored out. The system was unchanged. I checked a second system: Same issue. The .xml file shown above was there all the time, it does not prevent the problem. My suspicion is: If another BOINC CPU project is allowed to fetch and compute tasks on the same machine, it appears that GPUGRID occupies just one core and then fails to free-up the other cores from the tasks of the other project to start MT computation. Bafflingly, this worked without any problems around 4 weeks ago (probably the auxiliary apps were different than they are at present). So, to me there seems to be an issue in the app - maybe with task priorities? Please correct that, I can't support the project as long as this bug hasn't been resolved and for now I have withdrawn my machines. Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
Please provide the id of a failed task. We have been debugging one specific issue (see the other thread). |
Michael H.W. WeberSend message Joined: 9 Feb 16 Posts: 78 Credit: 656,229,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Please provide the id of a failed task. 3 examples of an i5 machine: http://www.gpugrid.net/result.php?resultid=17650661 http://www.gpugrid.net/result.php?resultid=17650648 http://www.gpugrid.net/result.php?resultid=17650310 3 examples of an i7 machine: http://www.gpugrid.net/result.php?resultid=17465735 http://www.gpugrid.net/result.php?resultid=17465712 http://www.gpugrid.net/result.php?resultid=17464261 We have been debugging one specific issue (see the other thread). Well, these errors occurred this morning around 10:30 am (i5 machine). So, if you haven't fixed anything thereafter, the bug is still alive... Please note that most wing-men attempts to complete these tasks successfully failed as well. In two cases, however, the task was completed by another machine (often after multiple failures), but I could not identify a similarity in these machines which is making them unique compared to my own settings or those of the other failing computers. Which of the many other threads relevant to this issue exactly do you mean? Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
|
Send message Joined: 9 May 13 Posts: 171 Credit: 4,594,296,466 RAC: 171 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Michael asked Which of the many other threads relevant to this issue exactly do you mean? https://www.gpugrid.net/forum_thread.php?id=4750&nowrap=true#49560 |
Michael H.W. WeberSend message Joined: 9 Feb 16 Posts: 78 Credit: 656,229,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
So, where is the ultimate solution to the problem? Unfortunately, I couldn't find it in the linked thread. Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
|
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 259 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
So, where is the ultimate solution to the problem? Sounds like its on server/admin side and nothing something a user can fix since tasks were canceled. |
Michael H.W. WeberSend message Joined: 9 Feb 16 Posts: 78 Credit: 656,229,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Looks like the problem is solved. Both of my machines now return valid CPU QC results. Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
©2025 Universitat Pompeu Fabra