Advanced search

Message boards : Number crunching : process exited with code 212 (0xd4, -44)

Author Message
valterc
Send message
Joined: 21 Jun 10
Posts: 21
Credit: 9,013,689,672
RAC: 15,182,940
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48771 - Posted: 1 Feb 2018 | 10:03:29 UTC
Last modified: 1 Feb 2018 | 10:15:12 UTC

Hi all, starting from a couple of days ago one of my computers (http://www.gpugrid.net/show_host_detail.php?hostid=178360) all the gpugrid tasks error out with this message

<core_client_version>7.2.47</core_client_version>
<![CDATA[
<message>
process exited with code 212 (0xd4, -44)
</message>
<stderr_txt>

</stderr_txt>
]]>

The same computer has no problems crunching collatz or primegrid tasks.
Any hints?

[edit] by running the program manually I got the following
./acemd.914-80.bin
# ACEMD Molecular Dynamics Version [3212u2]
# Basic license will expire soon. Contact info@acellera.com for licensing details
# Basic license has expired. Contact info@acellera.com for licensing details

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1627
Credit: 9,442,507,289
RAC: 17,259,730
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48772 - Posted: 1 Feb 2018 | 10:36:42 UTC - in response to Message 48771.
Last modified: 1 Feb 2018 | 10:50:17 UTC

Looking at the 'workunit' column for those errors, every machine which has attempted them has failed right at startup. That looks like a bad batch of work, rather than any problem with your machine.

Whether it's related to the licence warnings, I can't say.

Edit - having said that, the machine in front of me has just started running gpcrmd_3nya_complex_withNA_rep_3-GIANNI_CPXB-8-10-RND9513. It's running fine, 2.5% in after half an hour. The first attempt at that workunit (under Linux) failed with the same error as you: I'm running Windows 7, and I'm careful to stick with a known good driver.

Edit again - the vast majority of the failures among your wingmates are also running some flavour of Linux. That feels like a Clue.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,765,428,051
RAC: 1,425,398
Level
Tyr
Scientific publications
watwatwatwatwat
Message 48773 - Posted: 1 Feb 2018 | 12:37:49 UTC
Last modified: 1 Feb 2018 | 12:39:28 UTC

I've had 7 of these in a row since the 30th. Only 1 previous bad WU prior and nearly 100mil points worth of good tasks. System has been crunching away at E@H in the mean time.

System is a 1950x with GTX 1070 and 1070 Ti on Ubunutu 17.04.

All others who received my error'd tasks next are running Windows.

biodoc
Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 65,075
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48774 - Posted: 1 Feb 2018 | 13:16:59 UTC
Last modified: 1 Feb 2018 | 13:18:36 UTC

After a long absence from GPUGrid, I tried getting some work yesterday for my linux rig with 2-GTX1080 GPUs. After several hours of trying to get work, I did get one work unit and it failed with the same error code as posted above. Another computer with windows as OS finished that WU sucessfully.

http://www.gpugrid.net/workunit.php?wuid=13112686

I also notice on the bottom of the server status page that the failure rate for project work units vary from 13 to 20 %. Is that typical and are most of the failures on linux?

http://www.gpugrid.net/server_status.php

My computer:

3930K w/ 32 GB RAM
2-GTX1080
Linux Mint 18.3
Nvidia driver version 384.111

What is the recommended driver version for linux?

I've been folding successfully on this computer for a couple of months.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,765,428,051
RAC: 1,425,398
Level
Tyr
Scientific publications
watwatwatwatwat
Message 48775 - Posted: 1 Feb 2018 | 16:03:54 UTC - in response to Message 48774.

After a long absence from GPUGrid, I tried getting some work yesterday for my linux rig with 2-GTX1080 GPUs. After several hours of trying to get work, I did get one work unit and it failed with the same error code as posted above. Another computer with windows as OS finished that WU sucessfully.

http://www.gpugrid.net/workunit.php?wuid=13112686

I also notice on the bottom of the server status page that the failure rate for project work units vary from 13 to 20 %. Is that typical and are most of the failures on linux?

http://www.gpugrid.net/server_status.php

My computer:

3930K w/ 32 GB RAM
2-GTX1080
Linux Mint 18.3
Nvidia driver version 384.111

What is the recommended driver version for linux?

I've been folding successfully on this computer for a couple of months.


If the overclocks are reasonable and not on the bleeding edge the error rates are basically nothing. (Excluding errors caused by the project like this one). I have not had any tasks that crashed due to my own hardware. These these ones that don't even start and one after like 3 seconds.

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,737,954,648
RAC: 472,290
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48776 - Posted: 1 Feb 2018 | 20:54:29 UTC

Here as well! The last 4 WUs haver errored out after a few seconds with:
process exited with code 212 (0xd4, -44)

Linux Lubuntu 17.10, AMD Ryzen 1700x and GTX1070. This system has worked very well before.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2356
Credit: 16,395,174,309
RAC: 3,799,614
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48777 - Posted: 1 Feb 2018 | 23:45:17 UTC - in response to Message 48772.

Edit again - the vast majority of the failures among your wingmates are also running some flavour of Linux. That feels like a Clue.
Sure! The license of the Linux app is expired. It should be renewed - most probably there should be a new (at least in its version number) app for Linux as soon as possible!
The same thing happened in April 2017 with the Windows XP app (and the license of the present Windows XP app will expire in April 2018 again forever).

Perhaps the Linux users should try to set their clocks to an earlier date to make a fool of the app's license expiration check.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 48778 - Posted: 2 Feb 2018 | 0:06:20 UTC
Last modified: 2 Feb 2018 | 0:12:40 UTC

I've had 27 consecutive failures since Jan 30 on three different Linux machines (GTX-1060). Review of the results in the wingman column indicates all failures to be from v9.14 (ACEMD and Linux version?) and the successful completions using v9.18 Windows version I assume.

Edit: spelling

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1627
Credit: 9,442,507,289
RAC: 17,259,730
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48779 - Posted: 2 Feb 2018 | 8:46:38 UTC

Has anyone sent a PM to Gianni to suggest he deprecates the Linux app (to save wasted bandwidth) until the licence problem has been sorted out?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2356
Credit: 16,395,174,309
RAC: 3,799,614
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48780 - Posted: 2 Feb 2018 | 12:28:18 UTC - in response to Message 48779.

Has anyone sent a PM to Gianni to suggest he deprecates the Linux app (to save wasted bandwidth) until the licence problem has been sorted out?
I haven't.
The better way to address this issue to put the question this way:
Who will send a PM to Gianni to suggest he deprecates the Linux app?
Perhaps they have already noticed this issue on their own Linux machines?

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,765,428,051
RAC: 1,425,398
Level
Tyr
Scientific publications
watwatwatwatwat
Message 48781 - Posted: 2 Feb 2018 | 12:48:42 UTC

Up to 17 failed tasks for me. The project out put is really low. Looks like a test to see if the admins actually give 2 cents.

http://stats.free-dc.org/stats.php?page=proj&proj=ps3

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1627
Credit: 9,442,507,289
RAC: 17,259,730
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48782 - Posted: 2 Feb 2018 | 13:18:37 UTC - in response to Message 48780.
Last modified: 2 Feb 2018 | 13:28:59 UTC

Has anyone sent a PM to Gianni to suggest he deprecates the Linux app (to save wasted bandwidth) until the licence problem has been sorted out?
I haven't.
The better way to address this issue to put the question this way:
Who will send a PM to Gianni to suggest he deprecates the Linux app?
Perhaps they have already noticed this issue on their own Linux machines?

Not from what people are saying here.

In general, and from speaking with BOINC developers/administrators, we crunchers are much more aware of the small details of how a project is running than the admins are. Although Gianni's failure rate on the server status page has crept up from 23% to 24% overnight, it's still in the green, and work is being returned (my Windows resend went up this morning) - so, nothing obvious to ring alarm bells.

I'll send the PM...

...done

Hi Gianni,

We're discussing (in process exited with code 212 (0xd4, -44)) that all tasks run on Linux machines are failing. It started with your current tasks, but it may be more widespread than that.

One user (opening message 48771) has reported error messages relating to licensing expiry.

Windows applications are unaffected by this problem.

biodoc
Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 65,075
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48783 - Posted: 2 Feb 2018 | 15:53:01 UTC - in response to Message 48782.

Thanks for sending the PM Richard.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1360
Credit: 7,911,763,953
RAC: 5,680,236
Level
Tyr
Scientific publications
watwatwatwatwat
Message 48785 - Posted: 2 Feb 2018 | 17:53:23 UTC - in response to Message 48782.

This is the reply I got from the Acellera Help Desk

| Dear Keith,
|
| Thanks for your email and our apologies for the inconvenience.
|
| First, I would like to understand correctly what happened.
| May I ask you if you were running on GPU Grid?
| ACEMD basic license does not allow to run on GPUGrid, it is a different compilation.
| Then, We should be able to answer your request.
| Thank you for your understanding!
|
| I take advantage to ask you the following:
| Could you please let me know if you work for non profit entity?
|
| Finally, you can download free of charge the latest version of ACEMD at: https://software.acellera.com
| If you use more than one GPU, you may be interested by the PRO license.
| This license allows to run on e.g. cluster.
| Would it be useful for you ?
|
| Expecting your answer, I wish you a great day,
|
| Franck
`---


So apparently Gianni doesn't have a license to run acellera software at GPU Grid.net. Thus the message.

biodoc
Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 65,075
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48786 - Posted: 2 Feb 2018 | 18:24:22 UTC

I believe Gianni is the scientific founder of Acellera so I would think he would get a steep discount on licenses.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1360
Credit: 7,911,763,953
RAC: 5,680,236
Level
Tyr
Scientific publications
watwatwatwatwat
Message 48787 - Posted: 2 Feb 2018 | 19:37:59 UTC - in response to Message 48786.

I believe Gianni is the scientific founder of Acellera so I would think he would get a steep discount on licenses.

That's choice. Ha ha. Didn't know.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 48793 - Posted: 2 Feb 2018 | 22:01:42 UTC

Somebody must be listening as I haven't had a gpu wu since my last post here on any of my 3 gpu capable machines. Just crunching QC, WCG and E@H right now. Also noticed I haven't received an upgrade to ACEMD v 9.14 yet (original download date 5/6/17 still on file).

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1627
Credit: 9,442,507,289
RAC: 17,259,730
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48794 - Posted: 2 Feb 2018 | 22:34:30 UTC - in response to Message 48793.

Keep an eye on https://www.gpugrid.net/apps.php for new versions. That standard BOINC url works here as on all BOINC projects, even though there isn't an obvious link.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 48795 - Posted: 3 Feb 2018 | 0:17:34 UTC - in response to Message 48794.

Keep an eye on https://www.gpugrid.net/apps.php for new versions. That standard BOINC url works here as on all BOINC projects, even though there isn't an obvious link.


Thanks Richard. Very helpful info and link, added it to my favorites.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 48807 - Posted: 3 Feb 2018 | 18:19:53 UTC

Rats... I am getting more gpu work and they are all failing. Since I don't have any windows machines, I am out of luck until this is resolved.

Easton West
Send message
Joined: 26 Sep 09
Posts: 4
Credit: 190,733,965
RAC: 188,141
Level
Ile
Scientific publications
watwat
Message 48809 - Posted: 4 Feb 2018 | 3:19:33 UTC

Got same error code on my Linux box for several tasks:

http://gpugrid.net/results.php?hostid=447186

Most recent task, I suspended the task before it began. I rolled my computer's date back a month (February to January), then resumed the task to let it begin.

It didn't stop with an error immediately, that's a good sign.

I waited a minute, then returned my computer's date to real time. It's still running, hope it finishes okay.

http://gpugrid.net/workunit.php?wuid=13116635

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1627
Credit: 9,442,507,289
RAC: 17,259,730
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48810 - Posted: 4 Feb 2018 | 10:30:41 UTC

More than half the current research projects on the server status page are now showing amber for the error rate. That's the sort of signal that might attract the admins attention.

I've had no response to my PM, but if there's no sign of change by tomorrow morning I might try again.

Easton West
Send message
Joined: 26 Sep 09
Posts: 4
Credit: 190,733,965
RAC: 188,141
Level
Ile
Scientific publications
watwat
Message 48811 - Posted: 4 Feb 2018 | 14:46:48 UTC - in response to Message 48809.

http://gpugrid.net/result.php?resultid=16993864

It was going so well, but suspending and resuming with the date set to February caused the error again. The app seems to check the date at the beginning, and upon resuming as well. Ten hours of work wiped out, d'oh.

Trying again, starting with a January date and then trying to never suspend. Maybe it'll work this time.

http://www.gpugrid.net/workunit.php?wuid=13117301

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1627
Credit: 9,442,507,289
RAC: 17,259,730
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48812 - Posted: 4 Feb 2018 | 18:49:11 UTC - in response to Message 48811.

BOINC never keeps a task scheduled for a GPU in memory (there's no equivalent of paging or a swap file for graphics memory), so every 'resume' is actually a complete application relaunch. Try to avoid letting it swap out.

Easton West
Send message
Joined: 26 Sep 09
Posts: 4
Credit: 190,733,965
RAC: 188,141
Level
Ile
Scientific publications
watwat
Message 48813 - Posted: 5 Feb 2018 | 8:41:46 UTC
Last modified: 5 Feb 2018 | 9:00:11 UTC

Success! So fiddling with the date is a possible workaround for Linux users.

http://www.gpugrid.net/workunit.php?wuid=13117301

Edit: P.S. Suspend other tasks before rolling the date back. It's okay if they stay in memory, but if they're running, the rollback will lock them up. There are probably other ways to do it, but to be safe: I suspended everything, then rolled back date, then started GPUGRID task (manager won't show it running, but it is running), then corrected the date (manager now shows GPUGRID running), then resumed other tasks, and made sure the GPUGRID task never suspended.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1360
Credit: 7,911,763,953
RAC: 5,680,236
Level
Tyr
Scientific publications
watwatwatwatwat
Message 48817 - Posted: 5 Feb 2018 | 19:21:25 UTC
Last modified: 5 Feb 2018 | 19:35:41 UTC

Got a response to my trouble ticket at acellera.com about the app error from Gianni.

the error message is misleading, there is no license to request. We are already fixing the application in gpugrid.


So I hope that we see a new Linux application from him shortly that doesn't error out.

[Edit] There is also a new News item mentioning the issue.

[Edit2] Question from me:

Thanks for the reply Gianni. Any estimate when that will be made available?


Gianni's reply

not yet, but it will be posted on gpugrid forum. We have a problem that the person that builds it is out until the end of the week. We have to see if the others can do it without him.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1360
Credit: 7,911,763,953
RAC: 5,680,236
Level
Tyr
Scientific publications
watwatwatwatwat
Message 48869 - Posted: 7 Feb 2018 | 17:11:20 UTC

Looks like the same issue with the Short Acemd tasks. Thought it might use a different application, guess not. Same error in Linux as before.

Will have to wait for the developer to fix the application.

[AF>Libristes]Maeda
Send message
Joined: 5 May 12
Posts: 5
Credit: 605,226,958
RAC: 496,986
Level
Lys
Scientific publications
watwatwatwat
Message 48923 - Posted: 12 Feb 2018 | 21:51:48 UTC

Should we turn off GPUs in the meantime to avoid failed WU or it doesn't matter ?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1627
Credit: 9,442,507,289
RAC: 17,259,730
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48924 - Posted: 12 Feb 2018 | 22:11:24 UTC - in response to Message 48923.

Should we turn off GPUs in the meantime to avoid failed WU or it doesn't matter ?

Perhaps better to help out another project temporarily until this one is ready for you again?

Easton West
Send message
Joined: 26 Sep 09
Posts: 4
Credit: 190,733,965
RAC: 188,141
Level
Ile
Scientific publications
watwat
Message 48966 - Posted: 16 Feb 2018 | 1:06:17 UTC

Just completed a WU for the new version 9.19, seems to be working fine.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1360
Credit: 7,911,763,953
RAC: 5,680,236
Level
Tyr
Scientific publications
watwatwatwatwat
Message 48967 - Posted: 16 Feb 2018 | 2:08:15 UTC

I've already done and validated two new tasks with the 919 application. Original problem is resolved.

Post to thread

Message boards : Number crunching : process exited with code 212 (0xd4, -44)

//