All tasks failed with Exit status 195 (0xc3) EXIT_CHILD_FAILED

Message boards : Number crunching : All tasks failed with Exit status 195 (0xc3) EXIT_CHILD_FAILED
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59018 - Posted: 23 Jul 2022, 21:57:32 UTC
Last modified: 23 Jul 2022, 22:01:54 UTC

There seems to be a bug in these tasks.
I'm seeing a 100% failure on my system and the wingmen behind me.
Windows 10 or 11 does not make a difference.
A linux user also has this.
One of my tasks had 4-5 failures behind me.
Another task my first wingman failed but he runs a 780 and that does not have something in its firmware/software that will allow it to run these tasks.
I have a 1080 and it failed. The last person had a 1050 and it ran ok.

I don't get what is going on and why this was not picked up in testing.

I find this to be a common error message in the stderr file: OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\data\slots\1\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies.
ID: 59018 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59019 - Posted: 23 Jul 2022, 23:18:25 UTC - in response to Message 59018.  

That is a problem with Windows and memory reservation allocation when loading all the Python dll's.

Linux does not have the issue.

See this message of mine. https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908


The solution is to increase the size of your paging file.
ID: 59019 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jjch

Send message
Joined: 10 Nov 13
Posts: 101
Credit: 15,773,211,122
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59020 - Posted: 24 Jul 2022, 2:19:47 UTC - in response to Message 59019.  
Last modified: 24 Jul 2022, 2:29:42 UTC

I had to go back 6 tasks to find the one that failed with the paging file error. More recent tasks are having a different problem running out of memory somewhere.

You system looks like it has 48GB of physical memory so that should be sufficient to run the GPUgrid Python tasks unless there is another conflict with something else.

I have a Server running Win Server 2012 with the same amount of physical memory. The swap file is still set at "Automatically manage paging file size for all drives"

I left this one that way since is was working OK. With one GPUgrid Python task running it shows Currently allocated at 12800 MB which is typical.

Check the free space available on your swap drive and make sure it has a minimum of 16GB available. If you have plenty of space there then I would suggest you set the swap space separately.

I have found that sometimes it seems the Automatic isn't fast enough so try setting it to System managed size first. If that doesn't help then set it to Custom size.

You might need to play with the sizing a bit but you can try try Initial size 16384 and Maximum size 24576 or more.

The last 5 tasks are failing with various not enough memory errors but the first traceback is something I have been seeing with a lot of the tasks failing.

Just make sure you are not running anything that is tying up too much memory and not leaving enough available for GPUgrid.

Other than that these could be an internal error in the GPUgrid Python tasks causing it.
ID: 59020 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59025 - Posted: 24 Jul 2022, 8:12:21 UTC - in response to Message 59020.  
Last modified: 24 Jul 2022, 8:17:11 UTC

I have a whole HDD set aside for BOINC with 303GB of space left.
All the data files are there.
I run FAH plus all the projects you see in my profile here.
I am just around 73% memory usage.

Disk setting is leave 20GB free
Memory setting is computer in use 90%
Not in use 98%
Leave non GPU in memory (yes)
Page/Swap use at most 90%

You would think with these settings it has more than enough space to do what it needs to do.

According to BOINC tasks the current task uses 1932 physical and 3632 virtual.
BOINC says virtual size is 3.55 and working set is 1.89


Checked again after maxing everything out and this error keeps repeating:
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\data\slots\1\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "D:\data\slots\1\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "D:\data\slots\1\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)

Paging size, this seems to be an error in the code, I've opened up BOINC to the max. I think this was also a teething error in python CPU and RAH. But not paging size.


And after adjustments I get this: Detected memory leaks!
Dumping objects ->
..\api\boinc_api.cpp(309) : {13199} normal block at 0x000001B0A0972890, 8 bytes long.
Data: < > 00 00 94 A0 B0 01 00 00
..\lib\diagnostics_win.cpp(417) : {11918} normal block at 0x000001B0A0998B40, 1080 bytes long.
Data: <<j 4 > 3C 6A 00 00 CD CD CD CD 34 01 00 00 00 00 00 00
..\zip\boinc_zip.cpp(122) : {397} normal block at 0x000001B0A09708F0, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{383} normal block at 0x000001B0A096AA80, 52 bytes long.
Data: < r > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00
{378} normal block at 0x000001B0A096ABD0, 43 bytes long.
Data: < p > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00
{373} normal block at 0x000001B0A096AD90, 44 bytes long.
Data: < > 01 00 00 00 00 00 CD CD B1 AD 96 A0 B0 01 00 00
{368} normal block at 0x000001B0A096AD20, 44 bytes long.
Data: < A > 01 00 00 00 00 00 CD CD 41 AD 96 A0 B0 01 00 00
Object dump complete.
09:46:01 (13124): wrapper (7.9.26016): starting
09:46:01 (13124): wrapper: running python.exe (run.py)
Detected memory leaks!
Dumping objects ->
..\api\boinc_api.cpp(309) : {13134} normal block at 0x0000023C80BA32A0, 8 bytes long.
Data: < R < > 00 00 52 82 3C 02 00 00
..\lib\diagnostics_win.cpp(417) : {11853} normal block at 0x0000023C80BCF400, 1080 bytes long.
Data: <$2 P > 24 32 00 00 CD CD CD CD 50 01 00 00 00 00 00 00
..\zip\boinc_zip.cpp(122) : {397} normal block at 0x0000023C80BA3C60, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{383} normal block at 0x0000023C80B9AA70, 52 bytes long.
Data: < r > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00
{378} normal block at 0x0000023C80B9AC30, 43 bytes long.
Data: < p > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00
{373} normal block at 0x0000023C80B9A840, 44 bytes long.
Data: < a < > 01 00 00 00 00 00 CD CD 61 A8 B9 80 3C 02 00 00
{368} normal block at 0x0000023C80B9A990, 44 bytes long.
Data: < < > 01 00 00 00 00 00 CD CD B1 A9 B9 80 3C 02 00 00
Object dump complete.

But then it goes on to start running.
ID: 59025 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59026 - Posted: 24 Jul 2022, 10:12:46 UTC

I posted some screenshots of paging file settings in message 58934. I'd had similar failures with only 8 GB system RAM installed: with 16 GB and those settings, the Python app ran, though it's not a very efficient use of that particular machine.
ID: 59026 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59027 - Posted: 24 Jul 2022, 11:01:44 UTC - in response to Message 59026.  

I've searched windows and the net on how to do that and nothing matches those screen shots and nothing from the net matches my win 10 64bit software.

Can you tell me how to get to the tabs you did the screenshot of?
ID: 59027 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59028 - Posted: 24 Jul 2022, 12:41:23 UTC

Found this info in boinc_task_state.xml

<project_master_url>https://www.gpugrid.net/</project_master_url>
<result_name>e00028a00502-ABOU_rnd_ppod_expand_demos6_again2-0-1-RND4470_2</result_name>
<checkpoint_cpu_time>31287.720000</checkpoint_cpu_time>
<checkpoint_elapsed_time>15281.828158</checkpoint_elapsed_time>
<fraction_done>0.059200</fraction_done>
<peak_working_set_size>2470195200</peak_working_set_size>
<peak_swap_size>6816833536</peak_swap_size>
<peak_disk_usage>17117387104</peak_disk_usage>

I am assuming these huge values are in bytes?
ID: 59028 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59029 - Posted: 24 Jul 2022, 13:59:55 UTC - in response to Message 59027.  

Can you tell me how to get to the tabs you did the screenshot of?

All these low-level Windows management tools have barely changed since Windows NT 4 days, but the roadmap for finding them changes every time. The ones I posted were from Windows 7, but here's the routing for Windows 11 - split the difference...









For the final one, unset the first and third ('Automatic' and 'System' management), and set 'Custom' to open up all the options.
ID: 59029 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59030 - Posted: 24 Jul 2022, 15:25:56 UTC - in response to Message 59029.  

after a little trial and error I found a way to that location.
Set it to 144MB 3x physical to start and gave it 154MB max
See if this helps anything.
ID: 59030 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59031 - Posted: 24 Jul 2022, 16:57:00 UTC - in response to Message 59030.  

after a little trial and error I found a way to that location.
Set it to 144MB 3x physical to start and gave it 154MB max
See if this helps anything.

That's way undersized. It should be GB's . . . . not MB's

From your task data . . . <peak_disk_usage>17117387104</peak_disk_usage>

That is 17GB's of disk usage.

I would set 17GB or 17000MB for initial size and double it for max size.
or
34GB or 34000MB
ID: 59031 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59032 - Posted: 24 Jul 2022, 21:54:03 UTC - in response to Message 59031.  
Last modified: 24 Jul 2022, 21:58:31 UTC

after a little trial and error I found a way to that location.
Set it to 144MB 3x physical to start and gave it 154MB max
See if this helps anything.

That's way undersized. It should be GB's . . . . not MB's

From your task data . . . <peak_disk_usage>17117387104</peak_disk_usage>

That is 17GB's of disk usage.

I would set 17GB or 17000MB for initial size and double it for max size.
or
34GB or 34000MB


oh! thanks...will make the change
170000 and 340000
ID: 59032 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59033 - Posted: 25 Jul 2022, 20:37:44 UTC

Well that seems to have solved the problem on my Win10 machine.
2 tasks run and completed ok.

Thanks Keith!

Curious though why if it has to much space it errors out, but only here, not in other projects?
ID: 59033 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59034 - Posted: 26 Jul 2022, 2:17:38 UTC - in response to Message 59033.  

Go back and read this post of mine.

https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908

Only affects projects that use pytorch in Windows that have large DLL's that Windows MUST reserve a lot of memory for.

Don't think there are any other BOINC projects that use pytorch.

So not affected.
ID: 59034 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59035 - Posted: 26 Jul 2022, 18:57:46 UTC - in response to Message 59034.  
Last modified: 26 Jul 2022, 19:50:53 UTC

Go back and read this post of mine.

https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908

Only affects projects that use pytorch in Windows that have large DLL's that Windows MUST reserve a lot of memory for.

Don't think there are any other BOINC projects that use pytorch.

So not affected.



I have never heard of that. I wondered what that was.
So after reading that, it explains why Python GPU or anything in GPU is used at my oldest project RAH. They have Python CPU to run, generated by an external client, but that's about it for us BOINC users. They keep all the really interesting stuff inhouse for the AI system.
ID: 59035 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59036 - Posted: 27 Jul 2022, 0:40:49 UTC
Last modified: 27 Jul 2022, 0:55:40 UTC

Once again, GPUGrid is on the cutting edge of gpu science for BOINC projects with its machine learning and AI development. They were the first BOINC project to use gpus. I like they are still pushing the envelope.

The only other machine learning BOINC project I know about is MLC@home and they only use cpus now. Had a gpu app a few years ago but I don't think they are producing any tasks for gpus currently.
ID: 59036 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59037 - Posted: 27 Jul 2022, 19:36:56 UTC

I like projects that push the boundaries. Look for stuff that has not been done before either in code or in ideas of what to send out for crunching.
ID: 59037 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59038 - Posted: 27 Jul 2022, 21:19:42 UTC

Today, it is impossible for a human to take into account the results, even limited to the most important data, for millions of known molecules. The second objective of this project is to radically change the approach developing artificial intelligence and optimization methods in order to explore efficiently the highly combinatorial molecular space.

https://quchempedia.univ-angers.fr/athome/about.php

QuChemPedIA is an AI project, though CPU only. And it works best with Linux. You can use Windows with VirtualBox, but there are a lot of stuck work units you have to deal with.
ID: 59038 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59045 - Posted: 28 Jul 2022, 18:02:08 UTC - in response to Message 59038.  
Last modified: 28 Jul 2022, 18:02:28 UTC

Today, it is impossible for a human to take into account the results, even limited to the most important data, for millions of known molecules. The second objective of this project is to radically change the approach developing artificial intelligence and optimization methods in order to explore efficiently the highly combinatorial molecular space.

https://quchempedia.univ-angers.fr/athome/about.php

QuChemPedIA is an AI project, though CPU only. And it works best with Linux. You can use Windows with VirtualBox, but there are a lot of stuck work units you have to deal with.



I know it and due to that exact reason and other technical errors, I gave up.
I can't get it to run stable on my windows system, so forget it.

GPU's get enough action with this project and primegrid and FAH as well as Eisenstein.

I think I am attached to enough to projects to keep this system busy all the time it runs (16 hours a day)
ID: 59045 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59046 - Posted: 29 Jul 2022, 8:46:48 UTC

so...a new wrinkle.
I have two tasks running at the same time and RAH is complaining about disk space with the CPU Python.
I've maxed out the upper value.
rosetta python projects needs 3624.20MB more disk space. You currently have 15449.28 MB available and it needs 19073.49 MB.

So what do I have to do? I suppose I will have to restrict this project to 1 GPU in order to solve this disk space problem?
ID: 59046 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59047 - Posted: 29 Jul 2022, 9:21:50 UTC - in response to Message 59046.  

Disk space limits can be solved by tweaking BOINC's limits.

They're quite separate and distinct from the memory (RAM) problems you were having here earlier.
ID: 59047 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : All tasks failed with Exit status 195 (0xc3) EXIT_CHILD_FAILED

©2025 Universitat Pompeu Fabra