Message boards :
Number crunching :
All tasks failed with Exit status 195 (0xc3) EXIT_CHILD_FAILED
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
There seems to be a bug in these tasks. I'm seeing a 100% failure on my system and the wingmen behind me. Windows 10 or 11 does not make a difference. A linux user also has this. One of my tasks had 4-5 failures behind me. Another task my first wingman failed but he runs a 780 and that does not have something in its firmware/software that will allow it to run these tasks. I have a 1080 and it failed. The last person had a 1050 and it ran ok. I don't get what is going on and why this was not picked up in testing. I find this to be a common error message in the stderr file: OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\data\slots\1\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
That is a problem with Windows and memory reservation allocation when loading all the Python dll's. Linux does not have the issue. See this message of mine. https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908 The solution is to increase the size of your paging file. |
|
Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,773,211,122 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I had to go back 6 tasks to find the one that failed with the paging file error. More recent tasks are having a different problem running out of memory somewhere. You system looks like it has 48GB of physical memory so that should be sufficient to run the GPUgrid Python tasks unless there is another conflict with something else. I have a Server running Win Server 2012 with the same amount of physical memory. The swap file is still set at "Automatically manage paging file size for all drives" I left this one that way since is was working OK. With one GPUgrid Python task running it shows Currently allocated at 12800 MB which is typical. Check the free space available on your swap drive and make sure it has a minimum of 16GB available. If you have plenty of space there then I would suggest you set the swap space separately. I have found that sometimes it seems the Automatic isn't fast enough so try setting it to System managed size first. If that doesn't help then set it to Custom size. You might need to play with the sizing a bit but you can try try Initial size 16384 and Maximum size 24576 or more. The last 5 tasks are failing with various not enough memory errors but the first traceback is something I have been seeing with a lot of the tasks failing. Just make sure you are not running anything that is tying up too much memory and not leaving enough available for GPUgrid. Other than that these could be an internal error in the GPUgrid Python tasks causing it. |
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
I have a whole HDD set aside for BOINC with 303GB of space left. All the data files are there. I run FAH plus all the projects you see in my profile here. I am just around 73% memory usage. Disk setting is leave 20GB free Memory setting is computer in use 90% Not in use 98% Leave non GPU in memory (yes) Page/Swap use at most 90% You would think with these settings it has more than enough space to do what it needs to do. According to BOINC tasks the current task uses 1932 physical and 3632 virtual. BOINC says virtual size is 3.55 and working set is 1.89 Checked again after maxing everything out and this error keeps repeating: OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\data\slots\1\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies. Traceback (most recent call last): File "<string>", line 1, in <module> File "D:\data\slots\1\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "D:\data\slots\1\lib\multiprocessing\spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) Paging size, this seems to be an error in the code, I've opened up BOINC to the max. I think this was also a teething error in python CPU and RAH. But not paging size. And after adjustments I get this: Detected memory leaks! Dumping objects -> ..\api\boinc_api.cpp(309) : {13199} normal block at 0x000001B0A0972890, 8 bytes long. Data: < > 00 00 94 A0 B0 01 00 00 ..\lib\diagnostics_win.cpp(417) : {11918} normal block at 0x000001B0A0998B40, 1080 bytes long. Data: <<j 4 > 3C 6A 00 00 CD CD CD CD 34 01 00 00 00 00 00 00 ..\zip\boinc_zip.cpp(122) : {397} normal block at 0x000001B0A09708F0, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 {383} normal block at 0x000001B0A096AA80, 52 bytes long. Data: < r > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00 {378} normal block at 0x000001B0A096ABD0, 43 bytes long. Data: < p > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00 {373} normal block at 0x000001B0A096AD90, 44 bytes long. Data: < > 01 00 00 00 00 00 CD CD B1 AD 96 A0 B0 01 00 00 {368} normal block at 0x000001B0A096AD20, 44 bytes long. Data: < A > 01 00 00 00 00 00 CD CD 41 AD 96 A0 B0 01 00 00 Object dump complete. 09:46:01 (13124): wrapper (7.9.26016): starting 09:46:01 (13124): wrapper: running python.exe (run.py) Detected memory leaks! Dumping objects -> ..\api\boinc_api.cpp(309) : {13134} normal block at 0x0000023C80BA32A0, 8 bytes long. Data: < R < > 00 00 52 82 3C 02 00 00 ..\lib\diagnostics_win.cpp(417) : {11853} normal block at 0x0000023C80BCF400, 1080 bytes long. Data: <$2 P > 24 32 00 00 CD CD CD CD 50 01 00 00 00 00 00 00 ..\zip\boinc_zip.cpp(122) : {397} normal block at 0x0000023C80BA3C60, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 {383} normal block at 0x0000023C80B9AA70, 52 bytes long. Data: < r > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00 {378} normal block at 0x0000023C80B9AC30, 43 bytes long. Data: < p > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00 {373} normal block at 0x0000023C80B9A840, 44 bytes long. Data: < a < > 01 00 00 00 00 00 CD CD 61 A8 B9 80 3C 02 00 00 {368} normal block at 0x0000023C80B9A990, 44 bytes long. Data: < < > 01 00 00 00 00 00 CD CD B1 A9 B9 80 3C 02 00 00 Object dump complete. But then it goes on to start running. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I posted some screenshots of paging file settings in message 58934. I'd had similar failures with only 8 GB system RAM installed: with 16 GB and those settings, the Python app ran, though it's not a very efficient use of that particular machine. |
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
I've searched windows and the net on how to do that and nothing matches those screen shots and nothing from the net matches my win 10 64bit software. Can you tell me how to get to the tabs you did the screenshot of? |
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
Found this info in boinc_task_state.xml <project_master_url>https://www.gpugrid.net/</project_master_url> <result_name>e00028a00502-ABOU_rnd_ppod_expand_demos6_again2-0-1-RND4470_2</result_name> <checkpoint_cpu_time>31287.720000</checkpoint_cpu_time> <checkpoint_elapsed_time>15281.828158</checkpoint_elapsed_time> <fraction_done>0.059200</fraction_done> <peak_working_set_size>2470195200</peak_working_set_size> <peak_swap_size>6816833536</peak_swap_size> <peak_disk_usage>17117387104</peak_disk_usage> I am assuming these huge values are in bytes? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Can you tell me how to get to the tabs you did the screenshot of? All these low-level Windows management tools have barely changed since Windows NT 4 days, but the roadmap for finding them changes every time. The ones I posted were from Windows 7, but here's the routing for Windows 11 - split the difference... For the final one, unset the first and third ('Automatic' and 'System' management), and set 'Custom' to open up all the options. |
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
after a little trial and error I found a way to that location. Set it to 144MB 3x physical to start and gave it 154MB max See if this helps anything. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
after a little trial and error I found a way to that location. That's way undersized. It should be GB's . . . . not MB's From your task data . . . <peak_disk_usage>17117387104</peak_disk_usage> That is 17GB's of disk usage. I would set 17GB or 17000MB for initial size and double it for max size. or 34GB or 34000MB |
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
after a little trial and error I found a way to that location. oh! thanks...will make the change 170000 and 340000 |
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
Well that seems to have solved the problem on my Win10 machine. 2 tasks run and completed ok. Thanks Keith! Curious though why if it has to much space it errors out, but only here, not in other projects? |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Go back and read this post of mine. https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908 Only affects projects that use pytorch in Windows that have large DLL's that Windows MUST reserve a lot of memory for. Don't think there are any other BOINC projects that use pytorch. So not affected. |
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
Go back and read this post of mine. I have never heard of that. I wondered what that was. So after reading that, it explains why Python GPU or anything in GPU is used at my oldest project RAH. They have Python CPU to run, generated by an external client, but that's about it for us BOINC users. They keep all the really interesting stuff inhouse for the AI system. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Once again, GPUGrid is on the cutting edge of gpu science for BOINC projects with its machine learning and AI development. They were the first BOINC project to use gpus. I like they are still pushing the envelope. The only other machine learning BOINC project I know about is MLC@home and they only use cpus now. Had a gpu app a few years ago but I don't think they are producing any tasks for gpus currently. |
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
I like projects that push the boundaries. Look for stuff that has not been done before either in code or in ideas of what to send out for crunching. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Today, it is impossible for a human to take into account the results, even limited to the most important data, for millions of known molecules. The second objective of this project is to radically change the approach developing artificial intelligence and optimization methods in order to explore efficiently the highly combinatorial molecular space. https://quchempedia.univ-angers.fr/athome/about.php QuChemPedIA is an AI project, though CPU only. And it works best with Linux. You can use Windows with VirtualBox, but there are a lot of stuck work units you have to deal with. |
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
Today, it is impossible for a human to take into account the results, even limited to the most important data, for millions of known molecules. The second objective of this project is to radically change the approach developing artificial intelligence and optimization methods in order to explore efficiently the highly combinatorial molecular space. I know it and due to that exact reason and other technical errors, I gave up. I can't get it to run stable on my windows system, so forget it. GPU's get enough action with this project and primegrid and FAH as well as Eisenstein. I think I am attached to enough to projects to keep this system busy all the time it runs (16 hours a day) |
|
Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
so...a new wrinkle. I have two tasks running at the same time and RAH is complaining about disk space with the CPU Python. I've maxed out the upper value. rosetta python projects needs 3624.20MB more disk space. You currently have 15449.28 MB available and it needs 19073.49 MB. So what do I have to do? I suppose I will have to restrict this project to 1 GPU in order to solve this disk space problem? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Disk space limits can be solved by tweaking BOINC's limits. They're quite separate and distinct from the memory (RAM) problems you were having here earlier. |
©2025 Universitat Pompeu Fabra