All tasks failed with Exit status 195 (0xc3) EXIT_CHILD

Author	Message
Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 59018 - Posted: 23 Jul 2022, 21:57:32 UTC Last modified: 23 Jul 2022, 22:01:54 UTC There seems to be a bug in these tasks. I'm seeing a 100% failure on my system and the wingmen behind me. Windows 10 or 11 does not make a difference. A linux user also has this. One of my tasks had 4-5 failures behind me. Another task my first wingman failed but he runs a 780 and that does not have something in its firmware/software that will allow it to run these tasks. I have a 1080 and it failed. The last person had a 1050 and it ran ok. I don't get what is going on and why this was not picked up in testing. I find this to be a common error message in the stderr file: OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\data\slots\1\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies. ID: 59018 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,186,946,190 RAC: 1,288,374 Level Scientific publications	Message 59019 - Posted: 23 Jul 2022, 23:18:25 UTC - in response to Message 59018. That is a problem with Windows and memory reservation allocation when loading all the Python dll's. Linux does not have the issue. See this message of mine. https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908 The solution is to increase the size of your paging file. ID: 59019 · Rating: 0 · rate: / Reply Quote

jjch Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,776,211,122 RAC: 3,857 Level Scientific publications	Message 59020 - Posted: 24 Jul 2022, 2:19:47 UTC - in response to Message 59019. Last modified: 24 Jul 2022, 2:29:42 UTC I had to go back 6 tasks to find the one that failed with the paging file error. More recent tasks are having a different problem running out of memory somewhere. You system looks like it has 48GB of physical memory so that should be sufficient to run the GPUgrid Python tasks unless there is another conflict with something else. I have a Server running Win Server 2012 with the same amount of physical memory. The swap file is still set at "Automatically manage paging file size for all drives" I left this one that way since is was working OK. With one GPUgrid Python task running it shows Currently allocated at 12800 MB which is typical. Check the free space available on your swap drive and make sure it has a minimum of 16GB available. If you have plenty of space there then I would suggest you set the swap space separately. I have found that sometimes it seems the Automatic isn't fast enough so try setting it to System managed size first. If that doesn't help then set it to Custom size. You might need to play with the sizing a bit but you can try try Initial size 16384 and Maximum size 24576 or more. The last 5 tasks are failing with various not enough memory errors but the first traceback is something I have been seeing with a lot of the tasks failing. Just make sure you are not running anything that is tying up too much memory and not leaving enough available for GPUgrid. Other than that these could be an internal error in the GPUgrid Python tasks causing it. ID: 59020 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 59025 - Posted: 24 Jul 2022, 8:12:21 UTC - in response to Message 59020. Last modified: 24 Jul 2022, 8:17:11 UTC I have a whole HDD set aside for BOINC with 303GB of space left. All the data files are there. I run FAH plus all the projects you see in my profile here. I am just around 73% memory usage. Disk setting is leave 20GB free Memory setting is computer in use 90% Not in use 98% Leave non GPU in memory (yes) Page/Swap use at most 90% You would think with these settings it has more than enough space to do what it needs to do. According to BOINC tasks the current task uses 1932 physical and 3632 virtual. BOINC says virtual size is 3.55 and working set is 1.89 Checked again after maxing everything out and this error keeps repeating: OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\data\slots\1\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies. Traceback (most recent call last): File "<string>", line 1, in <module> File "D:\data\slots\1\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "D:\data\slots\1\lib\multiprocessing\spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) Paging size, this seems to be an error in the code, I've opened up BOINC to the max. I think this was also a teething error in python CPU and RAH. But not paging size. And after adjustments I get this: Detected memory leaks! Dumping objects -> ..\api\boinc_api.cpp(309) : {13199} normal block at 0x000001B0A0972890, 8 bytes long. Data: < > 00 00 94 A0 B0 01 00 00 ..\lib\diagnostics_win.cpp(417) : {11918} normal block at 0x000001B0A0998B40, 1080 bytes long. Data: <<j 4 > 3C 6A 00 00 CD CD CD CD 34 01 00 00 00 00 00 00 ..\zip\boinc_zip.cpp(122) : {397} normal block at 0x000001B0A09708F0, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 {383} normal block at 0x000001B0A096AA80, 52 bytes long. Data: < r > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00 {378} normal block at 0x000001B0A096ABD0, 43 bytes long. Data: < p > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00 {373} normal block at 0x000001B0A096AD90, 44 bytes long. Data: < > 01 00 00 00 00 00 CD CD B1 AD 96 A0 B0 01 00 00 {368} normal block at 0x000001B0A096AD20, 44 bytes long. Data: < A > 01 00 00 00 00 00 CD CD 41 AD 96 A0 B0 01 00 00 Object dump complete. 09:46:01 (13124): wrapper (7.9.26016): starting 09:46:01 (13124): wrapper: running python.exe (run.py) Detected memory leaks! Dumping objects -> ..\api\boinc_api.cpp(309) : {13134} normal block at 0x0000023C80BA32A0, 8 bytes long. Data: < R < > 00 00 52 82 3C 02 00 00 ..\lib\diagnostics_win.cpp(417) : {11853} normal block at 0x0000023C80BCF400, 1080 bytes long. Data: <$2 P > 24 32 00 00 CD CD CD CD 50 01 00 00 00 00 00 00 ..\zip\boinc_zip.cpp(122) : {397} normal block at 0x0000023C80BA3C60, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 {383} normal block at 0x0000023C80B9AA70, 52 bytes long. Data: < r > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00 {378} normal block at 0x0000023C80B9AC30, 43 bytes long. Data: < p > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00 {373} normal block at 0x0000023C80B9A840, 44 bytes long. Data: < a < > 01 00 00 00 00 00 CD CD 61 A8 B9 80 3C 02 00 00 {368} normal block at 0x0000023C80B9A990, 44 bytes long. Data: < < > 01 00 00 00 00 00 CD CD B1 A9 B9 80 3C 02 00 00 Object dump complete. But then it goes on to start running. ID: 59025 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 3 Level Scientific publications	Message 59026 - Posted: 24 Jul 2022, 10:12:46 UTC I posted some screenshots of paging file settings in message 58934. I'd had similar failures with only 8 GB system RAM installed: with 16 GB and those settings, the Python app ran, though it's not a very efficient use of that particular machine. ID: 59026 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 59027 - Posted: 24 Jul 2022, 11:01:44 UTC - in response to Message 59026. I've searched windows and the net on how to do that and nothing matches those screen shots and nothing from the net matches my win 10 64bit software. Can you tell me how to get to the tabs you did the screenshot of? ID: 59027 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 59028 - Posted: 24 Jul 2022, 12:41:23 UTC Found this info in boinc_task_state.xml <project_master_url>https://www.gpugrid.net/</project_master_url> <result_name>e00028a00502-ABOU_rnd_ppod_expand_demos6_again2-0-1-RND4470_2</result_name> <checkpoint_cpu_time>31287.720000</checkpoint_cpu_time> <checkpoint_elapsed_time>15281.828158</checkpoint_elapsed_time> <fraction_done>0.059200</fraction_done> <peak_working_set_size>2470195200</peak_working_set_size> <peak_swap_size>6816833536</peak_swap_size> <peak_disk_usage>17117387104</peak_disk_usage> I am assuming these huge values are in bytes? ID: 59028 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 3 Level Scientific publications	Message 59029 - Posted: 24 Jul 2022, 13:59:55 UTC - in response to Message 59027. Can you tell me how to get to the tabs you did the screenshot of? All these low-level Windows management tools have barely changed since Windows NT 4 days, but the roadmap for finding them changes every time. The ones I posted were from Windows 7, but here's the routing for Windows 11 - split the difference... For the final one, unset the first and third ('Automatic' and 'System' management), and set 'Custom' to open up all the options. ID: 59029 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 59030 - Posted: 24 Jul 2022, 15:25:56 UTC - in response to Message 59029. after a little trial and error I found a way to that location. Set it to 144MB 3x physical to start and gave it 154MB max See if this helps anything. ID: 59030 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,186,946,190 RAC: 1,288,374 Level Scientific publications	Message 59031 - Posted: 24 Jul 2022, 16:57:00 UTC - in response to Message 59030. after a little trial and error I found a way to that location. Set it to 144MB 3x physical to start and gave it 154MB max See if this helps anything. That's way undersized. It should be GB's . . . . not MB's From your task data . . . <peak_disk_usage>17117387104</peak_disk_usage> That is 17GB's of disk usage. I would set 17GB or 17000MB for initial size and double it for max size. or 34GB or 34000MB ID: 59031 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 59032 - Posted: 24 Jul 2022, 21:54:03 UTC - in response to Message 59031. Last modified: 24 Jul 2022, 21:58:31 UTC after a little trial and error I found a way to that location. Set it to 144MB 3x physical to start and gave it 154MB max See if this helps anything. That's way undersized. It should be GB's . . . . not MB's From your task data . . . <peak_disk_usage>17117387104</peak_disk_usage> That is 17GB's of disk usage. I would set 17GB or 17000MB for initial size and double it for max size. or 34GB or 34000MB oh! thanks...will make the change 170000 and 340000 ID: 59032 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 59033 - Posted: 25 Jul 2022, 20:37:44 UTC Well that seems to have solved the problem on my Win10 machine. 2 tasks run and completed ok. Thanks Keith! Curious though why if it has to much space it errors out, but only here, not in other projects? ID: 59033 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,186,946,190 RAC: 1,288,374 Level Scientific publications	Message 59034 - Posted: 26 Jul 2022, 2:17:38 UTC - in response to Message 59033. Go back and read this post of mine. https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908 Only affects projects that use pytorch in Windows that have large DLL's that Windows MUST reserve a lot of memory for. Don't think there are any other BOINC projects that use pytorch. So not affected. ID: 59034 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 59035 - Posted: 26 Jul 2022, 18:57:46 UTC - in response to Message 59034. Last modified: 26 Jul 2022, 19:50:53 UTC Go back and read this post of mine. https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908 Only affects projects that use pytorch in Windows that have large DLL's that Windows MUST reserve a lot of memory for. Don't think there are any other BOINC projects that use pytorch. So not affected. I have never heard of that. I wondered what that was. So after reading that, it explains why Python GPU or anything in GPU is used at my oldest project RAH. They have Python CPU to run, generated by an external client, but that's about it for us BOINC users. They keep all the really interesting stuff inhouse for the AI system. ID: 59035 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,186,946,190 RAC: 1,288,374 Level Scientific publications	Message 59036 - Posted: 27 Jul 2022, 0:40:49 UTC Last modified: 27 Jul 2022, 0:55:40 UTC Once again, GPUGrid is on the cutting edge of gpu science for BOINC projects with its machine learning and AI development. They were the first BOINC project to use gpus. I like they are still pushing the envelope. The only other machine learning BOINC project I know about is MLC@home and they only use cpus now. Had a gpu app a few years ago but I don't think they are producing any tasks for gpus currently. ID: 59036 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 59037 - Posted: 27 Jul 2022, 19:36:56 UTC I like projects that push the boundaries. Look for stuff that has not been done before either in code or in ideas of what to send out for crunching. ID: 59037 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 59038 - Posted: 27 Jul 2022, 21:19:42 UTC Today, it is impossible for a human to take into account the results, even limited to the most important data, for millions of known molecules. The second objective of this project is to radically change the approach developing artificial intelligence and optimization methods in order to explore efficiently the highly combinatorial molecular space. https://quchempedia.univ-angers.fr/athome/about.php QuChemPedIA is an AI project, though CPU only. And it works best with Linux. You can use Windows with VirtualBox, but there are a lot of stuck work units you have to deal with. ID: 59038 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 59045 - Posted: 28 Jul 2022, 18:02:08 UTC - in response to Message 59038. Last modified: 28 Jul 2022, 18:02:28 UTC Today, it is impossible for a human to take into account the results, even limited to the most important data, for millions of known molecules. The second objective of this project is to radically change the approach developing artificial intelligence and optimization methods in order to explore efficiently the highly combinatorial molecular space. https://quchempedia.univ-angers.fr/athome/about.php QuChemPedIA is an AI project, though CPU only. And it works best with Linux. You can use Windows with VirtualBox, but there are a lot of stuck work units you have to deal with. I know it and due to that exact reason and other technical errors, I gave up. I can't get it to run stable on my windows system, so forget it. GPU's get enough action with this project and primegrid and FAH as well as Eisenstein. I think I am attached to enough to projects to keep this system busy all the time it runs (16 hours a day) ID: 59045 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 59046 - Posted: 29 Jul 2022, 8:46:48 UTC so...a new wrinkle. I have two tasks running at the same time and RAH is complaining about disk space with the CPU Python. I've maxed out the upper value. rosetta python projects needs 3624.20MB more disk space. You currently have 15449.28 MB available and it needs 19073.49 MB. So what do I have to do? I suppose I will have to restrict this project to 1 GPU in order to solve this disk space problem? ID: 59046 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 3 Level Scientific publications	Message 59047 - Posted: 29 Jul 2022, 9:21:50 UTC - in response to Message 59046. Disk space limits can be solved by tweaking BOINC's limits. They're quite separate and distinct from the memory (RAM) problems you were having here earlier. ID: 59047 · Rating: 0 · rate: / Reply Quote

All tasks failed with Exit status 195 (0xc3) EXIT_CHILD_FAILED