LLMs crashing

Author	Message
WPrion Send message Joined: 30 Apr 13 Posts: 109 Credit: 3,977,737,860 RAC: 6,051 Level Scientific publications	Message 62417 - Posted: 11 May 2025, 16:50:50 UTC Last modified: 11 May 2025, 17:14:27 UTC After a decent run of mostly successes, I've had a couple of LLMs error out. e.g. Workunit 31492014 Outcome Computation error Client state Compute error Exit status 195 (0x000000C3) EXIT_CHILD_FAILED <message> The operating system cannot run (null). (0xc3) - exit code 195 (0xc3)</message> The behavior was strange in that it loaded 12 GB into the GPU memory, but the GPU remained basically idle until the task errored. The CPU was used for a while, but not the GPU. I ran an Einstein task just to satisfy myself the the GPU works normally on that project. Of course, LLMs are different. Help appreciated. ID: 62417 · Rating: 0 · rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 109 Credit: 3,977,737,860 RAC: 6,051 Level Scientific publications	Message 62419 - Posted: 12 May 2025, 10:57:02 UTC - in response to Message 62417. Last modified: 12 May 2025, 10:59:27 UTC After many more errors, one completed successfully, and another ~~looks promising~~ also completed. ID: 62419 · Rating: 0 · rate: / Reply Quote

98J_SSG Send message Joined: 15 Jun 09 Posts: 12 Credit: 845,727,756 RAC: 3,137 Level Scientific publications	Message 62420 - Posted: 13 May 2025, 12:45:01 UTC Good morning! I've received 3 of the small LLM tasks that have all failed and 2 of them had part of the Stderr output the following : C:\ProgramData\BOINC\slots\17\Lib\site-packages\huggingface_hub\file_download.py:144: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\ProgramData\BOINC\slots\.cache\hub\models--Acellera--gpugrid. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations. To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development warnings.warn(message) In checking the Hub Python library reference (https://huggingface.co/docs/huggingface_hub/v0.31.2/guides/manage-cache#limitations), it refers to either setting your machine to Developers mode or running Python as an Administrator for the Symlinks. I am not a programmer nor do I know if I've opened a whole new can of worms, but I set my PC to Developer mode to see if it makes any difference if I get any more of the small LLM tasks. Again, since I don't know what is happening and responding to that little detail, I might not be helping anything at all. Any insight from those who know what's happening here would be greatly appreciated. ID: 62420 · Rating: 0 · rate: / Reply Quote

98J_SSG Send message Joined: 15 Jun 09 Posts: 12 Credit: 845,727,756 RAC: 3,137 Level Scientific publications	Message 62421 - Posted: 13 May 2025, 23:49:57 UTC Last modified: 13 May 2025, 23:57:16 UTC It might have been luck, but this last small LLM task I got was completed successful. It's been the only task I've gotten, so far, so, I don't know if enabling Developer mode helped or I was lucky. If more are successful then it might be more than luck. Does anyone else know if Developer mode increases or frees up GPU memory available for these tasks to use? Thanks! ID: 62421 · Rating: 0 · rate: / Reply Quote