LLMs crashing

Message boards : Number crunching : LLMs crashing
Message board moderation

To post messages, you must log in.

AuthorMessage
WPrion

Send message
Joined: 30 Apr 13
Posts: 106
Credit: 3,805,237,860
RAC: 65
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62417 - Posted: 11 May 2025, 16:50:50 UTC
Last modified: 11 May 2025, 17:14:27 UTC

After a decent run of mostly successes, I've had a couple of LLMs error out.
e.g.
Workunit 31492014
Outcome Computation error
Client state Compute error
Exit status 195 (0x000000C3) EXIT_CHILD_FAILED

<message>
The operating system cannot run (null).
(0xc3) - exit code 195 (0xc3)</message>

The behavior was strange in that it loaded 12 GB into the GPU memory, but the GPU remained basically idle until the task errored. The CPU was used for a while, but not the GPU.
I ran an Einstein task just to satisfy myself the the GPU works normally on that project. Of course, LLMs are different.

Help appreciated.
ID: 62417 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
WPrion

Send message
Joined: 30 Apr 13
Posts: 106
Credit: 3,805,237,860
RAC: 65
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62419 - Posted: 12 May 2025, 10:57:02 UTC - in response to Message 62417.  
Last modified: 12 May 2025, 10:59:27 UTC

After many more errors, one completed successfully, and another looks promising also completed.
ID: 62419 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile 98J_SSG

Send message
Joined: 15 Jun 09
Posts: 12
Credit: 729,477,756
RAC: 84
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 62420 - Posted: 13 May 2025, 12:45:01 UTC

Good morning!

I've received 3 of the small LLM tasks that have all failed and 2 of them had part of the Stderr output the following :

C:\ProgramData\BOINC\slots\17\Lib\site-packages\huggingface_hub\file_download.py:144: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\ProgramData\BOINC\slots\.cache\hub\models--Acellera--gpugrid. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)

In checking the Hub Python library reference (https://huggingface.co/docs/huggingface_hub/v0.31.2/guides/manage-cache#limitations), it refers to either setting your machine to Developers mode or running Python as an Administrator for the Symlinks.

I am not a programmer nor do I know if I've opened a whole new can of worms, but I set my PC to Developer mode to see if it makes any difference if I get any more of the small LLM tasks. Again, since I don't know what is happening and responding to that little detail, I might not be helping anything at all.

Any insight from those who know what's happening here would be greatly appreciated.
ID: 62420 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile 98J_SSG

Send message
Joined: 15 Jun 09
Posts: 12
Credit: 729,477,756
RAC: 84
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 62421 - Posted: 13 May 2025, 23:49:57 UTC
Last modified: 13 May 2025, 23:57:16 UTC

It might have been luck, but this last small LLM task I got was completed successful. It's been the only task I've gotten, so far,
so, I don't know if enabling Developer mode helped or I was lucky. If more are successful then it might be more than luck.

Does anyone else know if Developer mode increases or frees up GPU memory available for these tasks to use?

Thanks!
ID: 62421 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : LLMs crashing

©2025 Universitat Pompeu Fabra