Name | wu_19b8feca-GIANNI_GPROTO7-0-1-RND9683_0 |
Workunit | 31544496 |
Created | 29 Sep 2025, 21:54:52 UTC |
Sent | 29 Sep 2025, 21:55:15 UTC |
Report deadline | 4 Oct 2025, 21:55:15 UTC |
Received | 30 Sep 2025, 1:06:42 UTC |
Server state | Over |
Outcome | Computation error |
Client state | Compute error |
Exit status | 195 (0x000000C3) EXIT_CHILD_FAILED |
Computer ID | 616787 |
Run time | 3 min 10 sec |
CPU time | 34 sec |
Validate state | Invalid |
Credit | 0.00 |
Device peak FLOPS | 16,318.25 GFLOPS |
Application version | LLM: LLMs for chemistry v1.00 (cuda124L) x86_64-pc-linux-gnu |
Peak working set size | 1.52 GB |
Peak swap size | 29.44 GB |
Peak disk usage | 8.21 GB |
<core_client_version>8.2.5</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 2025-09-29 20:00:34 (35159): wrapper (8.1.26018): starting 2025-09-29 20:02:51 (35159): wrapper: running bin/python (bin/conda-unpack) 2025-09-29 20:02:51 (35159): wrapper: created child process 35230 2025-09-29 20:02:58 (35159): bin/python exited; CPU time 1.144690 2025-09-29 20:02:58 (35159): wrapper: running bin/tar (xjvf input.tar.bz2) 2025-09-29 20:02:58 (35159): wrapper: created child process 35231 2025-09-29 20:02:59 (35159): bin/tar exited; CPU time 0.025372 2025-09-29 20:02:59 (35159): wrapper: running bin/bash (run.sh) 2025-09-29 20:02:59 (35159): wrapper: created child process 35233 + echo 'Setup environment' + source bin/activate ++ _conda_pack_activate ++ local _CONDA_SHELL_FLAVOR ++ '[' -n x ']' ++ _CONDA_SHELL_FLAVOR=bash ++ local script_dir ++ case "$_CONDA_SHELL_FLAVOR" in +++ dirname bin/activate ++ script_dir=bin +++ cd bin +++ pwd ++ local full_path_script_dir=/var/lib/boinc-client/slots/2/bin +++ dirname /var/lib/boinc-client/slots/2/bin ++ local full_path_env=/var/lib/boinc-client/slots/2 +++ basename /var/lib/boinc-client/slots/2 ++ local env_name=2 ++ '[' -n '' ']' ++ export CONDA_PREFIX=/var/lib/boinc-client/slots/2 ++ CONDA_PREFIX=/var/lib/boinc-client/slots/2 ++ export _CONDA_PACK_OLD_PS1= ++ _CONDA_PACK_OLD_PS1= ++ PATH=/var/lib/boinc-client/slots/2/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. ++ PS1='(2) ' ++ case "$_CONDA_SHELL_FLAVOR" in ++ hash -r ++ local _script_dir=/var/lib/boinc-client/slots/2/etc/conda/activate.d ++ '[' -d /var/lib/boinc-client/slots/2/etc/conda/activate.d ']' + export PATH=/var/lib/boinc-client/slots/2:/var/lib/boinc-client/slots/2/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. + PATH=/var/lib/boinc-client/slots/2:/var/lib/boinc-client/slots/2/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. + echo 'Create a temporary directory' + export TMP=/var/lib/boinc-client/slots/2/tmp + TMP=/var/lib/boinc-client/slots/2/tmp + mkdir -p /var/lib/boinc-client/slots/2/tmp + which python + pip install main_generation-0.1.0-py3-none-any.whl -v --no-deps + export CUDA_VISIBLE_DEVICES=0 + CUDA_VISIBLE_DEVICES=0 + export HF_HOME=../.cache + HF_HOME=../.cache + export VLLM_ASSETS_CACHE=../.cache + VLLM_ASSETS_CACHE=../.cache + export VLLM_CACHE_ROOT=../.cache + VLLM_CACHE_ROOT=../.cache + echo RUNNING + pythonbinary=/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/aiengine/main_generation.pyc + python /var/lib/boinc-client/slots/2/lib/python3.12/site-packages/aiengine/main_generation.pyc --conf conf.yaml Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 2500 examples [00:00, 159513.20 examples/s] Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:19<00:19, 19.79s/it] Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:19<00:19, 19.84s/it] [rank0]: Traceback (most recent call last): [rank0]: File "wheel_contents/aiengine/main_generation.py", line 87, in <module> [rank0]: File "wheel_contents/aiengine/model.py", line 36, in __init__ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/utils.py", line 1096, in inner [rank0]: return fn(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 243, in __init__ [rank0]: self.llm_engine = LLMEngine.from_engine_args( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 521, in from_engine_args [rank0]: return engine_cls.from_vllm_config( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 497, in from_vllm_config [rank0]: return cls( [rank0]: ^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 281, in __init__ [rank0]: self.model_executor = executor_class(vllm_config=vllm_config, ) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__ [rank0]: self._init_executor() [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor [rank0]: self.collective_rpc("load_model") [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc [rank0]: answer = run_method(self.driver_worker, method, args, kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/utils.py", line 2347, in run_method [rank0]: return func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/worker/worker.py", line 183, in load_model [rank0]: self.model_runner.load_model() [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1113, in load_model [rank0]: self.model = get_model(vllm_config=self.vllm_config) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model [rank0]: return loader.load_model(vllm_config=vllm_config) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 1280, in load_model [rank0]: self._load_weights(model_config, model) [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 1188, in _load_weights [rank0]: loaded_weights = model.load_weights(qweight_iterator) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 490, in load_weights [rank0]: return loader.load_weights(weights) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 261, in load_weights [rank0]: autoloaded_weights = set(self._load_module("", self.module, weights)) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 213, in _load_module [rank0]: for child_prefix, child_weights in self._groupby_prefix(weights): [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 103, in _groupby_prefix [rank0]: for prefix, group in itertools.groupby(weights_by_parts, [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 101, in <genexpr> [rank0]: for weight_name, weight_data in weights) [rank0]: ^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 964, in _quantized_4bit_generator [rank0]: ) in weight_iterator: [rank0]: ^^^^^^^^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 862, in _hf_weight_iter [rank0]: for org_name, param in iterator: [rank0]: ^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 443, in safetensors_weights_iterator [rank0]: param = f.get_tensor(name) [rank0]: ^^^^^^^^^^^^^^^^^^ [rank0]: File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/torch/utils/_device.py", line 104, in __torch_function__ [rank0]: return func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.45 GiB. GPU 0 has a total capacity of 11.77 GiB of which 729.62 MiB is free. Including non-PyTorch memory, this process has 11.04 GiB memory in use. Of the allocated memory 9.24 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank0]:[W929 20:03:59.876860295 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) 2025-09-29 20:04:01 (35159): bin/bash exited; CPU time 36.639700 2025-09-29 20:04:01 (35159): app exit status: 0x1 2025-09-29 20:04:01 (35159): called boinc_finish(195) </stderr_txt> ]]>
©2025 Universitat Pompeu Fabra