Task 38579024

Name wu_30b3e334-GIANNI_GPROTO7-0-1-RND4278_0
Workunit 31544438
Created 29 Sep 2025, 21:42:16 UTC
Sent 29 Sep 2025, 21:42:20 UTC
Report deadline 4 Oct 2025, 21:42:20 UTC
Received 30 Sep 2025, 1:00:05 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 195 (0x000000C3) EXIT_CHILD_FAILED
Computer ID 616787
Run time 3 min 33 sec
CPU time 28 sec
Validate state Invalid
Credit 0.00
Device peak FLOPS 16,318.25 GFLOPS
Application version LLM: LLMs for chemistry v1.00 (cuda124L)
x86_64-pc-linux-gnu
Peak working set size 1.19 GB
Peak swap size 29.29 GB
Peak disk usage 15.72 GB

Stderr output

<core_client_version>8.2.5</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
2025-09-29 19:56:05 (34630): wrapper (8.1.26018): starting
2025-09-29 19:58:39 (34630): wrapper: running bin/python (bin/conda-unpack)
2025-09-29 19:58:39 (34630): wrapper: created child process 34815
2025-09-29 19:58:48 (34630): bin/python exited; CPU time 1.389815
2025-09-29 19:58:48 (34630): wrapper: running bin/tar (xjvf input.tar.bz2)
2025-09-29 19:58:48 (34630): wrapper: created child process 34817
2025-09-29 19:58:49 (34630): bin/tar exited; CPU time 0.026518
2025-09-29 19:58:49 (34630): wrapper: running bin/bash (run.sh)
2025-09-29 19:58:49 (34630): wrapper: created child process 34819
+ echo 'Setup environment'
+ source bin/activate
++ _conda_pack_activate
++ local _CONDA_SHELL_FLAVOR
++ '[' -n x ']'
++ _CONDA_SHELL_FLAVOR=bash
++ local script_dir
++ case "$_CONDA_SHELL_FLAVOR" in
+++ dirname bin/activate
++ script_dir=bin
+++ cd bin
+++ pwd
++ local full_path_script_dir=/var/lib/boinc-client/slots/2/bin
+++ dirname /var/lib/boinc-client/slots/2/bin
++ local full_path_env=/var/lib/boinc-client/slots/2
+++ basename /var/lib/boinc-client/slots/2
++ local env_name=2
++ '[' -n '' ']'
++ export CONDA_PREFIX=/var/lib/boinc-client/slots/2
++ CONDA_PREFIX=/var/lib/boinc-client/slots/2
++ export _CONDA_PACK_OLD_PS1=
++ _CONDA_PACK_OLD_PS1=
++ PATH=/var/lib/boinc-client/slots/2/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
++ PS1='(2) '
++ case "$_CONDA_SHELL_FLAVOR" in
++ hash -r
++ local _script_dir=/var/lib/boinc-client/slots/2/etc/conda/activate.d
++ '[' -d /var/lib/boinc-client/slots/2/etc/conda/activate.d ']'
+ export PATH=/var/lib/boinc-client/slots/2:/var/lib/boinc-client/slots/2/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ PATH=/var/lib/boinc-client/slots/2:/var/lib/boinc-client/slots/2/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ echo 'Create a temporary directory'
+ export TMP=/var/lib/boinc-client/slots/2/tmp
+ TMP=/var/lib/boinc-client/slots/2/tmp
+ mkdir -p /var/lib/boinc-client/slots/2/tmp
+ which python
+ pip install main_generation-0.1.0-py3-none-any.whl -v --no-deps
+ export CUDA_VISIBLE_DEVICES=0
+ CUDA_VISIBLE_DEVICES=0
+ export HF_HOME=../.cache
+ HF_HOME=../.cache
+ export VLLM_ASSETS_CACHE=../.cache
+ VLLM_ASSETS_CACHE=../.cache
+ export VLLM_CACHE_ROOT=../.cache
+ VLLM_CACHE_ROOT=../.cache
+ echo RUNNING
+ pythonbinary=/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/aiengine/main_generation.pyc
+ python /var/lib/boinc-client/slots/2/lib/python3.12/site-packages/aiengine/main_generation.pyc --conf conf.yaml

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 2500 examples [00:00, 146422.58 examples/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]

Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:14<00:14, 14.30s/it]

Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:14<00:14, 14.34s/it]

[rank0]: Traceback (most recent call last):
[rank0]:   File "wheel_contents/aiengine/main_generation.py", line 87, in <module>
[rank0]:   File "wheel_contents/aiengine/model.py", line 36, in __init__
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/utils.py", line 1096, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 243, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 521, in from_engine_args
[rank0]:     return engine_cls.from_vllm_config(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 497, in from_vllm_config
[rank0]:     return cls(
[rank0]:            ^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 281, in __init__
[rank0]:     self.model_executor = executor_class(vllm_config=vllm_config, )
[rank0]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
[rank0]:     self.collective_rpc("load_model")
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/utils.py", line 2347, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/worker/worker.py", line 183, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1113, in load_model
[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]:     return loader.load_model(vllm_config=vllm_config)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 1280, in load_model
[rank0]:     self._load_weights(model_config, model)
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 1188, in _load_weights
[rank0]:     loaded_weights = model.load_weights(qweight_iterator)
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 490, in load_weights
[rank0]:     return loader.load_weights(weights)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 261, in load_weights
[rank0]:     autoloaded_weights = set(self._load_module("", self.module, weights))
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 213, in _load_module
[rank0]:     for child_prefix, child_weights in self._groupby_prefix(weights):
[rank0]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 103, in _groupby_prefix
[rank0]:     for prefix, group in itertools.groupby(weights_by_parts,
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 101, in <genexpr>
[rank0]:     for weight_name, weight_data in weights)
[rank0]:                                     ^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 964, in _quantized_4bit_generator
[rank0]:     ) in weight_iterator:
[rank0]:          ^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 862, in _hf_weight_iter
[rank0]:     for org_name, param in iterator:
[rank0]:                            ^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 443, in safetensors_weights_iterator
[rank0]:     param = f.get_tensor(name)
[rank0]:             ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/torch/utils/_device.py", line 104, in __torch_function__
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.45 GiB. GPU 0 has a total capacity of 11.77 GiB of which 729.62 MiB is free. Including non-PyTorch memory, this process has 11.04 GiB memory in use. Of the allocated memory 9.24 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W929 19:59:46.067061360 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
2025-09-29 19:59:47 (34630): bin/bash exited; CPU time 37.271689
2025-09-29 19:59:47 (34630): app exit status: 0x1
2025-09-29 19:59:47 (34630): called boinc_finish(195)

</stderr_txt>
]]>


©2025 Universitat Pompeu Fabra