Task 38579082

Name	wu_19b8feca-GIANNI_GPROTO7-0-1-RND9683_0
Workunit	31544496
Created	29 Sep 2025, 21:54:52 UTC
Sent	29 Sep 2025, 21:55:15 UTC
Report deadline	4 Oct 2025, 21:55:15 UTC
Received	30 Sep 2025, 1:06:42 UTC
Server state	Over
Outcome	Computation error
Client state	Compute error
Exit status	195 (0x000000C3) EXIT_CHILD_FAILED
Computer ID	616787
Run time	3 min 10 sec
CPU time	34 sec
Validate state	Invalid
Credit	0.00
Device peak FLOPS	16,318.25 GFLOPS
Application version	LLM: LLMs for chemistry v1.00 (cuda124L) x86_64-pc-linux-gnu
Peak working set size	1.52 GB
Peak swap size	29.44 GB
Peak disk usage	8.21 GB
Stderr output

<core_client_version>8.2.5</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
2025-09-29 20:00:34 (35159): wrapper (8.1.26018): starting
2025-09-29 20:02:51 (35159): wrapper: running bin/python (bin/conda-unpack)
2025-09-29 20:02:51 (35159): wrapper: created child process 35230
2025-09-29 20:02:58 (35159): bin/python exited; CPU time 1.144690
2025-09-29 20:02:58 (35159): wrapper: running bin/tar (xjvf input.tar.bz2)
2025-09-29 20:02:58 (35159): wrapper: created child process 35231
2025-09-29 20:02:59 (35159): bin/tar exited; CPU time 0.025372
2025-09-29 20:02:59 (35159): wrapper: running bin/bash (run.sh)
2025-09-29 20:02:59 (35159): wrapper: created child process 35233
+ echo 'Setup environment'
+ source bin/activate
++ _conda_pack_activate
++ local _CONDA_SHELL_FLAVOR
++ '[' -n x ']'
++ _CONDA_SHELL_FLAVOR=bash
++ local script_dir
++ case "$_CONDA_SHELL_FLAVOR" in
+++ dirname bin/activate
++ script_dir=bin
+++ cd bin
+++ pwd
++ local full_path_script_dir=/var/lib/boinc-client/slots/2/bin
+++ dirname /var/lib/boinc-client/slots/2/bin
++ local full_path_env=/var/lib/boinc-client/slots/2
+++ basename /var/lib/boinc-client/slots/2
++ local env_name=2
++ '[' -n '' ']'
++ export CONDA_PREFIX=/var/lib/boinc-client/slots/2
++ CONDA_PREFIX=/var/lib/boinc-client/slots/2
++ export _CONDA_PACK_OLD_PS1=
++ _CONDA_PACK_OLD_PS1=
++ PATH=/var/lib/boinc-client/slots/2/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
++ PS1='(2) '
++ case "$_CONDA_SHELL_FLAVOR" in
++ hash -r
++ local _script_dir=/var/lib/boinc-client/slots/2/etc/conda/activate.d
++ '[' -d /var/lib/boinc-client/slots/2/etc/conda/activate.d ']'
+ export PATH=/var/lib/boinc-client/slots/2:/var/lib/boinc-client/slots/2/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ PATH=/var/lib/boinc-client/slots/2:/var/lib/boinc-client/slots/2/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ echo 'Create a temporary directory'
+ export TMP=/var/lib/boinc-client/slots/2/tmp
+ TMP=/var/lib/boinc-client/slots/2/tmp
+ mkdir -p /var/lib/boinc-client/slots/2/tmp
+ which python
+ pip install main_generation-0.1.0-py3-none-any.whl -v --no-deps
+ export CUDA_VISIBLE_DEVICES=0
+ CUDA_VISIBLE_DEVICES=0
+ export HF_HOME=../.cache
+ HF_HOME=../.cache
+ export VLLM_ASSETS_CACHE=../.cache
+ VLLM_ASSETS_CACHE=../.cache
+ export VLLM_CACHE_ROOT=../.cache
+ VLLM_CACHE_ROOT=../.cache
+ echo RUNNING
+ pythonbinary=/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/aiengine/main_generation.pyc
+ python /var/lib/boinc-client/slots/2/lib/python3.12/site-packages/aiengine/main_generation.pyc --conf conf.yaml

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 2500 examples [00:00, 159513.20 examples/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]

Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:19<00:19, 19.79s/it]

Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:19<00:19, 19.84s/it]

[rank0]: Traceback (most recent call last):
[rank0]:   File "wheel_contents/aiengine/main_generation.py", line 87, in <module>
[rank0]:   File "wheel_contents/aiengine/model.py", line 36, in __init__
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/utils.py", line 1096, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 243, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 521, in from_engine_args
[rank0]:     return engine_cls.from_vllm_config(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 497, in from_vllm_config
[rank0]:     return cls(
[rank0]:            ^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 281, in __init__
[rank0]:     self.model_executor = executor_class(vllm_config=vllm_config, )
[rank0]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
[rank0]:     self.collective_rpc("load_model")
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/utils.py", line 2347, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/worker/worker.py", line 183, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1113, in load_model
[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]:     return loader.load_model(vllm_config=vllm_config)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 1280, in load_model
[rank0]:     self._load_weights(model_config, model)
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 1188, in _load_weights
[rank0]:     loaded_weights = model.load_weights(qweight_iterator)
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 490, in load_weights
[rank0]:     return loader.load_weights(weights)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 261, in load_weights
[rank0]:     autoloaded_weights = set(self._load_module("", self.module, weights))
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 213, in _load_module
[rank0]:     for child_prefix, child_weights in self._groupby_prefix(weights):
[rank0]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 103, in _groupby_prefix
[rank0]:     for prefix, group in itertools.groupby(weights_by_parts,
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 101, in <genexpr>
[rank0]:     for weight_name, weight_data in weights)
[rank0]:                                     ^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 964, in _quantized_4bit_generator
[rank0]:     ) in weight_iterator:
[rank0]:          ^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 862, in _hf_weight_iter
[rank0]:     for org_name, param in iterator:
[rank0]:                            ^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 443, in safetensors_weights_iterator
[rank0]:     param = f.get_tensor(name)
[rank0]:             ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/var/lib/boinc-client/slots/2/lib/python3.12/site-packages/torch/utils/_device.py", line 104, in __torch_function__
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.45 GiB. GPU 0 has a total capacity of 11.77 GiB of which 729.62 MiB is free. Including non-PyTorch memory, this process has 11.04 GiB memory in use. Of the allocated memory 9.24 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W929 20:03:59.876860295 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
2025-09-29 20:04:01 (35159): bin/bash exited; CPU time 36.639700
2025-09-29 20:04:01 (35159): app exit status: 0x1
2025-09-29 20:04:01 (35159): called boinc_finish(195)

</stderr_txt>
]]>