GPUGRID NVIDIA related crashing (looks like random appearances)

Message boards : Graphics cards (GPUs) : GPUGRID NVIDIA related crashing (looks like random appearances)
Message board moderation

To post messages, you must log in.

AuthorMessage
Jari Kosonen

Send message
Joined: 5 May 22
Posts: 24
Credit: 12,458,305
RAC: 0
Level
Pro
Scientific publications
wat
Message 58885 - Posted: 6 Jun 2022, 10:55:14 UTC
Last modified: 6 Jun 2022, 10:59:30 UTC

The boinc application GPUGRID seems to crash the system.
I think it is NVIDIA related, as any other application is not causing it.
Some cases it was overheating and some cases it just crashed without even
overheating.


System info;
System:    Kernel: 5.10.0-14-amd64 x86_64 bits: 64 compiler: gcc v: 10.2.1 
           parameters: BOOT_IMAGE=/vmlinuz-5.10.0-14-amd64 root=UUID=<filter> ro quiet 
           splash 
           Desktop: Xfce 4.16.0 tk: Gtk 3.24.24 info: xfce4-panel wm: xfwm 4.16.1 vt: 7 dm: LightDM 1.26.0 
           Distro: MX-21.1_ahs_x64 Wildflower April 9  2022 base: Debian GNU/Linux 11 (bullseye) 
Machine:   Type: Laptop System: HP product: HP ENVY Laptop 17-ce0xxx v: Type1ProductConfigId serial: <filter> 
           Chassis: type: 10 serial: <filter> 
           Mobo: HP model: 85E5 v: 30.32 serial: <filter> UEFI: Insyde v: F.13 date: 08/09/2021 
Battery:   ID-1: BAT0 charge: 0% condition: 53.2/53.2 Wh (100.0%) volts: 10.5 min: 11.6 model: 333-54-2C-A LK03055XL 
           type: Li-ion serial: <filter> status: Unknown 
           Device-1: hidpp_battery_0 model: Logitech Wireless Mouse serial: <filter> charge: 55% (should be ignored) 
           rechargeable: yes status: Discharging 
           Device-2: hidpp_battery_1 model: Logitech Wireless Keyboard serial: <filter> 
           charge: 55% (should be ignored) rechargeable: yes status: Discharging 
CPU:       Info: Quad Core model: Intel Core i7-8565U bits: 64 type: MT MCP arch: Kaby Lake note: check family: 6 
           model-id: 8E (142) stepping: C (12) microcode: EC cache: L2: 8 MiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 31999 
           Speed: 2200 MHz min/max: 400/4600 MHz Core speeds (MHz): 1: 2200 2: 2200 3: 2200 4: 2200 5: 2200 6: 2200 
           7: 2200 8: 2200 
           Vulnerabilities: Type: itlb_multihit status: KVM: VMX disabled 
           Type: l1tf status: Not affected 
           Type: mds status: Not affected 
           Type: meltdown status: Not affected 
           Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via prctl and seccomp 
           Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer sanitization 
           Type: spectre_v2 mitigation: Enhanced IBRS, IBPB: conditional, RSB filling 
           Type: srbds mitigation: TSX disabled 
           Type: tsx_async_abort status: Not affected 
Graphics:  Device-1: Intel WhiskeyLake-U GT2 [UHD Graphics 620] vendor: Hewlett-Packard driver: i915 v: kernel 
           bus-ID: 00:02.0 chip-ID: 8086:3ea0 class-ID: 0300 
           Device-2: NVIDIA GP108M [GeForce MX250] vendor: Hewlett-Packard driver: nvidia v: 515.48.07 
           alternate: nouveau,nvidia_drm bus-ID: 02:00.0 chip-ID: 10de:1d13 class-ID: 0302 
           Display: x11 server: X.Org 1.20.13 compositor: xfwm4 v: 4.16.1 driver: loaded: modesetting,nvidia 
           unloaded: fbdev,nouveau,vesa alternate: nv display-ID: :0.0 screens: 1 
           Screen-1: 0 s-res: 3840x1080 s-dpi: 96 s-size: 1016x286mm (40.0x11.3") s-diag: 1055mm (41.6") 
           Monitor-1: eDP-1 res: 1920x1080 hz: 60 dpi: 128 size: 382x215mm (15.0x8.5") diag: 438mm (17.3") 
           Monitor-2: DP-1 res: 1920x1080 hz: 60 dpi: 85 size: 575x323mm (22.6x12.7") diag: 660mm (26") 
           OpenGL: renderer: Mesa Intel UHD Graphics 620 (WHL GT2) v: 4.6 Mesa 21.2.5 direct render: Yes 
Audio:     Device-1: Intel Cannon Point-LP High Definition Audio vendor: Hewlett-Packard driver: sof-audio-pci 
           alternate: snd_hda_intel,snd_soc_skl,snd_sof_pci bus-ID: 00:1f.3 chip-ID: 8086:9dc8 class-ID: 0401 
           Sound Server-1: ALSA v: k5.10.0-14-amd64 running: yes 
           Sound Server-2: PulseAudio v: 14.2 running: yes 
Network:   Device-1: Intel Cannon Point-LP CNVi [Wireless-AC] driver: iwlwifi v: kernel modules: wl port: 5000 
           bus-ID: 00:14.3 chip-ID: 8086:9df0 class-ID: 0280 
           IF: wlan0 state: down mac: <filter> 
           Device-2: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: Hewlett-Packard driver: r8169 
           v: kernel port: 3000 bus-ID: 03:00.0 chip-ID: 10ec:8168 class-ID: 0200 
           IF: eth0 state: up speed: 1000 Mbps duplex: full mac: <filter> 
Bluetooth: Device-1: Intel Bluetooth 9460/9560 Jefferson Peak (JfP) type: USB driver: btusb v: 0.8 bus-ID: 1-10:4 
           chip-ID: 8087:0aaa class-ID: e001 
           Report: hciconfig ID: hci0 rfk-id: 1 state: up address: <filter> bt-v: 3.0 lmp-v: 5.1 sub-v: 100 
           hci-v: 5.1 rev: 100 
           Info: acl-mtu: 1021:4 sco-mtu: 96:6 link-policy: rswitch sniff link-mode: slave accept 
           service-classes: rendering, capturing, object transfer, audio 
RAID:      Hardware-1: Intel 82801 Mobile SATA Controller [RAID mode] driver: ahci v: 3.0 port: 5060 bus-ID: 00:17.0 
           chip-ID: 8086.282a rev: 30 class-ID: 0104 
Drives:    Local Storage: total: 1.59 TiB used: 357.94 GiB (21.9%) 
           SMART Message: Unable to run smartctl. Root privileges required. 
           ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Western Digital model: PC SN520 SDAPNUW-256G-1006 
           size: 238.47 GiB block-size: physical: 512 B logical: 512 B speed: 15.8 Gb/s lanes: 2 type: SSD 
           serial: <filter> rev: 20110006 temp: 36.9 C scheme: GPT 
           ID-2: /dev/sda maj-min: 8:0 vendor: Seagate model: ST1000LM049-2GH172 size: 931.51 GiB block-size: 
           physical: 4096 B logical: 512 B speed: 6.0 Gb/s type: HDD rpm: 7200 serial: <filter> rev: RXM3 
           scheme: GPT 
           ID-3: /dev/sdb maj-min: 8:16 type: USB vendor: HP model: x796w size: 462.32 GiB block-size: 
           physical: 512 B logical: 512 B type: N/A serial: <filter> rev: PMAP scheme: MBR 
           SMART Message: Unknown USB bridge. Flash drive/Unsupported enclosure? 
Partition: ID-1: / raw-size: 47.6 GiB size: 46.55 GiB (97.80%) used: 11.56 GiB (24.8%) fs: ext4 dev: /dev/nvme0n1p8 
           maj-min: 259:8 
           ID-2: /boot raw-size: 1024 MiB size: 989.4 MiB (96.62%) used: 267.7 MiB (27.1%) fs: ext4 
           dev: /dev/nvme0n1p7 maj-min: 259:7 
           ID-3: /boot/efi raw-size: 260 MiB size: 256 MiB (98.46%) used: 97.7 MiB (38.2%) fs: vfat 
           dev: /dev/nvme0n1p1 maj-min: 259:1 
           ID-4: /home raw-size: 488.28 GiB size: 479.54 GiB (98.21%) used: 346.03 GiB (72.2%) fs: ext4 
           dev: /dev/sda2 maj-min: 8:2 
Swap:      Alert: No swap data was found. 
Sensors:   System Temperatures: cpu: 70.0 C mobo: 57.0 C 
           Fan Speeds (RPM): N/A 
Repos:     Packages: note: see --pkg apt: 2309 lib: 1231 flatpak: 0 
           No active apt repos in: /etc/apt/sources.list 
           Active apt repos in: /etc/apt/sources.list.d/debian-stable-updates.list 
           1: deb http://deb.debian.org/debian bullseye-updates main contrib non-free
           Active apt repos in: /etc/apt/sources.list.d/debian.list 
           1: deb http://deb.debian.org/debian bullseye main contrib non-free
           2: deb http://security.debian.org/debian-security bullseye-security main contrib non-free
           Active apt repos in: /etc/apt/sources.list.d/mx.list 
           1: deb http://mirror.rise.ph/mxlinux-pkg/mx/repo/ bullseye main non-free
           2: deb http://mirror.rise.ph/mxlinux-pkg/mx/repo/ bullseye ahs
           Active apt repos in: /etc/apt/sources.list.d/teams.list 
           1: deb [arch=amd64] https://packages.microsoft.com/repos/ms-teams stable main
Info:      Processes: 330 Uptime: 23m wakeups: 10 Memory: 15.37 GiB used: 3.4 GiB (22.1%) Init: SysVinit v: 2.96 
           runlevel: 5 default: 5 tool: systemctl Compilers: gcc: 10.2.1 alt: 10 Shell: bash 
           default: Bash v: 5.1.4 running-in: quick-system-info-mx inxi: 3.3.06 
Boot Mode: UEFI



NVIDIA-driver specific info:
Mon Jun  6 18:54:20 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   60C    P8    N/A /  N/A |      4MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2663      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+


BOINC log file output:
Mon Jun  6 18:33:49 2022 |  | cc_config.xml not found - using defaults
Mon Jun  6 18:33:50 2022 |  | Starting BOINC client version 7.16.16 for x86_64-pc-linux-gnu
Mon Jun  6 18:33:50 2022 |  | log flags: file_xfer, sched_ops, task
Mon Jun  6 18:33:50 2022 |  | Libraries: libcurl/7.74.0 OpenSSL/1.1.1n zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) libssh2/1.9.0 nghttp2/1.43.0 librtmp/2.3
Mon Jun  6 18:33:50 2022 |  | Data directory: /home/jari/BOINC
Mon Jun  6 18:33:50 2022 |  | CUDA: NVIDIA GPU 0: NVIDIA GeForce MX250 (driver version 515.48, CUDA version 11.7, compute capability 6.1, 4042MB, 3982MB available, 1215 GFLOPS peak)
Mon Jun  6 18:33:50 2022 |  | OpenCL: NVIDIA GPU 0: NVIDIA GeForce MX250 (driver version 515.48.07, device version OpenCL 3.0 CUDA, 4042MB, 3982MB available, 1215 GFLOPS peak)
Mon Jun  6 18:33:50 2022 |  | libc: Debian GLIBC 2.31-13+deb11u3 version 2.31
Mon Jun  6 18:33:50 2022 |  | Host name: mx
Mon Jun  6 18:33:50 2022 |  | Processor: 8 GenuineIntel Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz [Family 6 Model 142 Stepping 12]
Mon Jun  6 18:33:50 2022 |  | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
Mon Jun  6 18:33:50 2022 |  | OS: Linux Debian: Debian GNU/Linux 11 (bullseye) [5.10.0-14-amd64|libc 2.31 (Debian GLIBC 2.31-13+deb11u3)]
Mon Jun  6 18:33:50 2022 |  | Memory: 15.37 GB physical, 0 bytes virtual
Mon Jun  6 18:33:50 2022 |  | Disk: 479.54 GB total, 109.10 GB free
Mon Jun  6 18:33:50 2022 |  | Local time is UTC +8 hours
Mon Jun  6 18:33:50 2022 | GPUGRID | General prefs: from GPUGRID (last modified 01-Jun-2022 19:37:34)
Mon Jun  6 18:33:50 2022 | GPUGRID | Computer location: home
Mon Jun  6 18:33:50 2022 | GPUGRID | General prefs: no separate prefs for home; using your defaults
Mon Jun  6 18:33:50 2022 |  | Reading preferences override file
Mon Jun  6 18:33:50 2022 |  | Preferences:
Mon Jun  6 18:33:50 2022 |  | max memory usage when active: 7869.12 MB
Mon Jun  6 18:33:50 2022 |  | max memory usage when idle: 7869.12 MB
Mon Jun  6 18:33:55 2022 |  | max disk usage: 50.00 GB
Mon Jun  6 18:33:55 2022 |  | max CPUs used: 6
Mon Jun  6 18:33:55 2022 |  | (to change preferences, visit a project web site or select Preferences in the Manager)
Mon Jun  6 18:33:55 2022 |  | Setting up project and slot directories
Mon Jun  6 18:33:55 2022 |  | Checking active tasks


I could not find it saving anything related the system crash.
So I could not specify any reasons why it occurs.
Maybe driver related issue.
ID: 58885 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58887 - Posted: 6 Jun 2022, 20:27:00 UTC - in response to Message 58885.  

I believe the 515 series drivers are the short term branch.

The long term stable series is the 510 drivers.

I'd drop back from the cutting edge for a run and try the 510 series.

Not having any issues with the 510 series on all of my hosts.
ID: 58887 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jari Kosonen

Send message
Joined: 5 May 22
Posts: 24
Credit: 12,458,305
RAC: 0
Level
Pro
Scientific publications
wat
Message 58900 - Posted: 11 Jun 2022, 13:04:44 UTC - in response to Message 58887.  

It looks like the HP ENVY Laptop 17-ce0xxx 19.5V/65W charger is either broken
or it is too small for this heavy load caused by the GPUGRID.
So I try to purchase 19.5V/90W larger charger to support the high
load caused by the GPUGRID and NVIDIA.
The battery was always nearly empty, because the 65W charger could not produce
enough current for this application to run.
Thus even in smallest load peak there might have been caused CPU/GPU undervoltage that crashed the system.
ID: 58900 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58901 - Posted: 11 Jun 2022, 16:52:50 UTC

Glad to hear you figured out the problem was not enough power for the system.
ID: 58901 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Graphics cards (GPUs) : GPUGRID NVIDIA related crashing (looks like random appearances)

©2025 Universitat Pompeu Fabra