-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
Type: DefectIncorrect behavior (e.g. crash, hang)Incorrect behavior (e.g. crash, hang)
Description
System information
Type | Version/Name |
---|---|
Distribution Name | Ubuntu |
Distribution Version | 24.04 LTS |
Kernel Version | 6.8.0-64-generic |
Architecture | x86_64 |
OpenZFS Version | 2.3.3-1 |
Describe the problem you're observing
During a zpool scrub
on a raidz1
pool with 8 SATA HDDs, the system experiences a kernel crash.
The crash consistently occurs early in the scrub process.
This leads to the txg_sync
thread becoming blocked indefinitely. The system becomes partially unresponsive and requires a hard reboot to recover.
Describe how to reproduce the problem
- Boot into a system with ZFS 2.3.3 and Linux 6.8.0.
- Have a pool configured with
raidz1
using 8 physical drives (WWN-based paths). - Start a
zpool scrub
on the pool:zpool scrub storage
- Monitor with:
watch zpool status -v
- After some GB scanned, observe the system lock and crash in kernel.
Include any warning/errors/backtraces from the system logs
🔧 Live kernel messages captured:
jul 21 13:07:03 gresint-server kernel: BUG: unable to handle page fault for address: 00007970a8dcc605
jul 21 13:07:03 gresint-server kernel: RIP: 0010:zio_vdev_io_done+0x6e/0x240 [zfs]
jul 21 13:07:03 gresint-server kernel: ? zio_vdev_io_done+0x6e/0x240 [zfs]
jul 21 13:07:03 gresint-server kernel: ? zio_vdev_io_done+0x4e/0x240 [zfs]
jul 21 13:07:03 gresint-server kernel: zio_execute+0x94/0x170 [zfs]
jul 21 13:07:03 gresint-server kernel: ? __pfx_zio_execute+0x10/0x10 [zfs]
jul 21 13:07:03 gresint-server kernel: RIP: 0010:zio_vdev_io_done+0x6e/0x240 [zfs]
Additional Information
zpool status -v
pool: storage
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub in progress since Sun Jul 20 23:49:14 2025
7.19T / 125T scanned at 1.13G/s, 2.96T / 125T issued at 474M/s
884K repaired, 2.37% done, 3 days 02:55:32 to go
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
wwn-0x5000c500e8a8c504 ONLINE 0 0 9 (repairing)
wwn-0x5000c500e8a8a1f6 ONLINE 0 0 4 (repairing)
wwn-0x5000c500f6e5f0ab ONLINE 0 0 7 (repairing)
wwn-0x5000c500e8496d51 ONLINE 0 0 2 (repairing)
wwn-0x5000c500f6ee1532 ONLINE 0 0 3 (repairing)
wwn-0x5000c500e8b3a6f9 ONLINE 0 0 7 (repairing)
wwn-0x5000c500e88ed746 ONLINE 0 0 8 (repairing)
wwn-0x5000c500e8a8a0aa ONLINE 0 0 6 (repairing)
logs
ubuntu-vg/slog-lv ONLINE 0 0 0
cache
ubuntu--vg-l2arc--lv ONLINE 0 0 0
errors: No known data errors
zpool get all storage
NAME PROPERTY VALUE SOURCE
storage size 146T -
storage capacity 85% -
storage altroot - default
storage health ONLINE -
storage guid 13753470766290521828 -
storage version - default
storage bootfs - default
storage delegation on default
storage autoreplace off default
storage cachefile - default
storage failmode wait default
storage listsnapshots off default
storage autoexpand on local
storage dedupratio 1.00x -
storage free 20.7T -
storage allocated 125T -
storage readonly off -
storage ashift 12 local
storage comment - default
storage expandsize - -
storage freeing 0 -
storage fragmentation 36% -
storage leaked 0 -
storage multihost off default
storage checkpoint - -
storage load_guid 4324850870775088562 -
storage autotrim off default
storage compatibility off default
storage bcloneused 0 -
storage bclonesaved 0 -
storage bcloneratio 1.00x -
storage dedup_table_size 0 -
storage dedup_table_quota auto default
storage last_scrubbed_txg 0 -
storage feature@async_destroy enabled local
storage feature@empty_bpobj enabled local
storage feature@lz4_compress active local
storage feature@multi_vdev_crash_dump enabled local
storage feature@spacemap_histogram active local
storage feature@enabled_txg active local
storage feature@hole_birth active local
storage feature@extensible_dataset active local
storage feature@embedded_data active local
storage feature@bookmarks enabled local
storage feature@filesystem_limits enabled local
storage feature@large_blocks enabled local
storage feature@large_dnode enabled local
storage feature@sha512 enabled local
storage feature@skein enabled local
storage feature@edonr enabled local
storage feature@userobj_accounting active local
storage feature@encryption enabled local
storage feature@project_quota active local
storage feature@device_removal enabled local
storage feature@obsolete_counts enabled local
storage feature@zpool_checkpoint enabled local
storage feature@spacemap_v2 active local
storage feature@allocation_classes enabled local
storage feature@resilver_defer enabled local
storage feature@bookmark_v2 enabled local
storage feature@redaction_bookmarks enabled local
storage feature@redacted_datasets enabled local
storage feature@bookmark_written enabled local
storage feature@log_spacemap active local
storage feature@livelist enabled local
storage feature@device_rebuild enabled local
storage feature@zstd_compress enabled local
storage feature@draid enabled local
storage feature@zilsaxattr enabled local
storage feature@head_errlog active local
storage feature@blake3 enabled local
storage feature@block_cloning enabled local
storage feature@vdev_zaps_v2 active local
storage feature@redaction_list_spill enabled local
storage feature@raidz_expansion enabled local
storage feature@fast_dedup enabled local
storage feature@longname enabled local
storage feature@large_microzap enabled local
SMART data (excerpt)
Obteniendo todos los discos del pool ZFS...
Se encontraron los siguientes discos:
- wwn-0x5000c500e8a8c504
- wwn-0x5000c500e8a8a1f6
- wwn-0x5000c500f6e5f0ab
- wwn-0x5000c500e8496d51
- wwn-0x5000c500f6ee1532
- wwn-0x5000c500e8b3a6f9
- wwn-0x5000c500e88ed746
- wwn-0x5000c500e8a8a0aa
---------------------------------------------------------
SMART info para wwn-0x5000c500e8a8c504 (/dev/sda):
---------------------------------------------------------
Device Model: ST20000NM007D-3DJ103
Serial Number:
SMART overall-health self-assessment test result: PASSED
Reallocated_Sector_Ct: 0
Current_Pending_Sector: 0
Offline_Uncorrectable: 0
Temperature: 36°C
---------------------------------------------------------
SMART info para wwn-0x5000c500e8a8a1f6 (/dev/sdb):
---------------------------------------------------------
Device Model: ST20000NM007D-3DJ103
Serial Number:
SMART overall-health self-assessment test result: PASSED
Reallocated_Sector_Ct: 0
Current_Pending_Sector: 0
Offline_Uncorrectable: 0
Temperature: 36°C
---------------------------------------------------------
SMART info para wwn-0x5000c500f6e5f0ab (/dev/sdc):
---------------------------------------------------------
Device Model: ST20000NM007D-3DJ103
Serial Number:
SMART overall-health self-assessment test result: PASSED
Reallocated_Sector_Ct: 2
Current_Pending_Sector: 0
Offline_Uncorrectable: 0
Temperature: 36°C
⚠️ Reallocated sectors detectados
-----------------------------------
... (truncated)
Troubleshooting steps attempted:
- Tested with
zfs_deadman_failmode
set topanic
,wait
, andcontinue
- Adjusted
zfs_vdev_scrub_max_active
andmin_active
- Monitored
journalctl -k
live during scrub - Verified all disks are SMART clean
Final Notes
This appears to be a kernel-space memory access bug triggered by ZIO completion under scrub load.
I'm available to test debug builds or apply custom patches if required.
Metadata
Metadata
Assignees
Labels
Type: DefectIncorrect behavior (e.g. crash, hang)Incorrect behavior (e.g. crash, hang)