Skip to content

ZFS 2.3.3 crash: kernel panic in scrub path (zio_*) #17559

@xaimepardal

Description

@xaimepardal

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 24.04 LTS
Kernel Version 6.8.0-64-generic
Architecture x86_64
OpenZFS Version 2.3.3-1

Describe the problem you're observing

During a zpool scrub on a raidz1 pool with 8 SATA HDDs, the system experiences a kernel crash.
The crash consistently occurs early in the scrub process.
This leads to the txg_sync thread becoming blocked indefinitely. The system becomes partially unresponsive and requires a hard reboot to recover.


Describe how to reproduce the problem

  1. Boot into a system with ZFS 2.3.3 and Linux 6.8.0.
  2. Have a pool configured with raidz1 using 8 physical drives (WWN-based paths).
  3. Start a zpool scrub on the pool:
    zpool scrub storage
  4. Monitor with:
    watch zpool status -v
  5. After some GB scanned, observe the system lock and crash in kernel.

Include any warning/errors/backtraces from the system logs

🔧 Live kernel messages captured:

jul 21 13:07:03 gresint-server kernel: BUG: unable to handle page fault for address: 00007970a8dcc605
jul 21 13:07:03 gresint-server kernel: RIP: 0010:zio_vdev_io_done+0x6e/0x240 [zfs]
jul 21 13:07:03 gresint-server kernel:  ? zio_vdev_io_done+0x6e/0x240 [zfs]
jul 21 13:07:03 gresint-server kernel:  ? zio_vdev_io_done+0x4e/0x240 [zfs]
jul 21 13:07:03 gresint-server kernel:  zio_execute+0x94/0x170 [zfs]
jul 21 13:07:03 gresint-server kernel:  ? __pfx_zio_execute+0x10/0x10 [zfs]
jul 21 13:07:03 gresint-server kernel: RIP: 0010:zio_vdev_io_done+0x6e/0x240 [zfs]

Additional Information

zpool status -v

  pool: storage
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Sun Jul 20 23:49:14 2025
	7.19T / 125T scanned at 1.13G/s, 2.96T / 125T issued at 474M/s
	884K repaired, 2.37% done, 3 days 02:55:32 to go
config:

	NAME                        STATE     READ WRITE CKSUM
	storage                     ONLINE       0     0     0
	  raidz1-0                  ONLINE       0     0     0
	    wwn-0x5000c500e8a8c504  ONLINE       0     0     9  (repairing)
	    wwn-0x5000c500e8a8a1f6  ONLINE       0     0     4  (repairing)
	    wwn-0x5000c500f6e5f0ab  ONLINE       0     0     7  (repairing)
	    wwn-0x5000c500e8496d51  ONLINE       0     0     2  (repairing)
	    wwn-0x5000c500f6ee1532  ONLINE       0     0     3  (repairing)
	    wwn-0x5000c500e8b3a6f9  ONLINE       0     0     7  (repairing)
	    wwn-0x5000c500e88ed746  ONLINE       0     0     8  (repairing)
	    wwn-0x5000c500e8a8a0aa  ONLINE       0     0     6  (repairing)
	logs	
	  ubuntu-vg/slog-lv         ONLINE       0     0     0
	cache
	  ubuntu--vg-l2arc--lv      ONLINE       0     0     0

errors: No known data errors

zpool get all storage

NAME     PROPERTY                       VALUE                          SOURCE
storage  size                           146T                           -
storage  capacity                       85%                            -
storage  altroot                        -                              default
storage  health                         ONLINE                         -
storage  guid                           13753470766290521828           -
storage  version                        -                              default
storage  bootfs                         -                              default
storage  delegation                     on                             default
storage  autoreplace                    off                            default
storage  cachefile                      -                              default
storage  failmode                       wait                           default
storage  listsnapshots                  off                            default
storage  autoexpand                     on                             local
storage  dedupratio                     1.00x                          -
storage  free                           20.7T                          -
storage  allocated                      125T                           -
storage  readonly                       off                            -
storage  ashift                         12                             local
storage  comment                        -                              default
storage  expandsize                     -                              -
storage  freeing                        0                              -
storage  fragmentation                  36%                            -
storage  leaked                         0                              -
storage  multihost                      off                            default
storage  checkpoint                     -                              -
storage  load_guid                      4324850870775088562            -
storage  autotrim                       off                            default
storage  compatibility                  off                            default
storage  bcloneused                     0                              -
storage  bclonesaved                    0                              -
storage  bcloneratio                    1.00x                          -
storage  dedup_table_size               0                              -
storage  dedup_table_quota              auto                           default
storage  last_scrubbed_txg              0                              -
storage  feature@async_destroy          enabled                        local
storage  feature@empty_bpobj            enabled                        local
storage  feature@lz4_compress           active                         local
storage  feature@multi_vdev_crash_dump  enabled                        local
storage  feature@spacemap_histogram     active                         local
storage  feature@enabled_txg            active                         local
storage  feature@hole_birth             active                         local
storage  feature@extensible_dataset     active                         local
storage  feature@embedded_data          active                         local
storage  feature@bookmarks              enabled                        local
storage  feature@filesystem_limits      enabled                        local
storage  feature@large_blocks           enabled                        local
storage  feature@large_dnode            enabled                        local
storage  feature@sha512                 enabled                        local
storage  feature@skein                  enabled                        local
storage  feature@edonr                  enabled                        local
storage  feature@userobj_accounting     active                         local
storage  feature@encryption             enabled                        local
storage  feature@project_quota          active                         local
storage  feature@device_removal         enabled                        local
storage  feature@obsolete_counts        enabled                        local
storage  feature@zpool_checkpoint       enabled                        local
storage  feature@spacemap_v2            active                         local
storage  feature@allocation_classes     enabled                        local
storage  feature@resilver_defer         enabled                        local
storage  feature@bookmark_v2            enabled                        local
storage  feature@redaction_bookmarks    enabled                        local
storage  feature@redacted_datasets      enabled                        local
storage  feature@bookmark_written       enabled                        local
storage  feature@log_spacemap           active                         local
storage  feature@livelist               enabled                        local
storage  feature@device_rebuild         enabled                        local
storage  feature@zstd_compress          enabled                        local
storage  feature@draid                  enabled                        local
storage  feature@zilsaxattr             enabled                        local
storage  feature@head_errlog            active                         local
storage  feature@blake3                 enabled                        local
storage  feature@block_cloning          enabled                        local
storage  feature@vdev_zaps_v2           active                         local
storage  feature@redaction_list_spill   enabled                        local
storage  feature@raidz_expansion        enabled                        local
storage  feature@fast_dedup             enabled                        local
storage  feature@longname               enabled                        local
storage  feature@large_microzap         enabled                        local

SMART data (excerpt)

Obteniendo todos los discos del pool ZFS...
Se encontraron los siguientes discos:
  - wwn-0x5000c500e8a8c504
  - wwn-0x5000c500e8a8a1f6
  - wwn-0x5000c500f6e5f0ab
  - wwn-0x5000c500e8496d51
  - wwn-0x5000c500f6ee1532
  - wwn-0x5000c500e8b3a6f9
  - wwn-0x5000c500e88ed746
  - wwn-0x5000c500e8a8a0aa

---------------------------------------------------------
SMART info para wwn-0x5000c500e8a8c504 (/dev/sda):
---------------------------------------------------------
Device Model:     ST20000NM007D-3DJ103
Serial Number:    
SMART overall-health self-assessment test result: PASSED
Reallocated_Sector_Ct: 0
Current_Pending_Sector: 0
Offline_Uncorrectable: 0
Temperature: 36°C

---------------------------------------------------------
SMART info para wwn-0x5000c500e8a8a1f6 (/dev/sdb):
---------------------------------------------------------
Device Model:     ST20000NM007D-3DJ103
Serial Number:    
SMART overall-health self-assessment test result: PASSED
Reallocated_Sector_Ct: 0
Current_Pending_Sector: 0
Offline_Uncorrectable: 0
Temperature: 36°C

---------------------------------------------------------
SMART info para wwn-0x5000c500f6e5f0ab (/dev/sdc):
---------------------------------------------------------
Device Model:     ST20000NM007D-3DJ103
Serial Number:    
SMART overall-health self-assessment test result: PASSED
Reallocated_Sector_Ct: 2
Current_Pending_Sector: 0
Offline_Uncorrectable: 0
Temperature: 36°C
⚠️  Reallocated sectors detectados

-----------------------------------
... (truncated)

Troubleshooting steps attempted:

  • Tested with zfs_deadman_failmode set to panic, wait, and continue
  • Adjusted zfs_vdev_scrub_max_active and min_active
  • Monitored journalctl -k live during scrub
  • Verified all disks are SMART clean

Final Notes

This appears to be a kernel-space memory access bug triggered by ZIO completion under scrub load.
I'm available to test debug builds or apply custom patches if required.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: DefectIncorrect behavior (e.g. crash, hang)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions