Skip to content

Conversation

pcd1193182
Copy link
Contributor

Sponsored by: [Wasabi, Inc.; Klara, Inc.]

Motivation and Context

As disk sector sizes increase, we are able to store fewer and fewer uberblocks on a disk. This makes it increasingly difficult to recover from issues by rolling back to earlier TXGs. Eventually, sector sizes may become large enough that not even a single uberblock can be stored without having to do a partial write. In addition, new ZFS features often need space to store metadata (see, for example, the buffer used by RAIDZ expansion). This space is highly limited with the current disk layout.

Description

This patch contains the logic for a new larger label format. This format is intended to support disks with large sector sizes. By using a larger label we can store more uberblocks and other critical pool metadata. We can also use the extra space to enable new features in ZFS going forwards. This initial commit does not add new capabilities, but provides the framework for them going forwards.

It also contains zdb and zhack support for the new label type, as well as tests that verify basic functionality of the new label. Currently, the size of the disk is used as a rubric for whether or not to enable the new label type, but that is open to change.

How Has This Been Tested?

In addition to the tests added in this PR, I also ran the ZFS test suite with the tunable turned below the size of the disks in use. Some tests failed, but only for space estimation reasons, which could have been corrected with fixes to the tests. Similarly, I ran some ztest runs with the new label format.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@pcd1193182 pcd1193182 force-pushed the new_label branch 4 times, most recently from e039970 to c20fcf4 Compare July 31, 2025 19:27
@behlendorf behlendorf added the Status: Design Review Needed Architecture or design is under discussion label Jul 31, 2025
#define VDEV_RESERVE_OFFSET (VDEV_LARGE_LABEL_SIZE * 2)
#define VDEV_RESERVE_SIZE (1 << 29) // 512MiB

#define VDEV_TOC_SIZE (1 << 13)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like 8KB to me, not 32KB as said above. BTW, does 32KB has any meaning, or just arbitrary chosen? At least now we have ASHIFT_MAX of 16, which means 64KB, but I guess it might be increased with the new label format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32KiB is semi-arbitrary; I wanted it to be large enough that we would never need to worry about the ToC running out of space. It doesn't need to be at least as large as ASHIFT_MAX, since the logic should work for disks with an ashift larger than 15 (The writes should get padded up to the sector size, and then we'll read the actual size of the write from the ToC itself).

Comment on lines 578 to 580
* Each sub-section is protected with an embedded checksum. In the event that a
* sub-section is larger than 16MiB, it will be split in 16MiB - sizeof
* (zio_eck_t) chunks, which will each have their own checksum.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't know what the sections will be, I am not sure it is proper to universally speak about checksums. Some RAIDZ expansion data may not have any checksums. BTW, is the TOC checksumed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent here was to convey that all the current sections use embedded checksums, but you're right that future ones need not and this is a little confusing as is. I'll update the language to make that clearer. The TOC is checksummed with an embedded checksum.

/* The size of the section that stores the pool config */
#define VDEV_TOC_POOL_CONFIG "pool_config"
/* The size of the section that stores auxilliary uberblocks */
#define VDEV_TOC_AUX_UBERBLOCK "aux_uberblock"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With just a list of arbitrary values we'll have to know the names of all regions and their order to be able to read anything. Will addition of any new type mean read-incompatible pool feature? BTW, what is AUX_UBERBLOCK?

Copy link
Contributor Author

@pcd1193182 pcd1193182 Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AUX_UBERBLOCK was intended to be a label section that stored auxiliary uberblocks (like checkpoints & MMP). But this ultimately doesn't seem that useful:

  1. MMP is actually more complicated if you store the aux uberblock outside of the ring, since the activity check works by just finding the newest uberblock using the normal uberblock load logic.
  2. Checkpoints don't need to store an uberblock anywhere in the label, because they can access the MOS (where the checkpointed uberblock is stored). The only way that would be especially useful is if the MOS was inaccessible due to corruption or a read-incompatible feature; but even then, with the number of uberblocks we can store now, we don't really need to store a special one in the label that's a copy of the checkpoint uberblock. That can be a future feature, if people want it.

@pcd1193182 pcd1193182 force-pushed the new_label branch 2 times, most recently from df3e905 to 821000e Compare August 26, 2025 21:22
Paul Dagnelie added 3 commits August 28, 2025 16:49
This patch contains the logic for a new larger label format. This format
is intended to support disks with large sector sizes. By using a larger
label we can store more uberblocks and other critical pool metadata. We
can also use the extra space to enable new features in ZFS going
forwards. This initial commit does not add new capabilities, but
provides the framework for them going forwards.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Wasabi, Inc.
Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Design Review Needed Architecture or design is under discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants