-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Implement new label format for large disks #17573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
e039970
to
c20fcf4
Compare
include/sys/vdev_impl.h
Outdated
#define VDEV_RESERVE_OFFSET (VDEV_LARGE_LABEL_SIZE * 2) | ||
#define VDEV_RESERVE_SIZE (1 << 29) // 512MiB | ||
|
||
#define VDEV_TOC_SIZE (1 << 13) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like 8KB to me, not 32KB as said above. BTW, does 32KB has any meaning, or just arbitrary chosen? At least now we have ASHIFT_MAX of 16, which means 64KB, but I guess it might be increased with the new label format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
32KiB is semi-arbitrary; I wanted it to be large enough that we would never need to worry about the ToC running out of space. It doesn't need to be at least as large as ASHIFT_MAX, since the logic should work for disks with an ashift larger than 15 (The writes should get padded up to the sector size, and then we'll read the actual size of the write from the ToC itself).
include/sys/vdev_impl.h
Outdated
* Each sub-section is protected with an embedded checksum. In the event that a | ||
* sub-section is larger than 16MiB, it will be split in 16MiB - sizeof | ||
* (zio_eck_t) chunks, which will each have their own checksum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't know what the sections will be, I am not sure it is proper to universally speak about checksums. Some RAIDZ expansion data may not have any checksums. BTW, is the TOC checksumed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intent here was to convey that all the current sections use embedded checksums, but you're right that future ones need not and this is a little confusing as is. I'll update the language to make that clearer. The TOC is checksummed with an embedded checksum.
include/sys/vdev_impl.h
Outdated
/* The size of the section that stores the pool config */ | ||
#define VDEV_TOC_POOL_CONFIG "pool_config" | ||
/* The size of the section that stores auxilliary uberblocks */ | ||
#define VDEV_TOC_AUX_UBERBLOCK "aux_uberblock" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With just a list of arbitrary values we'll have to know the names of all regions and their order to be able to read anything. Will addition of any new type mean read-incompatible pool feature? BTW, what is AUX_UBERBLOCK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AUX_UBERBLOCK was intended to be a label section that stored auxiliary uberblocks (like checkpoints & MMP). But this ultimately doesn't seem that useful:
- MMP is actually more complicated if you store the aux uberblock outside of the ring, since the activity check works by just finding the newest uberblock using the normal uberblock load logic.
- Checkpoints don't need to store an uberblock anywhere in the label, because they can access the MOS (where the checkpointed uberblock is stored). The only way that would be especially useful is if the MOS was inaccessible due to corruption or a read-incompatible feature; but even then, with the number of uberblocks we can store now, we don't really need to store a special one in the label that's a copy of the checkpoint uberblock. That can be a future feature, if people want it.
df3e905
to
821000e
Compare
This patch contains the logic for a new larger label format. This format is intended to support disks with large sector sizes. By using a larger label we can store more uberblocks and other critical pool metadata. We can also use the extra space to enable new features in ZFS going forwards. This initial commit does not add new capabilities, but provides the framework for them going forwards. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Wasabi, Inc. Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored by: [Wasabi, Inc.; Klara, Inc.]
Motivation and Context
As disk sector sizes increase, we are able to store fewer and fewer uberblocks on a disk. This makes it increasingly difficult to recover from issues by rolling back to earlier TXGs. Eventually, sector sizes may become large enough that not even a single uberblock can be stored without having to do a partial write. In addition, new ZFS features often need space to store metadata (see, for example, the buffer used by RAIDZ expansion). This space is highly limited with the current disk layout.
Description
This patch contains the logic for a new larger label format. This format is intended to support disks with large sector sizes. By using a larger label we can store more uberblocks and other critical pool metadata. We can also use the extra space to enable new features in ZFS going forwards. This initial commit does not add new capabilities, but provides the framework for them going forwards.
It also contains zdb and zhack support for the new label type, as well as tests that verify basic functionality of the new label. Currently, the size of the disk is used as a rubric for whether or not to enable the new label type, but that is open to change.
How Has This Been Tested?
In addition to the tests added in this PR, I also ran the ZFS test suite with the tunable turned below the size of the disks in use. Some tests failed, but only for space estimation reasons, which could have been corrected with fixes to the tests. Similarly, I ran some ztest runs with the new label format.
Types of changes
Checklist:
Signed-off-by
.