ZFS
From Wikipedia:
- ZFS is a combined file system and logical volume manager designed by Sun Microsystems. The features of ZFS include protection against data corruption, support for high storage capacities, efficient data compression, integration of the concepts of filesystem and volume management, snapshots and copy-on-write clones, continuous integrity checking and automatic repair, RAID-Z and native NFSv4 ACLs. The ZFS name is registered as a trademark of Oracle Corporation.
FreeBSD, and thus by extension PacBSD, had native ZFS support since FreeBSD 7.0.
Contents
Choosing between UFS and ZFS
ZFS is the world's most advanced file system and PacBSD has a built-in support. ZFS has several technological advantages over the more traditional UFS:
- Built in support for multiple device-based storage layouts
- Support for various RAID configurations
- Support for mirrored disks
- End-to-end checksums
- True live filesystem integrity checks, metadata and data, versus fsck only checking metadata and requires the filesystem to not be mounted
- Transparent data compression
- ZFS is designed to be a high capacity filesystem
- Native support for encryption (encryption happens after compression, and before checksumming and deduplication)
- Various cache support and management to speed up read and write options
- Copy-on-write transactional model
- Snapshots and clones
- Send and receive snapshots between multiple computers
- Dynamic striping
- Variable block sizes
- Lightweight filesystem creation
- Data deduplication
The trade off for some of these features is higher CPU and RAM usage, making it a less than ideal choice for older computers with slower CPUs or minimum amount of RAM. A good rule of thumb is to have 1GB of RAM plus an additional 1GB of RAM for each 1TB of storage space. If data deduplication is enabled then the requirement becomes 5GB of RAM for every 1TB of storage. It is also highly recommended to only use ZFS on 64 bit systems, while it may work on 32 bit systems there may be some stability issues.
Compression
ZFS has built in support for transparently compressing data. Not only does enabling this save space in the pool, but in some cases it drastically improves performance. This is because the time it takes to compress or decompress the data is quicker than the time it takes to read or write the uncompressed data to disk.
Supported compression options are:
LZ4 compression is the recommended compression algorithm as it offers the best compression and best performance of the three. LZJB makes for a good second choice as it provides a good trade-off between speed and space. Gzip is no longer recommended but is still supported, like other things that offer gzip support the compression rate is configurable between level 0-9 (Zero offering the least amount of compression and nine being the most) by default ZFS will use gzip compression level 6.
To see the which, if any, datasets use compression use:
# zfs get -r compression tank
NAME PROPERTY VALUE SOURCE tank compression off default tank/HOME compression lz4 local tank/HOME/root compression off local tank/PORTS compression lz4 local tank/ROOT compression off default tank/ROOT/pacbsd-0 compression lz4 local
The -r flag tells zfs get will work recursively to return not only the data for the tank pool but all datasets under it.
To enable compression on a dataset use:
# zfs set compression=lz4 tank/HOME/root
To see the ratio of space saved with compression use:
# zfs get compressratio tank
NAME PROPERTY VALUE SOURCE tank compressratio 2.62x -
Data Integrity
All data and metadata written in ZFS is checksummed to ensure that the data has not become corrupted over time. These checksums are used to validate the integrity of the file by checking for things like data rot or early stage drive failure. When a block is accessed, regardless of whether it is data or metadata, its checksum is calculated and compared with the stored checksum value of what it should be. If the checksums match, the data is processed normally, if the checksums do not match then ZFS will try to repair it by fetching a copy from a mirrored disk with a valid checksum or recreate it via the RAID.
To see which checksum algorithm is in use run:
# zfs get checksum tank
NAME PROPERTY VALUE SOURCE tank checksum sha256 local
While the checksum algorithm can be changed after the fact, the already existing checksums will need to be manually regenerated by rewriting the file(s). The easiest way to do this is with the zfs send and zfs receive commands, the data can either be first sent to an intermediate machine also using ZFS or be immediately written back to itself cutting out the need for a second computer. If using a second computer that must have a ZFS pool big enough to hold all the data from the pool that will be rechecksummed, if not using a second computer the pool must be large enough to hold a second copy of all the data. Either way depending on the amount of data on the pool this process may take some time to complete.
Setting the active checksum algorithm:
# zfs set checksum=<checksum> tank
Where <checksum> is one of fletcher2, fletcher4 or sha256.
While the checksums are automatically checked when accessing data, the system administrator can also manually trigger checking the checksums for the entire pool:
# zpool scrub tank
This starts the scan/scrub in the background and no information is presented to the user.
To see the status of the most recent scrub for a pool use zpool status:
# zpool status tank
pool: tank state: ONLINE scan: scrub repaired 0 in 4h28m with 0 errors on Sun Mar 27 07:28:54 2016 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1 ONLINE 0 0 0 ada2 ONLINE 0 0 0 errors: No known data errors
This shows that across all drives there are no read/write errors and all checksums match.
Boot Environments
ZFS supports the ability to boot from multiple or different zpools, commonly referred to as boot environments. This allows for rolling back from various mistakes or problems, such as a failed system update, deleting an important file or directory, etc.
Data Deduplication
On top of built in compression support, ZFS can save a lot of disk space by using data deduplication at the cost of higher RAM requirements. In short deduplication allows storing the same data multiple times, but only take up the space of a single copy. Depending on the system and the kind of data being written this can lead to pretty substantial differences in storage space used. ZFS is capable of deduplicating data on the file, block or byte level making it a very versatile in practice. An example of using data deduping would be storing multiple copies of virtual machine images where the data is fairly consistent between them.
# zfs create tank/VMs # zfs set dedup=on tank/VMs
Optionally the deduping can be set to use extra verification in the checksums to help avoid potential hash collisions. The downside to this is it adds extra overhead to both the checksum and deduping process. If the pool is set to use SHA-256 as the checksum hashing algorithm then the chances of a hash collision is low enough that this probably isn't needed.
# zfs set dedup=verify tank/VMs
If the pool is set to use SHA-256 and collision verification is set on the dataset, it is possible to tell ZFS to use a faster but weaker checksum for only this dataset to lessen the performance hit caused by it.
# zfs set checksum=fletcher4,verify tank/VMs
File system creation
This guide is only an example of setting up a very basic ZFS pool, while this may be good enough for most users this is by no means a recommended setup.
Creating the pool
ZFS can be used either on a single disk or across multiple disks in either a mirror or RAID-Z setup. Mirroring two or more drives offers the greatest redundancy as everything written to one drive is duplicated to the others, while RAID setups allows for striping data across multiple drives, creating redundancy across the disks or both. With enough drives it is also possible to use both mirroring and RAID-Z
Single disk pool creation:
# zpool create tank /dev/ada0p2
This creates a pool on /dev/ada0p2 called tank
Multiple disk pool creation:
# zpool create media raidz1 /dev/ada1 /dev/ada2 /dev/ada3 /dev/ada4
This creates a pool that spans across four drives, with enough redundancy for one drive failure, called media. Valid options are raidz1 which allows for one drive failure, raidz2 which allows for two drive failures, and raidz3 which allows for three drive failures.
To create mirrored pools, where one drive is an exact copy of another drive use:
# zpool create media mirror /dev/ada1 /dev/ada2 mirror /dev/ada3 /dev/ada4
This makes the pool so that /dev/ada2 is a mirror of /dev/ada1 and /dev/ada4 is a mirror of /dev/ada3.
To create a striped pool, where there is no data redundancy omit mirror and raidzn from the zpool create command:
# zpool create media /dev/ada1 /dev/ada2 /dev/ada3 /dev/ada4
This creates one large pool that contains the cumulative free space from all the vdevs, minus spaces reserved for ZFS to store metadata.
Listing all ZFS pools
It is possible to list all known ZFS pools to the system, and git a brief overview of them, with zfs list
# zfs list
NAME USED AVAIL REFER MOUNTPOINT tank 30.5G 419G 25.3K /tank
Export and Importing the pool
If this pool is going to be used for installation then it will need to be remounted from / to /mnt. Before remounting (exporting and reimporting), create a small ramdisk for /boot/zfs:
# mdmfs -s 128m md /boot/zfs
Remount the pool so it is mounted to /mnt:
# zpool export tank # zpool import -o altroot=/mnt -o cachefile=/boot/zfs/zpool.cache -f tank
Setting the checksum algorithm
All data and metadata written in ZFS is checksummed to ensure that the data has not become corrupted over time. These checksums are used to validate the integrity of the file by checking for things like data rot or early stage drive failure. When a block is accessed, regardless of whether it is data or metadata, its checksum is calculated and compared with the stored checksum value of what it should be. If the checksums match, the data is processed normally, if the checksums do not match then ZFS will try to repair it by fetching a copy from a mirrored disk with a valid checksum or recreate it via the RAID.
Currently there are two supported checksum algorithms in ZFS: fletcher2, fletcher4 checksum and sha256 hash. Fletcher4 is the default as SHA-256 is generally renowned to be more CPU intensive when calculating hashes, with a fairly recent machine that isn't constantly under heavy load choosing SHA-256 over Fletcher4 should be fine.
To change the checksum algorithm on the pool run:
# zfs set checksum=sha256 tank
Creating datasets
One of the advantages ZFS has to offer is support for multiple datasets (subvolumes or file systems within a file system). Datasets are created with zfs create and can be passed arguments similar to what one would pass to mount.
# zfs create -o canmount=off -o mountpoint=legacy tank/ROOT # zfs create -o canmount=on -o compression=lz4 -o mountpoint=/ tank/ROOT/pacbsd # zfs create -o compression=lz4 -o mountpoint=/home tank/HOME # zfs create -o compression=off -o mountpoint=/root tank/HOME/root
This creates four different datasets, one that is a global dataset (tank/ROOT) that isn't directly mountable by the system (canmount=off) and is managed by the administrator (mountpoint=legacy). After that separate datasets for / and /home are created and all data written in either dataset will automatically be compressed using the LZ4 algorithm (compression=lz4).
There is also a separate dataset created for /root which doesn't use compression, as using the root account is highly discouraged there little point setting compression on root's home directory.
Swap on ZFS
It is possible to create a dataset under ZFS to use as swap space. For directions on how to set this up see Swap#ZFS_Swap_Volume