Reduce SSD Wear When Running ZFS: Plus Extra Tips

wretchedghost

2022-07-06

Reduce SSD Wear When Running ZFS: Plus Extra Tips

A few tricks to reduce SSD (NOT NVMe/M.2) wear when running ZFS:

Remember to enable autotrim option on the pool. You should also setup a cron job to run zpool trim tank0 weekly or bi-weekly. Replace tank0 with your tank/dataset name.

zpool get autotrim tank0  # check trim
zpool set autotrim=on tank0   # enable trim on tank0
zpool trim tank0  # run trim manually

Use a large ashift of at least 12 but 13 is better. It will reduce write amplification. Jim Salters, recommends that you go higher than lower when choosing ashift which a too low can cripple whereas a high ashift won’t have much impact on most normal workloads. (You cannot change a pool vdev ashift once it has been set.
- ashift 9 = 512B sectors (Old drives)
- ashift 12 = 4K sectors (Modern HDDs)
- ashift 13 = 8K sectors (Most SSDs)
Use a large recordsize on the filesystem, preferably something between 128K and 1M. It will also reduce write amplification. A database on ZFS should be tweaked to match the size used by the application. MySQL (InnoDB 16KB or MyISAM 8K), PostgreSQL (default 8KB).
- Small Files: Use a high-speed SLOG (The bigger the size of the hardware for for SLOG the better to increase the write endurance range)
- Big Files: 128KB for large_block but can be set up to 1MB. 1MB recordsize is ideal for workstation loads. 5-8MB JPEG images and or 100GB movie filescan can be better utilized with 1M
- Virtual Machines Hosts: 64K (QEMU’s QCOW2 default is 64K)
Remember to disable atime on the filesystems in your /etc/fstab config, so your file reads do not result in metadata writes and reads

What is your workload?

If you have synchronous writes, you will benefit from a SLOG. It allows the pool to be written in a more efficient unfragmented fashion, and prevents metadata from being fragmented from data.

SLOG stands for **Separate intent LOG**.

ZIL stands for **ZFS Intent Log** which resides on the pool itself unless a SLOG is used. 

Asynchronous writes - is a pending write to a disk but is stored in RAM until ready.

Synchronous writes - a guaranteed integrity write where data must be written to disk before it moves on to the next one. Used for critical applications such as databases, VMs, and NFS but absolutely wasted if your system doesn't require this kind of workload.

When creating a SLOG though, keep these things in mind:

The SLOG requires at least one dedicated physical device that is only used by the ZIL
It is recommended to mirror SLOG devices
Not every pool has to have a SLOG. Again depends on the the workload.
Even if you don’t build a SLOG for a pool, each pool still requires its own ZIL. By creating a SLOG, the ZIL is moved from pool storage to the dedicated device
Neither ZIL nor SLOG make a difference with pending asynchronous writes which are always stored in RAM

A SLOG would be best used in a system with redundant battery backup and mirrored on multiple disks.

Mirrored SSDs are preferable to RAID. Make sure ashift is set to the native block size of your storage (The default usually is 13 on SSDs).

L2ARC should only be added after consideration of your I/O - it is rare that it is beneficial, though. You will need to have hundreds of users on NFS and or a HOT database to benefit from L2ARC.

Again, it all comes down to workload.

In the context of person using 24 1TB SSDs. (best to use 4vdevs mirrored)

For storing vhd files… With those drives. 4 RAIDz1 vdevs of 6 drives each.

For simple math, rounding up 960GB to 1TB, and 4 x 6 will give you roughly 20TB of table space. 4 vdevs will give better performance than one large vdev.

But with RAIDz1, if you lose 2 drives in any vdev, all data is gone. So although you should always backup your data, it’s even more important with a RAIDz1 config. The probability of 2 drives falling in one vdev is still very very low… But not impossible. Also, with 960gb SSD drives, the rebuilds will still be fast. Keep a spare on hand.

Although raid10 offers max performance and faster rebuild times, it comes at the expense of usable space.

One or two RAIDz2 vdevs would be very resilient. Meaning you would need to lose 3 drives in a single vdev to experience data loss. But that comes at the expense of performance.

With 24 960GB SSD drives, imo, 4 RAIDz1 vdevs is close to the sweet spot for performance and resilience…