ZFS Replication

Updated:

Note : This page may contain outdated information and/or broken links; some of the formatting may be mangled due to the many different code-bases this site has been through in over 20 years; my opinions may have changed etc. etc.

As I’ve been investigating ZFS for use on production systems, I’ve been making a great deal of notes, and jotting down little "cookbook recipies" for various tasks. One of the coolest systems I’ve created recently utilised the zfs send & receive commands, along with incremental snapshots to create a replicated ZFS environment across two different systems. True, all this is present in the zfs manual page, but sometimes a quick demonstration makes things easier to understand and follow.

While this isn’t true filesystem replication (you’d have to look at something like StorageTek AVS for that) it does provide periodic snapshots and incremental updates; these can be run every minute if you’re driving this from cron - or, at even more granular intervals if you write your own daemon. Nonetheless, this suffices for disaster recovery and redundancy if you don’t need up-to-the second replication between systems.

I’ve typed up my notes in blog format so you can follow along with this example yourself, all you’ll need is a Solaris system running ZFS. Read more for the full demonstration…

First, as with my last walkthrough, I’ll create a couple of files to use for testing purposes. In a real-life scenario, these would most likely be pools of disks in a RAIDZ configuration, and the two pools would also be on physically separate systems. I’m only using 100Mb files for each, as that’s all I need for this proof of concept.

[root@solaris]$ mkfile 100m master
[root@solaris]$ mkfile 100m slave
[root@solaris]$ zpool create master $PWD/master
[root@solaris]$ zpool create slave $PWD/slave
[root@solaris]$ zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
master 95.5M 84.5K 95.4M 0% ONLINE -
slave 95.5M 52.5K 95.4M 0% ONLINE -
[root@solaris]$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
master 77K 63.4M 24.5K /master
slave 52.5K 63.4M 1.50K /slave

There we go. The naming should be pretty self-explanatory : The "master" is the primary storage pool, which will replicate and push data through to the backup "slave" pool. Now, I’ll create a ZFS filesystem and add something to it. I had a few source tarballs knocking around, so I just unpacked one (GNU grep) to give me a set of files to use as a test :

[root@solaris]$ zpool create master/data
[root@solaris]$ cd /master/data/
[root@solaris]$ gtar xzf ~/grep-2.5.1.tar.gz
[root@solaris]$ ls
grep-2.5.1

We can also see from "zfs list" we’ve now taken up some space :

[root@solaris]$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
master 3.24M 60.3M 25.5K /master
master/data 3.15M 60.3M 3.15M /master/data
slave 75.5K 63.4M 24.5K /slave

Now, we’ll transfer all this over to the "slave", and start the replication going. We first need to take an initial snapshot of the filesystem, as that’s what "zfs send" works on. It’s also worth noting here that in order to transfer the data to the slave, I simply piped it to "zfs receive". If you’re doing this between two physically separate systems, you’d most likely just pipe this through SSH between the systems and set up keys to avoid the need for passwords. Anyway, enough talk :

[root@solaris]$ zfs snapshot master/data@1
[root@solaris]$ zfs send master/data@1 | zfs receive slave/data

This now sent it through to the slave. It’s also worth pointing out that I didn’t have to recreate the exact same pool or zfs structure on the slave (which may be useful if you are replicating between dissimilar systems), but I chose to keep the filesystem layout the same for the sake of legibility in this example. I also simply used a numeric identifier for each snapshot; in a production system, timestamps may be more appropriate. Anyway, let’s take a quick look at "zfs list", where we’ll see the slave has now gained a snapshot utilising exactly the same amount of space as the master :

[root@solaris]$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
master 3.25M 60.3M 25.5K /master
master/data 3.15M 60.3M 3.15M /master/data
master/data@1 0 - 3.15M -
slave 3.25M 60.3M 24.5K /slave
slave/data 3.15M 60.3M 3.15M /slave/data
slave/data@1 0 - 3.15M -

Now, here comes a big "gotcha". You now have to set the "readonly" attribute on the slave. I discovered that if this was not set, even just cd-ing into the slave’s mountpoints would cause things to break in subsequent replication operations; presumably down to metadata (access times and the like) being altered.

[root@solaris]$ zfs set readonly=on slave/data

So, let’s look in the slave to see if our files are there :

[root@solaris]$ ls /slave/data
grep-2.5.1

Excellent stuff! However, the real coolness starts with the incremental transfers - instead of transferring the whole lot again, we can just send only the bits of data that actually changed - this will drastically reduce bandwidth and the time taken to replicate data, making a "cron" based system of periodic snapshots and transfers feasable. To demonstrate this, I’ll unpack another tarball (this time, GNU bison) on the master so I have some more data to send :

[root@solaris]$ cd /master/data
[root@solaris]$ gtar xzf ~/bison-2.3.tar.gz

And we’ll now make a second snapshot, and transfer differences between this one and the last :

[root@solaris]$ zfs snapshot master/data@2
[root@solaris]$ zfs send -i master/data@1 master/data@2 | zfs receive slave/data

Checking to see what’s happened, we see the slave has gained another snapshot:

[root@solaris]$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
master 10.2M 53.3M 25.5K /master
master/data 10.1M 53.3M 10.1M /master/data
master/data@1 32.5K - 3.15M -
master/data@2 0 - 10.1M -
slave 10.2M 53.3M 25.5K /slave
slave/data 10.1M 53.3M 10.1M /slave/data
slave/data@1 32.5K - 3.15M -
slave/data@2 0 - 10.1M -

And our new data is now there as well :

[root@solaris]$ ls /slave/data/
bison-2.3 grep-2.5.1

And that’s it. All that remains to turn this into a production system between two hosts is for a periodic cron job to be written that runs at the appropriate intervals (daily, or even every minute if need be) and snapshots the filesystem before transferring it. You’ll also likely want to have another job that clears out old snapshots, or maybe archives them off somewhere.