Building a redundant iSCSI and NFS cluster with Debian - Part 5

Updated:

Note : This page may contain outdated information and/or broken links; some of the formatting may be mangled due to the many different code-bases this site has been through in over 20 years; my opinions may have changed etc. etc.

This is part 5 of a series on building a redundant iSCSI and NFS SAN with Debian.

Part 1 - Overview, network layout and DRBD installation
Part 2 - DRBD and LVM
Part 3 - Heartbeat and automated failover
Part 4 - iSCSI and IP failover
Part 5 - Multipathing and client configuration
Part 6 - Anything left over!

In this part of the series, we’ll configure an iSCSI client ("initiator"), connect it to the storage servers and set up multipathing. Note : Since Debian Lenny has been released since this series of articles started, that’s the version we’ll use for the client.

If you refer back to part one to refresh your memory of the network layout, you can see that the storage client ("badger" in that diagram) should have 3 network interfaces :

  • eth0 : 172.16.7.x for the management interface, this is what you’ll use to SSH into it.

And two storage interfaces. As the storage servers ("targets") are using 192.168.x.1 and 2, I’ve given this client the following addresses :

  • eth1: 192.168.1.10
  • eth2: 192.168.2.10

Starting at .10 on each range keeps things clear - I’ve found it can help to have a policy of servers being in a range of, say, 1 to 10, and clients being above this. Before we continue, make sure that these interfaces are configured, and you can ping the storage server over both interfaces, e.g. try pinging 192.168.1.1 and 192.168.2.1.

Assuming the underlying networking is configured and working, the first thing we need to do is install open-iscsi (which is the "initiator" - the iSCSI client). This is done by a simple :

# aptitude install open-iscsi

You should see the package get installed, and the service started :

Setting up open-iscsi (2.0.870~rc3-0.4) ...
Starting iSCSI initiator service: iscsid.
Setting up iSCSI targets:
iscsiadm: No records found!

At this point, we have all we need to start setting up some connections.  There are two ways we can "discover" targets on a server (well, three actually, if you include iSNS, but that’s beyond the scope of this article).

  • We can use "send targets" - this logs into a iSCSI target server, and asks it to send the initiator a list of all the available targets.
  • We can use manual discovery, where we tell the initiator explicitly what targets to connect to.

For this exercise, I’ll first show how "send targets" works, then we’ll delete the records so we can add them back manually later. Sendtargets can be useful if you’re not sure what targets your storage server offers, but you can end up with a lot of stale or unused records if you don’t trim down the ones you’re not using.

So, to get things rolling, we’ll query the targets available on one of the interfaces we’re going to use (192.168.1.1) - we’ll set up multipathing later. Run the following as root :

iscsiadm -m discovery -t st -p 192.168.1.1

And you should see the following output returned :

192.168.1.1:3260,1 iqn.2009-02.com.example:test

This shows that your initiator has successfully queried the storage server, and has returned a list of targets - which, if you haven’t changed anything since the last article, should just be the one "iqn.2009-02.com.example:test" target. You can always see which nodes are available to your initiator at any time by simply running :

iscsiadm -m node 

A few things have happened behind the scenes that it’s worth checking out at this point. After discovering an available target, the initiator will have created a node record for it under /etc/iscsi/nodes. If you take a look in that directory, you’ll see the following file :

/etc/iscsi/nodes/iqn.2009-02.com.example:test/192.168.1.1,3260,1/default

Which is a file that contains specific configuration details for that iSCSI node. Some of these settings are influenced by the contents of /etc/iscsi/iscsid.conf, which governs the overall behaviour of the iSCSI initiator (e.g. settings in iscsid.conf apply to all nodes). We’ll investigate a few of these settings later.

For now though, all your initiator has done is discover a set of available targets, we can’t actually make use of them without "logging in". So, now run the following as root :

iscsiadm -m node -p 192.168.1.1 -T iqn.2009-02.com.example:test -l

The arguments to this command are largely self-explanatory - we’re performing an operation on a node ("-m node"), are using the portal we queried earlier ("-p 192.168.1.1"), are running the operation on a specific target ("-T iqn.2009-02.com.example:test") and are logging in to it ("-l").

You can use the longer form of these arguments if you want - for instance, you could use "–login" instead of "-l" if you feel it makes things clearer (see the man page for iscsiadm for more details). Anyway, you should see the following output after running that command :

Logging in to [iface: default, target: iqn.2009-02.com.example:test, portal: 192.168.1.1,3260]
Login to [iface: default, target: iqn.2009-02.com.example:test, portal: 192.168.1.1,3260]: successful

If you now check the output from "dmesg", you’ll see output similar to the following in your logs :

[3688756.079470] scsi0 : iSCSI Initiator over TCP/IP
[3688756.463218] scsi 0:0:0:0: Direct-Access     IET      VIRTUAL-DISK     0    PQ: 0 ANSI: 4
[3688756.580379]  sda: unknown partition table
[3688756.581606] sd 0:0:0:0: [sda] Attached SCSI disk

The last line is important - it tells us the device node that the iSCSI node has been created under. You can also query this information by running :

iscsiadm -m session -P3

Which will display a lot of information about your iSCSI session, including the device it has created for you.

If you go back to your storage server now, you can see your client has connected and logged in to the target :

# cat /proc/net/iet/session
tid:1 name:iqn.2009-02.com.example:test
        sid:562949974196736 initiator:iqn.1993-08.org.debian:01:16ace3ba949f
                cid:0 ip:192.168.1.10 state:active hd:none dd:none

You now have a device on your iSCSI client that you can partition and format, just like it was a locally attached disk. Give it a try: fire up fdisk on it, create some partitions, format and mount them. You should find it behaves just the same as a local disk, although the speed will be limited by the capacity of your link to the storage server.

Once you’ve finished, make sure any filesystem you have created on the volume is unmounted, and we’ll then logout of the node and delete it’s record :

# iscsiadm -m node -p 192.168.1.1 -T iqn.2009-02.com.example:test --logout
Logging out of session [sid: 1, target: iqn.2009-02.com.example:test, portal: 192.168.1.1,3260]
Logout of [sid: 1, target: iqn.2009-02.com.example:test, portal: 192.168.1.1,3260]: successful
# iscsiadm -m node -p 192.168.1.1 -T iqn.2009-02.com.example:test -o delete

You should now find that the record for it has been removed from /etc/iscsi/nodes.

Multipathing

We’ll now manually log into the target on both paths to our storage server, and combine the two devices into one multipathed, fault-tolerant device that can handle the failure of one path.

Before we start, you’ll want to change a few of the default settings in /etc/iscsi/iscsid.conf - if you want any nodes you’ve added to the server to automatically be added back when the server reboots, you’ll want to change

node.startup = manual

to

node.startup = automatic

The default timeouts are also far too high when we’re using multipathing - you’ll want to set the following values :

node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 10
node.session.timeo.replacement_timeout = 15

Make sure you restart open-iscsi so these changes get picked up. We can then manually log into both paths to the storage server :

iscsiadm -m node -p 192.168.1.1 -T iqn.2009-02.com.example:test -o new
iscsiadm -m node -p 192.168.1.1 -T iqn.2009-02.com.example:test -l
iscsiadm -m node -p 192.168.2.1 -T iqn.2009-02.com.example:test -o new
iscsiadm -m node -p 192.168.2.1 -T iqn.2009-02.com.example:test -l

Note the use of "-o new" to manually specify and add the node, instead of using sendtargets discovery. After this, you should find that you have two devices created - in my case, these were /dev/sda and /dev/sdb. We now need to combine these using multipathing.

First, install "multipath-tools" :

aptitude install multipath-tools

And then create a default configuration file under /etc/multipath.conf with the following contents :

defaults {
        udev_dir                /dev
        polling_interval        10
        selector                "round-robin 0"
        path_grouping_policy    multibus
        getuid_callout          "/lib/udev/scsi_id -g -u -s /block/%n"
        prio_callout            /bin/true
        path_checker            readsector0
        rr_min_io               100
        rr_weight               priorities
        failback                immediate
        no_path_retry           fail
        user_friendly_names     no
}
blacklist {
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^hd[a-z][[0-9]*]"
}

The first section sets some defaults for the multipat daemon, including how it should identify devices. The blacklist section lists devices that should not be multipathed so the daemon can ignore them - you can see it’s using regular expressions to exclude a number of entries under /dev, including anything starting with "hd" . This will exclude internal IDE devices, for instance. You may need to tune this to your needs, but it should work OK for this example.

Restart the daemon with

/etc/init.d/multipath-tools restart

And check what it can see with the command "multipath -ll":

# multipath -ll
149455400000000000000000001000000c332000011000000dm-0 IET     ,VIRTUAL-DISK
[size=1.0G][features=0][hwhandler=0]
_ round-robin 0 [prio=1][active]
 _ 1:0:0:0 sda  8:0    [active][ready]
_ round-robin 0 [prio=1][enabled]
 _ 2:0:0:0 sdb  8:16   [active][ready]

That long number on the first line of output is the WWID of the multipathed device, which is similar to a MAC address in networking. It’s a unique identifier for this device, and you can see the components below it. You’ll also have a new device created under /dev/mapper :

/dev/mapper/149455400000000000000000001000000c332000011000000 

Which is the multipathed device. You can access this the same as you would the individual devices, but I always find that long WWID a little too cumbersome. Fortunately, you can assign short names to multipathed devices. Just edit /etc/multipath.conf, and add the following section (replacing the WWID with your value) :

multipaths {
        multipath {
                wwid 149455400000000000000000001000000c332000011000000
                alias mpio
        }
}

And restart multipath-tools. When you next run "multipath -ll", you should see the following :

mpio (149455400000000000000000001000000c332000011000000) dm-0 IET     ,VIRTUAL-DISK

And you can now access your volume through /dev/mapper/mpio.

Failing a path

To see what happens when a path fails, try creating a filesystem on your multipathed device (you may wish to partition it first, or you can use the whole device) and then mounting it. E.G.

mke2fs -j /dev/mapper/mpio
mount /dev/mapper/mpio /mnt

While the volume is mounted, try unplugging one of the storage switches - in this case, I tried pulling the power supply from the switch on the 192.168.2.x network. I then ran "multipath -ll", which paused for a short time (the timeout values set above), and then I saw the following :

sdb: checker msg is "directio checker reports path is down"
mpio (149455400000000000000000001000000c332000011000000) dm-0 IET     ,VIRTUAL-DISK
[size=1.0G][features=0][hwhandler=0]
_ round-robin 0 [prio=1][active]
 _ 3:0:0:0 sda  8:0    [active][ready]
_ round-robin 0 [prio=0][enabled]
 _ 4:0:0:0 sdb  8:16   [active][faulty]

So, one path to our storage is unavailable - you can see it marked above as faulty. However, as the 192.168.1.x network path is still available, IO can continue to the remaining "sda" component of the device. The volume was still mounted, and I could carry on copying data to and from it. I then plugged the switch back in, and after a short pause, multipathd shows both paths as active again :

# multipath -ll
mpio (149455400000000000000000001000000c332000011000000) dm-0 IET     ,VIRTUAL-DISK
[size=1.0G][features=0][hwhandler=0]
_ round-robin 0 [prio=1][active]
 _ 3:0:0:0 sda  8:0    [active][ready]
_ round-robin 0 [prio=1][enabled]
 _ 4:0:0:0 sdb  8:16   [active][ready]

You now have a resilient, fault-tolerant iSCSI SAN!

That’s it for this part - in the next part, I’ll add an NFS server to the mix, tie off a few loose ends, and discuss some performance tuning issues, as well as post some scripts I’ve written to automate some of this.