ZFS (Part 1)

Over the last year I was getting more and more curious/excited about OpenSolaris. Specifically I got interested in ZFS – Sun’s new filesystem/volume manager.

So I finally got my act together and gave it a whirl.

Test system: Pentium 4, 3.0Ghz in an MSI P4N SLI motherboard. Three ATA Seagate ST3300831A hard drives, one Maxtor 6L300R0 ATA drive (all are nominally 300 gigs – see previous post on slight capacity differences). One Western Digital WDC WD800JD-60LU SATA 80 gig hard drive. Solaris Express Community Release (SXCR) 51.

Originally I started this project running SXCR 41, but back then I only had 3 300 gig drives, and that was interfering with my plans for RAID 5 greatness. In the end the wait was worth it, as ZFS got revved since.

A bit about MSI motherboard. I like it. For a PC system I like it alot. It has two PCI slots, two full length PCI E slots (16x), and one PCIE 1x slot. Technically it supports SLI with two ATI Cross-Fire or Nvidea SLI capable cards, however in that case both full length slots will run at 8x. Single slot will run at 16x. Two dual channel IDE connectors, four SATA connectors, built in high end audio with SPDIF, built in GigE NIC based on Marvell chipset/PHY, serial, parallel, built in IEEE1394 (iLink/Firewire) with 3 ports (one on the back of the board, two more can be brought out). Plenty of USB 2.0 connectors (4 brought out on the back of the board, 6 more can be brought out from conector banks on the motherboard). Overall, pretty shiny.

My setup consists of four IDE hard drives on the IDE bus, and an 80 gig WD on SATA bus for the OS. Motherboard BIOS allowed me to specify that I want to boot from the SATA drive first, so I took advantage of the offer.

Installation of SXCR was from IDE DVD (a pair of hard drives was unplugged for the time).
SXCR recognized pretty much everything in the system, except built in Marvell Gig E nic. Shit happens, I tossed in a PCI 3Com 3c509C NIC that I had kicking around, and restarted. There was a bit of a hold up with SATA drive – Solaris didn’t recognize it, and wanted the geometry, number of heads and number of clusters so that it could create an apropriate volume label. Luckily WD made identical drive but in IDE configuration, for which it actually provided the heads/custers/sectors information, so I plugged those numbers in, and format and fdisk cheered up.

Other then that, normal Solaris install. I did console/text install just because I am alot more familiar with them, however Radeon Sapphire X550 PCIE video card was recognized, and system happily boots into OpenWindows/CDE if you want it to.

So I proceeded to create a ZFS pool.
First thing I wanted to check is how portable ZFS is. Specifically, Sun claims that it’s endinanness neutral (ie I can connect the same drives to the little endian PC, or big endian SPARC system, and as long as both run OS that recognizes ZFS, things will work). I wondered how it deals with device numbers. Traditionally Solaris is very picky about the device IDs, and changing things like controllers or SCSI IDs on a system can be tricky.

Here I wanted to know if I can just create, say, a “travelling zfs pool”, where I’ll have an external enclosure with a few SATA drives, an internal PCI SATA controller card, and if things go wrong in a particular system, I could always unplug the drives, and move them to a different system, and things will work. So I wanted to find out if ZFS can deal with changes in device IDs.

In order for ZFS to work reliably, it has to use a whole drive. It, in turn, writes an EFI disk label on the drive, with a unique identifier. Note that certain PC motherboards choke on EFI disk labels, and refuse to boot. Luckily most of the time this is fixable using a BIOS update.

root@dara:/[03:00 AM]# uname -a
SunOS dara.NotBSD.org 5.11 snv_51 i86pc i386 i86pc
root@dara:/[03:00 AM]# zpool create raid1 raidz c0d0 c0d1 c1d0 c1d1
root@dara:/[03:01 AM]# zpool status
  pool: raid1
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        raid1       ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c0d0    ONLINE       0     0     0
            c0d1    ONLINE       0     0     0
            c1d0    ONLINE       0     0     0
            c1d1    ONLINE       0     0     0

errors: No known data errors
root@dara:/[03:02 AM]# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
raid1                  1.09T    238K   1.09T     0%  ONLINE     -
root@dara:/[03:02 AM]# df -h /raid1 
Filesystem             size   used  avail capacity  Mounted on
raid1                  822G    37K   822G     1%    /raid1
root@dara:/[03:02 AM]# 

Here I created a raidz1 (zfs equivalent of RAID5 with one parity disk, giving me (N-1)*[capacity of the drives]. raidz can survive death of one hard drive. zfs pool can also be creatd with raidz2 command, giving an equivalent of raid5 with two parity disks. Such configuration can survive death of 2 disks) pool.

Note the difference in volume that zpool list and df produce. zpool list shows capacity not counting parity. df shows the more traditional available disk space. Using df will likely cause less confusion in normal operation.

So far so good.

Then I proceeded to create a large file on the ZFS pool:

root@dara:/raid1[03:04 AM]# time mkfile 10g reely_beeg_file

real    2m8.943s
user    0m0.062s
sys     0m5.460s
root@dara:/raid1[03:06 AM]# ls -la /raid1/reely_beeg_file 
-rw------T   1 root     root     10737418240 Nov 10 03:06 /raid1/reely_beeg_file
root@dara:/raid1[03:06 AM]#

While this was running, I was running zpool iostat -v raid1 10 in a different window.

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
raid1        211M  1.09T      0    187      0  18.7M
  raidz1     211M  1.09T      0    187      0  18.7M
    c1d0        -      -      0    110      0  6.26M
    c1d1        -      -      0    110      0  6.27M
    c0d0        -      -      0    110      0  6.25M
    c0d1        -      -      0     94      0  6.23M
----------  -----  -----  -----  -----  -----  -----

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
raid1       1014M  1.09T      0    601      0  59.5M
  raidz1    1014M  1.09T      0    601      0  59.5M
    c1d0        -      -      0    364      0  20.0M
    c1d1        -      -      0    363      0  20.0M
    c0d0        -      -      0    355      0  19.9M
    c0d1        -      -      0    301      0  19.9M
----------  -----  -----  -----  -----  -----  -----

[...]
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
raid1       8.78G  1.08T      0    778    363  91.1M
  raidz1    8.78G  1.08T      0    778    363  91.1M
    c1d0        -      -      0    412      0  30.4M
    c1d1        -      -      0    411  5.68K  30.4M
    c0d0        -      -      0    411  5.68K  30.4M
    c0d1        -      -      0    383  5.68K  30.4M
----------  -----  -----  -----  -----  -----  -----

10 gigabytes written over 128 seconds. About 80 megabytes a second on continuous writes. I think I can live with that.

Next I wanted to run some md5 digests of some files on the /raid1, then export the pool, shut system down, switch around IDE cables, boot system back up, reimport the pool, and re-run the md5 digests. This would simulate moving a disk pool to a different system, screwing up disk ordering in process.

root@dara:/[12:20 PM]# digest -a md5 /raid1/*
(/raid1/reely_beeg_file) = 2dd26c4d4799ebd29fa31e48d49e8e53
(/raid1/sunstudio11-ii-20060829-sol-x86.tar.gz) = e7585f12317f95caecf8cfcf93d71b3e
root@dara:/[12:23 PM]# zpool status
  pool: raid1
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        raid1       ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c0d0    ONLINE       0     0     0
            c0d1    ONLINE       0     0     0
            c1d0    ONLINE       0     0     0
            c1d1    ONLINE       0     0     0

errors: No known data errors
root@dara:/[12:23 PM]# zpool export raid1
root@dara:/[12:23 PM]# zpool status
no pools available
root@dara:/[12:23 PM]#

System was shutdown, IDE cables switched around, system was rebooted.

root@dara:/[02:09 PM]# zpool status
no pools available
root@dara:/[02:09 PM]# zpool import raid1
root@dara:/[02:11 PM]# zpool status
  pool: raid1
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        raid1       ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c1d0    ONLINE       0     0     0
            c1d1    ONLINE       0     0     0
            c0d0    ONLINE       0     0     0
            c0d1    ONLINE       0     0     0

errors: No known data errors
root@dara:/[02:11 PM]# 

Notice that the order of the drives changed. Was c0d0 c0d1 c1d0 c1d1, and now it’s c1d0 c1d1 c0d0 c0d1.

root@dara:/[02:22 PM]# digest -a md5 /raid1/*
(/raid1/reely_beeg_file) = 2dd26c4d4799ebd29fa31e48d49e8e53
(/raid1/sunstudio11-ii-20060829-sol-x86.tar.gz) = e7585f12317f95caecf8cfcf93d71b3e
root@dara:/[02:25 PM]#

Same digests.

Oh, and a very neat feature…. You want to know what was happening with your disk pools?

root@dara:/[02:12 PM]# zpool history raid1
History for 'raid1':
2006-11-10.03:01:56 zpool create raid1 raidz c0d0 c0d1 c1d0 c1d1
2006-11-10.12:19:47 zpool export raid1
2006-11-10.12:20:07 zpool import raid1
2006-11-10.12:39:49 zpool export raid1
2006-11-10.12:46:14 zpool import raid1
2006-11-10.14:09:54 zpool export raid1
2006-11-10.14:11:00 zpool import raid1

Yes, zfs logs the last bunch of commands on to the zpool devices. So even if you move the pool to a different system, command history will still be with you.

Lastly, some versioning history for ZFS:

root@dara:/[02:19 PM]# zpool upgrade raid1 
This system is currently running ZFS version 3.

Pool 'raid1' is already formatted using the current version.
root@dara:/[02:19 PM]# zpool upgrade -v
This system is currently running ZFS version 3.

The following versions are suppored:

VER  DESCRIPTION
---  --------------------------------------------------------
 1   Initial ZFS version
 2   Ditto blocks (replicated metadata)
 3   Hot spares and double parity RAID-Z

For more information on a particular version, including supported releases, see:

http://www.opensolaris.org/os/community/zfs/version/N

Where 'N' is the version number.
root@dara:/[02:19 PM]#