Mac ZFS Gotcha #1: Failmode defaults to Kernel Panic!

Mac ZFS Kernel Panic Crashing

Badness level, on a scale of 1 (not bad) - 100 (WTF?): 100

First up is the more egregious behavior that I've seen so far with Mac ZFS. As I've been experimenting, I've noticed that it fails in the most ungraceful way possible (kernel panic and immediate Grey Screen of Death) when any of these conditions occur:

  • You're using an external USB/FW drive for ZFS and you unplug the drive before you unmount any ZFS volumes and "zfs export -f poolname" the ZFS pool on that external drive.
  • If a vdev/drive in a zpool has a failure that will put you into a FAULTED state. So on a RAIDZ (single-parity) if you lose 2 drives, that causes the kernel panic. Or on a RAIDZ2 (double-parity), if you lose 3 drives, same deal. Or in a 2-way mirror if you lose both halves of the mirror.
  • Drive went to sleep or got spun down (like on a laptop or if you have the spin-down set in your Energy Saver settings).

Apparently ZFS panics (literally!) when there are an asynchronous writes that already returned "success" and would make disk state inconsistent. The supposed reason for defaulting to panics is "maintaining data integrity" because "ZFS cannot guarantee that the information in the cache, ZIL, and media will be consistent." [original post].

The obvious problem with this is that the second any of our zpool's lose enough vdevs to stop being redundant, it takes down your whole system that's hosting that zpool - plus whatever 10 or 15 other apps you had open will all be unceremoniously dumped on their asses when ZFS decides to panic. You would pretty much have to be running a separate dedicated Mac ZFS fileserver to avoid having these kernel panics take down your workstation every time.

The first few times I thought it was just my older Firewire enclosure or the old drives I was testing with or just a bug in the ZFS build. But after seeing this happen even on internal drives and the latest build-119, I found that lots of others have run into this as well, per the discussion lists for Mac ZFS. An un-exported hot unplug is a about the best guaranteed kernel panic you can find. I've verified this on both Intel and PowerPC Leopard Macs myself (too many times, unfortunately).

The semi-good news is that in more advanced versions of ZFS (like in Solaris 10 10/08) there is a new zpool property called "failmode":

(from a Solaris 10 10/08 server)
# zpool get all mypool

rpool  size         67.5G             -
rpool  used         5.78G             -
rpool  available    61.7G             -
rpool  capacity     8%                -
rpool  altroot      -                 default
rpool  health       ONLINE            -
rpool  guid         5262979625119216723  -
rpool  version      10                default
rpool  bootfs       rpool/ROOT/rpool  local
rpool  delegation   on                default
rpool  autoreplace  off               default
rpool  cachefile    -                 default
rpool  failmode     continue          local

The failmode can be set to "continue", "panic", or "wait". Using "continue" or "wait" sure sounds nicer than "panic", doesn't it?

As a veteran systems administrator and long-time Mac user though, it seems incredibly mind-boggling that for Mac ZFS a kernel panic was decided to be the first option for when vdevs become unreachable. Unless it's your boot drive, shouldn't a kernel panic be the LAST resort?

Not only that, given the worsening failure rates of the new 1TB+ drives out there now, I think a drive failure is even more imminent these days than ever, and exponentially more data loss at stake as we cram more into these drives. Why do they think there's such interest in ZFS and other data protection technology?

The even worse news is that apparently Apple is focusing its ZFS efforts for whatever is going to be included in Snow Leopard and not so much will be done to backport those features/bug fixes into the Leopard ZFS. [original post].

Sure, Apple always does whatever they want, and the whole Mac ZFS project is kind of beta anyway. But requiring Snow Leopard for a robust ZFS implementation will basically strand all the PowerPC ZFS users since Snow Leopard will be Intel-only. I don't have any problems with Snow Leopard being Intel-only, but there are still a lot of PowerPC Macs in use out there.

I know Apple hates legacy baggage (serial ports, floppy drives, mouse buttons, SCSI, ADB) and has the relentless push towards the latest-and-greatest. However, Apple is also trying to be a "green" company, right? What could be more green than keeping tens or hundreds of thousands of older systems off of a barge to China where all the toxic waste is spewed out during "recycling"? One thing would help for sure and could be all done via software -- by keeping ZFS viable in PowerPC/Leopard.