ZFS RaidZ Striping

Recently on the ZFS mailing list (see http://wiki.illumos.org/display/illumos/illumos+Mailing+Lists, there was some discussion about how ZFS distributes dataacross disks. I thought I might show some work I've done to betterunderstand this.

The disk blocks for a raidz/raidz2/raidz3 vdev are across all vdevs inthe pool. So, for instance, the data for block offset 0xc00000 with size 0x20000(as reported by zdb(1M))could be striped at different locations and various sizes on theindividual disks within the raidz volume. In other words, the offsetsand sizes are absolute with respect to the volume, not to theindividual disks.

The work of mapping a raidz pool offset and size to individual diskswithin the pool is done by vdev_raidz_map_alloc().(Note that this routine has been changed SmartOS in support ofallowing system crash dumps to be written to raidz volumes. A changethat will eventually be pushed upstream to illumos.)

Let's go through an example. First, we'll set up a raidz pool and putsome data into it.

# mkfile 100m /var/tmp/f0 /var/tmp/f1 /var/tmp/f2 /var/tmp/f3 /var/tmp/f4# zpool create rzpool raidz /var/tmp/f0 /var/tmp/f1 /var/tmp/f2 /var/tmp/f3 /var/tmp/f4# cp /usr/dict/words /rzpool/words#

And now let's see the blocks assigned to the /rzpool/words file.

# zdb -dddddddd rzpool...    Object  lvl   iblk   dblk  dsize  lsize   %full  type         8    2    16K   128K   259K   256K  100.00  ZFS plain file (K=inherit) (Z=inherit)                                        168   bonus  System attributes       dnode flags: USED_BYTES USERUSED_ACCOUNTED      dnode maxblkid: 1       path    /words  uid     0       gid     0       atime   Thu Apr 11 11:46:09 2013        mtime   Thu Apr 11 11:46:09 2013        ctime   Thu Apr 11 11:46:09 2013        crtime  Thu Apr 11 11:46:09 2013        gen     7       mode    100444  size    206674  parent  4       links   1       pflags  40800000004Indirect blocks:               0 L1  0:64c00:800 0:5c14c00:800 4000L/400P F=2 B=7/7               0  L0 0:14c00:28000 20000L/20000P F=1 B=7/7           20000  L0 0:3cc00:28000 20000L/20000P F=1 B=7/7              segment [0000000000000000, 0000000000040000) size  256K

So, there are two blocks, one at offset 0x14c00 and the other atoffset 0x3cc00, both of them 0x28000 bytes. Of the 0x28000 bytes,0x8000 is parity. The real data size is 0x20000. The question is,where does this data physically reside, i.e., which disk(s), and whereon those disks?

I wrote a small program that uses the code from thevdev_raidz_map_alloc() routine to tell me the mappingthat is set up. Here it is:

/* * Given an offset, size, number of disks in the raidz pool, * the number of parity "disks" (1, 2, or 3 for raidz, raidz2, raidz3), * and the sector size (shift), * print a set of stripes. */#include #include #include #include #include /* * The following are taken straight from usr/src/uts/common/fs/zfs/vdev_raidz.c * If they change there, they need to be changed here. * * a map of columns returned for a given offset and size */typedef struct raidz_col {    uint64_t rc_devidx;     /* child device index for I/O */    uint64_t rc_offset;     /* device offset */    uint64_t rc_size;       /* I/O size */    void *rc_data;          /* I/O data */    void *rc_gdata;         /* used to store the "good" version */    int rc_error;           /* I/O error for this device */    uint8_t rc_tried;       /* Did we attempt this I/O column? */    uint8_t rc_skipped;     /* Did we skip this I/O column? */} raidz_col_t;typedef struct raidz_map {    uint64_t rm_cols;       /* Regular column count */    uint64_t rm_scols;      /* Count including skipped columns */    uint64_t rm_bigcols;        /* Number of oversized columns */    uint64_t rm_asize;      /* Actual total I/O size */    uint64_t rm_missingdata;    /* Count of missing data devices */    uint64_t rm_missingparity;  /* Count of missing parity devices */    uint64_t rm_firstdatacol;   /* First data column/parity count */    uint64_t rm_nskip;      /* Skipped sectors for padding */    uint64_t rm_skipstart;  /* Column index of padding start */    void *rm_datacopy;      /* rm_asize-buffer of copied data */    uintptr_t rm_reports;       /* # of referencing checksum reports */    uint8_t rm_freed;       /* map no longer has referencing ZIO */    uint8_t rm_ecksuminjected;  /* checksum error was injected */    raidz_col_t rm_col[1];      /* Flexible array of I/O columns */} raidz_map_t;/* *  vdev_raidz_map_get() is hacked from vdev_raidz_map_alloc() in *  usr/src/uts/common/fs/zfs/vdev_raidz.c.  If that routine changes, *  this might also need changing. */raidz_map_t *vdev_raidz_map_get(uint64_t size, uint64_t offset, uint64_t unit_shift,            uint64_t dcols, uint64_t nparity){    raidz_map_t *rm;    uint64_t b = offset >> unit_shift;    uint64_t s = size >> unit_shift;    uint64_t f = b % dcols;    uint64_t o = (b / dcols) << unit_shift;    uint64_t q, r, c, bc, col, acols, scols, coff, devidx, asize, tot;    q = s / (dcols - nparity);    r = s - q * (dcols - nparity);    bc = (r == 0 ? 0 : r + nparity);    tot = s + nparity * (q + (r == 0 ? 0 : 1));    if (q == 0) {        acols = bc;        scols = MIN(dcols, roundup(bc, nparity + 1));    } else {        acols = dcols;        scols = dcols;    }    rm = malloc(offsetof(raidz_map_t, rm_col[scols]));    if (rm == NULL) {        fprintf(stderr, "malloc failed\n");        exit(1);    }    rm->rm_cols = acols;    rm->rm_scols = scols;    rm->rm_bigcols = bc;    rm->rm_skipstart = bc;    rm->rm_missingdata = 0;    rm->rm_missingparity = 0;    rm->rm_firstdatacol = nparity;    rm->rm_datacopy = NULL;    rm->rm_reports = 0;    rm->rm_freed = 0;    rm->rm_ecksuminjected = 0;    asize = 0;    for (c = 0; c < scols; c++) {        col = f + c;        coff = o;        if (col >= dcols) {            col -= dcols;            coff += 1ULL << unit_shift;        }        rm->rm_col[c].rc_devidx = col;        rm->rm_col[c].rc_offset = coff;        rm->rm_col[c].rc_data = NULL;        rm->rm_col[c].rc_gdata = NULL;        rm->rm_col[c].rc_error = 0;        rm->rm_col[c].rc_tried = 0;        rm->rm_col[c].rc_skipped = 0;        if (c >= acols)            rm->rm_col[c].rc_size = 0;        else if (c < bc)            rm->rm_col[c].rc_size = (q + 1) << unit_shift;        else            rm->rm_col[c].rc_size = q << unit_shift;        asize += rm->rm_col[c].rc_size;    }    rm->rm_asize = roundup(asize, (nparity + 1) << unit_shift);    rm->rm_nskip = roundup(tot, nparity + 1) - tot;    if (rm->rm_firstdatacol == 1 && (offset & (1ULL << 20))) {        devidx = rm->rm_col[0].rc_devidx;        o = rm->rm_col[0].rc_offset;        rm->rm_col[0].rc_devidx = rm->rm_col[1].rc_devidx;        rm->rm_col[0].rc_offset = rm->rm_col[1].rc_offset;        rm->rm_col[1].rc_devidx = devidx;        rm->rm_col[1].rc_offset = o;        if (rm->rm_skipstart == 0)            rm->rm_skipstart = 1;    }    return (rm);}intmain(int argc, char *argv[]){    uint64_t offset = 0;    uint64_t size = 0;    uint64_t dcols = 0;    uint64_t nparity = 1;    uint64_t unit_shift = 9;  /* shouldn't be hard-coded.  sector size */    raidz_map_t *rzm;    raidz_col_t *cols;    int i;    if (argc < 4) {        fprintf(stderr, "Usage: %s offset size ndisks [nparity [ashift]]\n", argv[0]);        fprintf(stderr, "  ndisks is number of disks in raid pool, including parity\n");        fprintf(stderr, "  nparity defaults to 1 (raidz1)\n");        fprintf(stderr, "  ashift defaults to 9 (512-byte sectors)\n");        exit(1);    }    /* XXX - check return values */    offset = strtoull(argv[1], NULL, 16);    size = strtoull(argv[2], NULL, 16);    dcols = strtoull(argv[3], NULL, 16);    if (size == 0 || dcols == 0) { /* should check size multiple of ashift...*/        fprintf(stderr, "size and/or number of columns must be > 0\n");        exit(1);    }    if (argc > 4)        nparity = strtoull(argv[4], NULL, 16);    if (argc == 6)        unit_shift = strtoull(argv[5], NULL, 16);    rzm = vdev_raidz_map_get(size, offset, unit_shift, dcols, nparity);    printf("cols = %d, firstdatacol = %d\n", rzm->rm_cols, rzm->rm_firstdatacol);    for (i = 0, cols = &rzm->rm_col[0]; i < rzm->rm_cols; i++, cols++)        printf("%d:%lx:%lx\n", cols->rc_devidx, cols->rc_offset, cols->rc_size);    exit(0);}

The program takes an offset, size, number of disks in the pool, andoptionally the number of parity (1 for raidz, 2 for raidz2 and 3 forraidz3)and the sector size (shift), and outputs the location of the blocks onthe underlying disks. Let's try it.

# gcc -m64 raidzdump.c -o raidzdump# ./raidzdump 14c00 20000 5cols = 5, firstdatacol = 11:4200:80002:4200:80003:4200:80004:4200:80000:4400:8000

The parity for the block is on disk1 (/var/tmp/f1), the first 32k ofdata on disk2 (/var/tmp/f2), the second on disk3, etc.We could use zdb(1M) to check this, except there is a bug(see Bug #3659).But the following works on older versions of illumos and on Solaris11.

# zdb -R rzpool 0.2:4200:8000:rFound vdev: /var/tmp/f2assertion failed for thread 0xfffffd7fff172a40, thread-id 1: vd->vdev_parent == (pio->io_vd ? pio->io_vd : pio->io_spa->spa_root_vdev), file ../../../uts/common/fs/zfs/zio.c, line 827Abort (core dumped)#

This says to go to vdev 2 (/var/tmp/f2, child of the root vdev 0), atlocation 0x4200, and read 0x8000 (32k) of data and display it.

Since zdb(1M) is currently broken for this, let's try adifferent way. We'll add the 0x4200 to the end of the disk label of4MB at the beginning of every disk to get an absolute byte offsetwithin the disk, then we'll use dd to look at the data.

# mdb4200+400000=E    4211200$q## dd if=/var/tmp/f2 bs=1 iseek=4211200 count=32k10th1st2nd3rd4th5th6th7th8th9thaAAAAAASAarhus...

And there is the first 32k of the words file. To get the next 32k,use the offset from the third line of output from theraidzdump utility, and so on.This gets more interesting with smaller blocks, but that is left as anexercise for the reader. For instance, a write of a 512-byte filewill stripe across 2 disks, one for the data, the other forparity.Note that I have not tested with raidz2 or raidz3. I expect the firstdata column to be 2 and 3 respectively, but the code should work...You need to specify the parity (2 or 3) to raidzdump as an argument.

Have fun!



Post written by Mr. Max Bruning