chunkfs... or how you can use divide-and-conquer to keep fsck times in bound ---------------------------------------------------------------------------- At the filesystem workshop we discussed the idea of chunkfs. But what is it exactly? This document tries to answer this question both on a high level and on a more detailed technical level. First of all in the text below I'm assuming a set of numbers and sizes that are just a very wild guess, and a lot more investigations need to be done to get actual good values for those parameters. However to keep the explanation readable I'm not going to mention this caveat every time; this document is about the concept, not about the exact numbers. Prime goals of chunkfs ---------------------- 1) Deal with fsck times on huge filesystems. If you think you'll never need to fsck because you use a journalling filesystem, you are up for a rough ride. Hardware is not perfect, and with the increasing disk sizes, the probability of hitting an IO error or any other kind of corruption just goes up big time to the point that these events are no longer just theory. Losing a filesystem is bad, so you need a fsck. But.. having to wait 2 weeks for fsck to finish is almost nearly as bad; you could have restored from tape quicker than that. The current generation filesystems, including even XFS, suffer from really long fsck times on large disks. On current generation Big Arrays(tm), doing a parallel fsck can save you a lot of time (by using all spindles in your Big Array(tm) in parallel), but in a few years a single consumer disk will be 8 or more terabytes in size. Parallelization is not going to save enough time there. 2) Contain faults If things go down the drain, try to contain the damage done to a limited subset and try to keep a large portion of the filesystem alive, rather than having to throw away the entire filesystem. This requires that you can know which part you lost in a reasonable short time. 3) Allow online fsck Ideally, you want to be able to quickly verify if (a portion of) the filesystem is still correct. Today you would need to make a COW snapshot in the volume manager for this, but it would be really nice if filesystems would have an automatic, periodic fsck without having to offline the whole filesystem for a long time. The high level idea ------------------- Chunkfs is a very simple idea in the basics: Just split the 4 Terabyte filesystem up into pieces (chunks) of, say, 1 Gigabyte. Each piece is individually fsck-able, has it's own block number space, allocation bitmaps, superblock, etc etc. The namespace of the filesystem (eg the directories and filenames) are still global of course, so that to the user, it really still feels as one filesystem. In ASCII art: +------------------------------ whole device -------------------------------> |+---------+---------+---------+---------+---------+---------+---------+--- || Chunk 1 | Chunk 2 | Chunk 3 | Chunk 4 | Chunk 5 | Chunk 6 | Chunk 7 | || Meta- | Meta- | Meta- | Meta- | Meta- | Meta- | Meta- | || data | data | data | data | data | data | data | |+---------+---------+---------+---------+---------+---------+---------+--- || Data | Data | Data | Data | Data | Data | Data | || | | | | | | | || | | | | | | | || | | | | | | | || | | | | | | | |+---------+---------+---------+---------+---------+---------+---------+--- +---------------------------------------------------------------------------> This sounds almost too simple, and, well, it is. Before going into some of the problems that this has, and solutions we have in mind for those problems, I'd like to first look at some of the good things that come from this approach: Limited fsck time ----------------- If an event happens that requires a filesystem check/repair pass (for example a bad sector in an important data area on the disk), only the chunk or chunks that contain the damaged area need to be checked. And since the chunks are relatively small, the assumption is that each chunk can be checked in a short time. This assumption appears to be justified based on the current existing fsck implementations, for which the disk bandwidth, disk seeks, memory usage and processor and wall-clock time all seem to scale linear or worse with the disk size. Containment ----------- To the extend that each chunk has a fully stand alone set of metadata, damage within a chunk will be limited to that chunk, and should not affect files that are entirely outside this chunk. The Achilles heel to this are directories; if a directory is corrupted that has many subdirectories, all those subdirectories instantly get orphaned. While the data will not be lost (it'll end up in lost+found), there still can be disruption to the system. Online fsck ----------- If each chunk gets it's own dirty flag (as per Val Hensons paper that will be presented at the OLS 2006 conference: http://www.linuxsymposium.org/2006/view_abstract.php?content_key=112 ) then it'll be quite common for many of the chunks in the filesystem to actually be marked clean. Running the diagnostics part of fsck on such chunks is then entirely trivial even when the filesystem remains mounted and active (the only gotcha is that fsck may need to abort if the chunk becomes dirty while fsck is in progress). Online repair is going to be a much harder issue, there are many kernel data structures that cache the on disk layout and the locking issues that would surround such repair will be highly complex if not impossible. This doesn't mean that online fsck has no value, quite often the knowledge that you don't need to do repairs is worth a lot, and in addition, if you do find an issue the chunk can be marked for repair so that during the offline repair, only the actual chunks in need of repair need to be looked at, which minimizes downtime. There are also some side effects of the chunkfs approach that are positive: 32 bit block numbers are enough ------------------------------- Because block numbers are local to the chunk, and the chunks are quite small, block numbers can remain 32 bit as they are in ext2/ext3 today. This is both an advantage in terms of metadata density on disk (and thus higher performance while reading metadata) and reduced memory and cpu usage. Online growth is easy --------------------- Online growth of the filesystem is very easy: just add new chunks at the end of the filesystem. Due to the independence of the chunks, none of the other chunks need adjusting for the new size (with the possible exception of the overall filesystem superblock that describes the number of chunks in the filesystem). Essential property ------------------ When evaluating all these advantages, one thing becomes clear: It is *essential* to the entire chunkfs concept that the filesystem metadata remains local. While it may be unavoidable to have SOME cross-chunk links, these should be kept to the absolute minimum. The Problems ============ Now on to the problems, and the propose solutions: Big files --------- With chunks being 1 gigabyte in size, I'm sure many people will have only one question on their mind: "How do I store my dvd images then". Storing big files is a challenge to the "keep everything local to the chunk" mantra, but there is a reasonably clean solution to this problem: Continuation inodes. To describe how a continuation inode works, lets take the example of a file that gets created and then grows to 1.5 gigabyte, on an empty chunkfs filesystem with a 1 gigabyte chunk size. At first, the file starts in chunk 0, gets its inode there (and directory information, but lets ignore that aspect for now). Then data starts coming in. This data gets put into chunk 0 as well of course (remember: everything is local to the chunk). This goes on for a while, until chunk 0 is full and no more data can be stored. At this point, the filesystem puts another inode on the filesystem, in chunk 1. This new inode contains a magic marker that indicates it's "follow up" status and a back pointer that indicates "I'm the follow up from THAT inode over there", while the original inode in chunk 0 gets a marker "for the rest of the data, look over there". With this new continuation inode in chunk 1, all new file data can be stored in chunk 1, with all allocation info stored in this chunk 1 inode: everything is local again. To fsck, apart from the special marker, this continuation inode just looks like a normal sparse file that has data from the 1Gb to the 1.5Gb range (in this example), and fsck can check the local chunk as if it was a stand alone filesystem again. The only exception check that can not be done on continuation inodes is the orphan inode check (the orphan inode check checks if a file exists on disk but there is not a single directory on the filesystem that points to this file, eg it's dead weight). But due to the magic marker and the pointer to the original inode, the fsck can refer to the orphan inode check in the chunk of the "real" inode. This is the one big exception of the "everything local" paradigm. The damage to the advantages isn't as large as it may sound though: * continuation inodes should be rare Large files are relatively rare on a filesystem (by definition, you cannot store very many of them before you run out of space) * fsck can trust the original chunk if the "real" inode thinks this continuation inode is indeed a real continuation for it, then there is no need to do the orphan inode check; the fsck can trust the orphan inode check done in the chunk that contains the real inode without any danger; if the inode there is incorrectly seen as not an orphan, you lose some space but you'll regain it once fsck runs on that chunk. In the opposite scenario, if the original inode no longer sees the continuation inode as part of the file, nothing can get to the data ever so it's safe to consider it an orphan. Hardlinks --------- As the previous section already alluded to: it is essential for fsck for the filename reference and the inode to be in the same chunk, so that fsck can check for orphaned inodes. When you normally create a file, this isn't that hard to achieve: The directory is in a certain chunk, and the filesystem just creates both the filename and the inode in this same chunk and everything works out fine automatically. It gets more tricky when hardlinks get involved; in the hardlink case you already have the inode allocated, so you can't place it in the same chunk as the directory. This is where, again, continuation inodes come to the rescue: to make sure the directory entry and the inode are in the same chunk, a continuation inode _for the directory_ is placed in the same chunk as the inode, and, using this continuation inode, the directory entry is placed in this specific chunk, again making sure that the directory entry and the inode are in the same chunk, This introduces a special failure case that needs to be solved/investigated: if the chunk with the inode has run out of space for allocating new inodes, the hardlink cannot be created even when the filesystem as a whole still has plenty of space. A way to mitigate this failure mode is to reserve a certain amount of space for such continuation inodes in each chunk just for such hardlink use. Cross-directory renames ----------------------- Cross directory renames effectively have the same problem as hardlinks: the inode already exists and a new directory entry needs to be created. In fact, except for some standard nitpicking, a cross directory rename is effectively the same as creating a hardlink to the file in the new directory, and then unlinking the old filename from the old directory. As a result, the solution to the cross-directory rename problem is identical to the solution to the hardlink problem. Subdirectory refcounts in fsck ------------------------------ Linux does not support directory hardlinks, which makes the subdirectory refcount issue several orders of magnitude easier. There still is an issue though; the allocation strategy of the filesystem will want to move directories to a different chunk than the parent (just think of the / directory; you don't want all data in the same chunk as the / directory, this chunk would go full far too quickly and unevenly). This poses a problem for fsck, since fsck needs to be able to determine if a directory hasn't become orphan.