Saturday, August 01, 2015

Notes about Filesystems

Well, Filesystems may sound very simple concepts specially in Unix environments but there are some interesting limitations and tricks which a Unix admin must keep in mind working with.
I'm taking some notes about filesystems and very simple definitions specifically for common file systems in Linux. I'll also include my references in here. Some are interesting.



A bit of reading about Inodes and Unix filesystem features

Unix Inodes
 
Many Unix filesystems (Berkeley Fast Filesystem, Linux ext2fs, Sun ufs, ...) take an approach that combines some of the ideas above.







  • each file is indexed by an inode
  • inodes are special disk blocks set aside just for this purpose (see df -i to see how many of these exist on your favorite Unix filesystem)
  • they are created when the filesystem is created
  • the number of inodes limits the total number of files/directories that can be stored in the filesystem
  • the inode itself consists of
  • administrative information (permissions, timestamps, etc.)
  • a number of direct blocks (typically 12) that contain pointers to the first 12 blocks of the file
  • a single indirect pointer that points to a disk block which in turn is used as an index block, if the file is too big to be indexed entirely by the direct blocks
  • a double indirect pointer that points to a disk block which is a collection of pointers to disk blocks which are index blocks, used if the file is too big to be indexed by the direct and single indirect blocks
  • a triple indirect pointer that points to an index block of index blocks of index blocks...
  • interesting reading on your favorite FreeBSD system: /sys/ufs/ufs/dinode.h
  • small files need only the direct blocks, so there is little waste in space or extra disk reads in those cases
  • medium sized files may use indirect blocks
  • only large files make use of (and incur the overhead of) the double or triple indirect blocks, and that is reasonable since those files are large anyway
  • since the disk is now broken into two different types of blocks - inodes and data blocks, there must be some way to determine where the inodes are, and to keep track of free inodes and disk blocks. This is done by a superblock, located at a fixed position in the filesystem. The superblock is usually replicated on the disk to avoid catastrophic failure in case of corruption of the main superblock 

  • Disk Allocation Considerations
  • limitations on file size, total partition size
  • internal, external fragmentation
  • overhead to store and access index blocks
  • layout of files, inodes, directories, etc, as they affect performance - disk head movement, rotational latency - many unix filesystems keep clusters of inodes at a variety of locations throughout the file system, to allow inodes and the disk blocks they reference to be close together
  • may want to reorganize files occasionally to improve layout (see hw7 question) 
  • Free Space Management
    With any of these methods of allocation, we need some way to keep track of free disk blocks.
    Two main options:



  • bit vector - keep a vector, one bit per disk block
  • 0 means the corresponding block is free, 1 means it is in use
  • search for a free block requires search for the first 0 bit, can be efficient given hardware support
  • vector is too big to keep in main memory, so it must be on disk, which makes traversal slow
  • with block size 212 or 4KB, disk size 233 or 8 GB, we need 221 bits (128 KB) for bit vector
  • easy to allocate contiguous space for files
  • free list - keep a linked list of free blocks
  • with linked allocation, can just use existing links to form a free list
  • with FAT, use FAT entries for unallocated blocks to store free list
  • no wasted space
  • can be difficult to allocate contiguous blocks
  • allocate from head of list, deallocated blocks added to tail, both O(1) operations
  • Performance Optimization
    Caching is an important optimization for disk accesses.
    A disk cache may be located:



  • main memory
  • disk controller
  • internal to disk drive



  • Safety and Recovery

    When a disk cache is used, there could be data in memory that has been "written" by programs, which which has not yet been physically written to the disk. This can cause problems in the event of a system crash or power failure.
    If the system detects this situation, typically on bootup after such a failure, a consistency checker is run. In Unix, this is usually the fsck program, and in Windows, scandisk or some variant. This checks for and repairs, if possible, inconsistencies in the filesystem.

    Journaling Filesystems
    One way to avoid data loss when a filesystem is left in an inconsistent state is to move to a log-structured or journaling filesystem.



  • record updates to the filesystem as transactions
  • transactions are written immediately to a log, though the actual filesystem may not yet be updated
  • transactions in the log are asynchronously applied to the actual filesystem, at which time the transaction is removed from the log
  • if the system crashes, any pending transactions can be applied to the filesystem - main benefits are less chance of significant inconsistencies, and that those inconsistencies can be corrected from the unfinished transactions, avoiding the long consistency check
  • Examples:
  • ReiserFS, a linux journaling filesystem - I recommend reading this page
  • ext3fs, also for linux
  • jfs, IBM journaling filesystem, available for AIX, Linux
  • Related idea in FreeBSD's filesystem: Soft Updates
  • Journaling extensions to Macintosh HFS disks, called Elvis, supposedly coming in OS X 10.2.2
  • NTFS does some journaling, but some claim it is not "fully journaled"
  • the term "journaling" may also refer to systems that maintain the transaction log for a longer time, giving the ability to "undo" changes and retrieve a previous state of a filesystem


  • From: Ext4 filesystem layout

    Overview


        An ext4 file system is split into a series of block groups. To reduce performance difficulties due to fragmentation, the block allocator tries very hard to keep each file's blocks within the same group, thereby reducing seek times.  The size of a block group can be calculated as 8 * block_size_in_bytes. With the default block size of 4KiB, each group will contain 32,768 blocks, for a length of 128MiB. ( It's a good to group things. )

    Blocks

        ext4 allocates storage space in units of "blocks". A block is a group of sectors between 1KiB and 64KiB, and the number of sectors must be an integral power of 2. Blocks are in turn grouped into larger units called block groups. Block size is specified at mkfs time and typically is 4KiB. You may experience mounting problems if block size is greater than page size (i.e. 64KiB blocks on a i386 which only has 4KiB memory pages). By default a filesystem can contain 2^32 blocks; if the '64bit' feature is enabled, then a filesystem can have 2^64 blocks. 


    32-bit mode 64-bit mode
    Item 1KiB 2KiB 4KiB 64KiB 1KiB 2KiB 4KiB 64KiB
    Blocks 2^32 2^32 2^32 2^32 2^64 2^64 2^64 2^64
    Inodes 2^32 2^32 2^32 2^32 2^32 2^32 2^32 2^32
    File System Size 4TiB 8TiB 16TiB 256PiB 16ZiB 32ZiB 64ZiB 1YiB

     ( Nice, so It's actually the System's architecture dictating the maximum filesystem's size.  2 to the power of 32 or 64 and you may have some playroom with your block size but you must stick to blocks as big as your memory page size. But, Can't I mount the file systems created on bigger machines on smaller ones ? Seems we may have difficulties. So don't be sure unless you've tried it. )


    Layout

    The layout of a standard block group is approximately as follows (each of these fields is discussed in a separate section below):
    Group 0 Padding ext4 Super Block Group Descriptors Reserved GDT Blocks Data Block Bitmap inode Bitmap inode Table Data Blocks
    1024 bytes 1 block many blocks many blocks 1 block 1 block many blocks many more blocks
    For the special case of block group 0, the first 1024 bytes are unused, to allow for the installation of x86 boot sectors and other oddities. The superblock will start at offset 1024 bytes, whichever block that happens to be (usually 0). However, if for some reason the block size = 1024, then block 0 is marked in use and the superblock goes in block 1. For all other block groups, there is no padding.




    Still to continue.....






    No comments: