_This is a summary based on the [transactional.blog post "Userland Disk I/O"](https://transactional.blog/how-to-learn/disk-io)._ A while ago, I read this [blog](https://transactional.blog), which presented an interesting concept and [philosophy](https://transactional.blog/how-to-learn/philosophy). I found it interesting, so I then dig into one of the articles, which was exceptionally well-written and explained complex topics in a simple, accessible way that only a true [SME](https://en.wikipedia.org/wiki/Subject-matter_expert) could achieve. The article categorizes Userland disk I/O into four main topics: - File I/O - Durability - Filesystems - Kernel Things Each topic is briefly discussed, with comparisons across different methods in different OS within each category. This article also includes a curated list of related articles on disk I/O. If you're interested, you can explore these links further, I’ve compiled them all at the end of this summary. I've bullet-pointed the topics and removed extra content for clarity. However, if you enjoy this summary, make sure to read the original blog post as well. ### Summary: File I/O, Durability, Filesystems, and Kernel Considerations - **File I/O** - Many databases use `O_DIRECT` for unbuffered IO to manage their own page cache. - `O_SYNC`/`O_DSYNC` ensure data reaches disk reliably - `O_DIRECT` skips kernel page cache, useful for direct data modifications. - Buffered IO is beneficial in embedded databases (e.g., RocksDB, LMDB), allowing kernel-managed memory allocation. - Use `fallocate(2)` for extending files in chunks to minimize filesystem metadata costs. - Directly invoking `write()` will perform synchronous IO. - For asynchronous IO, prefer `io_uring` for higher performance - other options are `io_submit`, `aio`, `epoll`, `select`, with limitations at each level. - **Durability** - `fsync()` ensures data persistence, ideal for power loss recovery. - Distinction between `O_SYNC` (file integrity) and `O_DSYNC` (data integrity), both have trade-offs depending on metadata needs. - example for understanding, after modification to a file 2 metadata will be updated, last modification timestamp and file length. - all write operations will update last file modification - only write operations that add data to end of file will update file length - `O_DSYNC` would only guarantee to flush updates to the file length metadata, and reader will ensure that a read completes successfully. - Challenges with `fsync()` errors can lead to application issues (e.g., fsyncgate in PostgreSQL). - Specific platform notes: - **macOS**: Limited async IO, no `O_DIRECT`, you should use `fcntl(F_NOCACHE)` instead. - **Windows**: `FlushFileBuffers` is equivalent to `fsync()`, `NtFlushBuffersFileEx(FLUSH_FLAGS_FILE_DATA_SYNC_ONLY)` is equivalent to `fdatasync()`, and `_commit()` is equivalent for `_open()` with reliability concerns on older drivers. - Force Unit Access (FUA) - **Purpose**: Ensures data is written to non-volatile storage, bypassing volatile caches for durability in power-loss scenarios. - **Buffered IO Use Case**: - `pwritev2()` includes support for multi-block atomic writes, contingent on FUA-capable drives and compatible filesystems. - **Limitations**: - Limited hardware support: Few drives support FUA, making it unreliable for guaranteed multi-block atomicity. - In NVMe drives, `Atomic Write Unit Power Fail` (AWUPF) value > 0 indicates FUA support, though it is rarely observed in practice. - **Filesystems** - XFS is generally preferred for databases due to performance and handling of special cases. - Ext4/XFS aggregate contiguous blocks to optimize metadata overhead, encouraging large file extensions or appends. - Device type impacts parallel operations: - SATA NCQ supports 32 requests - NVMe supports up to 65k. - Raw block device access can bypass the filesystem, but requires 4k alignment and lacks filesystem features. - **Kernel Things** - For SSDs, use `mq-deadline` or `none` IO schedulers to minimize overhead. - `vm.dirty_ratio` controls when Linux writes modified pages to disk with buffered IO. - Disk metrics can be self-monitored via `/proc/diskstats`. Links: - [Linus rants about O_DIRECT](https://yarchive.net/comp/linux/o_direct.html) - [fsyncgate](https://danluu.com/fsyncgate/) - [RFC: Clarifying Direct I/O Semantics](https://lwn.net/Articles/348739/) - [Windows I/O Completion Ports](https://learn.microsoft.com/en-us/windows/win32/fileio/i-o-completion-ports) - [Files are Hard by Dan Luu](https://danluu.com/file-consistency/) - [Notes on disk flush commands and why they are important (Complains about UFS on BSD) by Matthew Dillon](https://lists.dragonflybsd.org/pipermail/kernel/2010-January/317935.html) - [XFS / EXT4 / Btrfs / F2FS / NILFS2 Filesystems Performance Benchmark On Linux 5.8](https://www.phoronix.com/review/linux-58-filesystems#google_vignette) - [Qualifying Filesystems for Seastar and ScyllaDB Or "Why filesystems are important for databases?"](https://www.scylladb.com/2016/02/09/qualifying-filesystems/) - [IOSchedulers Wiki Page](https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers) #filesystem #linux #database