_This is a summary based on the [transactional.blog post "Userland Disk I/O"](https://transactional.blog/how-to-learn/disk-io)._
A while ago, I read this [blog](https://transactional.blog), which presented an interesting concept and [philosophy](https://transactional.blog/how-to-learn/philosophy). I found it interesting, so I then dig into one of the articles, which was exceptionally well-written and explained complex topics in a simple, accessible way that only a true [SME](https://en.wikipedia.org/wiki/Subject-matter_expert) could achieve.
The article categorizes Userland disk I/O into four main topics:
- File I/O
- Durability
- Filesystems
- Kernel Things
Each topic is briefly discussed, with comparisons across different methods in different OS within each category. This article also includes a curated list of related articles on disk I/O. If you're interested, you can explore these links further, I’ve compiled them all at the end of this summary.
I've bullet-pointed the topics and removed extra content for clarity. However, if you enjoy this summary, make sure to read the original blog post as well.
### Summary: File I/O, Durability, Filesystems, and Kernel Considerations
- **File I/O**
- Many databases use `O_DIRECT` for unbuffered IO to manage their own page cache.
- `O_SYNC`/`O_DSYNC` ensure data reaches disk reliably
- `O_DIRECT` skips kernel page cache, useful for direct data modifications.
- Buffered IO is beneficial in embedded databases (e.g., RocksDB, LMDB), allowing kernel-managed memory allocation.
- Use `fallocate(2)` for extending files in chunks to minimize filesystem metadata costs.
- Directly invoking `write()` will perform synchronous IO.
- For asynchronous IO, prefer `io_uring` for higher performance
- other options are `io_submit`, `aio`, `epoll`, `select`, with limitations at each level.
- **Durability**
- `fsync()` ensures data persistence, ideal for power loss recovery.
- Distinction between `O_SYNC` (file integrity) and `O_DSYNC` (data integrity), both have trade-offs depending on metadata needs.
- example for understanding, after modification to a file 2 metadata will be updated, last modification timestamp and file length.
- all write operations will update last file modification
- only write operations that add data to end of file will update file length
- `O_DSYNC` would only guarantee to flush updates to the file length metadata, and reader will ensure that a read completes successfully.
- Challenges with `fsync()` errors can lead to application issues (e.g., fsyncgate in PostgreSQL).
- Specific platform notes:
- **macOS**: Limited async IO, no `O_DIRECT`, you should use `fcntl(F_NOCACHE)` instead.
- **Windows**: `FlushFileBuffers` is equivalent to `fsync()`, `NtFlushBuffersFileEx(FLUSH_FLAGS_FILE_DATA_SYNC_ONLY)` is equivalent to `fdatasync()`, and `_commit()` is equivalent for `_open()` with reliability concerns on older drivers.
- Force Unit Access (FUA)
- **Purpose**: Ensures data is written to non-volatile storage, bypassing volatile caches for durability in power-loss scenarios.
- **Buffered IO Use Case**:
- `pwritev2()` includes support for multi-block atomic writes, contingent on FUA-capable drives and compatible filesystems.
- **Limitations**:
- Limited hardware support: Few drives support FUA, making it unreliable for guaranteed multi-block atomicity.
- In NVMe drives, `Atomic Write Unit Power Fail` (AWUPF) value > 0 indicates FUA support, though it is rarely observed in practice.
- **Filesystems**
- XFS is generally preferred for databases due to performance and handling of special cases.
- Ext4/XFS aggregate contiguous blocks to optimize metadata overhead, encouraging large file extensions or appends.
- Device type impacts parallel operations:
- SATA NCQ supports 32 requests
- NVMe supports up to 65k.
- Raw block device access can bypass the filesystem, but requires 4k alignment and lacks filesystem features.
- **Kernel Things**
- For SSDs, use `mq-deadline` or `none` IO schedulers to minimize overhead.
- `vm.dirty_ratio` controls when Linux writes modified pages to disk with buffered IO.
- Disk metrics can be self-monitored via `/proc/diskstats`.
Links:
- [Linus rants about O_DIRECT](https://yarchive.net/comp/linux/o_direct.html)
- [fsyncgate](https://danluu.com/fsyncgate/)
- [RFC: Clarifying Direct I/O Semantics](https://lwn.net/Articles/348739/)
- [Windows I/O Completion Ports](https://learn.microsoft.com/en-us/windows/win32/fileio/i-o-completion-ports)
- [Files are Hard by Dan Luu](https://danluu.com/file-consistency/)
- [Notes on disk flush commands and why they are important (Complains about UFS on BSD) by Matthew Dillon](https://lists.dragonflybsd.org/pipermail/kernel/2010-January/317935.html)
- [XFS / EXT4 / Btrfs / F2FS / NILFS2 Filesystems Performance Benchmark On Linux 5.8](https://www.phoronix.com/review/linux-58-filesystems#google_vignette)
- [Qualifying Filesystems for Seastar and ScyllaDB Or "Why filesystems are important for databases?"](https://www.scylladb.com/2016/02/09/qualifying-filesystems/)
- [IOSchedulers Wiki Page](https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers)
#filesystem #linux #database