The Filesystem: How the Kernel Reads a File

When you open a file, you give the kernel a string — a path like /etc/passwd. The kernel does not store files by name. It stores them by number. The path is resolved, step by step, to a number called an inode. The inode points to the actual data. The filename is only a label pointing at that number, and a file can have multiple labels. Deleting a label does not delete the file — not until the last label is gone and no process has the file open.

This model is the foundation everything else in the filesystem is built on.

The VFS Layer

Linux supports dozens of filesystem types: ext4, Btrfs, XFS, tmpfs, procfs, sysfs, FAT32, and more. A program that opens a file does not need to know which one it is using. The open() system call works the same way on all of them.

This uniformity comes from the Virtual Filesystem Switch — VFS. The VFS is a kernel abstraction layer that defines a common interface: inodes, directory entries, file operations. Every filesystem driver implements this interface. When a program calls open(), the kernel calls into the VFS, which dispatches to the appropriate filesystem driver based on where the file lives. The program sees none of this.

The same abstraction extends to things that are not filesystems in the traditional sense. /proc and /sys have no disk backing — the kernel generates their contents in memory on demand. From a program's perspective, they behave like any other directory tree. The VFS makes this possible: a filesystem driver just has to implement the interface, not store data on disk.

Inodes

An inode is a data structure the filesystem maintains for every file and directory. It contains the file's metadata — everything about the file except its name.

A typical inode contains:

File type — regular file, directory, symbolic link, character device, block device, socket, or named pipe
Permissions — the nine permission bits (owner read/write/execute, group, others) plus the setuid, setgid, and sticky bits
Ownership — the UID and GID of the file's owner and group
Timestamps — three of them: mtime (last data modification), ctime (last inode change — permissions, ownership, link count), atime (last access). Note: ctime is not creation time — it is the last time the inode itself changed. Birth time (creation time) is stored by ext4 and accessible via statx(), but not all filesystems support it.
Size — the file's size in bytes
Link count — the number of directory entries pointing to this inode
Block pointers — the addresses of the data blocks on disk that hold the file's content

What the inode does not contain: the filename. The name lives in the directory entry, not the inode. This separation is what makes hard links possible.

Every inode has a number — unique within its filesystem. To see inode numbers:

ls -i /etc/passwd

The number at the left is the inode number. Now check the inode's metadata directly:

stat /etc/passwd

stat displays the full inode: inode number, size, permissions, ownership, all three timestamps, link count, and block allocation. The filename is shown at the top, but it is not stored in the inode — stat simply reports the path you gave it alongside the inode data.

Filenames and Directories

A directory is not a special container. It is a file whose content is a list of name-to-inode mappings — directory entries. Each entry maps a filename (a string) to an inode number. The filename exists only here, in the directory entry. The inode itself has no knowledge of what names point to it.

When you create a file, the kernel allocates an inode, writes the file's data to disk, and creates a directory entry in the target directory mapping the filename to the new inode number. When you rename a file, the kernel updates the directory entry — the inode is untouched. When you delete a file with rm, the kernel removes the directory entry and decrements the inode's link count. The inode and its data are freed only when the link count reaches zero and no process has the file open.

Hard links are a direct consequence of this model. A hard link is simply another directory entry pointing to the same inode number — the same inode, the same data blocks, a different name (possibly in a different directory). Both names are equally the file. Neither is the original. The link count in the inode reflects how many directory entries point to it.

echo "test" > /tmp/testfile
ln /tmp/testfile /tmp/testfile-link
ls -i /tmp/testfile /tmp/testfile-link

Both lines show the same inode number. stat on either path shows link count 2. Remove one with rm and the link count drops to 1. The inode and data survive until the last link is removed.

Symbolic links are different. A symbolic link is a file whose content is a path string. When the kernel resolves a symlink, it reads that path and continues resolution from there. The symlink has its own inode, separate from the target. If the target is deleted, the symlink becomes dangling — it points to a path that no longer resolves.

Path Resolution

When a process calls open("/etc/passwd", O_RDONLY), the kernel resolves the path to an inode through a series of directory lookups.

Starting from the root inode (for an absolute path) or the process's current working directory inode (for a relative path), the kernel reads the directory's content and looks for the next component of the path. For /etc/passwd:

Look up etc in the root directory — find its inode number, confirm it is a directory
Look up passwd in the etc directory — find its inode number, confirm it is a regular file
Check permissions at each step — does the calling process have execute permission on the directory? Execute permission on a directory means the right to traverse it, not to list it
Return the inode

Each step is a directory read. On a cold filesystem with nothing cached, resolving a deep path requires multiple disk reads. On a warm system, the kernel caches the results.

The dentry cache (directory entry cache) stores recent path component lookups in memory. A dentry maps a name within a parent directory to an inode. Once /etc has been looked up once, the dentry cache holds the result. The next lookup of anything under /etc skips the disk read for that component. On a busy system, the dentry cache absorbs most path resolution work entirely in memory.

The Three-Layer Model

When open() succeeds, the kernel returns a file descriptor — an integer. Behind that integer are three distinct layers.

File descriptor — The integer the process holds. An index into the process's file descriptor table, stored in its files_struct. This is the process-private layer: file descriptors are per-process.

Open file description — The entry in the kernel's system-wide open file table that the file descriptor points to. This structure holds the current file offset (the position from which the next read or write will happen), the access mode the file was opened with (O_RDONLY, O_RDWR, etc.), and flags. Multiple file descriptors in the same or different processes can point to the same open file description — this is what happens when fork() duplicates a file descriptor. They share the offset.

Inode — The filesystem object itself. Multiple open file descriptions can point to the same inode — when two processes independently open the same file, they each get their own open file description with their own offset, but both point to the same inode. Writes from one are visible to the other because they are writing to the same underlying data.

This three-layer structure explains several behaviours that otherwise seem inconsistent. If two processes open the same file independently, they have independent offsets — reads do not interfere. If a process duplicates a file descriptor with dup(), both descriptors share an offset — a read from one advances the position seen by the other. If a process deletes a file while another has it open, the directory entry is removed but the inode and data survive until the last file descriptor is closed.

The Page Cache

Reading from disk on every file access would be prohibitively slow. The kernel caches file data in RAM in a structure called the page cache. Pages here are managed by the same virtual memory infrastructure that backs process address spaces.

When a process reads from a file, the kernel checks the page cache first. If the relevant pages are present — a cache hit — the data is copied from RAM to the process's buffer with no disk access. If not — a cache miss — the kernel reads the pages from disk into the page cache, then copies to the process's buffer. Subsequent reads of the same data hit the cache.

The page cache is shared across all processes. If two processes read the same file, the second read (by either process) hits cache regardless of which process cached it first. The kernel also performs read-ahead: when a sequential read is detected, it prefetches pages beyond what was requested, anticipating the next read. This is why reading a large file sequentially is much faster than random access — the read-ahead keeps the cache warm ahead of the process.

Writes work differently depending on how the file was opened. By default, write() copies data into the page cache and marks those pages dirty — modified but not yet written to disk. The kernel flushes dirty pages to disk asynchronously through a background process called writeback. This makes writes appear fast to the application, but introduces a window where data exists only in RAM. If the machine loses power during that window, the writes are lost.

For applications that cannot afford this risk — databases, for example — O_SYNC or fsync() forces dirty pages to disk before the call returns. The cost is latency: the application waits for the disk write to complete.

You can see current page cache usage:

grep -E "Cached|Dirty|Writeback" /proc/meminfo

Cached is the amount of RAM used by the page cache. Dirty is data written by applications but not yet flushed to disk. Writeback is data currently being written to disk by the kernel's writeback mechanism.

Journaling

A filesystem write is rarely a single operation. Creating a file involves updating the directory entry, allocating an inode, allocating data blocks, and updating the free space bitmap. If the machine loses power after some of these operations but not all, the filesystem is left in an inconsistent state — a directory entry pointing to an uninitialized inode, or allocated blocks not referenced by any inode.

Older filesystems recovered from this using fsck — a full consistency check that scanned every inode and block at boot time. On a large disk, this could take hours.

Journaling eliminates the wait. Before making any changes to the filesystem, the kernel writes a description of the intended operations to a reserved area called the journal. This is the intent record. Only after the journal entry is safely on disk does the kernel apply the changes to the filesystem proper. If power is lost mid-operation, the journal survives. On next boot, the filesystem driver replays the journal: any committed but incomplete operations are redone, any uncommitted operations are discarded. The filesystem is consistent within seconds.

The trade-off is write amplification — every operation writes to the journal first, then to the filesystem. Most modern filesystems (ext4, XFS, Btrfs) journal only metadata by default — directory entries, inode updates, block allocations — not file data. This covers the consistency problem at lower overhead. Full data journaling is available on ext4 with data=journal but is rarely used outside environments where data integrity is more important than write throughput.

To see the filesystem type and journal status of mounted filesystems:

findmnt -o TARGET,FSTYPE,OPTIONS

Look for data=ordered (ext4's default — metadata journaled, data written before journal commit), data=writeback (metadata journaled, data may be written after), or data=journal (full journaling).

Inode Exhaustion

A filesystem can run out of inodes independently of disk space. Each filesystem allocates a fixed number of inodes at format time — ext4 defaults to one inode per 16 kilobytes of space. A filesystem used to store millions of small files can exhaust all inodes while disk blocks remain available. When that happens, no new files can be created despite the available space.

df -i

This shows inode usage per mounted filesystem: total inodes, used, free, and percentage used. A filesystem approaching 100% inode usage needs attention — either files need to be cleaned up or the filesystem needs to be reformatted with a higher inode density.

References

Linux man page: stat(2) — inode metadata accessible from user space
Linux man page: open(2) — the full open() interface including flags
Linux man page: inode(7) — inode fields and their meaning
Linux kernel source: fs/namei.c — path resolution in the kernel
Linux kernel source: include/linux/fs.h — VFS inode and file structures