Back
How Operating Systems Work · Part 4

Virtual Memory: The Illusion of Private RAM

Every process on a Linux system believes it has the machine's entire address space to itself. That belief is constructed entirely by the kernel and the CPU's memory management unit — and it is one of the most consequential abstractions in systems software.

Open two terminals and run echo $$ in each. Two different PIDs. Both processes have a stack at roughly the same virtual address. Both have their C library mapped at roughly the same virtual address. Neither is aware the other exists, and neither can read the other's memory.

One machine. One physical RAM. Two completely isolated views of it.

This is virtual memory.


The Problem It Solves

Physical RAM is a shared resource. Every process on the system uses the same physical chips. Without some mechanism for isolation, any process could read or write any other process's memory — accidentally through a bug, or deliberately through an exploit. The kernel itself would be readable and writable by any process. Nothing would be safe.

The naive solution — giving each process its own dedicated region of physical RAM — does not scale. A machine with 16 GB of RAM cannot run a hundred processes each expecting gigabytes of address space. The addresses in a compiled binary are fixed at compile time; if two copies of the same program run simultaneously, their addresses would collide.

Virtual memory solves both problems. Each process operates on virtual addresses — addresses that mean nothing to the hardware until translated. The CPU translates every virtual address to a physical address before accessing RAM, using a per-process mapping that the kernel controls. Two processes can have the same virtual address map to entirely different physical memory. A process can have a virtual address space far larger than physical RAM. The kernel can revoke access to any region at any time by changing the mapping.


Virtual Address Spaces

On x86-64 Linux with standard 4-level paging, the total virtual address space is 256 terabytes — split between user space and kernel space. The lower half, addresses from 0x0000000000000000 to 0x00007fffffffffff, belongs to the process: 128 terabytes of user-space virtual address capacity, most of which is unmapped. Accessing an unmapped address faults immediately. The upper half is reserved for the kernel, mapped into every process's page tables but only accessible at ring 0. A user process that tries to read a kernel address gets a page fault, not kernel data.

Within the user portion, different regions serve different purposes. The text segment holds the program's executable code. The data segment holds global and static variables. The heap grows upward from a base address as the program allocates memory. The stack grows downward from a high address. Shared libraries are mapped into the address space at load time. These regions are not adjacent — there are large unmapped gaps between them. That is deliberate: accessing a gap produces a fault, which catches bugs like NULL pointer dereferences and stack overflows before they silently corrupt memory.


Page Tables

The CPU does not translate addresses one byte at a time. It works in fixed-size chunks called pages. On x86-64 Linux, the default page size is 4 kilobytes — 4096 bytes. Every virtual address belongs to a page, and the translation unit is the page, not the individual byte.

The mapping from virtual pages to physical pages is stored in a data structure called a page table. Each process has its own page table, maintained by the kernel in physical memory. The page table is a multi-level tree — on x86-64 Linux it is four levels deep (five on systems with five-level paging enabled for very large address spaces). Each level is an array of entries, each entry pointing either to the next level or, at the final level, to the physical page frame that holds the actual data.

When the CPU needs to translate a virtual address, it walks this tree. The virtual address is split into index fields — bits that select an entry at each level — and a page offset that identifies the specific byte within the final physical page. The walk produces a physical address, and the CPU uses that to access RAM.

The kernel updates page tables to control what each process can see. Mapping a page makes it accessible. Unmapping it removes access immediately — the next access faults. Changing a mapping's permissions — marking a page read-only, for example — takes effect on the next access to that page. This is how the kernel enforces memory isolation: not by moving data, but by changing what addresses map to.

The physical address of the current process's top-level page table is stored in the CR3 register. Switching between processes means writing a new value to CR3 — the CPU immediately begins using the new process's page tables for all subsequent address translations. This is the memory component of a context switch.


The TLB

Walking a four-level page table on every memory access would be prohibitively slow. A single load instruction requires one memory access for the data — but four more to walk the page table tree, each of which might require its own translation. The overhead would multiply every memory operation by five.

The CPU caches recent translations in a hardware structure called the Translation Lookaside Buffer — TLB. The TLB stores a small number of virtual-to-physical mappings. When the CPU translates an address, it checks the TLB first. If the mapping is there — a TLB hit — the translation completes in a single cycle, with no page table walk. If not — a TLB miss — the CPU walks the page table, stores the result in the TLB for future use, and proceeds.

On a warm workload, the TLB hit rate is high. Most programs access a relatively small working set of pages repeatedly, and those translations stay cached. Performance degrades when a workload accesses many distinct pages — a large random-access data structure, for example — because the TLB fills up and misses become frequent.

Context switches invalidate the TLB. When the CPU switches to a different process, the virtual-to-physical mappings in the TLB belong to the previous process and are no longer valid. The CPU flushes the user-space portion of the TLB on a CR3 write. Kernel page table entries are marked global — the CPU retains them across CR3 writes, so the kernel's own mappings do not need to be rebuilt on every context switch. This is one reason context switches have a cost beyond just saving and restoring registers — the user-space TLB must be rebuilt from scratch for the new process, causing a burst of page table walks until the working set warms up again.

Modern x86-64 CPUs support Process Context Identifiers (PCIDs), which tag TLB entries with a process identifier. With PCIDs, a context switch does not flush the entire TLB — entries from the previous process are retained but tagged, so the new process does not accidentally use them. The kernel enables PCID support on hardware that provides it.


Demand Paging

Mapping a virtual address does not mean the corresponding physical page exists in RAM. Linux uses demand paging: physical pages are allocated only when they are actually accessed.

When a process calls mmap() to map a file or allocate anonymous memory, the kernel creates entries in the page table but marks them not-present. No physical memory is allocated. When the process first accesses any address in the mapped region, the CPU finds a not-present entry in the page table and raises a page fault — a hardware exception handled by the kernel's page fault handler.

The handler determines what kind of fault it is. If the address is validly mapped but the page has not been loaded yet, the handler allocates a physical page frame, reads the data into it if the mapping is file-backed, updates the page table entry to point to the new frame and mark it present, and returns. The CPU retries the instruction that caused the fault, which now succeeds.

Demand paging has two significant consequences. First, a process can have a very large virtual address space without consuming physical RAM proportional to that size — only the pages it actually touches consume memory. Second, starting a program is fast: the kernel maps the executable and its libraries into the address space without reading them all into RAM. Pages are loaded as execution reaches them.

The mmap() system call is the interface. When a dynamic linker loads a shared library, it calls mmap() to map the library file into the process's address space. The library's code pages are not read into RAM until the program actually calls a function in that library — demand paging loads them on first access. This is why programs start faster when their libraries are already in the page cache from a previous run: the page fault handler finds the data already in RAM and avoids a disk read.


Swap

Physical RAM is finite. When the total demand from all running processes exceeds available RAM, the kernel must make room. It does this by paging out — writing infrequently accessed pages from RAM to a swap partition or swap file on disk, freeing their physical frames for other use.

The page table entry for a swapped-out page is marked not-present. When a process accesses that address, the CPU raises a page fault. The kernel's handler detects that the page is on swap, allocates a physical frame, reads the page back from disk, updates the page table, and returns. The process resumes, unaware that the access took milliseconds instead of nanoseconds.

Swap exists as a pressure valve, not a performance feature. A machine that swaps heavily under normal load is a machine that needs more RAM. Disk access is orders of magnitude slower than RAM — a page read from an NVMe SSD takes microseconds; from a spinning disk, milliseconds. When the kernel is spending significant time moving pages between RAM and disk, application performance degrades visibly.

You can see current swap usage:

grep -i swap /proc/meminfo

And which processes are consuming the most memory, making swap more likely:

ps aux --sort=-%mem | head -10

The vDSO

Some kernel operations are read-only and do not require a privilege level crossing to be safe. Getting the current time is the common example. A call to gettimeofday() needs only data the kernel maintains — the current time — not any privileged operation. But without an optimisation, every call would require a full syscall instruction: the ring 3 to ring 0 transition, the kernel handler, and sysret back. On a program that reads the time frequently, this overhead accumulates.

The vDSO — virtual Dynamic Shared Object — is the kernel's solution. At boot, the kernel prepares a small shared library containing implementations of a few frequently called functions. It maps this library into the address space of every process automatically, at a random address chosen by ASLR. The dynamic linker resolves the symbols the vDSO exposes at process startup, and the C library — glibc on most Linux systems — routes calls to gettimeofday(), clock_gettime(), and a small number of others to the vDSO implementation instead of the real syscall.

The vDSO implementation reads from a page of kernel data also mapped into every process's address space — read-only, accessible at ring 3. The kernel updates this page when the time changes. The process reads current time data directly from memory, with no syscall, no ring crossing, no kernel involvement. The call completes in nanoseconds instead of hundreds of nanoseconds.

This is only possible because of virtual memory. The kernel can map its own data into every process's address space as a read-only page, invisible and inaccessible to the process except through the vDSO interface. The process cannot write to it — the page table entry is read-only, and any write attempt faults. It can only read the data the kernel exposes.


Reading Your Own Address Space

The complete virtual memory map of any process is visible in /proc:

cat /proc/$$/maps

Each line is one mapped region. The columns are: virtual address range, permissions, offset, device, inode, and pathname. A line with no pathname is an anonymous mapping — heap memory, stack, or a mapping created with mmap() without a file. A line with a pathname is a file-backed mapping — a shared library, the executable itself, or a mapped file.

Look for these in the output:

The executable itself — a line with your shell's path, permissions r-xp (read, execute, private).

The stack — a line labelled [stack] near the top of the address space, permissions rw-p.

The heap — a line labelled [heap], permissions rw-p, address just above the executable's data segment.

Shared libraries — multiple lines for each .so file: one r--p segment for read-only data, one r-xp for executable code, one rw-p for writable data.

The vDSO — a line labelled [vdso], a small region mapped by the kernel into every process. No file path, because it is not a file — it is the kernel-generated library described above.

The vvar page — a line labelled [vvar], the read-only kernel data page the vDSO reads from. Also mapped by the kernel, also present in every process.

grep -E "\[stack\]|\[heap\]|\[vdso\]|\[vvar\]" /proc/$$/maps

This shows the four kernel-managed regions directly. The addresses will differ from run to run — ASLR randomises them at process creation.


References

Processes: How the Kernel Tracks Everything