Back
How Operating Systems Work · Part 10

Threads: One Process, Multiple Execution Paths

A thread is a second instruction pointer inside the same address space. The kernel creates threads the same way it creates processes — with clone() — but with flags that share memory instead of copying it. Here is what that means in practice.

A process has one instruction pointer — one position in the code it is currently executing. A thread is a second instruction pointer inside the same process, running independently, sharing the same memory.

Two threads in the same process can read and write the same variables. They share the same file descriptor table, the same heap, the same global data. What they do not share is execution state: each thread has its own stack, its own registers, and its own program counter. From the kernel's perspective, each thread is a schedulable unit. From the program's perspective, two things appear to happen at once.


How Linux Implements Threads

The Linux kernel does not have a separate concept of a thread. It has one primitive: clone() — the same system call that creates processes, called with different flags.

clone() is the system call that creates new schedulable entities. fork() is a wrapper around clone() that passes flags telling the kernel to copy the parent's address space, file descriptor table, and signal handlers. A thread creation call passes different flags — flags that tell the kernel to share those resources instead of copying them.

The key flags for thread creation:

  • CLONE_VM — share the virtual memory address space. Both the parent and the new thread see the same page tables. A write by one is immediately visible to the other.
  • CLONE_FILES — share the file descriptor table. Both threads use the same set of open file descriptors.
  • CLONE_SIGHAND — share the signal handler table. Signal dispositions set by one thread apply to all threads in the process.
  • CLONE_THREAD — place the new thread in the same thread group as the caller, giving it the same TGID (Thread Group ID). The TGID is what getpid() returns — all threads in a process report the same PID to user space, even though each has a unique kernel-level TID (Thread ID).

The result is a new task_struct — a full kernel-level schedulable entity — that happens to point to the same mm_struct, the same files_struct, and the same sighand_struct as its creator. The kernel schedules it independently. It can run on a different CPU core simultaneously.

In practice, neither application developers nor library authors call clone() directly. On Linux, the standard threading interface is POSIX threads — pthreads. The Native POSIX Thread Library (NPTL), part of glibc, provides pthread_create(), which sets up the new thread's stack, configures the appropriate clone() flags, and makes the system call. Application code calls pthread_create(); the kernel sees clone().


Thread vs Process: What Is Shared, What Is Private

When a thread is created, the following are shared with the creating thread:

  • Virtual address space — heap, global variables, code, mapped files
  • File descriptor table
  • Signal handlers
  • Working directory
  • User and group credentials

Each thread has its own private copy of:

  • Stack — each thread needs its own call stack. pthread_create() allocates a new stack region in the shared address space for each thread, typically 8 MB by default on Linux. Threads do not share stack memory. At 8 MB per thread, a server spawning one thread per connection exhausts available address space long before it reaches a million concurrent connections — this memory cost is a core reason why the one-thread-per-connection model does not scale to high concurrency.
  • Registers — each thread has its own set of CPU register values, including the instruction pointer and stack pointer. These are saved and restored on each context switch, exactly as with processes.
  • Signal mask — each thread can independently block or unblock signals. The signal handler table is shared, but which signals are currently blocked is per-thread.
  • Thread-local storage (TLS) — a mechanism that allows a variable to appear global within a thread but be independent across threads. Declaring a variable with __thread (C) or thread_local (C++11) causes the compiler to allocate it in a per-thread segment. Each thread gets its own copy of the variable at its own address. Errno is the most common example — it is a thread-local variable so that system call errors in one thread do not corrupt the errno seen by another.

To see the threads of a running process:

ls /proc/$(pgrep firefox)/task/

Each directory in /proc/[pid]/task/ is a TID — a kernel-level thread. The count of directories is the number of threads in the process. A single-threaded process has one entry, its own PID.

To see all threads system-wide with their process context:

ps -eLf

The LWP column is the thread ID (Light Weight Process — the historical term for kernel-level threads). The PID column is the thread group ID — the same for all threads in a process. A process with N threads appears N times in this output.


Race Conditions

Sharing memory between threads introduces a fundamental problem: two threads can read and write the same memory simultaneously, and without coordination, the outcome is undefined.

Consider two threads both incrementing a shared counter:

counter = counter + 1;

This looks like one operation. At the machine level it is three: read the value from memory into a register, add 1 to the register, write the register back to memory. If two threads execute this sequence concurrently without coordination, the following can happen:

  1. Thread A reads counter (value: 5)
  2. Thread B reads counter (value: 5)
  3. Thread A writes counter + 1 (value: 6)
  4. Thread B writes counter + 1 (value: 6)

Both threads incremented the counter, but the result is 6 instead of 7. Thread B's read happened before Thread A's write, so Thread B incremented a stale value. The increment was lost.

This is a race condition. It is not a bug that always reproduces — it depends on the exact interleaving of instructions across threads, which varies between runs, between machines, and under different load conditions. Race conditions are among the hardest bugs to diagnose precisely because they are intermittent.

The solution is synchronisation — preventing threads from concurrently accessing shared data in ways that produce inconsistent results.


Mutexes and Futexes

The basic synchronisation primitive is a mutex — mutual exclusion lock. A mutex is either locked or unlocked. Only one thread can hold it at a time. Before accessing shared data, a thread locks the mutex. When it is done, it unlocks it. A thread that tries to lock an already-locked mutex blocks until the mutex is released.

The counter example with a mutex:

pthread_mutex_lock(&counter_mutex);
counter = counter + 1;
pthread_mutex_unlock(&counter_mutex);

Only one thread can execute the increment at a time. The race is eliminated.

The naive implementation of a mutex requires a system call on every lock and unlock operation — a context switch costs roughly 1–10 microseconds on modern hardware. A mutex that is rarely contended — locked by one thread while no other thread is waiting — would pay that cost on every access for no benefit.

Linux solves this with futexes — Fast Userspace Mutexes. A futex is a 32-bit integer in shared memory. Locking it uses an atomic compare-and-swap operation in user space: if the integer is 0 (unlocked), set it to 1 (locked) in a single atomic instruction. No system call, no kernel involvement. The lock takes nanoseconds.

The kernel is only involved when there is contention. If a thread tries to lock a futex that is already locked (value is 1), it calls the futex() system call with FUTEX_WAIT, which puts the thread to sleep in the kernel, removing it from the scheduler's runqueue until the futex is released. When the holding thread unlocks the futex, it calls futex() with FUTEX_WAKE to wake one or more waiting threads. The kernel involvement is proportional to the contention — an uncontended mutex pays zero kernel cost.

The pthreads mutex (pthread_mutex_t) is implemented on top of futexes. Application code calls pthread_mutex_lock(); glibc performs the atomic user-space operation and calls the kernel only if it needs to sleep. The fast path is entirely in user space.


Kernel Threads

Not all threads have a user-space counterpart. The kernel creates its own threads — kernel threads, or kthreads — to perform background work entirely within kernel space.

A kernel thread is a task_struct with no associated mm_struct. It has no user-space address space, no user-space stack, no user-space memory mappings. It runs exclusively in ring 0, executing kernel code directly. It is scheduled like any other task.

Common kernel threads:

  • kworker — processes work items queued by kernel subsystems (bottom halves, deferred work)
  • ksoftirqd — processes softirqs when they accumulate faster than they are handled in interrupt context
  • kswapd — manages memory reclamation, writing dirty pages to swap when physical RAM is under pressure
  • jbd2 — the journaling thread for ext4 filesystems, writing journal commits to disk
  • migration — per-CPU thread that migrates tasks between runqueues during load balancing

You can see all kernel threads:

ps -eo pid,comm | grep -E "^\s*[0-9]+ \["

Kernel thread names appear in square brackets in ps output — [kworker/0:1], [ksoftirqd/0], [kswapd0]. The number after the slash or at the end often indicates which CPU core the thread is bound to.

Kernel threads are created with kthread_create() and started with kthread_run(). They run until explicitly stopped with kthread_stop() or until the kernel shuts down. They cannot be killed from user space with kill — they exist entirely in kernel space and are not subject to normal process signal delivery.


The CPython GIL

The kernel provides all the primitives for true parallelism: multiple threads, multiple cores, per-thread scheduling. Whether a program actually runs in parallel depends on what the language runtime does with those primitives.

CPython — the reference implementation of Python — imposes a constraint that the kernel knows nothing about: the Global Interpreter Lock, or GIL.

The GIL is a mutex inside the CPython interpreter that must be held to execute any Python bytecode. Only one thread can hold the GIL at a time. All other Python threads, regardless of how many CPU cores are available, wait.

The GIL exists because CPython's memory management is not thread-safe. CPython uses reference counting to track object lifetimes — every object has a counter that is incremented when a new reference to it is created and decremented when a reference is dropped. When the count reaches zero, the object is freed. Reference count updates are not atomic. Without the GIL, two threads concurrently manipulating the same object's reference count would race, potentially freeing an object still in use or leaking memory.

The consequence: a CPU-bound Python program using multiple threads does not run faster on a multi-core machine. The GIL serialises execution — threads take turns holding it rather than running in parallel. The GIL releases between bytecode instructions and during I/O operations, so I/O-bound programs (network servers, for example) benefit from threads because threads yield the GIL while waiting for I/O. CPU-bound programs do not.

The standard workaround for CPU-bound Python is the multiprocessing module, which creates separate processes instead of threads. Separate processes have separate address spaces and separate interpreter instances — no shared GIL. The cost is inter-process communication overhead instead of shared memory.

Python 3.13 introduced an experimental free-threaded build (--disable-gil) with per-object locking replacing the single global lock. Python 3.14 promoted it to officially supported status via PEP 779 — the free-threaded build is available as a separate interpreter but is not yet the default. The long-term plan is to make it the default in a future release, but the migration is careful because removing the GIL changes the memory model that decades of Python extension code was written against. Many C extension packages still need updates before they work correctly without the GIL.


References