Processes: How the Kernel Tracks Everything

When you run a command, the kernel does not simply start executing its code. It creates a data structure, fills it with everything needed to manage that program's execution, and adds it to the list of things it is responsible for. Every process on the system — from PID 1 to the one you just launched — exists in the kernel's memory as an instance of this structure.

That structure is called the task_struct.

The task_struct

The task_struct is defined in include/linux/sched.h in the kernel source. It is large — several hundred fields — but the core of it covers a small set of concerns: identity, state, memory, files, and scheduling.

Identity — The process's PID, its parent's PID (PPID), its user and group credentials. The PID is the integer the kernel assigned when the process was created. The PPID identifies who created it. Credentials determine what the process is allowed to do: which files it can open, which system calls it can make with elevated access, which signals it can send to other processes.

State — The current execution state of the process: running, sleeping, zombie, or stopped. The kernel uses this to decide whether to give the process CPU time. A process in a sleeping state is not considered for scheduling until whatever it is waiting for arrives.

Memory — A pointer to the process's mm_struct — the memory descriptor that holds the page table reference and the list of all virtual memory areas (VMAs) mapped into the process's address space. This is the per-process view of virtual memory. Every mmap() call, every mapped library, every stack and heap region is recorded here.

Files — A pointer to the process's files_struct — the file descriptor table. This is the table that maps integers (file descriptors 0, 1, 2, and so on) to open files, sockets, pipes, and devices. Each process has its own copy of this table, which is why closing a file in one process does not affect another.

Scheduling — The process's scheduling class, its priority, and the accounting data the scheduler uses to decide when to run it next. This includes the virtual runtime used by CFS and the deadline used by EEVDF (introduced in Linux 6.6).

Signal handling — The signal mask (which signals are currently blocked), pending signals, and the table of signal handlers the process has registered.

You can read a summary of the kernel's live task_struct for any process through /proc:

cat /proc/$$/status

The fields map directly: Pid, PPid, State, VmRSS (physical memory in use), SigBlk (blocked signal mask), Uid and Gid (credentials). These are not a copy of the struct — they are generated in real time from it, every time the file is read.

Process States

The state field in the task_struct is one of a small set of values. Understanding them explains a large portion of what ps and top output actually means.

Running (R) — The process is either currently executing on a CPU core or is runnable and waiting for one. A process in state R is either actively running or queued in the scheduler's runqueue.

Sleeping, interruptible (S) — The process is waiting for something: a read to return data, a lock to be released, a timer to fire. It is not consuming CPU. The scheduler skips it until whatever it is waiting for wakes it. This state can be interrupted by a signal — if a signal arrives while the process is sleeping interruptibly, the kernel wakes it and delivers the signal.

Sleeping, uninterruptible (D) — The process is waiting for something and cannot be interrupted, not even by a signal. This state exists for operations where waking early would leave the kernel in an inconsistent state — typically waiting for disk I/O to complete, or for certain kernel locks. A process stuck in D state for more than a few seconds usually indicates a hardware problem or a kernel bug. SIGKILL has no effect on a process in D state; it will not be delivered until the process leaves D.

Zombie (Z) — The process has exited but its task_struct has not yet been removed from the kernel's process table. It is waiting for its parent to call wait() and collect its exit status. A zombie consumes no CPU and negligible memory — only its task_struct entry in the process table remains. A system accumulating large numbers of zombies has a parent process that is not calling wait().

Stopped (T) — The process has been paused, either by a SIGSTOP signal or by a debugger. It is not running and will not run until it receives SIGCONT. This is the state a process enters when you press Ctrl-Z in a shell.

ps -eo stat= | sort | uniq -c | sort -rn

This prints a count of each process state on the running system. On a healthy machine, the vast majority will be S. A count of processes in D above single digits is worth investigating.

fork()

New processes come into existence through one mechanism: fork(). Under the hood, Linux implements fork() via the more general clone() system call, which accepts flags that control exactly what the child shares with the parent. fork() is clone() called with flags that copy everything. This distinction matters when we get to threads — threads are also created via clone(), but with flags that share the address space instead of copying it. For now, fork() is the baseline.

When a process calls fork(), the kernel creates a new task_struct for the child, copying most fields from the parent. The child gets a new PID. Its PPID is set to the parent's PID. Its credential, signal handler, and scheduling fields are copied. Its file descriptor table is duplicated — the child starts with the same open files as the parent, but its own copy of the table, so closing a descriptor in one does not affect the other.

The memory mapping is handled differently. Copying the parent's entire address space would be expensive — the parent might have gigabytes of heap. Instead, the kernel uses copy-on-write (COW). Both parent and child initially share the same physical pages, with the page table entries marked read-only for both. Neither process knows this. They each see their own virtual address space as normal.

When either process writes to a shared page, the CPU raises a page fault — a write to a read-only page. The kernel's page fault handler detects this is a copy-on-write fault, allocates a new physical page, copies the content, updates the page table for the writing process to point to the new page, marks it writable, and returns. The other process continues using the original page. The copy happens only when — and only for the pages — that actually diverge.

After fork(), both parent and child return from the call — in different processes, at the same instruction. The return value distinguishes them: the parent receives the child's PID; the child receives 0. A typical use in C:

pid_t pid = fork();
if (pid == 0) {
    // child process
} else if (pid > 0) {
    // parent process, pid holds the child's PID
} else {
    // fork() failed — resource limits or out of memory
}

This fork and diverge pattern is how every process on a Unix system comes into existence. Every process — including all system daemons, every shell, every server — descends from PID 1 through an unbroken chain of fork() calls going back to boot.

execve()

fork() creates a copy of the calling process. execve() replaces the calling process with a new program.

When execve() is called with a path and arguments, the kernel loads the specified executable, discards the current process's address space entirely, and replaces it with the new program's text, data, and stack. The PID does not change — the process keeps its identity. The file descriptor table is preserved by default, though descriptors marked with O_CLOEXEC are closed. Signal handlers are reset to defaults. The program counter jumps to the new executable's entry point.

execve() does not return on success. There is nothing to return to — the code that called it no longer exists in the process's address space.

The fork-exec sequence is the standard pattern for running a new program:

Parent calls fork() — a child process is created
Child calls execve() — the child's address space is replaced with the new program
Parent calls wait() or waitpid() — the parent blocks until the child exits

This is how your shell runs every command you type. The shell forks itself, the child calls execve() with the path of the program you typed, and the shell waits for it to finish. The program runs with the shell's PID as its PPID. When it exits, the shell wakes from wait() and prompts you again.

The Process Tree

Every process has a parent. The parent is the process that called fork() to create it. The parent's PID is stored in the child's task_struct as PPID.

At the root of this tree is PID 1 — systemd on most modern Linux systems. Every other process on the machine is a descendant of PID 1, connected through an unbroken chain of parent-child relationships going back to boot. The kernel spawned PID 1 directly from start_kernel(). Every subsequent process was created by a fork() call somewhere in that tree.

pstree -p

This prints the full process tree with PIDs. The root is PID 1. Below it are the processes systemd started directly. Below those are their children, and so on. Every process you have launched in your current session appears somewhere in this tree, descended from your shell, which is descended from your terminal emulator, which is descended from your display manager or login service, which is descended from PID 1.

When a process exits and still has living children, the kernel walks those children and reparents them to the nearest ancestor marked as a subreaper — a process that has called prctl(PR_SET_CHILD_SUBREAPER) to signal it will take responsibility for orphaned descendants. If no subreaper exists in the tree, the orphan falls back to PID 1. In practice on most Linux systems, systemd is the subreaper, so orphans end up there. This is called reparenting.

Reparenting is also why long-running background daemons use a specific sequence to fully detach. The standard pattern: fork once, the parent exits, the child calls setsid() to create a new session — severing itself from the original terminal and process group — then forks a second time and exits. The grandchild continues as the daemon. The second fork is necessary because a session leader can accidentally acquire a controlling terminal if it opens a terminal device. The grandchild is definitively not a session leader, so it cannot.

File Descriptors and fork()

Each process has its own file descriptor table stored in its files_struct. The table maps integers to open file descriptions maintained by the kernel. When a process opens a file, the kernel creates an entry in its table and returns the lowest unused integer — the file descriptor.

File descriptor 0 is standard input. File descriptor 1 is standard output. File descriptor 2 is standard error. These exist in every process from the moment it is created, inherited from its parent.

When fork() creates a child, the file descriptor table is duplicated. The child gets its own table, but the entries in that table point to the same underlying open file descriptions as the parent. This matters: the open file description includes the file offset — the current position in the file. Parent and child share that offset. If the parent reads 100 bytes, moving the offset forward, and then the child reads, it picks up where the parent left off, not at the beginning.

This sharing is deliberate. It is what makes shell pipes work. Before forking to run two commands in a pipeline, the shell creates a pipe — a kernel-managed byte stream with a read end and a write end. Both ends are file descriptors. The shell forks, and both parent and child inherit both ends. Each then closes the end it does not need, leaving the write end in the process writing to the pipe and the read end in the process reading from it. The data flows through the kernel without either process touching the other's memory.

When execve() runs, the file descriptor table is preserved — unless descriptors were opened with O_CLOEXEC, which marks them for automatic closure on exec. This flag exists precisely to prevent file descriptors from accidentally leaking into child processes that should not inherit them.

References

Linux man page: fork(2)
Linux man page: execve(2)
Linux man page: proc(5) — full documentation of /proc/[pid]/status and related files
Linux kernel source: include/linux/sched.h — the task_struct definition