The Scheduler: How the Kernel Decides Who Runs Next

A typical Linux system runs hundreds of processes. A typical laptop has four to sixteen CPU cores. The mismatch is large and permanent — there will always be more work than cores to run it on.

The scheduler's job is to make this invisible.

The Problem

Every process that is ready to run but not currently running is waiting for a CPU core. The scheduler must decide, continuously, which of these processes gets to run next, on which core, and for how long.

A bad scheduler produces a system that feels sluggish — interactive tasks wait behind batch jobs, a single busy process starves everything else, CPU cores sit idle while work queues up on others. A good scheduler is invisible: processes get CPU time proportional to their priority weight, interactive tasks respond quickly, and cores stay busy without any single process monopolising them.

The Linux scheduler does not make one global decision. It makes many small decisions, driven by a hardware timer that fires hundreds of times per second on every CPU core.

The Runqueue

Each CPU core has its own runqueue — a data structure holding all runnable tasks assigned to that core. A task in a runqueue is ready to run: it is not sleeping, not waiting for I/O, not blocked on a lock. It is eligible to be scheduled.

When the scheduler selects the next task to run, it picks from the runqueue of the core it is running on. The runqueue is not a simple list. Its internal structure depends on the scheduling class of the tasks it contains — the data structure used by the Completely Fair Scheduler is different from the one used by real time tasks. Each scheduling class has its own runqueue embedded within the per-CPU structure.

The Timer Interrupt: The Trigger

The scheduler does not run continuously. It runs when something gives it control — a process making a system call, a process blocking on I/O, or most commonly, a timer interrupt.

IRQ 0 fires on every CPU core at a regular interval — 250 times per second on most Linux systems. Every tick invokes the scheduler's tick function. The tick updates the current task's accounting: how much CPU time it has consumed since it last ran. It then checks whether the current task should be preempted — replaced by a higher-priority or more deserving task.

If preemption is warranted, the scheduler sets a flag. The actual switch does not happen inside the interrupt handler — that would require saving and restoring full task state inside an interrupt context, which is expensive and complex. Instead, the flag is checked when the interrupt handler returns. At that point, if the flag is set, the scheduler runs and performs the context switch before returning to user space.

CFS and EEVDF

For most tasks on a Linux system, scheduling decisions are made by one of two algorithms: the Completely Fair Scheduler (CFS), which dominated Linux scheduling from kernel 2.6.23 (2007) until recently, and the Earliest Eligible Virtual Deadline First scheduler (EEVDF), which replaced it as the default in Linux 6.6 (2023).

CFS

CFS tracks a value called virtual runtime for each task — a measure of how much CPU time the task has received, adjusted by its priority weight. A task with lower priority accumulates virtual runtime faster than a high priority task, even if both receive the same wall clock CPU time. The scheduler always picks the task with the lowest virtual runtime — the task that has received the least CPU time relative to its weight.

CFS stores runnable tasks in a red-black tree sorted by virtual runtime. The task with the lowest virtual runtime is always at the leftmost node — a constant time lookup. Inserting a task and removing it are O(log n) operations. The tree is self-balancing, so the structure remains efficient regardless of how many tasks are runnable.

EEVDF

EEVDF extends the virtual runtime model with the concept of deadlines. Each task has an eligible time — the earliest point at which it may be scheduled — and a virtual deadline — the latest point by which it should have been scheduled. The scheduler picks the eligible task with the earliest virtual deadline.

The key improvement over CFS is latency control. CFS could produce unbounded latency for tasks that were repeatedly preempted — a task with low virtual runtime might still wait longer than expected if the scheduler's granularity was too coarse. EEVDF's deadline model gives the scheduler a target: each task has a point by which it must run, and the algorithm enforces it.

In practice, EEVDF behaves similarly to CFS for most workloads. The difference is visible in latency sensitive workloads — audio processing, interactive applications, real time-adjacent tasks — where EEVDF's tighter deadline enforcement produces more consistent response times.

Context Switching

When the scheduler decides to replace the running task with another, it performs a context switch.

A context switch saves the complete CPU state of the current task and restores the saved state of the next task. What gets saved:

General-purpose registers — rax, rbx, rcx, and all others. The values currently in these registers belong to the current task and must be preserved.
Instruction pointer — where the task will resume when it next runs
Stack pointer — the current task's stack location
Floating-point and SIMD state — the contents of the FPU and SSE/AVX registers, if the task uses them. These are large. On modern x86-64 hardware with XSAVEOPT support, the kernel saves them on every context switch using an optimised instruction that skips registers that have not changed since the last save.

The kernel stores this state in the task's task_struct. Restoring it means loading the saved values back into the hardware registers. When the incoming task's instruction pointer is restored and execution resumes, the task continues from exactly the instruction it was about to execute when it was last preempted.

The memory component of a context switch is a CR3 write — loading the incoming task's page table root into the register the MMU uses for all address translations. This flushes the user space TLB — the CPU's cache of recent virtual-to-physical address translations. The TLB must rebuild its cache for the new task, which adds a burst of page table walks after every context switch.

Context switches are not free. On modern hardware a context switch costs roughly 1–10 microseconds depending on the workload — register saves, TLB flush, pipeline flush, cache warming. A system performing tens of thousands of context switches per second pays a measurable overhead. This is why well written servers try to minimise unnecessary thread creation: each thread adds scheduling overhead even when idle.

You can see the scheduler's view of your current process:

cat /proc/$$/sched

The output includes the task's virtual runtime, the number of times it has been scheduled, how many times it has been involuntarily preempted (the timer said time's up) versus voluntarily descheduled (the task blocked on I/O or a lock), and its current scheduling policy.

The Idle Task

When a CPU core's runqueue is empty — no runnable tasks assigned to it — the core does not halt. It runs a special task: the idle task, also called swapper, with PID 0.

The idle task is not a process in the normal sense. It never appears in ps output. It has one job: execute the hlt instruction or an equivalent low-power instruction that tells the CPU to stop executing until an interrupt arrives. This puts the core into a hardware sleep state — a C-state — reducing power consumption until new work arrives.

When a timer interrupt fires on an idle core, the interrupt handler runs normally. If the interrupt was a timer tick and nothing new is runnable, the idle task resumes. If new work has arrived — a wakeup from another core, a device interrupt that unblocked a task — the scheduler sees a runnable task and switches to it, ending the idle period.

The idle task is the reason top shows CPU usage less than 100% on a lightly loaded system. Time attributed to idle is time the core spent in a sleep state waiting for work.

Priority and Niceness

Linux gives every task a scheduling weight that influences how much CPU time it receives relative to other tasks. The user facing interface for this weight is the nice value.

Nice values range from -20 (highest priority, most CPU time) to +19 (lowest priority, least CPU time). The default is 0. The name comes from the idea of being "nice" to other processes — a task with a high nice value voluntarily takes less CPU time, leaving more for others.

Nice values are not direct time allocations. They are weights. A task at nice -20 does not get 100% of the CPU; it gets more relative to other tasks. On a lightly loaded system all tasks run freely regardless of nice value. Under contention, nice value determines the share.

To check the scheduling policy and priority of your current shell:

chrt -p $$

To run a command at a lower priority:

nice -n 10 make -j8

To change the priority of a running process:

renice -n 5 -p <pid>

Raising a process's priority (lowering its nice value) requires root. Lowering it (raising the nice value) can be done by the process owner.

Real Time Scheduling Classes

Normal tasks — those managed by CFS or EEVDF — can be preempted by higher-priority tasks and by the timer interrupt. Real time tasks operate under different rules.

Linux provides two real time scheduling policies:

SCHED_FIFO — First in, first out. A SCHED_FIFO task runs until it voluntarily yields, blocks, or is preempted by a higher-priority real time task. The timer interrupt does not preempt it. If a SCHED_FIFO task enters an infinite loop, it holds its CPU core indefinitely.

SCHED_RR — Round-robin. Like SCHED_FIFO but with a time quantum. After the quantum expires, the task is moved to the back of the queue of tasks at the same priority level.

Both policies preempt any normal (CFS/EEVDF) task. A real time task at any priority runs before any normal task. real time priorities range from 1 (lowest) to 99 (highest).

real time scheduling is used in audio servers, industrial control systems, and any application where bounded latency matters more than fairness. Misconfigured real time tasks are a common cause of unresponsive systems — a runaway SCHED_FIFO task can starve the entire system. Linux 3.14 introduced a mechanism to mitigate this: by default, real time tasks are limited to 95% of CPU time per period, leaving 5% for normal tasks. This is configurable via /proc/sys/kernel/sched_rt_runtime_us.

Cgroups and CPU Quotas

Individual task priorities are not sufficient for managing CPU allocation in multi-tenant environments — containers, virtual machines, or shared servers where multiple applications must be isolated from each other.

Linux control groups (cgroups) extend scheduling to groups of tasks. A cgroup is a hierarchical grouping of processes. The CPU controller within cgroups allows the kernel to enforce CPU time allocation at the group level.

Two relevant controls in cgroups v2:

cpu.weight — analogous to nice values but for the entire group. A group with cpu.weight=200 receives twice the CPU share of a group with cpu.weight=100 under contention.

cpu.max — a hard quota. The format is quota period, both in microseconds. A value of 50000 100000 means the group may use at most 50ms of CPU time per 100ms period — 50% of one core. When the quota is exhausted, all tasks in the group are throttled until the next period begins.

Docker and Kubernetes use cpu.max to enforce the CPU limits set in container and pod specifications. A container configured with --cpus=0.5 translates to a cpu.max of 50000 100000 in the cgroup the container's tasks belong to. The scheduler enforces this transparently — the container's processes are unaware they are being throttled, and the host system's other processes are unaffected.

Multicore Scheduling and NUMA

The per-CPU runqueue model scales well — each core makes local scheduling decisions without global locks — but it creates an imbalance problem. If one core has ten runnable tasks and another has none, the loaded core is doing all the work while the idle core sits unused.

The kernel addresses this with load balancing. At regular intervals, and when a core becomes idle, the scheduler looks at the runqueues of nearby cores and migrates tasks if the imbalance exceeds a threshold. Task migration is not free — moving a task to a different core means its working set is no longer in that core's L1 or L2 cache, causing a cache cold period after migration. The scheduler weighs the cost of imbalance against the cost of migration.

On systems with Non Uniform Memory Access (NUMA) topology — servers with multiple CPU sockets, each with its own local RAM — the cost of migrating a task across NUMA nodes is significantly higher than migrating within a node. A task that allocated most of its memory on NUMA node 0 runs faster on a core attached to node 0. Moving it to node 1 means all its memory accesses cross the inter-socket interconnect, increasing latency by a factor of two or more depending on the hardware.

The Linux scheduler is NUMA aware. It prefers to keep tasks on cores local to their memory allocations and treats cross node migrations as a last resort. The kernel's NUMA balancing feature (/proc/sys/kernel/numa_balancing) also periodically migrates memory pages toward the cores that access them most frequently, keeping data and compute co-located over time.

References

Linux man page: sched(7) — scheduling policies, priorities, and the full interface
Linux man page: nice(2) — the nice system call
Linux man page: sched_setscheduler(2) — setting real time scheduling policies
Linux kernel source: kernel/sched/fair.c — CFS and EEVDF implementation
EEVDF paper: "Earliest Eligible Virtual Deadline First" — Stoica & Abdel-Wahab, 1995 — the original algorithm the Linux implementation is based on