Back

How ptrace Works

ptrace is the only mechanism Linux provides for one process to observe, interrupt, and modify the execution of another. Every debugger, syscall tracer, and container runtime syscall filter uses it. Here is how the kernel implements it and where the access model breaks down.

Every time you run strace, attach gdb to a running process, or invoke a syscall tracer, you are using ptrace. The system call has been in the Linux kernel since version 2.2. It is the only mechanism Linux provides for one process to observe, interrupt, and modify the execution of another. Debuggers, security tools, container runtimes, and exploit primitives all use the same interface.

Most people who use ptrace have never read its kernel implementation. Most people who have read the implementation still find it surprising.

What ptrace Does

The fundamental operation is attachment: a tracer process attaches to a tracee, and from that point forward the kernel routes specific events — system call entries and exits, signal deliveries, instruction steps — through the tracer before they reach the tracee. The tracer can inspect the tracee's registers and memory, modify them, inject synthetic signals, and decide whether execution continues.

ptrace requests fall into four categories. Attach and detach establish and terminate the tracing relationship. Read and write requests access the tracee's memory, registers, and floating-point state. Execution control requests — PTRACE_CONT, PTRACE_SINGLESTEP, PTRACE_SYSCALL — determine how the tracee resumes after a stop. Information requests retrieve process state: signal information, the current system call number, the exit status.

There is also PTRACE_TRACEME, which inverts the relationship. A process calls PTRACE_TRACEME to volunteer itself for tracing by its parent. This is how shells launch debugged processes: the child calls PTRACE_TRACEME, then execve(), and the parent receives a SIGTRAP before the new program gets its first instruction.

The ptrace Stop

The ptrace stop is the central event in the protocol. When the tracee hits a traced event — a syscall boundary, a signal, a breakpoint — the kernel does not deliver the event immediately. Instead it places the tracee in TASK_TRACED state, which suspends its execution and notifies the tracer via wait(). The tracer wakes, inspects or modifies the tracee's state, then calls PTRACE_CONT or another resume request to let the tracee proceed.

This makes ptrace synchronous and blocking by design. The tracer blocks in wait() until the tracee stops. The tracee blocks in TASK_TRACED until the tracer resumes it. Every ptrace stop is a context switch pair — one to suspend the tracee, one to wake the tracer, and the reverse on resume. This is why strace on a busy process is expensive: it inserts a full context switch pair at every syscall boundary.

The signals involved are not incidental. SIGTRAP is the signal the kernel delivers to the tracer on most ptrace stops. SIGSTOP cannot be intercepted or ignored by a traced process — the kernel enforces this to prevent a tracee from escaping the tracer's control by handling the stop signal itself. The interaction between ptrace stops and signal delivery is one of the more complex parts of the kernel's signal handling path.

The Kernel Implementation

ptrace is implemented as a single system call with a request code that dispatches to the appropriate handler. The entry point is sys_ptrace(), which validates the request and hands off to ptrace_request() for generic requests or the architecture-specific handler for register and memory access.

The kernel tracks the tracing relationship inside each process's task_struct. The relevant fields are ptrace (a bitmask of ptrace flags), parent (the process that will receive ptrace notifications — normally the real parent, but replaced by the tracer on attach), real_parent (the true parent, preserved across attach), and ptracer_cred (the credentials of the attaching tracer, used for access checks after credential changes).

When a tracer calls ptrace(PTRACE_ATTACH, pid, ...), the kernel runs ptrace_attach(). This function performs the access check, sets the PT_PTRACED flag on the tracee's ptrace field, replaces the tracee's parent pointer with the tracer's task_struct, and sends SIGSTOP to the tracee to produce an initial stop. The tracer then calls wait() and receives the stop notification.

Detach runs the reverse: ptrace_detach() clears PT_PTRACED, restores parent to real_parent, and resumes the tracee if it is currently stopped.

__ptrace_may_access() — The Access Check

Before any ptrace operation proceeds, the kernel runs __ptrace_may_access() to determine whether the calling process has the right to trace the target. The function implements a three-part check.

First, same-user check: if the tracer's real, effective, and saved user IDs all match the tracee's, and the group IDs match, access is granted. This is the common case — a developer attaching a debugger to their own process.

Second, capability check: if the tracer holds CAP_SYS_PTRACE, access is granted regardless of user ID. This is how root processes attach to any process on the system.

Third, dumpable check: the kernel examines the tracee's MM_DUMPABLE flag. When a process undergoes a privilege change — executing a setuid binary being the primary case — the kernel clears this flag to 0. A non-dumpable process cannot be attached by a process that does not pass the capability check. The purpose is to prevent an unprivileged process from using ptrace to read the memory of a setuid process that has elevated its privileges.

On top of these three checks sits the Yama Linux Security Module, which adds a fourth layer via kernel.yama.ptrace_scope. Scope 1 restricts attachment to direct ancestors only — a process can only be traced by its own parent or a process above it in the process tree. Scope 2 requires CAP_SYS_PTRACE for all attach operations. Scope 3 disables ptrace attach entirely. Scope 0 disables Yama's restrictions and falls back to the kernel's own checks — it is not a security configuration and should not be used as one.

The Race Window

CVE-2026-46333 lives in a gap that __ptrace_may_access() does not cover. When a process exits, the kernel tears down its resources in sequence: the memory descriptor (mm_struct) is detached and released, then the file descriptor table is closed. Between those two events, the memory descriptor is NULL but the file descriptors are still open.

The dumpable check reads the MM_DUMPABLE flag from the memory descriptor. If the memory descriptor is NULL, the check cannot run — there is nothing to read. The kernel skipped the check rather than treating a NULL memory descriptor as non-dumpable. The access check passed for a process it should have rejected.

An unprivileged process can exploit this window using pidfd_getfd(2), introduced in Linux 5.6. pidfd_getfd clones a file descriptor from another process identified by a PID file descriptor. It runs through the same __ptrace_may_access() check as a gate. If the check passes — as it does during the NULL mm window — the calling process receives a copy of the target's open file descriptor. The target is already in the process of exiting; its open descriptors are still valid.

The targets are setuid binaries that open root-owned files during their normal operation and have not yet closed those files when the race window opens. ssh-keysign keeps the SSH host private key files open through its exit path. chage keeps /etc/shadow open. The exploit races against their exit, wins the pidfd_getfd check during the NULL mm window, and reads files the unprivileged process has no business accessing.

The fix — commit 31e62c2ebbfd, "ptrace: slightly saner get_dumpable() logic" — caches the last user-dumpable state before the memory descriptor is released, so the check has a value to read even after mm is NULL. The exploit window closes.

Why ptrace Is a Recurring Attack Surface

A process with ptrace access to another process has nearly unlimited control over it. It can read and write arbitrary memory, modify register state, redirect system calls, inject synthetic signals, and change the program counter. ptrace is not a narrow interface — it is a general-purpose process control mechanism, and that breadth is the source of its security cost.

Container runtimes restrict ptrace for exactly this reason. A process inside a container that can ptrace the container runtime has a reliable path to escape the container entirely — it can modify the runtime's memory, redirect its system calls, and reach the host kernel. The standard container security configuration is Yama scope 1 or higher, or a seccomp-bpf profile that blocks ptrace attach entirely.

seccomp-bpf addresses the same use case as ptrace — syscall filtering — without requiring a tracer process. A seccomp filter runs entirely in kernel space on every syscall entry, which means there is no tracer process to wake, no context switch to user space, and no wait queue round-trip — eliminating the per-syscall overhead that makes ptrace-based tracing expensive at scale. It does not grant the filtering agent any access to the filtered process's memory. It is strictly less powerful than ptrace for that reason, but for syscall filtering specifically, that limitation is the point.

The pattern behind CVE-2026-46333 is not unique to ptrace. The gap between credential drop and resource release is a recurring kernel security problem: a process in transition between two security states, where the check reads the old state but the resources belong to the new one, or vice versa. ptrace's access check is one instance. It will not be the last.