How Linux System Calls Work: The Kernel Crossing Explained

When you run a program, that program cannot do much on its own. It cannot read a file, send a network packet, or even write a line to the terminal without asking the operating system for help. The mechanism it uses to ask is a system call.

System calls are the only legitimate way for a program to cross from user space into the kernel. Every read, every write, every new process, every memory allocation goes through one. Understanding them explains a large portion of what Linux is actually doing when software runs.

The Privilege Boundary

Linux splits execution into two worlds: user space, where programs run with strict limits, and kernel space, where the OS runs with unrestricted hardware access. This boundary is enforced by the CPU in hardware — a user space program that tries to execute a privileged instruction does not get ignored, it gets faulted before the instruction completes. The full mechanism behind this is covered in Privilege Rings: How the CPU Enforces the Boundary.

What matters here: programs need to do things that require kernel access, and the system call is the one controlled path across.

What a System Call Actually Is

Registers: The CPU's Working Memory

Before explaining how a system call works at the hardware level, it helps to understand one concept: CPU registers.

A register is a small, named slot of storage built directly into the processor. Unlike RAM, which is a separate chip the CPU accesses over a bus, registers are inside the CPU itself. Reading or writing a register takes a single clock cycle. Reading RAM takes hundreds. Registers are where the CPU does its immediate work — they hold the values currently being operated on.

On x86-64 processors (the architecture in virtually every modern desktop, laptop, and server), the general-purpose registers are named rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, and r8 through r15. Each holds 64 bits — 8 bytes — of data. When the CPU executes an instruction like "add these two numbers," those numbers come from registers. When a function returns a value, that value is placed in rax by convention, and the caller reads it from rax.

This matters for system calls because the program and the kernel communicate through registers. There is no shared memory they can safely write to — the whole point is that they are isolated from each other. Registers are the channel the hardware defines for this.

The Mechanism

A system call is a controlled transfer of execution from user space to kernel space. The word "controlled" is important. The program does not jump to an arbitrary kernel address — it cannot, because kernel memory is not mapped into user space in a way programs can execute. Instead, the program signals its intent, and the CPU transitions to kernel mode at a single, fixed entry point that the kernel itself defined at boot time. The kernel then decides what to do with the request.

The sequence for a typical system call on x86-64 Linux:

Step 1. The program places a number in the rax register. This number is the system call number — a fixed integer that identifies which operation is being requested. On x86-64 Linux, read is 0, write is 1, open is 2, close is 3. These numbers are defined in the kernel source at arch/x86/entry/syscalls/syscall_64.tbl and never change between kernel versions — existing software depends on them remaining stable.

Step 2. Arguments go into the registers rdi, rsi, rdx, r10, r8, and r9, in that order. A write call takes three arguments: a file descriptor, a pointer to the data to write, and the number of bytes. A file descriptor is an integer the kernel assigns when a program opens a file, socket, or any other resource — it is how the program refers back to that resource in future calls. So rdi gets the file descriptor, rsi gets the memory address of the data, rdx gets the byte count. You might notice the fourth argument slot is r10 rather than rcx, which is what the standard C calling convention uses. The reason is hardware: when the CPU executes the syscall instruction, it saves the return address — the point in user space to resume after the call — directly into rcx, overwriting whatever was there. The kernel developers had no choice but to use r10 for the fourth argument instead.

Step 3. The program executes the syscall instruction. This is a single CPU instruction with a specific job: it atomically saves the current execution state (so the program can resume exactly where it left off), switches the CPU from ring 3 to ring 0, and jumps to the kernel's system call entry point — a function called entry_SYSCALL_64. "Atomically" here means the CPU does all three things as one indivisible operation. There is no intermediate state where the CPU is at ring 0 but has not yet saved the program's state.

Step 4. The kernel is now running. It reads the value in rax to know which system call was requested. It looks up the corresponding handler function in a table. It reads the arguments from rdi, rsi, rdx and the rest. It validates everything — checking that the arguments are legitimate, that the process has permission, that the memory addresses it was given actually belong to the calling process. Then it executes the operation.

Step 5. The kernel places the return value in rax and executes the sysret instruction, which is the mirror image of syscall: it switches the CPU back to ring 3 and restores execution at the instruction immediately after the syscall instruction in the program. On success, rax holds zero or a positive number — a byte count, a file descriptor, a PID. On failure, the kernel places a negative error code in rax, such as -2 for ENOENT. The program itself never sees this directly. The C library wrapper checks rax, and if negative, stores the absolute value in the errno variable and returns -1 to the caller. This is why C code checks for -1 and reads errno separately — that two-step is the C library's doing, not the kernel's.

The entire round trip — from the program executing syscall to the kernel completing the operation and returning — takes on the order of 100 to 300 nanoseconds on modern hardware. For comparison, accessing RAM takes around 60 to 100 nanoseconds. A system call is fast, but not free. Programs that call the kernel millions of times per second feel the cost.

What You Never See

In practice, you almost never write any of this yourself. The C standard library — glibc on most Linux systems — wraps every system call in a normal-looking function. When your C program calls write(1, "hello\n", 6), the library's write() function loads the registers, executes syscall, reads the return value from rax, and hands it back to you. The system call number, the register protocol, the kernel entry point — all invisible.

Higher-level languages add more layers. A Python print() statement eventually reaches the C library's write(), which reaches the write syscall. A Go fmt.Println() does the same through Go's own runtime. A Node.js console.log() goes through the V8 engine and then libuv, which eventually calls the kernel. Each language adds a layer on top of the one below it, but the system call is always at the bottom.

Common System Calls

You do not need to memorize system call numbers. What matters is recognizing the categories and knowing what operations map to them. When you see these names in tool output or error messages, you will know what they represent.

File I/O

Syscall	What it does
`open` / `openat`	Opens a file and returns a file descriptor — an integer handle the kernel uses to track the open file
`read`	Reads bytes from a file descriptor into a buffer in the program's memory
`write`	Writes bytes from a buffer in the program's memory to a file descriptor
`close`	Releases a file descriptor
`stat` / `fstat`	Returns metadata about a file: size, permissions, timestamps
`lseek`	Moves the read/write position within a file

A file descriptor is just an integer — 0, 1, 2, 7, 23. The kernel maintains a table per process mapping these integers to open files, sockets, pipes, and devices. File descriptor 0 is standard input. File descriptor 1 is standard output. File descriptor 2 is standard error. These three exist in every process by default, created before your program's first line runs.

Note on open vs openat: x86-64 Linux keeps both for historical reasons, but newer architectures like ARM64 have dropped open entirely from their syscall tables — openat is the only way to open a file. The older open call exists on x86-64 purely so old software does not break.

Process Management

Syscall	What it does
`fork` / `clone`	Creates a new process by duplicating the calling process
`execve`	Replaces the current process image with a new program
`wait4` / `waitpid`	Waits for a child process to change state
`exit` / `exit_group`	Terminates the process
`getpid`	Returns the process's own ID
`kill`	Sends a signal to a process

When you run a command in your shell, the shell calls fork() to create a copy of itself, then calls execve() in the child to replace it with the program you asked for. The shell calls wait4() to pause until the child finishes. This fork-exec-wait sequence is how every process on a Unix system comes into existence. There is no other way. On modern Linux, fork() is actually a C library wrapper around the clone syscall, which is more general and supports threads as well as processes — but the concept is the same.

Memory

Syscall	What it does
`mmap`	Maps a region of memory — can be anonymous (new RAM) or file-backed
`munmap`	Unmaps a region
`brk`	Moves the end of the heap up or down
`mprotect`	Changes the permissions on a memory region

When a C program calls malloc(), the C library uses either brk() (for small allocations) or mmap() (for large ones) to request memory from the kernel. malloc() is not itself a system call — it is a library function that manages a pool of memory obtained through system calls. The kernel knows nothing about malloc. It only sees brk and mmap.

Networking

Syscall	What it does
`socket`	Creates a new socket and returns a file descriptor
`bind`	Assigns an address to a socket
`listen`	Marks a socket as passive — ready to accept incoming connections
`accept`	Accepts an incoming connection, returns a new file descriptor
`connect`	Initiates a connection to a remote address
`sendto` / `recvfrom`	Sends and receives data

A network connection in Linux is a file descriptor. Once a connection is established, you read from and write to it using the same read and write system calls used for files. The kernel handles the difference. This is the Unix "everything is a file" model in practice — the same interface works for disk files, terminal input, pipes between processes, and TCP connections over the network.

Seeing System Calls in Real Time

strace

strace is a tool that attaches to a process and logs every system call it makes. It uses a kernel facility called ptrace— itself a system call — to intercept execution at each syscall instruction before and after the kernel handles it.

First, check if it is installed:

strace --version

If it is not installed, on Debian or Ubuntu:

sudo apt install strace

Trace the system calls of a simple command:

strace echo "hello"

The output will be longer than expected. echo "hello" feels instant and trivial, but before it can print anything, the dynamic linker has to load shared libraries, the C library has to initialize, and the process has to set itself up. All of that involves system calls. You will see openat calls loading library files, mmap calls mapping them into memory, and finally a write call sending hello\n to file descriptor 1.

Trim the output to only the calls you care about:

strace -e trace=write echo "hello"

This shows only write calls. You will see one call that writes hello\n to file descriptor 1 (standard output).

Trace an already-running process by its PID:

strace -p <pid>

This attaches to the process without interrupting it. You will see its system calls in real time until you press Ctrl-C.

Count system calls and show totals:

strace -c ls /tmp

This runs the command, suppresses the per-call output, and prints a summary table showing every system call ls made, how many times, how long each took, and what fraction of total time was spent in each. It is often the fastest way to understand what a program actually spends its time doing.

Follow child processes as well:

strace -f bash -c "ls /tmp"

The -f flag tells strace to follow fork() calls and trace child processes too. Without it, you only see the parent. With it, each line is prefixed with the PID of the process that made the call.

Reading strace Output

A typical line looks like this:

openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3

The format is always: syscall_name(arguments) = return_value.

Breaking this line down:

openat is the system call being made.
AT_FDCWD is the first argument — it means "relative to the current working directory." openat is the modern version of open; it can open files relative to a directory file descriptor, or relative to the current directory when given AT_FDCWD. Internally, AT_FDCWD is just the integer -100 — a value the kernel reserves specifically for this purpose. strace translates it to the macro name so the output is readable.
"/etc/ld.so.cache" is the file being opened.
O_RDONLY|O_CLOEXEC are flags controlling how the file is opened. O_RDONLY means open for reading only. O_CLOEXEC means automatically close this file descriptor when the process calls execve — a common safety measure to prevent file descriptors leaking into child processes.
= 3 is the return value. The kernel created a new file descriptor and assigned it the number 3. The program can now pass 3 to read to read from this file.

When a call fails, strace shows -1 followed by the error code and a description:

openat(AT_FDCWD, "/etc/missing-file", O_RDONLY) = -1 ENOENT (No such file or directory)

ENOENT is the kernel's integer error code for "no such file or directory." The human-readable string in parentheses is added by strace for convenience — the kernel only returns the integer. The C library converts the negative return value in rax to -1 and stores the error code in errno.

/proc

Every running process has a directory under /proc named after its PID. This filesystem is not stored on disk anywhere — the kernel generates its contents in memory on demand, in response to reads. It is a window into the kernel's live view of each process.

See what files your shell has open:

ls -la /proc/$$/fd

$$ is a variable that expands to the PID of your current shell. The fd directory contains one entry per open file descriptor, each a symlink to whatever the descriptor points to: a file path, a socket, a pipe. File descriptors 0, 1, and 2 — stdin, stdout, stderr — are always present.

See which system call your shell is currently executing or blocked inside:

cat /proc/$$/syscall

If the file exists, it shows the system call number the process is currently in, along with its arguments in hexadecimal. If the process is not currently inside a system call, it shows running. This is useful when a process appears frozen — you can check exactly what it is waiting for without attaching a debugger. Note: this requires CONFIG_HAVE_ARCH_TRACEHOOK in the kernel build — present on most desktop and server kernels, absent on some minimal or container-optimized builds.

See the full memory map of your current shell:

cat /proc/$$/maps

Each line is one memory region: its virtual address range, its permissions (read, write, execute), and what backs it — a file path for shared libraries, [heap] for the anonymous region managed by malloc, [stack] for the call stack. Every region labeled with a file path was mapped in via mmap. The [heap] region was grown via brk.

Common Errors and Blocking States

The Syscall Fails

System calls signal failure by returning a negative error code in rax. The C library converts this to -1 and stores the absolute value of that code in errno. Common ones:

Error	Meaning
`ENOENT`	File or directory does not exist
`EACCES`	Permission denied
`EBADF`	Invalid file descriptor — not owned by this process, or already closed
`ENOMEM`	The kernel could not allocate memory to fulfill the request
`EAGAIN`	Resource temporarily unavailable — the operation would block, but the descriptor is non-blocking
`EINTR`	The call was interrupted by a signal before it completed
`EFAULT`	The memory address provided is invalid — the kernel refused to use it

EINTR deserves attention. If a signal arrives while a process is blocked inside a slow system call — read waiting for keyboard input, accept waiting for a network connection — the kernel interrupts the call early and returns EINTR instead of completing it. Well-written programs detect this and retry the call. Many programs do not, which causes subtle bugs where operations silently fail when a signal happens to arrive at the wrong moment.

Too Many Open Files

Each process has a limit on how many file descriptors it can hold open simultaneously. On most Linux systems the default is 1024 (soft limit) and 4096 (hard limit). When a process hits the limit, open() returns -1 with errno set to EMFILE. This is a common failure mode for servers that accept many concurrent connections — if the code does not call close() when it is done with a connection, the file descriptor count climbs until it hits the ceiling, and new connections start failing.

Check the current limits for a running process:

cat /proc/$(pgrep nginx)/limits

Check the limit for your current shell session:

ulimit -n

The Syscall Blocks

Some system calls return immediately. getpid() looks up the PID and returns in microseconds. Others block until something external happens: read() on a terminal waits until the user presses Enter. accept() on a server socket waits until a client connects. wait4() waits until a child process exits.

When a process is blocked inside a system call, it shows as state S (sleeping, interruptible) in process listings. You can see exactly which call it is waiting in:

cat /proc/<pid>/syscall

This is the first thing to check when a process appears hung. If it is blocked in read on a file descriptor, you can look up that descriptor in /proc/<pid>/fd to see what it is reading from. If it is blocked in accept, it is waiting for a network connection. The information is there without needing a debugger or elevated access.

Kernel-Side Validation and Safety

When the kernel receives a system call, it does not simply execute the operation. It treats the calling process as untrusted and validates every part of the request before touching anything.

Validation

Take a read() call. The program is asking the kernel to read data from file descriptor 3 into a memory buffer at address 0x7fff2000, for 512 bytes.

The kernel checks each part independently:

Does file descriptor 3 exist? The kernel looks up the process's file descriptor table. If there is no entry for 3 — because the process never opened that many files, or because it closed descriptor 3 earlier — the kernel returns EBADF.

Is descriptor 3 open for reading? A file can be opened read-only, write-only, or read-write. If descriptor 3 was opened with O_WRONLY, the kernel returns EBADF. The kernel does not assume that because a descriptor exists it can be used for any operation.

Does the memory address 0x7fff2000 belong to this process? This is the critical check. The kernel checks the address against the process's memory map — the same map visible in /proc/[pid]/maps. The address must fall within a region that belongs to this process and is writable — the kernel is about to write into it. If it does not — if the program passed an address from kernel space, or an address it simply made up, or an address in a region it has already freed — the kernel returns EFAULT. No data is read. No memory is written. For a write() call, the same check applies in reverse: the source buffer must be readable by the process, not writable.

This check is what makes the boundary meaningful. The process cannot trick the kernel into operating outside its own address space by passing a crafted address. Every address the process provides is checked before it is used.

Copying Data Across the Boundary

Once validation passes, the kernel needs to move data between kernel memory and the process's memory. It cannot simply dereference the address the process provided — the kernel runs in a different address space context. Instead, it uses two internal functions specifically designed for this transfer: copy_to_user and copy_from_user.

copy_to_user(destination, source, count) copies count bytes from a kernel buffer into the process's buffer — used when a syscall returns data to the process, as in read().

copy_from_user(destination, source, count) copies in the opposite direction — from the process's buffer into a kernel buffer — used when a syscall receives data from the process, as in write().

These functions do more than copy bytes. Linux uses demand paging, which means a page the process owns might be on swap disk. If the kernel tried to dereference that address directly, it would fault in kernel mode — a serious problem. copy_to_user and copy_from_user handle page faults gracefully: if a page needs to be loaded from swap during the copy, the kernel handles it safely rather than crashing.

Why This Design Matters

The combination of validation and safe copying means every system call arrives from a process the kernel treats as untrusted. A buggy process that accidentally passes the wrong address gets EFAULT. A compromised process that deliberately passes a kernel memory address gets EFAULT. The outcome is the same: the process receives an error and the kernel's internal state is untouched.

This is the practical meaning of the user space / kernel space separation. Not just an organisational boundary — an enforced one, re-verified on every crossing.

Why This Matters

The system call is not an implementation detail — it is the boundary that makes everything else possible. strace works by using ptrace, which is a system call that lets one process observe another. When the Linux OOM Killer sends SIGKILL to a process, the signal is not delivered instantly — it is delivered the next time the kernel returns execution to user space, whether that is after a system call or after a hardware interrupt like a timer tick. When dmesg reads the kernel ring buffer, it does so through syslog(), which is a system call.

The system call boundary is the seam where software and operating system meet. Most of the time it is invisible — hidden by libraries, runtimes, and frameworks that handle the details. When something goes wrong — a process hangs, a file cannot be opened, memory runs out, a connection is refused — the failure is almost always visible at that seam, if you know where to look.

References

Linux man page: syscalls(2) — the full list of Linux system calls with brief descriptions
Linux man page: errno(3) — all error codes and what they mean
strace documentation — full flag reference and examples
Linux kernel source browser — search and read kernel source in the browser, no download required. The syscall table for x86-64 is at arch/x86/entry/syscalls/syscall_64.tbl