Back
mechanics

System Calls: How Your Code Talks to the Kernel

Every time a program reads a file, opens a socket, or allocates memory, it makes a system call. Here is what that means, how it works at the hardware level, and how to see it happening in real time.

When you run a program, that program cannot do much on its own. It cannot read a file, send a network packet, or even write a line to the terminal without asking the operating system for help. The mechanism it uses to ask is a system call.

System calls are the only legitimate way for a program to cross from user space into the kernel. Every read, every write, every new process, every memory allocation goes through one. Understanding them explains a large portion of what Linux is actually doing when software runs — and why software built for one operating system does not run on another.


The Privilege Boundary

User Space and Kernel Space

Linux splits memory into two regions: user space and kernel space.

User space is where programs run. Your shell, your browser, your Python scripts — all of these live in user space. Code running in user space operates with strict limits. It cannot access hardware directly. It cannot read another process's memory. It cannot modify the kernel's internal state. If it tries, the CPU raises a fault and the kernel terminates it.

Kernel space is where the operating system itself runs. The kernel has unrestricted access to hardware, to all physical memory, and to every process on the system. Code running in kernel space can do anything. That power is also why a bug in kernel space can bring down the entire machine — there is nothing above it to catch the mistake.

This separation is enforced by the CPU, not by convention. Modern CPUs operate in different privilege levels — sometimes called rings. The kernel runs at ring 0, which means full privilege: every instruction is permitted. User programs run at ring 3, which means restricted: certain instructions are forbidden entirely. The hardware itself enforces the boundary. A user space program that tries to execute a privileged instruction does not just get ignored — the CPU raises an exception immediately, before the instruction completes.

This design predates Linux. It goes back to the 1960s, when engineers working on early time-sharing systems realized that if every program could touch every part of memory, one buggy program could corrupt every other program running on the same machine. The ring model was the solution. Linux inherits it from Unix, which inherited it from that earlier era.

The Problem This Creates

The separation solves a security and stability problem, but it creates a practical one: programs need to do things that require kernel access. A text editor needs to read files. A web server needs to open network connections. A shell needs to start new processes.

None of those things can be done from user space alone. The program needs to ask the kernel to do them on its behalf. The system call is that request — a controlled crossing of the boundary between the two worlds.


What a System Call Actually Is

Registers: The CPU's Working Memory

Before explaining how a system call works at the hardware level, it helps to understand one concept: CPU registers.

A register is a small, named slot of storage built directly into the processor. Unlike RAM, which is a separate chip the CPU accesses over a bus, registers are inside the CPU itself. Reading or writing a register takes a single clock cycle. Reading RAM takes hundreds. Registers are where the CPU does its immediate work — they hold the values currently being operated on.

On x86-64 processors (the architecture in virtually every modern desktop, laptop, and server), the general-purpose registers are named rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, and r8 through r15. Each holds 64 bits — 8 bytes — of data. When the CPU executes an instruction like "add these two numbers," those numbers come from registers. When a function returns a value, that value is placed in rax by convention, and the caller reads it from rax.

This matters for system calls because the program and the kernel communicate through registers. There is no shared memory they can safely write to — the whole point is that they are isolated from each other. Registers are the channel the hardware defines for this.

The Mechanism

A system call is a controlled transfer of execution from user space to kernel space. The word "controlled" is important. The program does not jump to an arbitrary kernel address — it cannot, because kernel memory is not mapped into user space in a way programs can execute. Instead, the program signals its intent, and the CPU transitions to kernel mode at a single, fixed entry point that the kernel itself defined at boot time. The kernel then decides what to do with the request.

The sequence for a typical system call on x86-64 Linux:

Step 1. The program places a number in the rax register. This number is the system call number — a fixed integer that identifies which operation is being requested. On x86-64 Linux, read is 0, write is 1, open is 2, close is 3. These numbers are defined in the kernel source at arch/x86/entry/syscalls/syscall_64.tbl and never change between kernel versions — existing software depends on them remaining stable.

Step 2. Arguments go into the registers rdi, rsi, rdx, r10, r8, and r9, in that order. A write call takes three arguments: a file descriptor, a pointer to the data to write, and the number of bytes. A file descriptor is an integer the kernel assigns when a program opens a file, socket, or any other resource — it is how the program refers back to that resource in future calls. So rdi gets the file descriptor, rsi gets the memory address of the data, rdx gets the byte count. You might notice the fourth argument slot is r10 rather than rcx, which is what the standard C calling convention uses. The reason is hardware: when the CPU executes the syscall instruction, it saves the return address — the point in user space to resume after the call — directly into rcx, overwriting whatever was there. The kernel developers had no choice but to use r10 for the fourth argument instead.

Step 3. The program executes the syscall instruction. This is a single CPU instruction with a specific job: it atomically saves the current execution state (so the program can resume exactly where it left off), switches the CPU from ring 3 to ring 0, and jumps to the kernel's system call entry point — a function called entry_SYSCALL_64. "Atomically" here means the CPU does all three things as one indivisible operation. There is no intermediate state where the CPU is at ring 0 but has not yet saved the program's state.

Step 4. The kernel is now running. It reads the value in rax to know which system call was requested. It looks up the corresponding handler function in a table. It reads the arguments from rdi, rsi, rdx and the rest. It validates everything — checking that the arguments are legitimate, that the process has permission, that the memory addresses it was given actually belong to the calling process. Then it executes the operation.

Step 5. The kernel places the return value in rax and executes the sysret instruction, which is the mirror image of syscall: it switches the CPU back to ring 3 and restores execution at the instruction immediately after the syscall instruction in the program. On success, rax holds zero or a positive number — a byte count, a file descriptor, a PID. On failure, the kernel places a negative error code in rax, such as -2 for ENOENT. The program itself never sees this directly. The C library wrapper checks rax, and if negative, stores the absolute value in the errno variable and returns -1 to the caller. This is why C code checks for -1 and reads errno separately — that two-step is the C library's doing, not the kernel's.

The entire round trip — from the program executing syscall to the kernel completing the operation and returning — takes on the order of 100 to 300 nanoseconds on modern hardware. For comparison, accessing RAM takes around 60 to 100 nanoseconds. A system call is fast, but not free. Programs that call the kernel millions of times per second feel the cost.

What You Never See

In practice, you almost never write any of this yourself. The C standard library — glibc on most Linux systems — wraps every system call in a normal-looking function. When your C program calls write(1, "hello\n", 6), the library's write() function loads the registers, executes syscall, reads the return value from rax, and hands it back to you. The system call number, the register protocol, the kernel entry point — all invisible.

Higher-level languages add more layers. A Python print() statement eventually reaches the C library's write(), which reaches the write syscall. A Go fmt.Println() does the same through Go's own runtime. A Node.js console.log() goes through the V8 engine and then libuv, which eventually calls the kernel. Each language adds a layer on top of the one below it, but the system call is always at the bottom. Every layer above it exists to make the kernel's raw interface easier to use.

The Exception: vDSO

There is one case where Linux sidesteps the syscall mechanism entirely for performance reasons.

Operations like reading the current time — gettimeofday() — are called so frequently that the 100–300 nanosecond cost of switching to ring 0 adds up. To avoid it, Linux uses a mechanism called vDSO (virtual dynamic shared object). At boot, the kernel maps a small read-only region of memory into every process's address space. This region contains fast-changing kernel data — including the current time — that the kernel keeps current. When a program calls gettimeofday(), the C library reads directly from this shared region without executing a syscall instruction. No ring switch. No kernel entry. The answer is already in user space.

If you run strace on a program that reads the time, you will not see a gettimeofday syscall in the output — because none was made. vDSO calls are invisible to strace for the same reason: they never cross the ring 0 boundary that strace monitors.


System Calls Are Not Portable

The Same Concept, Different Implementations

The concept of a system call — a controlled privilege transition through a fixed interface — is universal. Every modern operating system has it. Linux, Windows, macOS, FreeBSD — all of them enforce user space and kernel space, and all of them require programs to cross the boundary through a defined gate.

What is not universal is the implementation. The specific system calls, their numbers, their arguments, and the hardware instruction used to invoke them are different on every operating system. This difference is why software built for Linux does not run on Windows, and why a binary compiled for macOS does not run on Linux.

Linux vs. Windows

Linux defines around 300 to 400 system calls depending on the architecture. Windows defines its own set, internally called NT system calls — NtReadFile, NtWriteFile, NtCreateProcess. The names are different, the numbers are different, the argument conventions are different, and the philosophy is different.

Linux follows the Unix model: almost everything is a file descriptor. A socket is a file descriptor. A pipe is a file descriptor. A timer is a file descriptor. Once you have one, you can read and write it with the same read and write system calls you use for files. The kernel handles the difference underneath.

Windows does not follow this model. It has separate handle types for files, sockets, threads, processes, and registry keys, each with its own family of API calls. There is no single "write to this thing" call. You use WriteFile for files, send for sockets, and so on. Neither model is wrong — they reflect different design decisions made in the 1970s and 1980s that every subsequent version has had to maintain for compatibility.

The calling convention also differs. Both Linux and Windows on x86-64 use the syscall instruction at the hardware level, but the register assignments are different, and the entry points are different. A Linux binary that executes syscall with rax=1 is asking Linux for write. The same instruction on Windows means something entirely different or maps to nothing at all.

Linux vs. macOS

macOS shares Unix heritage. It is built on a kernel called XNU, which itself is built on Mach and parts of BSD. Many macOS system calls have the same names as Linux system calls — open, read, write, fork all exist. The semantics are often similar.

But the numbers are different — and the difference goes deeper than just counting from a different starting point. On macOS, write is system call number 4, inherited from BSD. But macOS groups its system calls into classes using the upper bits of the number. BSD calls use a class offset of 0x2000000, so the actual value placed in rax for a write call is 0x2000004, not just 4. Passing 4 alone would hit a different class of call entirely — a Mach trap. A Linux binary that places 1 in rax lands in a completely different part of the kernel. macOS also adds its own syscalls for Apple-specific features that have no Linux equivalent.

On Apple Silicon Macs (ARM processors), the hardware instruction for a system call is svc, not syscall. A Linux binary compiled for x86-64 would not even contain the right instruction, let alone the right number.

Why WSL Exists

Windows Subsystem for Linux is Microsoft's answer to this incompatibility. WSL 1 intercepted Linux system calls at runtime and translated them into Windows kernel calls — when a Linux binary executed syscall with rax=1 (Linux write), WSL caught it, translated it into NtWriteFile, and returned the result in the format Linux expected.

This translation layer proved too difficult to maintain with full accuracy. Some Linux syscalls had no clean Windows equivalent, and edge cases accumulated. WSL 2 abandoned the translation approach entirely. It runs a real Linux kernel inside a lightweight Hyper-V virtual machine. Linux binaries make real Linux syscalls to a real Linux kernel — no translation required. The compatibility is near-perfect because it is genuine.

What This Means for Software

When someone says software is "cross-platform," they usually mean the source code can be compiled for multiple operating systems. The same C source file that calls write() will compile and run on Linux, macOS, and Windows — because the C library on each platform provides a write() function that calls the local kernel the right way. The abstraction the C library provides is what enables portability, not the syscall interface itself.

When software is distributed as a compiled binary, it is tied to a specific operating system and architecture. A Linux x86-64 binary contains Linux syscall numbers for x86-64. It will not run on macOS, Windows, or even Linux on ARM without translation.


Common System Calls

You do not need to memorize system call numbers. What matters is recognizing the categories and knowing what operations map to them. When you see these names in tool output or error messages, you will know what they represent.

File I/O

Syscall What it does
open / openat Opens a file and returns a file descriptor — an integer handle the kernel uses to track the open file
read Reads bytes from a file descriptor into a buffer in the program's memory
write Writes bytes from a buffer in the program's memory to a file descriptor
close Releases a file descriptor
stat / fstat Returns metadata about a file: size, permissions, timestamps
lseek Moves the read/write position within a file

A file descriptor is just an integer — 0, 1, 2, 7, 23. The kernel maintains a table per process mapping these integers to open files, sockets, pipes, and devices. File descriptor 0 is standard input. File descriptor 1 is standard output. File descriptor 2 is standard error. These three exist in every process by default, created before your program's first line runs.

Note on open vs openat: x86-64 Linux keeps both for historical reasons, but newer architectures like ARM64 have dropped open entirely from their syscall tables — openat is the only way to open a file. The older open call exists on x86-64 purely so old software does not break.

Process Management

Syscall What it does
fork / clone Creates a new process by duplicating the calling process
execve Replaces the current process image with a new program
wait4 / waitpid Waits for a child process to change state
exit / exit_group Terminates the process
getpid Returns the process's own ID
kill Sends a signal to a process

When you run a command in your shell, the shell calls fork() to create a copy of itself, then calls execve() in the child to replace it with the program you asked for. The shell calls wait4() to pause until the child finishes. This fork-exec-wait sequence is how every process on a Unix system comes into existence. There is no other way. On modern Linux, fork() is actually a C library wrapper around the clone syscall, which is more general and supports threads as well as processes — but the concept is the same.

Memory

Syscall What it does
mmap Maps a region of memory — can be anonymous (new RAM) or file-backed
munmap Unmaps a region
brk Moves the end of the heap up or down
mprotect Changes the permissions on a memory region

When a C program calls malloc(), the C library uses either brk() (for small allocations) or mmap() (for large ones) to request memory from the kernel. malloc() is not itself a system call — it is a library function that manages a pool of memory obtained through system calls. The kernel knows nothing about malloc. It only sees brk and mmap.

Networking

Syscall What it does
socket Creates a new socket and returns a file descriptor
bind Assigns an address to a socket
listen Marks a socket as passive — ready to accept incoming connections
accept Accepts an incoming connection, returns a new file descriptor
connect Initiates a connection to a remote address
sendto / recvfrom Sends and receives data

A network connection in Linux is a file descriptor. Once a connection is established, you read from and write to it using the same read and write system calls used for files. The kernel handles the difference. This is the Unix "everything is a file" model in practice — the same interface works for disk files, terminal input, pipes between processes, and TCP connections over the network.


Seeing System Calls in Real Time

strace

strace is a tool that attaches to a process and logs every system call it makes. It uses a kernel facility called ptrace — itself a system call — to intercept execution at each syscall instruction before and after the kernel handles it.

First, check if it is installed:

strace --version

If it is not installed, on Debian or Ubuntu:

sudo apt install strace

Trace the system calls of a simple command:

strace echo "hello"

The output will be longer than expected. echo "hello" feels instant and trivial, but before it can print anything, the dynamic linker has to load shared libraries, the C library has to initialize, and the process has to set itself up. All of that involves system calls. You will see openat calls loading library files, mmap calls mapping them into memory, and finally a write call sending hello\n to file descriptor 1.

Trim the output to only the calls you care about:

strace -e trace=write echo "hello"

This shows only write calls. You will see one call that writes hello\n to file descriptor 1 (standard output).

Trace an already-running process by its PID:

strace -p <pid>

This attaches to the process without interrupting it. You will see its system calls in real time until you press Ctrl-C.

Count system calls and show totals:

strace -c ls /tmp

This runs the command, suppresses the per-call output, and prints a summary table showing every system call ls made, how many times, how long each took, and what fraction of total time was spent in each. It is often the fastest way to understand what a program actually spends its time doing.

Follow child processes as well:

strace -f bash -c "ls /tmp"

The -f flag tells strace to follow fork() calls and trace child processes too. Without it, you only see the parent. With it, each line is prefixed with the PID of the process that made the call.

Reading strace Output

A typical line looks like this:

openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3

The format is always: syscall_name(arguments) = return_value.

Breaking this line down:

  • openat is the system call being made.
  • AT_FDCWD is the first argument — it means "relative to the current working directory." openat is the modern version of open; it can open files relative to a directory file descriptor, or relative to the current directory when given AT_FDCWD. Internally, AT_FDCWD is just the integer -100 — a value the kernel reserves specifically for this purpose. strace translates it to the macro name so the output is readable.
  • "/etc/ld.so.cache" is the file being opened.
  • O_RDONLY|O_CLOEXEC are flags controlling how the file is opened. O_RDONLY means open for reading only. O_CLOEXEC means automatically close this file descriptor when the process calls execve — a common safety measure to prevent file descriptors leaking into child processes.
  • = 3 is the return value. The kernel created a new file descriptor and assigned it the number 3. The program can now pass 3 to read to read from this file.

When a call fails, strace shows -1 followed by the error code and a description:

openat(AT_FDCWD, "/etc/missing-file", O_RDONLY) = -1 ENOENT (No such file or directory)

ENOENT is the kernel's integer error code for "no such file or directory." The human-readable string in parentheses is added by strace for convenience — the kernel only returns the integer. The C library converts the negative return value in rax to -1 and stores the error code in errno.

/proc

Every running process has a directory under /proc named after its PID. This filesystem is not stored on disk anywhere — the kernel generates its contents in memory on demand, in response to reads. It is a window into the kernel's live view of each process.

See what files your current shell currently has open:

ls -la /proc/$$/fd

$$ is a variable that expands to the PID of your current shell — it always refers to the exact terminal you are typing in. The fd directory contains one entry per open file descriptor, each a symlink to whatever the descriptor points to: a file path, a socket, a pipe. File descriptors 0, 1, and 2 — stdin, stdout, stderr — are always present.

See which system call your shell is currently executing or blocked inside (available on most desktop and server kernels; absent on some minimal or container-optimized builds):

cat /proc/$$/syscall

If the file exists, it shows the system call number the process is currently in, along with its arguments in hexadecimal. If the process is not currently inside a system call, it shows running. If the file is missing entirely, the kernel was compiled without this feature — it requires CONFIG_HAVE_ARCH_TRACEHOOK. This is useful when a process appears frozen — you can check exactly what it is waiting for, without attaching a debugger.

See the full memory map of your current shell:

cat /proc/$$/maps

Each line is one memory region: its virtual address range, its permissions (read, write, execute), and what backs it — a file path for shared libraries, [heap] for the anonymous region managed by malloc, [stack] for the call stack. Every region labeled with a file path was mapped in via mmap. The [heap] region was grown via brk.


Common Errors and Blocking States

The Syscall Fails

System calls signal failure by returning a negative error code in rax. The C library converts this to -1 and stores the absolute value of that code in errno. Common ones:

Error Meaning
ENOENT File or directory does not exist
EACCES Permission denied — the process does not have the rights to perform this operation
EBADF Invalid file descriptor — the process passed a descriptor it does not own, or one that is already closed
ENOMEM The kernel could not allocate memory to fulfill the request
EAGAIN Resource temporarily unavailable — the operation would block, but the file descriptor is set to non-blocking mode
EINTR The call was interrupted by a signal before it completed
EFAULT The memory address the program provided is invalid — the kernel refused to use it

EINTR deserves attention. If a signal arrives while a process is blocked inside a slow system call — read waiting for keyboard input, accept waiting for a network connection — the kernel interrupts the call early and returns EINTR instead of completing it. Well-written programs detect this and retry the call. Many programs do not, which causes subtle bugs where operations silently fail when a signal happens to arrive at the wrong moment.

Too Many Open Files

Each process has a limit on how many file descriptors it can hold open simultaneously. On most Linux systems the default is 1024 (soft limit) and 4096 (hard limit). When a process hits the limit, open() returns -1 with errno set to EMFILE. This is a common failure mode for servers that accept many concurrent connections — if the code does not call close() when it is done with a connection, the file descriptor count climbs until it hits the ceiling, and new connections start failing.

Check the current limits for a running process:

cat /proc/$(pgrep nginx)/limits

Check the limit for your current shell session:

ulimit -n

The Syscall Blocks

Some system calls return immediately. getpid() looks up the PID and returns in microseconds. Others block until something external happens: read() on a terminal waits until the user presses Enter. accept() on a server socket waits until a client connects. wait4() waits until a child process exits.

When a process is blocked inside a system call, it shows as state S (sleeping, interruptible) in process listings. You can see exactly which call it is waiting in:

cat /proc/<pid>/syscall

This is the first thing to check when a process appears hung. If it is blocked in read on a file descriptor, you can look up that descriptor in /proc/<pid>/fd to see what it is reading from. If it is blocked in accept, it is waiting for a network connection. The information is there without needing a debugger or elevated access.


Kernel-Side Validation and Safety

When the kernel receives a system call, it does not simply execute the operation. It treats the calling process as untrusted and validates every part of the request before touching anything.

Validation

Take a read() call. The program is asking the kernel to read data from file descriptor 3 into a memory buffer at address 0x7fff2000, for 512 bytes.

The kernel checks each part independently:

Does file descriptor 3 exist? The kernel looks up the process's file descriptor table. If there is no entry for 3 — because the process never opened that many files, or because it closed descriptor 3 earlier — the kernel returns EBADF. The process receives an error. Nothing else happens.

Is descriptor 3 open for reading? A file can be opened read-only, write-only, or read-write. If descriptor 3 was opened with O_WRONLY, the kernel returns EBADF. The kernel does not assume that because a descriptor exists it can be used for any operation.

Does the memory address 0x7fff2000 belong to this process? This is the critical check. The program told the kernel: "write the data you read into memory at this address." If the kernel wrote to that address without checking, a malicious program could pass the address of kernel memory, or another process's memory, and read data it has no right to see.

The kernel checks the address against the process's memory map — the same map visible in /proc/[pid]/maps. The address must fall within a region that belongs to this process and is writable. If it does not — if the program passed an address from kernel space, or an address it simply made up, or an address in a region it has already freed — the kernel returns EFAULT. No data is read. No memory is written.

This check is what makes the boundary meaningful. The process cannot trick the kernel into operating outside its own address space by passing a crafted address. Every address the process provides is checked before it is used.

Copying Data Across the Boundary

Once validation passes, the kernel needs to move data between kernel memory and the process's memory. It cannot simply dereference the address the process provided — the kernel runs in a different address space context. Instead, it uses two internal functions specifically designed for this transfer: copy_to_user and copy_from_user.

copy_to_user(destination, source, count) copies count bytes from a kernel buffer at source to a user space address at destination. It is used when a syscall returns data to the process — for example, a read() call copies file data from a kernel buffer into the process's buffer.

copy_from_user(destination, source, count) copies in the opposite direction — from a user space address into a kernel buffer. It is used when a syscall receives data from the process — for example, a write() call copies the data the process wants to write from the process's buffer into a kernel buffer before sending it to the file or device.

These functions do more than copy bytes. They handle a specific problem: the process's memory address might be valid — it passed the validation check — but the page of memory it refers to might not currently be in RAM. Linux uses demand paging, which means physical memory is allocated only when first accessed. A page the process owns might be sitting on swap disk. If the kernel tries to dereference that address directly, it would cause a fault in kernel mode, which is a serious problem.

copy_to_user and copy_from_user are written to handle page faults gracefully. If a fault occurs during the copy — because a page needs to be loaded from swap — the kernel handles it safely rather than crashing. When the transfer is complete, both sides have their data, and the call proceeds normally.

Why This Design Matters

The combination of validation and safe copying means that from the kernel's perspective, every system call arrives from a process the kernel treats as untrusted. Not because processes are malicious, but because the kernel cannot assume they are not. A buggy process might pass garbage arguments. A compromised process might pass crafted arguments designed to reach memory it should not see.

The validation catches both. A bug that accidentally passes the wrong address gets EFAULT. A deliberate attempt to pass a kernel memory address gets EFAULT. The outcome is the same: the process receives an error and the kernel's internal state is untouched.

This is the practical meaning of the user space / kernel space separation. It is not just an organizational boundary — it is an enforced one, re-verified on every single crossing.


Why This Matters

The system call is not an implementation detail — it is the boundary that makes everything else possible. strace works by using ptrace, which is a system call that lets one process observe another. When the Linux OOM Killer sends SIGKILL to a process, the signal is not delivered instantly — it is delivered the next time the kernel returns execution to user space, whether that is after a system call or after a hardware interrupt like a timer tick. When dmesg reads the kernel ring buffer, it does so through syslog(), which is a system call.

The system call boundary is the seam where software and operating system meet. Most of the time it is invisible — hidden by libraries, runtimes, and frameworks that handle the details. When something goes wrong — a process hangs, a file cannot be opened, memory runs out, a connection is refused — the failure is almost always visible at that seam, if you know where to look.


References