> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pebchip.top/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Operating Systems: Processes, Memory, and Concurrency

> Key operating system concepts including process vs thread, virtual memory, scheduling algorithms, and synchronization primitives for engineers.

Operating system concepts are the foundation beneath every application you write. When you debug a deadlocked service, tune a high-concurrency server, or investigate memory pressure, you are reasoning directly about processes, virtual memory, scheduling, and I/O models. This page covers the core OS concepts that come up most often in backend engineering and system design, drawn from both academic foundations and practical Linux internals.

## Processes vs Threads

A **process** is an independent execution unit with its own virtual address space, file descriptor table, and OS-level resources. A **thread** is a lightweight execution unit that shares the address space and resources of its parent process.

| Property            | Process                                             | Thread                                           |
| ------------------- | --------------------------------------------------- | ------------------------------------------------ |
| Memory              | Private address space                               | Shared with other threads in the same process    |
| File descriptors    | Private (inherited at fork, then diverge)           | Shared                                           |
| Communication       | IPC (pipes, sockets, shared memory, signals)        | Shared memory, global variables                  |
| Context switch cost | High — must switch page tables, TLB flush           | Lower — only registers and stack pointer change  |
| Crash isolation     | A crashing process does not corrupt other processes | A crashing thread can corrupt the entire process |
| Creation cost       | High (`fork` + `exec`)                              | Lower (`pthread_create` or `std::thread`)        |

### When to Use Each

Use **separate processes** when you need strong isolation—for example, sandboxed plugins, multi-tenant workers, or any scenario where one failure must not bring down the rest of the application.

Use **threads** within a process when tasks share state naturally (e.g., a web server handling concurrent requests against the same in-memory cache) and when the context-switch overhead of processes would be prohibitive.

### Context Switching Cost

During a context switch the OS saves the current execution context (registers, program counter, stack pointer) and loads the saved context of the next process or thread. For threads within the same process, the virtual address space does not change, so the expensive TLB flush and page-table swap are avoided. That said, even thread context switches carry overhead from saving/restoring registers and flushing CPU caches, which accumulates under high thread counts.

### Inter-Process Communication

Linux provides several IPC mechanisms:

| Mechanism             | Description                                                               | Suitable for                                    |
| --------------------- | ------------------------------------------------------------------------- | ----------------------------------------------- |
| **Anonymous pipe**    | Byte stream between parent and child via `fork`                           | Parent-child communication                      |
| **Named pipe (FIFO)** | Named file in the filesystem; any process can open it                     | Unrelated processes                             |
| **Message queue**     | Kernel-managed message list; structured data; lifetime tied to kernel     | Asynchronous, structured IPC                    |
| **Shared memory**     | Two processes map the same physical pages into their address spaces       | High-throughput IPC (fastest mechanism)         |
| **Semaphore**         | Integer counter for mutual exclusion and synchronization (P/V operations) | Protecting shared memory                        |
| **Signal**            | Asynchronous notification (e.g., `SIGINT`, `SIGKILL`)                     | Exception handling, process control             |
| **Socket**            | Full-duplex byte stream; works cross-host                                 | Any two processes, including across the network |

***

## Virtual Memory

Every process sees a flat, private **virtual address space** (0 to 2^48 on 64-bit Linux). The OS and MMU (Memory Management Unit) transparently map virtual addresses to physical RAM.

### Address Spaces and Page Tables

Virtual memory is divided into fixed-size **pages** (typically 4 KB). Physical RAM is divided into **frames** of the same size. The OS maintains a **page table** for each process that maps page numbers to frame numbers.

A virtual address is split into:

```
[ Page number | Page offset ]
```

The MMU looks up the page number in the page table to find the frame number, then combines it with the page offset to produce the physical address.

### TLB (Translation Lookaside Buffer)

Walking the page table on every memory access would double or triple memory latency. The **TLB** is a small, fast hardware cache inside the CPU that stores recent virtual-to-physical translations.

* A **TLB hit** resolves the translation in 1–2 cycles.
* A **TLB miss** triggers a page-table walk (multiple memory accesses) or a hardware page-table walker.

On a process context switch the TLB must be flushed (or tagged with an address space ID), which is part of why process switches are more expensive than thread switches.

### Page Faults

When a process accesses a virtual page that has no current mapping (the page is not in RAM), the CPU raises a **page fault**. The OS handles it by:

1. Checking whether the access is valid (within the process's allocated virtual space).
2. Finding the page on disk (swap space or a memory-mapped file).
3. Allocating a free frame, loading the page, updating the page table.
4. Resuming the faulting instruction.

**Minor page faults** (page exists but not yet mapped for this process, e.g., a copy-on-write page) are fast. **Major page faults** (page must be read from disk) are expensive—many microseconds to milliseconds.

***

## CPU Scheduling

The scheduler decides which ready process or thread runs next. Every algorithm makes trade-offs between throughput, fairness, response time, and starvation avoidance.

| Algorithm       | How it works                                                                     | Strength                              | Weakness                                                  |
| --------------- | -------------------------------------------------------------------------------- | ------------------------------------- | --------------------------------------------------------- |
| **FCFS**        | Run in arrival order; non-preemptive                                             | Simple                                | Long jobs block short ones (convoy effect)                |
| **SJF**         | Run shortest job first; non-preemptive                                           | Optimal average turnaround time       | Requires knowing runtime in advance; long jobs may starve |
| **STCF / SRT**  | Preemptive SJF; new arrival can preempt current job                              | Better response time than SJF         | Long jobs can starve indefinitely                         |
| **Round Robin** | Each job runs for a fixed time quantum, then preempts                            | Fair; no starvation                   | Turnaround time can be worse than FCFS                    |
| **Priority**    | Higher-priority jobs run first                                                   | Flexible                              | Low-priority jobs may starve                              |
| **MLFQ**        | Multiple queues with decreasing priority; new jobs start high, demote on CPU use | Balances response time and throughput | Complex; gaming by malicious processes                    |

**Key metrics:**

* **Turnaround time** = completion time − arrival time
* **Response time** = first run time − arrival time
* **Weighted turnaround** = turnaround / run time (normalized fairness)

Round Robin with a very small quantum behaves like a fair-share scheduler but generates high context-switch overhead. The time quantum must be large enough that context-switch cost is a small fraction of quantum length but small enough that response time stays acceptable.

***

## Synchronization

When threads share mutable state, you need synchronization primitives to prevent data races and enforce ordering.

### Mutex (Mutual Exclusion Lock)

A mutex has two states: locked and unlocked. Only one thread can hold the lock at a time; all others block until it is released.

```cpp theme={null}
#include <pthread.h>

pthread_mutex_t count_mutex;
volatile long long cnt = 0;

void *thread(void *vargp) {
    long long niters = *((long long *)vargp);
    for (long long i = 0; i < niters; i++) {
        pthread_mutex_lock(&count_mutex);   // enter critical section
        cnt++;                              // critical section
        pthread_mutex_unlock(&count_mutex); // exit critical section
    }
    return NULL;
}
```

### Semaphore

A semaphore is an integer counter supporting two atomic operations:

* **P (wait/down)**: decrement; block if the value would go negative.
* **V (signal/up)**: increment; wake one blocked thread.

Initialized to **1**, a semaphore acts as a mutex. Initialized to **0**, it enforces ordering (process A must run before process B).

### Condition Variables

A condition variable allows a thread to sleep until a specific condition is true, releasing the associated mutex atomically while sleeping.

```cpp theme={null}
pthread_cond_wait(&cond, &mutex);    // atomically release mutex + sleep
pthread_cond_signal(&cond);          // wake one waiting thread
pthread_cond_broadcast(&cond);       // wake all waiting threads
```

This is the standard pattern for producer-consumer queues: the consumer sleeps on the condition variable when the queue is empty; the producer signals after adding an item.

### Spin Lock

A spin lock busy-waits (polls in a loop) instead of sleeping. It is faster than a mutex when the expected wait time is very short (less than a context switch), but wastes CPU cycles if the wait is long. Use spin locks for very short critical sections in kernel or real-time code; prefer mutexes in application code.

### Deadlock

A **deadlock** occurs when a set of processes each wait for a resource held by another member of the set—none can ever proceed.

The four **Coffman conditions** must all hold simultaneously for deadlock to occur:

| Condition            | Meaning                                                         |
| -------------------- | --------------------------------------------------------------- |
| **Mutual exclusion** | A resource can be held by only one process at a time            |
| **Hold and wait**    | A process holds at least one resource while waiting for another |
| **No preemption**    | Resources cannot be forcibly taken from a process               |
| **Circular wait**    | A cycle exists in the resource-allocation graph                 |

**Prevention**: break any one condition.

* Eliminate hold-and-wait by requiring processes to request all resources at once.
* Allow preemption for resources that can be safely reclaimed.
* Impose a global ordering on resource types and require processes to request in that order (breaks circular wait).

**Avoidance**: use the Banker's Algorithm to admit a resource request only if the resulting state is provably safe.

**Detection and recovery**: let deadlocks happen, detect them with a resource-allocation graph, then resolve by killing or rolling back one process.

***

## I/O Models

How your application interacts with the OS during I/O determines its concurrency and performance characteristics.

### Blocking I/O

The simplest model. A call like `read()` blocks the thread until data is available. One thread per connection is required, which limits scalability.

### Non-blocking I/O

`read()` returns immediately with `EAGAIN` if no data is available. The application must poll repeatedly ("busy-wait"). CPU usage is high when data is sparse.

### I/O Multiplexing: select / poll / epoll

A single thread monitors many file descriptors simultaneously and only acts when one is ready.

| Mechanism | Storage                              | Max FDs            | Per-call copy                                    | Find ready FDs                 |
| --------- | ------------------------------------ | ------------------ | ------------------------------------------------ | ------------------------------ |
| `select`  | Bitmap                               | 1024 (FD\_SETSIZE) | Full bitmap copied to kernel each call           | O(n) scan                      |
| `poll`    | Array (linked list)                  | System limit       | Full array copied each call                      | O(n) scan                      |
| `epoll`   | Red-black tree (kernel) + ready list | System limit       | Register once with `epoll_ctl`; no per-call copy | O(1) — only ready FDs returned |

`epoll` is the Linux workhorse for high-concurrency servers (nginx, Redis, Node.js). It uses **event-driven** notification:

```
epoll_create()   → creates the epoll instance
epoll_ctl()      → registers/modifies/removes a socket interest
epoll_wait()     → blocks until one or more FDs are ready; returns only ready ones
```

`epoll` supports two triggering modes:

* **Level-triggered (LT)**: `epoll_wait` keeps returning the fd until its buffer is drained. Easier to use correctly; default mode.
* **Edge-triggered (ET)**: `epoll_wait` fires once per state change. Requires you to drain the buffer completely in one shot. More efficient (fewer syscalls) but harder to implement correctly.

### Asynchronous I/O

The process submits an I/O request and immediately continues. The OS completes the I/O entirely in the background and notifies the process (via signal or callback) when done. Unlike I/O multiplexing, the application does not perform the actual `read`/`write`—the OS does. Linux AIO and `io_uring` implement this model.

### Zero-Copy

Traditional file transfer requires four context switches and four data copies (disk → kernel buffer → user buffer → socket buffer → NIC). **Zero-copy** techniques reduce this:

* **`mmap + write`**: maps the kernel buffer into user space; eliminates the kernel→user copy. Still two syscalls, four context switches, three data copies.
* **`sendfile`**: one syscall copies directly from kernel buffer to socket buffer. Two context switches, three data copies.
* **`sendfile` with SG-DMA**: on kernels ≥ 2.4, the NIC DMA engine reads directly from the kernel buffer. Two context switches, **two** data copies (both DMA, no CPU involvement in data movement).

Zero-copy is why tools like Kafka and Nginx can sustain very high file-serving throughput with minimal CPU overhead.