Liam White
Operating systems part 0: Hardware building blocks

Here I catalog the parts of a modern system necessary to bootstrap an operating system. These devices are generally called "Systems on a Chip" due to the wide variety of functionality they include.

Cores

At a high level, a core executes a single stream of instructions, which direct it to operate on registers, operate on memory, perform arithmetic, or change the next instruction. The core's operation is periodic and occurs in synchronization with its internal clock.

Cores are often categorized as "application cores" (high performance, suitable for general use) or "microcontroller cores" (low performance, niche uses which require fixed latency). This will be discussed in more detail below. Among application cores, there are often further distinctions made, such as ARM's big.LITTLE heterogeneous architecture.

System on a Chip (SoC)

Modern SoCs have at least one core, a large number of peripherals, and some amount of internal memory, of which peripherals are "mapped" at. This mapping allows a simple interface for software to interact with peripherals by reading and writing to memory. Systems which are designated as multi-core have multiple cores, which share the same memory, but can process instructions independently of each other.

Oscillators

Crystal oscillators produce an accurate fixed frequency output when a specific voltage range is applied. The most common frequency values used for crystal oscillators in computers are 25MHz and 100MHz. (Crystal oscillators may be replaced in temperature-sensitive applications by MEMS oscillators.)

However, SoCs often operate at internal clock speeds measured in hundreds of MHz, or even GHz. This is done with a phase-locked loop, typically based on a ring oscillator: a programmable number of NOT gates are chained together to create smaller delays. The charge and discharge delay of the NOT gates creates signals which are offset from the rise of the crystal's voltage. Designed carefully, this can be used to produce a much higher-frequency output clock from a lower-frequency input clock.

SoCs typically include a large number of derived clocks called a clock tree, where certain faster clocks are hierarchically derived from slower clocks. Enabling one component of the clock tree enables a peripheral or set of peripherals attached to it, called a clock domain. These are selectively enabled to save power. On reset, only essential clock domains are enabled, and on multi-core systems, only one core will be enabled.

Memory Management Unit (MMU)

The MMU is one of the most important foundations of modern multiprocessing systems. It allows different running programs to remain effectively isolated from each other (to have their own "address space"), to allow them to share memory when needed, to set different kinds of permissions for different regions of memory (no permissions, read-only, read-execute, read-write).

The MMU is configured through page tables, which are an in-memory data structure that hierarchically map addresses which are seen by code running on the core to specific ranges of physical memory, and permissions for that memory.

Each core can have one (or sometimes two) "page table base registers" configured, which allows each core to have a view of the current address space which depends on the currently active page table.

Microcontrollers

Modern application cores are intensely power-hungry, and may consume power between single-digit watts and hundreds of watts. However, efficiency matters, so we generally want to be able to turn most of the system off when it is not in use. This creates a big problem: when the user requests use of the system again, how do we turn things back on?

The general answer is that there is a smaller core which uses a microcontroller design, and is responsible for powering the clock tree on the application core(s). It may even be embedded onto the same package. The microcontroller generally runs a small amount of code in an infinite loop and never sleeps, but its sparse and efficient design means its static power consumption is on the order of microwatts, so it can be left running while the main application cores are powered down, or even entirely powered off.

When a SoC comes back from a powered-off mode, it may need to run special BootROM (see below) code (often called "warm boot") to load its registers and begin executing code from DRAM again.

Interrupt generators and controllers

In order to operate efficiently, it is important that cores are not spending unnecessary work checking for completion of operations they have started. Interrupt generators solve this need. A core requests an interrupt on completion of an operation, then starts that operation, and enters a low-power state. Once the operation has completed, an interrupt causes the core to exit the low-power state, where it can resume its work.

Cores will generally have hardware-defined interrupts, which are triggered by the unique peripherals attached to it, and software-defined interrupts, which can be generated by writing to a register unique to the core.

Interrupt controllers are attached to each core and redirect execution based on the current interrupt state and the current state of the core. For standard edge-triggered interrupt signals, the controller will latch the signal on the next clock cycle. The controller will cause the core to immediately store all of its internal state and begin executing an interrupt service routine to process the signal instead of the next instruction. Generally, the operating system should first clear the interrupt before processing the condition that caused it; if it is not cleared, then the interrupt handler will simply run again until the interrupt condition is no longer met. Once the interrupt service routine returns and there are no more interrupts to handle, the original state and control flow is restored.

Interrupt controllers usually have a concept of priority - if multiple types of interrupts are ready to be handled, the controller will deliver them in a specific order.

An operating system may choose to mask an interrupt (prevent the service routine from executing) because the current core is in a critical section. Interrupt controllers will generally not automatically immediately reset control flow when a core is currently handling an interrupt of the same type to be delivered, in order to allow the current interrupt to be handled fully first.

Timers

Timers count the cycles of their clock domain when they are enabled. There is usually one per core. When cleared, the timer simply counts cycles, and does not generate interrupts. When set, the timer generates a hardware timer interrupt after a programmed cycle count has been reached, and then waits to be cleared again. This can be used to interrupt a core after a delay precise to the frequency of the clock domain.

Serial ports

Modern systems may include a number of components which need to synchronize, but also normally operate at incompatible clock speeds. Communication is often accomplished using serial ports. The sending and receiving serial ports synchronize their own internal clocks independently of the sender and receiver domain clocks, which likely operate at different frequencies. There are many kinds of serial ports. RS-232 UART is an old but still-common standard, and likely the first you'll use to get "Hello World" from an operating system.

When sending data, the sender will write a fixed packet of data into the sending serial port's send buffer memory, then a register will be set on the sending serial port to begin sending bits. The sending serial port latches the send register on its next clock cycle and begins sending. Once sending is complete, the serial port will clear the send register and interrupt the sender to notify it of completion.

When receiving data, the receiver will set a register on the receiving serial port to begin receiving. The receiving serial port latches the receive register on its next clock cycle, waits for the sender, and then begins writing into the receive buffer memory. Once receiving is complete, the receiving serial port will interrupt the receiver to notify it that new data can be read.

The send and receive buffers and transmit bits of serial ports are usually mapped at unique physical addresses on SoCs, but need to be enabled in the clock tree to become usable.

There are lots of variations in serial data architectures. Common older standards include RS-232, RS-485, SPI, I2C, and I2S. Common newer standards include Ethernet, PCIe, SATA, and USB. These standards work in conceptually similar ways, but are very complicated and somewhat beyond the scope of this post. Some (not all) of these are self-clocking: the data bus also encodes the bit clock of the sender, and the receiver is responsible for automatically synchronizing to it. This allows the protocol to operate using fewer wires, and avoids transmission errors related to differences in the length of the clock trace and data traces.

Fuses

Most SoCs include banks of one-time programmable memory called fuses. These operate like conventional fuses on the micro scale—once a fuse has been blown, it cannot be restored to its previous value. These are used to store keys and secrets internal to the SoC which are either not knowable at manufacturing time, or impractical to program directly into the design of the SoC.

For each fuse bank, there is typically an additional fuse which, once blown, instructs the fuse hardware to prevent further fuse programming for that bank. This "seals" the value of the bank and protects it from further changes.

SRAM and DRAM

All useful SoCs require writable memory to operate. This may be in the form of RAM which is internal to the SoC (SRAM, static random access memory), or external to it (DRAM, dynamic random access memory). If a SoC has SRAM, the usable size is typically less than 128KiB. DRAM can be up to 4GiB for 32-bit SoCs, or effectively any size for 64-bit SoCs.

SRAM does not require any initialization to operate, and is permanently mapped at a fixed physical address in the SoC, so once it has come out of reset, it is immediately usable by any running code. However, its small size limits the code which can run on it.

DRAM requires initialization to operate. Processors are generally expected to be used with different models of DRAM, and these models have different physical timing characteristics which must be programmed into the SoC's DRAM controller. The primary characteristics of DRAM are its frequency (the speed at which the memory clock operates) and the latency (the time it takes for a given operation to generate a complete response). DRAM models can typically operate at different standard frequency values, but with a fixed latency in nanoseconds. The latency must be either known in advance, or discovered by the system.

If the type of DRAM is not known, the system may be able to automatically determine the appropriate timing parameters through a process called training. Training tests decreasing values of frequency and increasing values of latency until the memory does not produce any errors. The result of the training process may then be stored into non-volatile memory, so it can be reused on the next boot.

Typically, when DRAM is used, the characteristics of the memory will be known. This may be through SPD, some sort of non-volatile storage, through pull resistors attached to the SoC, or through fuses blown internally to the SoC. This allows one to skip the training process and directly initialize the memory.

Caches

Simpler core designs may not do any caching, but the speed of modern SoCs vastly outpaces the latency of memory available to feed it, so a complex system of caches is used to ensure the processor can be kept busy with work efficiently for most programs. These are banks of SRAM that map address ranges (called "cache lines") to data. Common cache line sizes are 32 bytes, 64 bytes, and 128 bytes. However, the overwhelming majority of systems currently use 64 bytes.

These are divided into different levels:

SoCs generally handle instruction and data cache differently due to their different needs. Writes to instruction memory are expected to be rare, so instruction caches are usually designed in a way that they must be explicitly invalidated when instruction memory is written to.

In contrast, the coherency of data caches with memory is automatically managed by essentially every core design. Cache coherency protocols (like MESI) ensure that each core of a multicore system sees a consistent view of memory by requiring each core to "acquire" the entire cache line when it wants to use addresses within it.

Even though writes could be asynchronous, memory writes generally will remain in cache until later evicted. If this is not the desired behavior (like when interacting with a peripheral), cache flush instructions which explicitly force write-back of a given address range may be needed. It may also be possible to mark a peripheral address range as "uncached" in the MMU to prevent the addresses from ever being cached, but this is usually less desired.

Direct Memory Access (DMA)

Modern cores include DMA controllers, which perform asynchronous copying of memory regions (such as to write a large range of host memory to some peripheral address). These controllers are enabled by writing source and destination physical addresses and a size to registers, and they generate an interrupt when they complete the transfer.

Delegating this work allows the main CPU cores to avoid spending time waiting and potentially polluting caches copying memory (a simple, high-latency activity) and do the more complex tasks associated with an operating system.

SPI NOR flash

Despite its age and low performance, this technology is still widely used today by both embedded processors and IBM-compatible PCs. Because the SPI NOR protocol is extremely simple, it can be easily programmed into a BootROM to fetch the bootloader. SPI NOR flash can also be reliably assumed to be error-free, with no bad block management required. However, the capacity used for SPI NOR memory is typically only a few megabytes, so when this is used, it is typically only used for the bootloader, and possibly also non-volatile parameters used by the bootloader.

Most SoCs and compatible flashes support a feature called XIP ("execute in-place"), where SPI NOR contents can be directly executed without requiring a copy into memory, obviating even the need for a dedicated BootROM (the processor can simply run the code on the NOR flash directly out of reset). The feature is ubiquitous and found on almost every processor made in the last 20 years. However, XIP is incompatible with verified boot.

MMC

The MMC protocol can be used for accessing several types of devices: UFS/MMC cards, eUFS/eMMC packages, and SD cards. The most common technology used in production systems is eMMC, since there is typically no desire for the internal storage to be removable or externally rewritable. Unlike the SPI protocol, the MMC protocol is significantly more complicated. However, many BootROMs support loading bootloaders from MMC devices, because of their versatility, high capacity, and high performance. It may also significantly simplify the manufacturing process for product vendors to program a single eMMC image, and avoid requiring the expense of ordering and programming a separate SPI memory device.

BootROM

The BootROM is read-only code which is directly programmed into the SoC design, and cannot be modified by customers (except in very unusual circumstances, such as custom manufacturing agreements). It is mapped at a fixed address and is the first code executed after the SoC has come out of reset. This is the starting point for all code running on the processor.

The first programmable stage of a processor is the bootloader. The BootROM loads the bootloader from SPI flash, MMC, or some other non-volatile storage into memory, optionally verifies it against a key burned into the fuses, and then executes the bootloader.

On older systems, the BootROM is generally also responsible for initializing system memory (DRAM). In this case, the BootROM performs DRAM initialization (and possibly training), the contents of the bootloader are copied into DRAM, verification optionally takes place, and then the bootloader is executed.

On newer systems, multi-stage bootloaders are often used instead. With these systems, the operation of the BootROM is mostly the same, except that the BootROM copies the bootloader binary into SRAM instead, and the bootloader must take the necessary steps to enable DRAM.

When verification (secure boot) is used, the raw public key value is generally not fused into the SoC. For efficiency, a hash of the root key, typically SHA-256, is fused instead. The bootloader is encapsulated in a binary format which has the public key chain and the associated signature attached. The BootROM hashes the root key and verifies that the hash matches the fused value, then performs public-key verification of the bootloader data, before finally executing it.

Trusted Execution Environment

The TEE is a fundamentally a peripheral which stores and generates keys. It may have a large amount of protected SRAM which it uses to perform encryption and decryption. This SRAM is unique to the TEE and generally cannot be accessed by a standard SoC core.

The TEE's code may be self-contained and initialized on reset by the BootROM with a signature value derived from the bootloader, or it may be dynamically programmed to perform specific functions by custom signed microcode, and initialized with a signature value derived from such microcode.

The value the TEE is initialized with is used as a form of seeding for the keys it derives. By tying specific signature values to the TEE's execution, it can be used to attest with high confidence to a remote observer that the system is executing the intended code. Additionally, this tying process can be used to ensure TEE can only derive the intended keys when it is running secure microcode signed by the key owners.

A common use of a TEE is to implement access paths for Digital Rights Management-protected content. Encrypted data is streamed into the TEE. The TEE then generates a key based on the signed microcode value and a secret shared by all devices with the same type of TEE, decrypts the data, and outputs data directly to a peripheral device, bypassing the cores and system memory, protecting both the key value and decrypted data from exposure.Modern SoCs have at least one core, a large number of peripherals, and some amount of internal memory, of which peripherals are "mapped" at. This mapping allows a simple interface for software to interact with