Operating systems part 0: Hardware building blocks

Here I catalog the parts of a modern system necessary to bootstrap an operating system.

CPU

I use the name CPU to refer to the core of a general-purpose programmable computing device. At a high level, a CPU executes a single stream of instructions, which direct it to operate on registers, operate on memory, perform arithmetic, or change the next instruction. The CPU's operation is periodic and occurs in synchronization with its internal clock.

Processor

Modern processors have at least one CPU, a large number of peripherals and some amount of internal memory. Systems which are designated as multi-core have multiple CPUs, which can process instructions independently of each other.

Oscillators

Crystal oscillators produce an accurate fixed frequency output when a specific voltage range is applied. The most common frequency values used for crystal oscillators in computers are 25MHz and 100MHz. (Crystal oscillators may be replaced in temperature-sensitive applications by MEMS oscillators.)

However, processors often operate at internal clock speeds measured in hundreds of MHz, or even GHz. This is done with a clock multiplier–ring oscillator: a programmable number of NOT gates are chained together to create smaller delays. The charge and discharge delay of the NOT gates creates a signal which is offset from the rise of the crystal's voltage. Designed carefully, this can be used to produce a much higher-frequency output clock from a lower-frequency input clock.

Processors typically include a large number of derived clocks called a clock tree, where certain faster clocks are hierarchically derived from slower clocks. Enabling one component of the clock tree enables a peripheral or set of peripherals attached to it, called a clock domain. These are selectively enabled to save power. On reset, only essential clock domains are enabled, and on multi-core systems, only one CPU will be enabled.

Microcontrollers

Modern application processors are intensely power-hungry, and may consume power between single-digit watts and hundreds of watts. However, efficiency matters, so we generally want to be able to turn most of the system off when it is not in use. This creates a big problem: when the user requests use of the system again, how do we turn things back on?

The general answer is that there is a second processor which uses a microcontroller design, and is responsible for powering the clock tree on the application processor. It may even be embedded onto the same package. The microcontroller runs a small amount of code in an infinite loop and never sleeps, but its sparse and efficient design means its static power consumption is on the order of microwatts, so it can be left running while the main application processor is powered down, or even entirely powered off.

When a processor comes back from a powered-off mode, it may need to run special BootROM code (often called "warm boot") to load its registers and begin executing code from DRAM again.

Interrupt generators

In order to operate efficiently, it is important that processors are not spending unnecessary work checking for completion of operations they have started. Interrupt generators solve this need. A CPU requests an interrupt on completion of an operation, then starts that operation, and enters a low-power state. Once the operation has completed, an interrupt causes the CPU to exit the low-power state, where it can resume its work.

CPUs will generally have hardware-defined interrupts, which are triggered by the unique peripherals attached to it, and software-defined interrupts, which can be generated by writing to a register unique to the CPU.

Timers

Timers count the cycles of the crystal oscillator after the CPU is reset. There is usually one per CPU. When cleared, the timer simply counts cycles, and does not generate interrupts. When set, the timer generates a hardware timer interrupt after a programmed cycle count has been reached, and then waits to be cleared again. This can be used to interrupt the CPU after a delay precise to the frequency of the crystal oscillator.

Serial ports

Modern systems may include a number of parts which need to synchronize, but also normally operate at different clock speeds. This is often accomplished using a serial port, which is a type of hardware peripheral which decouples communicating components. The sending and receiving serial ports have their own bit clock, which is independent of the sender and receiver clock.

When sending data, the sender will write a fixed packet of data into the sending serial port's send buffer memory, then a register will be set on the sending serial port to begin sending bits. The sending serial port latches the send register on its next clock cycle and begins sending. Once sending is complete, the serial port will clear the send register and interrupt the sender to notify it of completion.

When receiving data, the receiver will set a register on the receiving serial port to begin receiving. The receiving serial port latches the receive register on its next clock cycle, waits for the sender, and then begins writing into the receive buffer memory. Once receiving is complete, the receiving serial port will interrupt the receiver to notify it that new data can be read.

Serial data transmission is often self-clocking: the data bus also encodes the bit clock of the sender, and the receiver is responsible for automatically synchronizing to it. This allows the protocol to operate using fewer wires, and avoids transmission errors related to differences in the length of the clock trace and data traces.

The send and receive buffers and transmit bits of serial ports are usually mapped at unique physical addresses on processors, but need to be enabled in the clock tree to become usable.

Fuses

Most processors include banks of one-time programmable memory called fuses. These operate like conventional fuses on the micro scale—once a fuse has been blown, it cannot be restored to its previous value. These are used to store keys and secrets internal to the processor which are either not knowable at manufacturing time, or impractical to program directly into the design of the processor.

For each fuse bank, there is typically an additional fuse which, once blown, instructs the fuse hardware to prevent further fuse programming for that bank. This "seals" the value of the bank and protects it from further changes.

SRAM and DRAM

All useful processors require writable memory to operate. This may be in the form of RAM which is internal to the processor (SRAM, static random access memory), or external to it (DRAM, dynamic random access memory). If a processor has SRAM, the usable size is typically less than 128KiB. DRAM can be up to 4GiB for 32-bit processors, or any size for 64-bit processors.

SRAM does not require any initialization to operate, and is permanently mapped at a fixed physical address in the processor, so once the processor has come out of reset, it is immediately usable by any running code. However, its small size limits the code which can run on it.

DRAM requires initialization to operate. Processors are generally expected to be used with different models of DRAM, and these models have different physical timing characteristics which must be programmed into the processor's DRAM controller. The primary characteristics of DRAM are its frequency (the speed at which the memory clock operates) and the latency (the time it takes for a given operation to generate a complete response). DRAM models can typically operate at different standard frequency values, but with a fixed latency in nanoseconds. The latency must be either known in advance, or discovered by the system.

If the type of DRAM is not known, the system may be able to automatically determine the appropriate timing parameters through a process called training. Training tests decreasing values of frequency and increasing values of latency until the memory does not produce any errors. The result of the training process may then be stored into non-volatile memory, so it can be reused on the next boot.

Typically, when DRAM is used, the characteristics of the memory will be known. This may be through SPD, some sort of non-volatile storage, through pull resistors attached to the processor, or through fuses blown internally to the processor. This allows one to skip the training process and directly initialize the memory.

SPI NOR flash

Despite its age and poor performance, this technology is still widely used today by both embedded processors and IBM-compatible PCs. Because the SPI NOR protocol is extremely simple, it can be easily programmed into a BootROM to fetch the bootloader. SPI NOR flash can also be reliably assumed to be error-free, and no bad block management is required. However, the capacity used for SPI NOR memory is typically only a few megabytes, so when this is used, it is typically only used for the bootloader, and possibly also non-volatile parameters used by the bootloader.

Most processors and compatible flashes support a feature called XIP ("execute in-place"), where SPI NOR contents can be directly executed without requiring a copy into memory. The feature is ubiquitous and found on almost every processor made in the last 20 years. However, XIP is incompatible with verified boot.

MMC

The MMC protocol can be used for accessing three types of devices: MMC cards, eMMC chips, and SD cards. The most common technology used in production systems is eMMC, since there is typically no desire for the internal storage to be removable or externally rewritable. Unlike the SPI protocol, the MMC protocol is significantly more complicated. However, many BootROMs support loading bootloaders from MMC devices, because of their versatility, high capacity, and high performance. It may also significantly simplify the manufacturing process for product vendors to program a single eMMC image, and avoid requiring the expense of ordering and programming a separate SPI memory device.

BootROM

The BootROM is read-only code which is directly programmed into the processor design, and cannot be modified (except in very unusual circumstances, such as custom manufacturing agreements). It is mapped at a fixed address and is the first code executed after the processor has come out of reset. This is the starting point for all code running on the processor.

The first programmable stage of a processor is the bootloader. The BootROM loads the bootloader from SPI flash, MMC, or some other non-volatile storage into memory, optionally verifies it against a key burned into the fuses, and then executes the bootloader.

On older systems, the BootROM is generally also responsible for initializing system memory (DRAM). In this case, the BootROM performs DRAM initialization (and possibly training), the contents of the bootloader are copied into DRAM, verification optionally takes place, and then the bootloader is executed.

On newer systems, multi-stage bootloaders are often used instead. With these systems, the operation of the BootROM is almost the same, except that the BootROM copies the bootloader binary into SRAM instead, and the bootloader must enable DRAM.

When verification (secure boot) is used, the public key is not fused into the processor. Instead, a hash of the root key, typically SHA-256, is fused into the processor. The bootloader is encapsulated in a binary format which has the public key chain and the associated signature attached. The BootROM hashes the root key and verifies that the hash matches the fused value, then performs public-key verification of the bootloader data, before finally executing it.

Trusted Execution Environment

The TEE is a fundamentally a peripheral which stores and generates keys. It may have a large amount of protected SRAM which it uses to perform encryption and decryption. This SRAM is unique to the TEE and generally cannot be accessed by any CPU.

The TEE's code may be self-contained and initialized on reset by the BootROM with a signature value derived from the bootloader, or it may be dynamically programmed to perform specific functions by custom signed microcode, and initialized with a signature value derived from such microcode.

The value the TEE is initialized with is used as a form of seeding for the keys it derives. By tying specific signature values to the TEE's execution, it can be used to attest with high confidence to a remote observer that the system is executing the intended code. Additionally, this tying process can be used to ensure TEE can only derive the intended keys when it is running secure microcode signed by the key owners.

A common use of a TEE is to implement access paths for Digital Rights Management-protected content. Encrypted data is streamed into the TEE. The TEE then generates a key based on the signed microcode value and a secret shared by all devices with the same type of TEE, decrypts the data, and outputs data directly to a peripheral device, bypassing the CPU and system memory, protecting both the key value and decrypted data from exposure.