Memory Management in osv

7 February, 2024

Introduction

In osv, user applications use virtual memory created by paging hardware. To understand it, we’ll start by describing the physical memory. Then we’ll look at the address space, and finally explore how memory management works in osv.

Physical memory

In general, a PC’s physical address space is hard-wired to have the following layout:

    +------------------+  <- 0xFFFFFFFFFFFFFFFF (18 exabytes)
    |                  |
    |      Unused      |
    |                  |
    +------------------+  <- 0x0000000100000000 (4GB)
    |      32-bit      |
    |  memory mapped   |
    |     devices      |
    +------------------+  <- 0x00000000FE000000 (4GB - 32MB)
    |                  |
    |      Unused      |
    |                  |
    +------------------+  <- depends on amount of RAM
    |                  |
    |                  |
    | Extended Memory  |
    |                  |
    |                  |
    +------------------+  <- 0x0000000000100000 (1MB)
    |     BIOS ROM     |
    +------------------+  <- 0x00000000000F0000 (960KB)
    |  16-bit devices, |
    |  expansion ROMs  |
    +------------------+  <- 0x00000000000C0000 (768KB)
    |   VGA Display    |
    +------------------+  <- 0x00000000000A0000 (640KB)
    |                  |
    |    Low Memory    |
    |                  |
    +------------------+  <- 0x0000000000000000

The first 1 MB

The first PCs, which were based on the 16-bit Intel 8088 processor, were only capable of addressing 1 MB of physical memory. The physical address space of an early PC would therefore start at 0x0000 0000 0000 0000 (that’s 16 zeros) but and at 0x00000000000FFFFF instead of 0x00000000FFFFFFFF.

The 640 KB area marked “Low Memory” above was the only random-access memory (RAM) that an early PC could use; in fact, the very earliest PCs could only be configured with 16 KB, 32 KB, or 64 KB of RAM.

The 348 KB area from 0x00000000000A0000 through 0x00000000000FFFFF was reserved by the hardware for special uses such as video display buffers and firmware held in non-volatile memory. The most important part of this reserved area is the “Basic Input/Output System” (BIOS), which occupies the 64 KB region from 0x00000000000F0000 through 0x00000000000FFFFF. In early PCs, the BIOS was held in true read-only memory (ROM), but current PCs store the BIOS in updateable flash memory. The BIOS is responsible for performing basic system initialization such as activating the video card and checking the amount of memory installed. After performing this initialization, the BIOS loads the operating system from some appropriate location such as a floppy disk, hard disk, CD-ROM, or the network, and passes control of the machine to the operating system.

Extended memory

When Intel finally “broke the one megabyte barrier” with the 80286 and 80386 processors, which supported 16 MB and 4 GB physical address spaces respectively, the PC architects nevertheless preserved the original layout for the low 1 MB of physical address space in order to ensure backward compatability with existing software. Modern PCs therefore have a “hole” in physical memory from 0x00000000000A000 to 0x0000000000100000, dividing RAM into “low” or “conventional” memory (the first 640 KB) and “extended memory” (everything else). In addition, some space at the very top of the PC’s 32-bit physical address space, above all physical RAM, is now commonly reserved by the BIOS for use by 32-bit PCI devices.

In practice

The detail of the memory layout may vary depending on your machine. For example, when running on QEMU, osv prints out the following “map” of the physical address space (sometimes referred to as the E820 memory map):

E820: physical memory map [mem 0x121000-0x1FFE0000]
 [0x0 - 0x9FC00] usable
 [0x9FC00 - 0xA0000] reserved
 [0xF0000 - 0x100000] reserved
 [0x100000 - 0x1FFE0000] usable
 [0x1FFE0000 - 0x20000000] reserved
 [0xFFFC0000 - 0x100000000] reserved

From this output, we see that there are two available physical address ranges: the first 639 KB and above 1 MB (both marked “usable”). The other ranges (including those not mentioned in the E820 map) are reserved by the BIOS and osv shouldn’t use them. Note that E820 address ranges map overlap or even have holes; an address range is safe for osv to use only if it shows up as “usable” and not as “reserved”.

When osv is configured for more than 4 GB of physical RAM (i.e., so that RAM can extend further above 0x00000000FFFFFFFF), the BIOS must arrange to leave a second hole in the system’s RAM at the top of the 32-bit addressable region, to leave room for these 32-bit devices to be mapped. In osv only the first 512 MB of a PC’s physical memory is used, though, so this is not an issue.

Address spaces

Each process has its own illusion of having the entire memory, which is called an address space. In osv, page tables (implemented by hardware) are used to give each process its own address space. The x86_64 page table translates (or maps) a virtual address (the address that an x86 instruction manipulates) to a physical address (an address that the processor chip sends to main memory).

x86_64-specific page tables

x86_64 uses a page table to translate each virtual memory address to a physical memory address. An x86_64 page table is logically an array of 2^52 page table entries (PTEs). Each PTE contains a 20-bit physical page frame number (PFN) and some flags (e.g., “valid”). Each physical table entry controls 4096 bytes of physical memory; such a 4 KB chunk is called a page.

Address translation in x86_64 happens in 4 steps. A page table is stored in physical memory not as a linear array (which would be prohibitively large and mostly full of invalid PTEs), but instead as a 4-level tree. To translate a virtual address into a physical address, paging hardware in x86_64 walks the 4-level tree. A software implementation of the page walk is find_pte (in arch/x86_64/kernel/mm/vpmap.c). The function find_pte is a utility function to help osv manage the page table.

Each PTE contains flag bits (defined in arch/x86_64/include/arch/mmu.h) that tell the paging hardware how the associated virtual address is allowed to be used:

PTE_P: whether the PTE is present
PTE_W: whether instructions can issue writes to the page
PTE_U: whether user programs can use the page

A process’s address space

In osv, there is a separate page table for each process, defining that process’s address space. As illustrated below, an address space includes the process’s user memory starting at virtual address 0x0. Instructions come first, followed by global variables, a heap region (for malloc), and then the stack. Note that the code and globals together are listed as “Text” below.

We call each region a “memregion”; each memregion has a start, end, and memory permissions. In osv, we cap the maximum user memory per-application at USTACK_UPPERBOUND. This should be plenty of memory for a user process.

    +------------------+  <- 0xFFFFFFFFFFFFFFFF (18 exabytes)
    |                  |
    |      Kernel      |
    |                  |
    +------------------+  <- KMAP_BASE = 0xFFFFFFFF80000000
    |                  |
    |      Unused      |
    |                  |
    +------------------+  <- stack memregion end (USTACK_UPPERBOUND = 0xFFFFFF7FFFFFF000)
    |                  |
    |       Stack      |
    |                  |
    +------------------+  <- stack memregion start
    |                  |
    |                  |
    |      Unused      |
    |                  |
    |                  |
    +------------------+  <- heap memregion end
    |       Heap       |
    +------------------+  <- heap memregion start / code memregion end
    |                  |
    |       Text       |
    |                  |
    +------------------+  <- code memregion start

Each process’s address space maps the kernel’s instructions and data as well as the user program’s memory. When a process invokes a system call, the system call executes in the kernel mappings of the process’s address space. This arrangement exists so that the kernel system call code can directly refer to user memory. In order to leave plenty of room for user memory, osv’s address spaces map the kernel at high addresses, starting at 0xFFFFFFFF80100000.

Setting up page tables in `osv`

When osv boots, vpmap_init (in arch/x86_64/kernel/vpmap.c) sets up the kernel page table. It maps virtual addresses starting at 0xFFFFFFFF80000000 (called KMAP_BASE in arch/x86_64/include/arch/mmu.h) to physical addresses starting at 0x0, as shown below. The function vpmap_init also sets up kernel text mapping, mapping all physical memory into kernel and device memory. The kernel page table is copied into every address space in vpmap_copy_kernel_mapping. As a new user process gets created, the code, stack, and heap memregions are added to its address space.

    Virtual Memory                          Physical Memory
    
    +---------+
    |  Kernel |
    +---------+  <- KMAP_BASE               |         |
    |  Unused |                             |         |
    +---------+  <- USTACK_UPPERBOUND       +---------+
    |   User  |                             |  Kernel |
    +---------+  <- 0                       +---------+  <- 0

Memory management in `osv`

The kernel must allocate and free physical memory at run-time for page tables, user process memory, and kernel stacks. In osv, physical memory is allocated and freed in units of 4096-byte pages at a time. It uses a “buddy allocator” (which allocates in powers of 2) to manage physical memory allocation. If you look in pmem.c, you can find more details, e.g.,:

pmem_alloc() allocates a single page (4096 bytes) of physical memory
pmem_nalloc() allocates n contiguous pages of physical memory.

Each page of physical memory allocated has a corresponding struct page to track information about the physical memory, such as the number of address spaces that reference that page, the size of the allocated memory, etc.:

/*
 * Each physical page has an associated struct page.
 */
struct page {
    struct sleeplock lock;
    Node node;
    struct kmem_cache *kmem_cache;
    struct slab *slab;
    // reverse mapping
    struct rmap *rmap;
    // reference count
    int refcnt;
    // size of the block (power of two number of pages)
    int order;
    // Status of the page. Contains the following flags:
    // - DIRTY
    state_t state;
    // used by bdev to locate block headers
    List blk_headers;
};

Both pmem_alloc() and pmem_nalloc() return a physical memory address. To access it, you need to call kmap_p2v(paddr) to get the corresponding kernel virtual address. This calls into KMAP_P2V(p):

#define KMAP_P2V(p) ((paddr_t)(p) + KMAP_BASE)

All of physical memory is therefore mapped into the kernel address space. You should use pmem_alloc() or pmem_nalloc() to directly allocate physical memory when you need a page of physical memory (in cases like a page table, user process memory, and a kernel stack).

To map a page of physical memory into a process’s address space, you can call vpmap_map() (see the function prototype in include/kernel/vpmap.h and the definition in arch/x86_64/kernel/mm/vpmap.c) with a virtual address in the user address space and the physical address returned from a pmem allocation function.

Sometimes you may want to allocate data structures (like struct proc) or buffers in kernel memory. You can use either kmalloc() or kmem_cache_alloc() to allocate kernel memory. Both calls return a kernel virtual address that you can access directly. Note that kmem_cache_alloc() efficiently allocates memory of a particular size defined during kmem_cache_create(). This should be used if you expect to allocate a particular struct multiple times (e.g., struct proc). The function kmalloc() is a more generic memory allocator that can handle any size allocation <= 4096 bytes.