Virtual Memory Manager

Overview

The virtual memory (or VM) subsystem is the queen and mother of all Keyronex subsystems. Virtual memory provides for memory as it appears to applications to be abstracted from underlying main memory, so that more memory can be apparently available to applications than there is main memory in the system, and so that things other than main memory can appear to the application as if they are main memory.

The VM subsystem provides an implementation of virtual memory supporting all the features any virtual memory system should have - memory mapped files with page caching, anonymous memory, virtual copy with copy-on-write optimisation, and paging - the retrieval from and putting-back to disk of pages, both pages of memory-mapped files and pages of anonymous memory, which are put back to a swapfile.

Note

“Paging” (and “virtual memory”) are sometimes used online as simple synonyms of “virtual address translation”, the process by which a virtual memory address is translated to a physical address. Keyronex uses these terms in their traditional and accepted sense: paging refers to movement of pages of data between backing store and main memory, while virtual memory refers to the abstraction described in the first paragraph of this section.

The Keyronex virtual memory manager has several responsibilities. Broadly, these are:

Page management: the allocation of pages of physical memory, reading contents of pages in from their backing store, and putting them back to their backing store, according to demand.
Address space management: maintaining an abstract, machine-independent description of virtual memory state, and providing interfaces to alter this.
Physical mapping: Translating that abstract representation to the form demanded by the platform’s memory-management unit.

The Keyronex virtual memory manager is principally derived from the design of NetBSD’s UVM, a design that meets these three responsibilities well.

Data Structures

A number of data structures play a role in the system. This is an overview of the major players:

`vm_page_t`

Represents a physical page of usable memory. These are placed on queues according to their use and state. Queues include the free queue, wired queues, and others - queues (and vm_page_t in greater depth) are described in the section on paging. The set of all vm_page_ts is collectively called the Resident Pagetable (or RPT).

Hint

A vm_page_t is analogous to a struct page in GNU/Linux or a PFN database entry in NT.

If it is being used as a page to store part of anonymous memory, it holds a pointer to the vm_anon_t it belongs to; if it is being used to store data belonging to a vnode VM object (analogous to a page cache in GNU/Linux) it stores a pointer to the VM object it belongs to, linkage for that object’s RB tree of pages, and the offset of this page within that object. In either of these cases, it also contains a pv_map list, which stores, for each mapping of the page, the vm_map_t into which it is mapped and its virtual address within that map.

Note

When talking about a “page owner”, this refers to the vm_anon_t or the vm_object_t to which it belongs, as described above. Page ownership is irrelevant if a page does not belong to either of these, e.g. in the case of kernel wired memory. Anything which says “the page owner must be locked” is ignored in that case.

vm_page_t`s also carry a busy bit which is used in paging. That bit is guarded by the page queues lock. Some other fields of a `vm_page_t are guarded by its owner.

`vm_anon_t`

A virtual page of anonymous memory. It may link to a vm_page_t if it is resident in memory, or if it has been swapped out, it stores an identifier by which its contents can be retrieved by the swap pager, called a drum slot. It also stores a reference count used in copy-on-write logic, and a mutex lock.

It is hoped that vm_anon_ts will themselves become pageable in the future, which will help to dissociate availability of virtual memory from physical memory; currently availability of virtual memory is indirectly constrained by available physical memory, because used virtual memory must be described by vm_anon_ts.

`vm_amap_t`

A map of anons associated with an anonymous VM object. The map is a 2-level sparse array - the first level is contained in the vm_amap_t holding pointers to vm_amap_chunk_ts called chunks; chunks contain a certain number (currently 32) of pointers to vm_anon_ts.

`vm_object_t`

These are mappable VM objects. They may be anonymous or vnode objects. If an anonymous object, it holds a pointer to a vm_amap_t, otherwise it is a vnode object and contains a handle to the vnode it belongs to and the head of the RB tree of vm_page_ts which store currently-resident data of the vnode VM object. They also carry a mutex lock.

`vm_map_t`

Represents a virtual address space, composed of a splay tree of vm_map_entry_ts. There is a special kmap which holds the kernel’s mappings (these are mapped into all processes, but not usermode-accessible). They carry a mutex lock.

A vm_map_entry_t is an entry within an address space map. They refer to a vm_object_t which they map, and also carry an offset (relative to which faults in the map entry are processed by the vm_object_t), a size, and a maximum protection value - this allows for read-only or non-execute mappings (these are not properties of VM objects themselves).

Page management

In any good virtual memory system, main memory is treated as the cache of secondary storage, so the virtual memory manager tries to ensure that a large amount of memory is used at all times, as unused memory is memory wasted. The technnique by which this is done in Keyronex is called paging - the movement of pages of data to and from the backing store with which that page is associated. Pages are associated with backing store according to which VM object they belong to; if they belong to a vnode VM object, their backing store is a file or block device, while if they belong to anonymous memory, their backing store is the swap space.

Note

In the future Keyronex might support user-defined VM object types as well as the built-in vnode and anonymous objects. A custom pager would be required for these to carry out page-in and page-out. One use-case would be to allow for a single-level store for the Oopsilon programming environment.

Pages are paged-in from their backing store in response to page faults, and paged out according to a page replacement policy. Keyronex uses a simple global page replacement algorithm, the FIFO second-chance approach, a variant of the general category of Not Recently Used page replacement algorithms. This involves maintaining two queues of pageable pages - active and inactive.

The page daemon

The page daemon, a kernel thread named vm_pagedaemon, is responsible for maintaining the page replacement policy. It maintains high and low watermarks for number of free pages and number of inactive pages, and spends most of its time sleeping on an event.

The event is signalled under two conditions:

there has been a request to allocate a physical page, but the number of pages on the free queue is less than the low watermark for the free page queue; or
greater than 75% of main memory is in use, and the number of pages on the inactive queue is less than the low watermark for the inactive queue.

The page daemon will wake up and calculate new watermarks for the inactive queue; these aim to keep around 33% of pageable pages - that is, the sum of the number of pages on the active and inactive queues - on the inactive queue. If the number of pages on the inactive queue is less than that of the low watermark, the page daemon will move pages from the tail of the active queue to the head of the inactive queue until the inactive queue high water mark is reached. Pages carry used bits to determine whether they have been accessed or not; this bit is reset when the page is moved to the inactive queue (this may involve a TLB shootdown; see the Physical Mapping section).

If the number of free pages is below the free page low watermark, the pagedaemon will now also take pages from the tail of the inactive queue and check their used bit. If it is set, the pages get a second chance - they are replaced to the head of the active queue. Otherwise, they are put back to their backing store. This is done by invoking the relevant pager according to the VM object to which the page belongs. For vnode VM objects, the vnode pager is used, while for anonymous VM objects, the swap pager is used.

After the pager has completed the put back to backing store, the page is placed on the free queue. This process will continue in a loop until the number of pages on the free page queue reaches the high watermark.

Todo

describe what happens when no pages can be put back to backing store anymore, e.g. when pagefile space is exhausted.

Pagers

Pagers are reponsible for carrying out the actual paging-in and paging-out of pages; page-in requests are generated by page faults, while page-out requests are generated by the page daemon. The interface is simple - just a page-out function to put a page to backing store, and a page-in function to retrieve a page from backing store.

Page-in

Page-in takes two arguments - the vm_page_t to page in, and a drumslot_t - this is only passed for page-ins for anonymous VM object, it’s irrelevant for other VM object types. The fault handler will have allocated a new page already, and have inserted it into the owner, setting the busy bit so that any further page faults will wait on an event which will be signalled when the page-in is complete. The page fault handler unlocks the address space in which which the fault occurred, the owner object, and the page queues before calling the pager. The page-in routine must now carry out any I/O necessary to bring the page into memory, after which the busy bit can be unset. The fault handler now returns with the “retry” status code, causing the fault to be restarted.

Note

Dropping the locks requires page faults to be restarted after page-in, but it has a major advantage: when two threads share an address space map, it allows page faults on separate pages to be handled simultaneously, since the map remains unlocked. For vnode VM objects there are an additional two advantages, which come about because of vnode VM objects relying on just one lock, rather than the one-lock-per-anon of anonymous memory:

It allows for the pagedaemon to simultaneously page-out other pages of that object.
It allows other threads to simultaneously page-in other pages of that object.

Page-out

Page-out also takes only one argument, the vm_page_t to page out. The page daemon will have set the busy bit of the page and unmapped it from all physical maps in which it is mapped. Note that the object is unlocked at the time of making the page-out request. The pager can then do any I/O necessary to put the page to its backing store, after which it can lock the owner and update its state appropriately - that is, remove the vm_page_t from the RB tree of its owning vm_object_t, or set the owning vm_anon_t’s state to ‘nonresident’ and set its drumslot appropriately.

Important

Page fault code paths have a “top-down” lock ordering (they lock the address space, then the object, then for anonymous mappings also the vm_anon_t, then the page queues) while the page daemon code path has a “bottom-up” lock ordering (it locks the page queues, then the vm_anon_t or vm_object_t the page is owned by.)

This lock ordering violation is possible because of the busy bit. When the page daemon wants to page out a page, it first sets the busy bit in a page before it locks the owner. Page faults which find a page to be busy unlock everything then wait on an event which is signalled when the page is unbusied. They then restart the page fault, because the state might have changed enough to make the work done hitherto no longer valid.

Allocation

Pagers may need to allocate memory themselves to carry out page-out even under low-memory conditions. This is why the low watermark for free pages is set to a number higher than zero. When that low watermark is reached, most page allocation attempts will sleep until an event is signalled indicating that there are sufficient free pages to proceed again. Pagers do not make these sleepable allocations.

Anonymous mappings

Anonymous mappings supporting copy-on-write semantics are implemented efficiently with reference-counting. The core principle is that a vm_anon_t with a reference count greater than 1 is always mapped read-only, and if there is a write fault at an address which is represented by that vm_anon_t, it must copy the vm_anon_t and its underlying page before mapping it read-write.

Todo

as well as the example below, fully detail the logic in an anonymous fault?

Consider a region of anonymous memory newly allocated in a process with PID 1. There are no vm_anon_ts yet because they are lazily allocated on first fault. PID 1 writes to the first page of region; a vm_anon_t is allocated with a refcnt of 1. PID 1 also writes to the 2nd page, and the same logic is followed.

Now PID 1 forks into PID 2. PID 2 is given a new anonymous VM object for that region with a copy of the vm_amap_t of that of its parent’s equivalent VM object. The copy refers to the same vm_anon_t, but the copying process has incremented the reference count of the 1st and 2nd vm_anon_t as they are now referenced by two vm_amap_ts. The copying process has also made all the old writeable mappings of these pages read-only again.

PID 2 now writes to the 2nd page of the anonymous region. The fault handler finds the corresponding vm_anon_t and notices that its refcnt is 2. As this is a write fault, it must copy the vm_anon_t and its underlying page. After copying it, it releases its reference to the vm_anon_t that was shared with PID 1, and maps the new copied vm_anon_t’s underlying page read-write. The same thing would happen if PID 1 had tried to do the write.

Anonymous-on-vnode mappings

A special case of mapping is used for MAP_PRIVATE mmap()’s of a vnode. An anonymous VM object is created with a parent pointer; the parent pointer points to the VM object associated with the vnode which is to be mapped MAP_PRIVATE.

Fault handling for this case is modified with respect to handling for faults in simple anonymous memory. A read fault will first check for a vm_anon_t that already exists, but if there is none, it will instead ask the parent vnode object to map the page for the faulting address into the faulting process’ address space.

In the case of a write fault, the page for the faulting address in the parent vnode object will be copied into a new page allocated which will be associated with a vm_anon_t and placed in the anonymous-on-vnode object’s amap. This is then mapped read-write into the faulting process’ address space, and copy-on-write has been achieved.

It should be noted that this is one-directional; that is, once an anonymous-on-vnode mapping is established, if the vnode object’s pages are changed by writes into a mapping of that vnode object, then it doesn’t subject them to copy-on-write (xxx is this clearly written?). This means that an anonymous-on-vnode mapping’s contents, where there have not been writes (which cause the copy-on-write process), the content of the pages will change if they change in the parent vnode object.

Virtual Memory Manager

Overview

Data Structures

vm_page_t

vm_anon_t

vm_amap_t

vm_object_t

vm_map_t

Page management

The page daemon

Pagers

Page-in

Page-out

Allocation

Anonymous mappings

Anonymous-on-vnode mappings

`vm_page_t`

`vm_anon_t`

`vm_amap_t`

`vm_object_t`

`vm_map_t`