Lab 5 Copy-on-Write Fork for xv6
Virtual memory provides a level of indirection: the kernel can intercept memory references by marking PTEs invalid or read-only, leading to page faults, and can change what addresses mean by modifying PTEs. There is a saying in computer systems that any systems problem can be solved with a level of indirection. This lab explores an example: copy-on-write fork.
To start the lab, switch to the cow branch:
c
The problem
The fork() system call in xv6 copies all of the parent process's user-space memory into the child. If the parent is large, copying can take a long time. Worse, the work is often largely wasted: fork() is commonly followed by exec() in the child, which discards the copied memory, usually without using most of it. On the other hand, if both parent and child use a copied page, and one or both writes it, the copy is truly needed.
The solution
Your goal in implementing copy-on-write (COW) fork() is to defer allocating and copying physical memory pages until the copies are actually needed, if ever.
COW fork() creates just a pagetable for the child, with PTEs for user memory pointing to the parent's physical pages. COW fork() marks all the user PTEs in both parent and child as read-only. When either process tries to write one of these COW pages, the CPU will force a page fault. The kernel page-fault handler detects this case, allocates a page of physical memory for the faulting process, copies the original page into the new page, and modifies the relevant PTE in the faulting process to refer to the new page, this time with the PTE marked writeable. When the page fault handler returns, the user process will be able to write its copy of the page.
COW fork() makes freeing of the physical pages that implement user memory a little trickier. A given physical page may be referred to by multiple processes' page tables, and should be freed only when the last reference disappears. In a simple kernel like xv6 this bookkeeping is reasonably straightforward, but in production kernels this can be difficult to get right; see, for example, Patching until the COWs come home.
Implement copy-on-write fork(hard)
Your task is to implement copy-on-write fork in the xv6 kernel. You are done if your modified kernel executes both the cowtest and 'usertests -q' programs successfully.
To help you test your implementation, we've provided an xv6 program called cowtest (source in user/cowtest.c). cowtest runs various tests, but even the first will fail on unmodified xv6. Thus, initially, you will see:
The "simple" test allocates more than half of available physical memory, and then fork()s. The fork fails because there is not enough free physical memory to give the child a complete copy of the parent's memory.
When you are done, your kernel should pass all the tests in both cowtest and usertests -q. That is:
$ cowtest
simple: ok
simple: ok
three: zombie!
ok
three: zombie!
ok
three: zombie!
ok
file: ok
ALL COW TESTS PASSED
$ usertests -q
...
ALL TESTS PASSED
$
Here's a reasonable plan of attack.
- Modify uvmcopy() to map the parent's physical pages into the child, instead of allocating new pages. Clear
PTE_W
in the PTEs of both child and parent for pages that havePTE_W
set. - Modify usertrap() to recognize page faults. When a write page-fault occurs on a COW page that was originally writeable, allocate a new page with kalloc(), copy the old page to the new page, and install the new page in the PTE with
PTE_W
set. Pages that were originally read-only (not mappedPTE_W
, like pages in the text segment) should remain read-only and shared between parent and child; a process that tries to write such a page should be killed. - Ensure that each physical page is freed when the last PTE reference to it goes away -- but not before. A good way to do this is to keep, for each physical page, a "reference count" of the number of user page tables that refer to that page. Set a page's reference count to one when
kalloc()
allocates it. Increment a page's reference count when fork causes a child to share the page, and decrement a page's count each time any process drops the page from its page table.kfree()
should only place a page back on the free list if its reference count is zero. It's OK to to keep these counts in a fixed-size array of integers. You'll have to work out a scheme for how to index the array and how to choose its size. For example, you could index the array with the page's physical address divided by 4096, and give the array a number of elements equal to highest physical address of any page placed on the free list bykinit()
in kalloc.c. Feel free to modify kalloc.c (e.g.,kalloc()
andkfree()
) to maintain the reference counts. - Modify copyout() to use the same scheme as page faults when it encounters a COW page.
Some hints:
- It may be useful to have a way to record, for each PTE, whether it is a COW mapping. You can use the RSW (reserved for software) bits in the RISC-V PTE for this.
usertests -q
explores scenarios thatcowtest
does not test, so don't forget to check that all tests pass for both.- Some helpful macros and definitions for page table flags are at the end of
kernel/riscv.h
. - If a COW page fault occurs and there's no free memory, the process should be killed.
//kernel/vm.c
...
// Given a parent process's page table, copy
// its memory into a child's page table.
// Copies both the page table and the
// physical memory.
// returns 0 on success, -1 on failure.
// frees any allocated pages on failure.
int
uvmcopy(pagetable_t old, pagetable_t new, uint64 sz)
{
pte_t *pte;
uint64 pa, i;
uint flags;
for(i = 0; i < sz; i += PGSIZE){
if((pte = walk(old, i, 0)) == 0)
panic("uvmcopy: pte should exist");
if((*pte & PTE_V) == 0)
panic("uvmcopy: page not present");
pa = PTE2PA(*pte);
// lab5: Copy on write
// father
// 如果该页本身就不可写,那么子进程肯定也不可写,不用对其考虑COW
if(*pte & PTE_W){
*pte &= ~PTE_W;
*pte |= PTE_COW;
}
flags = PTE_FLAGS(*pte);
// child
if(mappages(new, i, PGSIZE, (uint64) pa, flags) != 0){
goto err;
}
incr((void *) pa);
}
return 0;
...
}
...
// Copy from kernel to user.
// Copy len bytes from src to virtual address dstva in a given page table.
// Return 0 on success, -1 on error.
int
copyout(pagetable_t pagetable, uint64 dstva, char *src, uint64 len)
{
uint64 n, va0, pa0;
pte_t *pte;
while(len > 0){
va0 = PGROUNDDOWN(dstva);
if (is_cow_fault(pagetable, va0)){
if (cow_alloc(pagetable, va0) < 0){
printf("copyout: cowalloc fail!\n");
return -1;
}
}
if(va0 >= MAXVA)
return -1;
pte = walk(pagetable, va0, 0);
...
}
...
int cow_alloc(pagetable_t pagetable, uint64 va){
va = PGROUNDDOWN(va);
pte_t *pte = walk(pagetable, va, 0);
uint64 pa = PTE2PA(*pte);
int flag = PTE_FLAGS(*pte);
char *mem = kalloc();
if (mem == 0){
return -1;
}
memmove(mem, (char *) pa, PGSIZE);
uvmunmap(pagetable, va, 1, 1);
flag &= ~(PTE_COW);
flag |= PTE_W;
if(mappages(pagetable, va, PGSIZE, (uint64) mem, flag) < 0){
kfree(mem);
return -1;
}
return 0;
}
//kernel/trap.c
...
} else if((which_dev = devintr()) != 0){
// ok
} else if((r_scause() == 15 || r_scause() == 13)) {
uint64 va = r_stval();
if(is_cow_fault(p->pagetable, va)){
if(cow_alloc(p->pagetable, va) < 0){
printf("usertrap(): cow_alloc failed");
p->killed = 1;
}
}else{
printf("usertrap(): unexpected scause %p pid=%d\n", r_scause(), p->pid);
printf(" sepc=%p stval=%p\n", r_sepc(), r_stval());
setkilled(p);
}
} else {
printf("usertrap(): unexpected scause %p pid=%d\n", r_scause(), p->pid);
printf(" sepc=%p stval=%p\n", r_sepc(), r_stval());
setkilled(p);
}
...
//kernel/riscv.h
...
#define PTE_X (1L << 3)
#define PTE_U (1L << 4) // user can access
#define PTE_COW (1L << 8) // 8 -> cow page
...
//kernel/defs.h
//vm.c
...
int copyinstr(pagetable_t, char *, uint64, uint64);
int is_cow_fault(pagetable_t, uint64);
int cow_alloc(pagetable_t, uint64);
void incr(void *);
...
//kernel/trap.c
...
//
// handle an interrupt, exception, or system call from user space.
// called from trampoline.S
//
void
usertrap(void)
{
...
// ok
} else if((r_scause() == 15 || r_scause() == 13)) {
uint64 va = r_stval();
if(is_cow_fault(p->pagetable, va)){
if(cow_alloc(p->pagetable, va) < 0){
printf("usertrap(): cow_alloc failed");
p->killed = 1;
}
}else{
printf("usertrap(): unexpected scause %p pid=%d\n", r_scause(), p->pid);
printf(" sepc=%p stval=%p\n", r_sepc(), r_stval());
setkilled(p);
}
} else {
...
}
...
// kernel/kalloc.c
// Physical memory allocator, for user processes,
// kernel stacks, page-table pages,
// and pipe buffers. Allocates whole 4096-byte pages.
#include "types.h"
#include "param.h"
#include "memlayout.h"
#include "spinlock.h"
#include "riscv.h"
#include "defs.h"
void freerange(void *pa_start, void *pa_end);
extern char end[]; // first address after kernel.
// defined by kernel.ld.
struct run {
struct run *next;
};
struct {
struct spinlock lock;
struct run *freelist;
char *ref_page;
int page_cnt;
char *end_;
} kmem;
int page_cnt;
int
pagecnt(void *pa_start, void *pa_end){
char *p;
p = (char *) PGROUNDUP((uint64) pa_start);
for(; p + PGSIZE <= (char *) pa_end; p += PGSIZE)
page_cnt++;
return page_cnt;
}
void
kinit()
{
initlock(&kmem.lock, "kmem");
kmem.page_cnt = pagecnt(end, (void *) PHYSTOP);
printf("page_cnt: %d\n", page_cnt);
kmem.ref_page = end;
for (int i = 0; i < kmem.page_cnt; ++i){
kmem.ref_page[i] = 0;
}
kmem.end_ = kmem.ref_page + kmem.page_cnt;
freerange(kmem.end_, (void*)PHYSTOP);
}
int
page_index(uint64 pa){
pa = PGROUNDDOWN(pa);
int res = (pa - (uint64) kmem.end_) / PGSIZE;
if (res < 0 || res >= kmem.page_cnt){
panic("page_index illegal");
}
return res;
}
void
incr(void *pa){
int index = page_index((uint64) pa);
acquire(&kmem.lock);
kmem.ref_page[index]++;
release(&kmem.lock);
}
void
desc(void *pa){
int index = page_index((uint64)pa);
acquire(&kmem.lock);
kmem.ref_page[index]--;
release(&kmem.lock);
}
void
freerange(void *pa_start, void *pa_end)
{
char *p;
p = (char*)PGROUNDUP((uint64)pa_start);
for(; p + PGSIZE <= (char*)pa_end; p += PGSIZE)
kfree(p);
}
// Free the page of physical memory pointed at by pa,
// which normally should have been returned by a
// call to kalloc(). (The exception is when
// initializing the allocator; see kinit above.)
void
kfree(void *pa)
{
int index = page_index((uint64) pa);
if(kmem.ref_page[index] > 1){
desc(pa);
return;
}
if (kmem.ref_page[index] == 1){
desc(pa);
}
struct run *r;
if(((uint64)pa % PGSIZE) != 0 || (char*)pa < end || (uint64)pa >= PHYSTOP)
panic("kfree");
// Fill with junk to catch dangling refs.
memset(pa, 1, PGSIZE);
r = (struct run*)pa;
acquire(&kmem.lock);
r->next = kmem.freelist;
kmem.freelist = r;
release(&kmem.lock);
}
void
print_free_pages_cnt(){
struct run *r;
int cnt = 0;
acquire(&kmem.lock);
r = kmem.freelist;
while(r->next){
r = r->next;
cnt++;
}
release(&kmem.lock);
printf("free_pages_cnt: %d\n", cnt);
}
// Allocate one 4096-byte page of physical memory.
// Returns a pointer that the kernel can use.
// Returns 0 if the memory cannot be allocated.
void *
kalloc(void)
{
struct run *r;
acquire(&kmem.lock);
r = kmem.freelist;
if(r)
kmem.freelist = r->next;
release(&kmem.lock);
if(r){
memset((char*)r, 5, PGSIZE); // fill with junk
// print_free_pages_cnt();
incr(r);
}
return (void*)r;
}
Submit the lab
Time spent
Create a new file, time.txt
, and put in a single integer, the number of hours you spent on the lab. git add and git commit the file.
Answers
If this lab had questions, write up your answers in answers-*.txt
. git add and git commit these files.
Submit
Assignment submissions are handled by Gradescope. You will need an MIT gradescope account. See Piazza for the entry code to join the class. Use this link if you need more help joining.
When you're ready to submit, run make zipball, which will generate lab.zip
. Upload this zip file to the corresponding Gradescope assignment.
If you run make zipball and you have either uncomitted changes or untracked files, you will see output similar to the following:
Inspect the above lines and make sure all files that your lab solution needs are tracked, i.e., not listed in a line that begins with ??
. You can cause git
to track a new file that you create using git add {filename}.
- Please run make grade to ensure that your code passes all of the tests. The Gradescope autograder will use the same grading program to assign your submission a grade.
- Commit any modified source code before running make zipball.
- You can inspect the status of your submission and download the submitted code at Gradescope. The Gradescope lab grade is your final lab grade.
Optional challenge exercise
- Measure how much your COW implementation reduces the number of bytes xv6 copies and the number of physical pages it allocates. Find and exploit opportunities to further reduce those numbers.