Skip to content

Lab 7 networking

In this lab you will write an xv6 device driver for a network interface card (NIC).

Fetch the xv6 source for the lab and check out the net branch:

$ git fetch
$ git checkout net
$ make clean

Screenshot 2023-10-28 at 22.38.24

Background

Before writing code, you may find it helpful to review "Chapter 5: Interrupts and device drivers" in the xv6 book.

在编写代码之前,您可能会发现回顾xv6 book中的“第5章:中断和设备驱动程序”很有帮助

You'll use a network device called the E1000 to handle network communication. To xv6 (and the driver you write), the E1000 looks like a real piece of hardware connected to a real Ethernet local area network (LAN). In fact, the E1000 your driver will talk to is an emulation provided by qemu, connected to a LAN that is also emulated by qemu. On this emulated LAN, xv6 (the "guest") has an IP address of 10.0.2.15. Qemu also arranges for the computer running qemu to appear on the LAN with IP address 10.0.2.2. When xv6 uses the E1000 to send a packet to 10.0.2.2, qemu delivers the packet to the appropriate application on the (real) computer on which you're running qemu (the "host").

我们将使用一个名为E1000的网络设备来处理网络通信。对于xv6(以及您编写的驱动程序)来说,E1000看起来就像一个连接到真正的以太网局域网(LAN)的真正硬件。实际上,您的驱动程序与之通信的E1000是由qemu提供的仿真,连接到一个同样由qemu仿真的局域网。在这个模拟的局域网中,xv6(“来宾”)的IP地址是10.0.2.15。Qemu还安排运行Qemu的计算机出现在局域网中,其IP地址为10.0.2.2。当xv6使用E1000向10.0.2.2发送数据包时,qemu将数据包发送给运行qemu的计算机(即“主机”)上的相应应用程序。

You will use QEMU's "user-mode network stack". QEMU's documentation has more about the user-mode stack here. We've updated the Makefile to enable QEMU's user-mode network stack and the E1000 network card.

您将使用QEMU的“用户模式网络栈”。QEMU的文档有更多关于用户模式栈的信息此处。我们已经更新了Makefile,以启用QEMU的用户模式网络堆栈和E1000网卡。

The Makefile configures QEMU to record all incoming and outgoing packets to the file packets.pcap in your lab directory. It may be helpful to review these recordings to confirm that xv6 is transmitting and receiving the packets you expect. To display the recorded packets:

Makefile配置QEMU将所有传入和传出的数据包记录到文件的数据包中。在你的实验室目录。查看这些录音可能有助于确认xv6正在传输和接收预期的分组。显示记录的报文。

tcpdump -XXnr packets.pcap

We've added some files to the xv6 repository for this lab. The file kernel/e1000.c contains initialization code for the E1000 as well as empty functions for transmitting and receiving packets, which you'll fill in. kernel/e1000_dev.h contains definitions for registers and flag bits defined by the E1000 and described in the Intel E1000 Software Developer's Manual. kernel/net.c and kernel/net.h contain a simple network stack that implements the IP, UDP, and ARP protocols. These files also contain code for a flexible data structure to hold packets, called an mbuf. Finally, kernel/pci.c contains code that searches for an E1000 card on the PCI bus when xv6 boots.

我们为这个实验室添加了一些文件到xv6存储库。kernel/E1000.c文件包含了E1000的初始化代码,以及用于发送和接收数据包的空函数,你将填充它们。kernel/e1000_dev.h包含E1000定义的寄存器和标志位的定义,并在Intel E1000软件开发人员手册中描述。kernel/net.ckernel/net.h包含一个简单的网络堆栈,它实现了IPUDPARP协议。这些文件还包含用于保存数据包的灵活数据结构的代码,称为mbuf。最后,kernel/PCI.c包含了启动xv6时在PCI总线上查找E1000网卡的代码。

Your Job (hard)

Your job is to complete e1000_transmit() and e1000_recv(), both in kernel/e1000.c, so that the driver can transmit and receive packets. You are done when make grade says your solution passes all the tests.

你的工作是在kernel/e1000.c中完成e1000_transmit()e1000_recv(),这样驱动程序就可以传输和接收数据包。当make grade表示你的解决方案通过了所有测试时,你就完成了。

While writing your code, you'll find yourself referring to the E1000 Software Developer's Manual. Of particular help may be the following sections:

在编写代码时,您会发现自己要参考E1000软件开发人员手册。以下部分可能会有特别的帮助:

  • Section 2 is essential and gives an overview of the entire device.

第2节是必不可少的,并给出了整个设备的概述

  • Section 3.2 gives an overview of packet receiving.

3.2节概述了分组接收

  • Section 3.3 gives an overview of packet transmission, alongside section 3.4.

3.3节概述了数据包传输,以及3.4节

  • Section 13 gives an overview of the registers used by the E1000.

第13节概述了E1000使用的寄存器

  • Section 14 may help you understand the init code that we've provided.

第14节可能会帮助你理解我们提供的init代码

Browse the E1000 Software Developer's Manual. This manual covers several closely related Ethernet controllers. QEMU emulates the 82540EM. Skim Chapter 2 now to get a feel for the device. To write your driver, you'll need to be familiar with Chapters 3 and 14, as well as 4.1 (though not 4.1's subsections). You'll also need to use Chapter 13 as a reference. The other chapters mostly cover components of the E1000 that your driver won't have to interact with. Don't worry about the details at first; just get a feel for how the document is structured so you can find things later. The E1000 has many advanced features, most of which you can ignore. Only a small set of basic features is needed to complete this lab.

浏览E1000软件开发人员手册。本手册涵盖了几个密切相关的以太网控制器。QEMU模拟了82540EM。现在浏览第2章,对这个设备有个大致的了解。要编写你的驱动程序,你需要熟悉第3章、第14章以及4.1章(虽然不包括4.1章的小节)。你还需要参考第13章。其他章节主要介绍了E1000的组件,你的驱动程序不需要与之交互。一开始不要担心细节;只需要对文档的结构有一个感觉,以便稍后找到内容。E1000有许多高级特性,其中大多数您可以忽略。完成这个实验只需要一小部分基本功能。

The e1000_init() function we provide you in e1000.c configures the E1000 to read packets to be transmitted from RAM, and to write received packets to RAM. This technique is called DMA, for direct memory access, referring to the fact that the E1000 hardware directly writes and reads packets to/from RAM.

我们在E1000 .c中提供的e1000_init()函数配置E1000读取要从RAM传输的数据包,并将接收到的数据包写入RAM。这种技术称为DMA,用于直接内存访问,涉及E1000硬件直接读写RAM数据包的事实。

Because bursts of packets might arrive faster than the driver can process them, e1000_init() provides the E1000 with multiple buffers into which the E1000 can write packets. The E1000 requires these buffers to be described by an array of "descriptors" in RAM; each descriptor contains an address in RAM where the E1000 can write a received packet. struct rx_desc describes the descriptor format. The array of descriptors is called the receive ring, or receive queue. It's a circular ring in the sense that when the card or driver reaches the end of the array, it wraps back to the beginning. e1000_init() allocates mbuf packet buffers for the E1000 to DMA into, using mbufalloc(). There is also a transmit ring into which the driver should place packets it wants the E1000 to send. e1000_init() configures the two rings to have size RX_RING_SIZE and TX_RING_SIZE.

因为数据包的突发到达速度可能比驱动程序处理它们的速度更快,所以e1000_init()为E1000提供了多个缓冲区,E1000可以写入数据包。E1000要求用内存中的一个“描述符”数组来描述这些缓冲区。每个描述符在RAM中都包含一个地址,E1000可以在其中写入接收到的分组。struct rx_desc描述符格式。描述符数组称为接收环(receive ring)或接收队列(receive queue)。这是一个环形结构,因为当卡片或驱动程序到达数组末尾时,它会返回数组的开头。e1000_init()使用mbufalloc()为E1000分配用于DMA的mbuf数据包缓冲区。还有一个传输环,驱动程序将希望E1000发送的分组放入其中。e1000_init()将两个环的大小配置为RX_RING_SIZETX_RING_SIZE

When the network stack in net.c needs to send a packet, it calls e1000_transmit() with an mbuf that holds the packet to be sent. Your transmit code must place a pointer to the packet data in a descriptor in the TX (transmit) ring. struct tx_desc describes the descriptor format. You will need to ensure that each mbuf is eventually freed, but only after the E1000 has finished transmitting the packet (the E1000 sets the E1000_TXD_STAT_DD bit in the descriptor to indicate this).

net.c中的网络栈需要发送一个数据包时,它调用e1000_transmit()并传入一个mbuf来保存要发送的数据包。发送代码必须在发送(发送)环的描述符中放置一个指向数据包数据的指针。struct tx_desc描述符格式。您将需要确保每个mbuf最终都被释放,但只有在E1000完成传输数据包之后(E1000在描述符中设置E1000_TXD_STAT_DD位来表示这一点)。

When the E1000 receives each packet from the ethernet, it DMAs the packet to the memory pointed to by addr in the next RX (receive) ring descriptor. If an E1000 interrupt is not already pending, the E1000 asks the PLIC to deliver one as soon as interrupts are enabled. Your e1000_recv() code must scan the RX ring and deliver each new packet's mbuf to the network stack (in net.c) by calling net_rx(). You will then need to allocate a new mbuf and place it into the descriptor, so that when the E1000 reaches that point in the RX ring again it finds a fresh buffer into which to DMA a new packet.

当E1000从以太网接收到每个数据包时,它将数据包DMAs到下一个RX (receive)环描述符中' addr '指向的内存中。如果一个E1000中断没有被挂起,E1000会要求PLIC在中断启用后立即交付一个。你的e1000_recv()代码必须扫描RX环,并通过调用net_rx()将每个新数据包的mbuf传递到网络栈(在net.c中)。然后,用户需要分配一个新的mbuf并将其放入描述符中,这样当E1000再次到达RX环中的该点时,它就会找到一个新的缓冲区,通过DMA向其中发送一个新分组。

In addition to reading and writing the descriptor rings in RAM, your driver will need to interact with the E1000 through its memory-mapped control registers, to detect when received packets are available and to inform the E1000 that the driver has filled in some TX descriptors with packets to send. The global variable regs holds a pointer to the E1000's first control register; your driver can get at the other registers by indexing regs as an array. You'll need to use indices E1000_RDT and E1000_TDT in particular.

除了读写RAM中的描述符环,驱动程序还需要通过E1000的内存映射控制寄存器与E1000交互,以检测接收到的数据包何时可用,并通知E1000驱动程序已经用要发送的数据包填充了一些TX描述符。全局变量regs保存了一个指向E1000的第一个控制寄存器的指针; 你的驱动程序可以通过索引regs作为一个数组来获取其他寄存器。你需要特别使用E1000_RDTE1000_TDT索引。

To test your driver, run make server in one window, and in another window run make qemu and then run nettests in xv6. The first test in nettests tries to send a UDP packet to the host operating system, addressed to the program that make server runs. If you haven't completed the lab, the E1000 driver won't actually send the packet, and nothing much will happen.

要测试你的驱动程序,在一个窗口中运行make server,在另一个窗口中运行make qemu,然后在xv6中运行nettestsnettests中的第一个测试试图向主机操作系统发送一个UDP数据包,地址指向运行make server的程序。如果您还没有完成实验,E1000驱动程序实际上不会发送数据包,也不会发生什么事情。

After you've completed the lab, the E1000 driver will send the packet, qemu will deliver it to your host computer, make server will see it, it will send a response packet, and the E1000 driver and then nettests will see the response packet. Before the host sends the reply, however, it sends an "ARP" request packet to xv6 to find out its 48-bit Ethernet address, and expects xv6 to respond with an ARP reply. kernel/net.c will take care of this once you have finished your work on the E1000 driver. If all goes well, nettests will print testing ping: OK, and make server will print a message from xv6!.

完成实验后,E1000驱动程序将发送数据包,qemu将把它发送到你的主机,make server将看到它,它将发送一个响应数据包,然后E1000驱动程序和nettests将看到响应数据包。不过,在主机发送应答之前,它会向xv6发送一个“ARP”请求包,以查找其48位以太网地址,并期望xv6返回一个ARP应答。一旦你完成了E1000驱动程序的工作,kernel/net.c将负责处理这个问题。如果一切顺利,nettests将打印testing ping: OK,而make server将打印a message from xv6!

tcpdump -XXnr packets.pcap should produce output that starts like this:

reading from file packets.pcap, link-type EN10MB (Ethernet)
15:27:40.861988 IP 10.0.2.15.2000 > 10.0.2.2.25603: UDP, length 19
        0x0000:  ffff ffff ffff 5254 0012 3456 0800 4500  ......RT..4V..E.
        0x0010:  002f 0000 0000 6411 3eae 0a00 020f 0a00  ./....d.>.......
        0x0020:  0202 07d0 6403 001b 0000 6120 6d65 7373  ....d.....a.mess
        0x0030:  6167 6520 6672 6f6d 2078 7636 21         age.from.xv6!
15:27:40.862370 ARP, Request who-has 10.0.2.15 tell 10.0.2.2, length 28
        0x0000:  ffff ffff ffff 5255 0a00 0202 0806 0001  ......RU........
        0x0010:  0800 0604 0001 5255 0a00 0202 0a00 0202  ......RU........
        0x0020:  0000 0000 0000 0a00 020f                 ..........
15:27:40.862844 ARP, Reply 10.0.2.15 is-at 52:54:00:12:34:56, length 28
        0x0000:  ffff ffff ffff 5254 0012 3456 0806 0001  ......RT..4V....
        0x0010:  0800 0604 0002 5254 0012 3456 0a00 020f  ......RT..4V....
        0x0020:  5255 0a00 0202 0a00 0202                 RU........
15:27:40.863036 IP 10.0.2.2.25603 > 10.0.2.15.2000: UDP, length 17
        0x0000:  5254 0012 3456 5255 0a00 0202 0800 4500  RT..4VRU......E.
        0x0010:  002d 0000 0000 4011 62b0 0a00 0202 0a00  .-....@.b.......
        0x0020:  020f 6403 07d0 0019 3406 7468 6973 2069  ..d.....4.this.i
        0x0030:  7320 7468 6520 686f 7374 21              s.the.host!

Your output will look somewhat different, but it should contain the strings "ARP, Request", "ARP, Reply", "UDP", "a.message.from.xv6" and "this.is.the.host".

nettests performs some other tests, culminating in a DNS request sent over the (real) Internet to one of Google's name servers. You should ensure that your code passes all these tests, after which you should see this output:

$ nettests
nettests running on port 25603
testing ping: OK
testing single-process pings: OK
testing multi-process pings: OK
testing DNS
DNS arecord for pdos.csail.mit.edu. is 128.52.129.126
DNS OK
all tests passed.

You should ensure that make grade agrees that your solution passes.

Screenshot 2023-10-28 at 23.04.36

tcpdump -XXnr packets.pcap:

Screenshot 2023-10-28 at 23.10.21

Hints

Start by adding print statements to e1000_transmit() and e1000_recv(), and running make server and (in xv6) nettests. You should see from your print statements that nettests generates a call to e1000_transmit.

Some hints for implementing e1000_transmit:

  • First ask the E1000 for the TX ring index at which it's expecting the next packet, by reading the E1000_TDT control register.
  • Then check if the the ring is overflowing. If E1000_TXD_STAT_DD is not set in the descriptor indexed by E1000_TDT, the E1000 hasn't finished the corresponding previous transmission request, so return an error.
  • Otherwise, use mbuffree() to free the last mbuf that was transmitted from that descriptor (if there was one).
  • Then fill in the descriptor. m->head points to the packet's content in memory, and m->len is the packet length. Set the necessary cmd flags (look at Section 3.3 in the E1000 manual) and stash away a pointer to the mbuf for later freeing.
  • Finally, update the ring position by adding one to E1000_TDT modulo TX_RING_SIZE.
  • If e1000_transmit() added the mbuf successfully to the ring, return 0. On failure (e.g., there is no descriptor available to transmit the mbuf), return -1 so that the caller knows to free the mbuf.

Some hints for implementing e1000_recv:

  • First ask the E1000 for the ring index at which the next waiting received packet (if any) is located, by fetching the E1000_RDT control register and adding one modulo RX_RING_SIZE.
  • Then check if a new packet is available by checking for the E1000_RXD_STAT_DD bit in the status portion of the descriptor. If not, stop.
  • Otherwise, update the mbuf's m->len to the length reported in the descriptor. Deliver the mbuf to the network stack using net_rx().
  • Then allocate a new mbuf using mbufalloc() to replace the one just given to net_rx(). Program its data pointer (m->head) into the descriptor. Clear the descriptor's status bits to zero.
  • Finally, update the E1000_RDT register to be the index of the last ring descriptor processed.
  • e1000_init() initializes the RX ring with mbufs, and you'll want to look at how it does that and perhaps borrow code.
  • At some point the total number of packets that have ever arrived will exceed the ring size (16); make sure your code can handle that.

You'll need locks to cope with the possibility that xv6 might use the E1000 from more than one process, or might be using the E1000 in a kernel thread when an interrupt arrives.

Solution:

kerne/e1000c:

int
e1000_transmit(struct mbuf *m)
{
  //
  // Your code here.
  //
  // the mbuf contains an ethernet frame; program it into
  // the TX descriptor ring so that the e1000 sends it. Stash
  // a pointer so that it can be freed after sending.
  //

    acquire(&e1000_lock);
    // 查询ring里下一个packet的下标
    int idx = regs[E1000_TDT];

    if ((tx_ring[idx].status & E1000_TXD_STAT_DD) == 0) {
        // 之前的传输还没有完成
        release(&e1000_lock);
        return -1;
    }

    // 释放上一个包的内存
    if (tx_mbufs[idx])
        mbuffree(tx_mbufs[idx]);

    // 把这个新的网络包的pointer塞到ring这个下标位置
    tx_mbufs[idx] = m;
    tx_ring[idx].length = m->len;
    tx_ring[idx].addr = (uint64) m->head;
    tx_ring[idx].cmd = E1000_TXD_CMD_RS | E1000_TXD_CMD_EOP;
    regs[E1000_TDT] = (idx + 1) % TX_RING_SIZE;

    release(&e1000_lock);

  return 0;
}

static void
e1000_recv(void)
{
  //
  // Your code here.
  //
  // Check for packets that have arrived from the e1000
  // Create and deliver an mbuf for each packet (using net_rx()).
  //
    while (1) {
        // 把所有到达的packet向上层递交
        int idx = (regs[E1000_RDT] + 1) % RX_RING_SIZE;
        if ((rx_ring[idx].status & E1000_RXD_STAT_DD) == 0) {
            // 没有新包了
            return;
        }
        rx_mbufs[idx]->len = rx_ring[idx].length;
        // 向上层network stack传输
        net_rx(rx_mbufs[idx]);
        // 把这个下标清空 放置一个空包
        rx_mbufs[idx] = mbufalloc(0);
        rx_ring[idx].status = 0;
        rx_ring[idx].addr = (uint64)rx_mbufs[idx]->head;
        regs[E1000_RDT] = idx;
    }
}

Submit the lab

Time spent

Create a new file, time.txt, and put in a single integer, the number of hours you spent on the lab. git add and git commit the file.

Answers

If this lab had questions, write up your answers in answers-*.txt. git add and git commit these files.

Submit

Assignment submissions are handled by Gradescope. You will need an MIT gradescope account. See Piazza for the entry code to join the class. Use this link if you need more help joining.

When you're ready to submit, run make zipball, which will generate lab.zip. Upload this zip file to the corresponding Gradescope assignment.

If you run make zipball and you have either uncomitted changes or untracked files, you will see output similar to the following:

M hello.c
?? bar.c
?? foo.pyc
Untracked files will not be handed in.  Continue? [y/N]

Inspect the above lines and make sure all files that your lab solution needs are tracked, i.e., not listed in a line that begins with ??. You can cause git to track a new file that you create using git add {filename}.

  • Please run make grade to ensure that your code passes all of the tests. The Gradescope autograder will use the same grading program to assign your submission a grade.
  • Commit any modified source code before running make zipball.
  • You can inspect the status of your submission and download the submitted code at Gradescope. The Gradescope lab grade is your final lab grade.

Screenshot 2023-10-28 at 22.57.54

Optional Challenges:

Some of the benefits of the challenge exercises below are only measurable/testable on real, high-performance hardware, which means x86-based computers.

  • In this lab, the networking stack uses interrupts to handle ingress packet processing, but not egress packet processing. A more sophisticated strategy would be to queue egress packets in software and only provide a limited number to the NIC at any one time. You can then rely on TX interrupts to refill the transmit ring. Using this technique, it becomes possible to prioritize different types of egress traffic. (easy)
  • The provided networking code only partially supports ARP. Implement a full ARP cache and wire it in to net_tx_eth(). (moderate)
  • The E1000 supports multiple RX and TX rings. Configure the E1000 to provide a ring pair for each core and modify your networking stack to support multiple rings. Doing so has the potential to increase the throughput that your networking stack can support as well as reduce lock contention. (moderate), but difficult to test/measure
  • sockrecvudp() uses a singly-linked list to find the destination socket, which is inefficient. Try using a hash table and RCU instead to increase performance. (easy), but a serious implementation would difficult to test/measure
  • ICMP can provide notifications of failed networking flows. Detect these notifications and propagate them as errors through the socket system call interface.
  • The E1000 supports several stateless hardware offloads, including checksum calculation, RSC, and GRO. Use one or more of these offloads to increase the throughput of your networking stack. (moderate), but hard to test/measure
  • The networking stack in this lab is susceptible to receive livelock. Using the material in lecture and the reading assignment, devise and implement a solution to fix it. (moderate), but hard to test.
  • Implement a UDP server for xv6. (moderate)
  • Implement a minimal TCP stack and download a web page. (hard)

If you pursue a challenge problem, whether it is related to networking or not, please let the course staff know!