Thoughts from upside down: Raw Sockets, Raw Brain

I've been working for a while on a networking app that involves moving packets at great speed from one interface to another, while doing a little processing on them. Key to the performance we need is to avoid copying data or making kernel calls for each packet. We already did some testing a while back with Netmap, which provides direct access from user-mode code to hardware buffers and buffer rings - perfect for our application. Intel now has something similar called DPDK, but we chose Netmap because it was there when we needed it and also because, being open source, it avoids having to do a secret handshake with Intel.

However Netmap has some limitations so we have to modify it before we can fully integrate with it. In the meantime I wanted to do some functional testing, where ultimate performance isn't essential. So it seemed obvious to code a variation which uses sockets, accepting the performance penalty for now.

Little did I know. Getting raw sockets to work, and then getting my code to work correctly with them, was really a major pain. Since I did eventually get it to work, I hope these notes may help anyone else who runs into the same requirement.

What I wanted was a direct interface to the ethernet device driver for a specific interface. Thus I'd get and send raw ethernet packets directly to or from that specific interface. But sockets are really designed to sit on top of the internal IP network stack in the Linux kernel (or Unix generally) - for which they work very nicely. It's pretty much an unnatural act to get them to deal only with one specific interface. It's also, I guess, a pretty uncommon use case. The documentation is scrappy and incomplete, and the common open source universal solution of "just Google it" didn't come up with much either.

Just getting hold of raw packets is quite well documented. You create a socket using:

my_socket = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL))

and you will get, and be able to send, raw packets, with everything from the ethernet header upwards created by you. (You'll need to run this as root to get it to work).

Since my application runs as "bump in the wire", picking up all traffic regardless of the MAC address, I also needed to listen promiscuously. This also is reasonably well documented - it's done by reading the interface flags, setting the requisite bit, and setting them again:

ifreq ifr;
memset(&ifr, 0, sizeof(ifr));
ioctl(my_socket, SIOCGIFFLAGS, &ifr);
ifr.ifr_flags |= IFF_PROMISC;
ioctl(my_socket, SIOCSIFFLAGS, &ifr);

Production code should check for errors, of course.

I got my code running like this, but it was still getting packets destined for all interfaces. This is where it got tricky and had me tearing my hair out. Some documentation suggests using bind for this, while others suggest using an IOCTL whose name I forget. It's all very unclear. Eventually, bind did the trick. First you have to translate the interface name to an interface id, then use that for the call to bind:

sockaddr_ll sa;
ifr.ifr_addr.sa_family = AF_INET;
strncpy(ifr.ifr_name, sys_name.c_str(), sys_name.size()+1);
ioctl(my_socket, SIOCGIFINDEX, &ifr);
sa.sll_ifindex = ifr.ifr_ifindex;
sa.sll_family = AF_PACKET;
sa.sll_halen = ETH_ALEN;
my_mac_address.to_msg(sa.sll_addr);
bind(my_socket, (sockaddr*)&sa, sizeof(sockaddr_ll));

A couple of wrinkles here. It may seem intuitively obvious that sockaddr_ll is in effect a subclass of sockaddr, but it isn't documented that way anywhere that I could find. And finding the header files that defines these things, and then the header files they depend upon, and so (almost) ad infinitum, is a nightmare. In the end the best solution I could come up with was to run just the preprocessor on my source files, and look at the resulting C code. And note the ugly cast in the call to bind, because in the world of C there is no such thing as inheritance - the common superclass is actually a macro, and as far as the compiler is concerned sockaddr and sockaddr_ll are completely unrelated.

Another wrinkle is the bind function itself. I use boost::bind all the time, far too much to want to type or read the full qualified name, so my common header file contains "using boost::bind". That absolutely wipes out any attempt to use the socket function of the same name. The only way round it is to define a trivial function called socket_bind (or whatever you prefer), whose definition in its .cpp file studiously avoids using the common header file. It's only a nuisance, but it did take a little thought to come up with a reasonable workaround when I first ran into the problem.

So, with all this done, I was receiving raw ethernet frames, doing my thing with them, and sending them on through the paired egress interface. Wonderful.

Except actually, not. The frames I was receiving were way longer than ethernet frames. Since I'm using jumbo-frame sized buffers (9000 bytes), I'd receive them OK but not be able to send them. But sometimes, they were even too large for that, and I wouldn't receive anything at all. And this was where things got really frustrating.

The first move, of course, was to check the MTUs (maximum permitted frame size under TCP and IP) on all the relevant interfaces. They were fine. Then I found a suggestion that TCP will use the MTU of the loopback interface, relying on the driver to straighten things out. So I set that down to 1400 too. It still made no difference.

At that point, my code didn't send ICMP messages for too-large packets, which a router or host is supposed to do. I spent a whole Saturday in a distress-coding binge writing my ICMP implementation, and changing my super-slick multi-threaded lock-free infrastructure to accommodate it. It did make a very small difference. Instead of just blasting away with giant frames, it would send each packet initially as a giant frame, then retransmit it later in smaller, frame-sized chunks. The data did get through, but at a pititful rate with all those retransmissions and timeouts.

Finally, after much Googling, I discovered the "tcp segmentation offload" parameter. That made no difference. With more Googling, I also discovered "generic segmentation offload". That made things better, though still far from good. I had Wireshark running on all four interfaces - the two test systems running iPerf, and both interfaces on my system in the middle. (All this is running as VMs under VMware, by the way - see earlier ~~rant~~ reasoned discourse about the problems I had trying to get Xen to work). Wireshark clearly showed packets leaving the first system as correctly sized ethernet frames, yet when they showed up at the second system they'd magically coalesced into jumbo frames.

After much cursing I found the third thing I had to turn off, "generic receive offload". The design assumption here is that practically all network traffic is TCP - which after all is true. So the hardware (emulated in my case) combines smaller TCP packets into huge ones, to reduce the amount of work done in the network stack. It's an excellent idea, since much of the overhead of network processing is in handling each packet rather than the actual data bytes. But of course it completely broke my application.

This is not one of the better documented bits of Linux. There is - of course - a utility to manage all this stuff, but it's so obscure that it's not part of the standard Ubuntu distribution. You have to explicitly install it. So a summary of what is required to solve this problem is:

sudo -s

apt-get install ethtool

ethtool -K ethn tso off

ethtool -K ethn gso off

ethtool -K ethn gro off

All of this requires root privileges. Whoever wrote ethtool had a sense of humor - the '-K' option sets parameters, the '-k' option (lower case) shows them. It would have been too hard, I suppose, to think of a different letter for such fundamentally different operations.

With that done, my code sees the packets at their normal (no greater than MTU) size. Finally, I could get on with debugging my own code.