The Programmable Data Plane Reading List

Applications

A main motivation for programmable data planes are the novel applications they enable. We identify and, in the following, will discuss five main categories: applications related to resilient and efficient forwarding, in-network computation, consensus, telemetry, and load-balancing. One may wonder, what aspects of SDN and programmable data plane make these applications possible? There is probably no single perfect answer to this question. Applications related to in-network computation typically leverage new hardware-assisted primitive operations, supported in the data plane, to provide novel functionality and improve performance. Resilient and efficient routing (and to some extent load-balancing) leverage the unique and unprecedented programmatic control over the way traffic flows through the network, e.g., to implement advanced functionality in the data plane (whereas formerly it used to be handled, e.g., in the control plane). Measurement applications benefit from the improved traffic visibility and/or from the improved latency and throughput at which high-volume and highly variable traffic can be handled, if offloaded to the data plane. Reduced latency and improved reaction time is arguably also a key reason for consensus applications. Furthermore, measurement applications benefit from the fact that they can be expressed in terms of simple primitives (e.g., sketches). We also note that such applications are not limited to be "performed (only) in the network": for example, telemetry can (and today often does) occur outside the network. That said, telemetry applications also benefit from the new visibility into the network, e.g., queues occupation levels of the switches along the path. Many interesting applications also arise from offloading applications that were formerly handled in a separate middlebox to programmable switches. In general, any application designed for a non-programmable device may benefit from the flexibilities introduced by a programmable counterpart (e.g., allowing to evolve the application). Also, applications with a strong networking component (e.g., request-response patterns) are more likely to benefit from in-network services, as much communication traffic naturally traverses the network anyway.

Resilient, Robust, and Efficient Forwarding

Data planes often operate much faster than the control plane, which motivates to move functionality for maintaining connectivity and efficient routing under failures to the switches. At the same time, implementing such functionality is non-trivial, as discussed in the following research papers.

Hedera: Dynamic Flow Scheduling for Data Center Networks Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, Amin Vahdat — USENIX NSDI '10 (2010) This paper is motivated by the limitations of existing IP multipath protocols relying on per-flow static hashing, which can result in suboptimal throughput and bandwidth losses due to long-term collisions. Hedera is a dynamic flow scheduling system for multi-stage switch topologies as they often appear in data centers. Hedera uses flow information from constituent switches and reroutes traffic to non-conflicting routes accordingly. The authors show that the more global view of routing and traffic demands allows Hedera to see bottlenecks that switch-local schedulers cannot, and to adaptively schedule the switching fabric in a way which significantly improves aggregate network utilization with minimal overheads.
Ensuring Connectivity via Data Plane Mechanisms Junda Liu, Aurojit Panda, Ankit Singla, Brighten Godfrey, Michael Schapira, Scott Shenker — USENIX NSDI '13 (2013) The authors propose to move the responsibility for maintaining basic network connectivity (as opposed to the computation of optimal paths which require global control plane knowledge) to the data plane, which operates orders of magnitude faster than the control plane. Their Data-Driven Connectivity (DDC) approach, which can handle arbitrary delays and losses, relies on simple state changes which can done at packet rates. In particular, DCC relies on link reversal routing, adapted to suit the data plane, e.g., to handle message loss.
The show must go on: Fundamental data plane connectivity services for dependable SDNs Michael Borokhovich, Clement Rault, Liron Schiff, Stefan Schmid — Elsevier Computer Communications 116 (2018) The paper argues that in order to provide a high availability, connectivity, and robustness, dependable SDNs must implement functionality for inband network traversals, e.g., to find failover paths in the presence link failures. Three fundamentally different mechanisms are described: simple stateless mechanisms, efficient mechanisms based on packet tagging, and mechanisms based on dynamic state at the switches.
Blink: Fast Connectivity Recovery Entirely in the Data Plane Thomas Holterbach, Edgar Costa Molero, Maria Apostolaki, Alberto Dainotti, Stefano Vissicchio, Laurent Vanbever — USENIX NSDI '19 (2019) The paper explores new possibilities, created by programmable switches, for fast dataplane-driven rerouting upon signals triggered by traffic disruptions. The proposed method, Blink, uses exploits TCP-induced signals to detect failures; when compounded over multiple flows, TCP behavior creates a strong and characteristic failure signal. Blink analyzes TCP flows, at line rate, to reliably and quickly detect major traffic disruptions and recover data-plane connectivity. Evaluation results on a P4 implementation of Blink running on real Tofino switch indicate that it can achieve sub-second rerouting for realistic Internet traffic and scales to protect large fractions of realistic traffic.

In-network Computation

Offloading computation, on‐path aggregation functionalities, caching, or even AI, to the network, has the potential to significantly improve the efficiency of distributed applications. Accordingly, the study of such mechanisms have recently received much attention.

Camdoop: Exploiting In-network Aggregation for Big Data Applications Paolo Costa, Austin Donnelly, Antony Rowstron, Greg O'Shea — USENIX NSDI '12 (2012) This paper makes the case that many massive-scale information processing and real-time applications may benefit from pushing data-aggregation load from the network edge into the network. This is because in many of these applications data is aggregated during the computation process and the output size is a fraction of the input size. The authors explore a different point in the design space, whereby instead of increasing the network bandwidth they rather implement a MapReduce-like system on a cluster design that uses a direct-connect network topology, with servers directly linked to other servers, and letting servers to perform in-network aggregation of data during the shuffle phase. Camdoop was shown to significantly reduce network traffic and provide high performance increase.
Netagg: Using middleboxes for application-specific on-path aggregation in data centres Luo Mai, Lukas Rupprecht, Abdul Alim, Paolo Costa, Matteo Migliavacca, Peter Pietzuch, Alexander L. Wolf — ACM CoNEXT '14 (2014) This paper is motivated by the performance challenges faced by data-center applications, such as Hadoop batch processing, during the data aggregation phase: if the network struggles to support many-to-few, high-bandwidth communication between servers then it can become a bottleneck. Mai et al. propose to depart from performing data aggregation at edge servers, but rather, do it more efficiently along network paths. The presented software platform, NETAGG, supports on-path aggregation for network-bound partition/aggregation applications. It is based on a middlebox-like design, in which dedicated servers that can execute aggregation functions provided by applications. The authors demonstrate that NETAGG can improve throughput substantially.
Scalable hierarchical aggregation protocol (SHArP): a hardware architecture for efficient data reduction Richard L. Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, Lion Levi, Alex Margolin, Tamir Ronen, Alexander Shpiner, Oded Wertheim, Eitan Zahavi — IEEE COMHPC '16 (2016) SHArP is designed to offload computational load to the network, by relying on intelligent network devices manipulating data traversing the datacenter. SHArP is implemented in Mellanox’s SwitchIB-2 ASIC, using in-network trees to reduce data from a group of sources, and to distribute the result. Multiple parallel jobs with several partially overlapping groups are supported, and pipelining is used for improving latency further.
In-Network Computation is a Dumb Idea Whose Time Has Come Amedeo Sapio, Ibrahim Abdelaziz, Abdulla Aldilaijan, Marco Canini, Panos Kalnis — ACM HotNets '17 (2017) The authors ask the question, given that programmable data plane hardware creates new opportunities for infusing intelligence into the network, what kinds of computation should be delegated to the data plane? The paper discusses the opportunities and challenges for co-designing data center distributed systems with their network layer, under the constraints imposed by the limitations of the network machine architecture of programmable devices. They find that, in particular, aggregation functions raise opportunities to exploit the limited computation power of networking hardware to lessen network congestion and improve the overall application performance.
IncBricks: Toward In-Network Computation with an In-Network Cache Ming Liu, Liang Luo, Jacob Nelson, Luis Ceze, Arvind Krishnamurthy, Kishore Atreya — ASPLOS '17 (2017) This paper presents IncBricks, an in-network caching fabric with basic computing primitives. IncBricks is a hardware-software co-designed system that supports caching in the network using a programmable network middlebox. As a key-value store accelerator, our prototype lowers request latency by over 30% and doubles throughput for 1024 byte values in a common cluster configuration. The results demonstrate the effectiveness of in-network computing and that efficient datacenter network request processing is possible if we carefully split the computation across programmable switches, network accelerators, and end hosts.
Evaluating the Power of Flexible Packet Processing for Network Resource Allocation Naveen Kr. Sharma, Antoine Kaufmann, Thomas Anderson, Changhoon Kim, Arvind Krishnamurthy, Jacob Nelson, Simon Peter — NSDI '17 (2017) The main contribution of this work is providing a set of general building blocks that mask the limitations of programmable switches (limited state, support limited types of operations, limited per-packet computation) using approximation techniques and thereby enabling the implementation of realistic network protocols. These building blocks are then used to tackle the network resource allocation problem within datacenters and realize approximate variants of congestion control and load balancing protocols, such as XCP, RCP, and CONGA, that require explicit support from the network. The evaluations show that the proposed approximations are accurate and that they do not exceed the hardware resource limits associated with flexible switches.
Can the Network Be the AI Accelerator? Davide Sanvito, Giuseppe Siracusano, Roberto Bifulco — ACM NetCompute '18 (2018) This paper analyzes the feasibility and opportunities from using programmable network devices (e.g., network cards and switches), as accelerators for Artificial Neural Networks (NNs). In particular, the authors investigate the properties of NN processing on CPUs, and find that programmable network devices may indeed be a suitable engine, for implementing a CPU’s NN co-processor.
In-network Neural Networks Giuseppe Siracusano, Roberto Bifulco — unpublished manuscript (2018) The paper presents N2Net, a system that implements binary neural networks using commodity switching chips deployed in network switches and routers. N2Net shows that these devices can run simple neural network models, whose input is encoded in the network packets' header, at packet processing speeds (billions of packets per second). Furthermore, the authors' experience highlights that switching chips could support even more complex models, provided that some minor and cheap modifications to the chip's design are applied.

Distributed Consensus

Another interesting application for programmable data planes is related to consensus algorithms: the coordination among controllers or switches may be performed most efficiently directly on the network devices. Over the last years, several interesting first approaches have been reported in the literature, not only to compute consensus but also to provide different notions of consistency more generally.

NetPaxos: Consensus at Network Speed Huynh Tu Dang, Daniele Sciascia, Marco Canini, Fernando Pedone, Robert Soulé — ACM SOSR '15 (2015) This paper explores the possibility of implementing the widely deployed Paxos consensus protocol in network devices. Two different approaches are presented: (1) a detailed design description for implementing the full Paxos logic in SDN switches, which identifies a sufficient set of required OpenFlow extensions, and (2) an alternative, optimistic protocol which can be implemented without changes to the OpenFlow API, but relies on assumptions about how the network orders messages. Although neither of these protocols can be fully implemented without changes to the underlying switch firmware, the authors argue that such changes are feasible in existing hardware.
Paxos Made Switch-y Huynh Tu Dang, Marco Canini, Fernando Pedone, and Robert Soulé — ACM SIGCOMM CCR 46,2 (2016) This paper posits that there are significant performance benefits to be gained by implementing the Paxos protocol, the foundation for building many fault-tolerant distributed systems and services, in network devices. The paper describes an implementation of Paxos in P4.
Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li, Raghav Sethi, Michael Kaminsky, David G. Andersen, Michael J. Freedman — USENIX NSDI '16 (2016) SwitchKV implements a key-value store system leveraging SDN network switches to balance the cache servers workload routing the traffic based on the content of the network packets. To identify the content of a packet, the key of a key-value entry is encoded in the packet header. A hybrid cache strategy keeps the cache and switch forwarding rules updated, finally achieving significant improvements in both system's throughput and latency.
In-band synchronization for distributed SDN control planes Liron Schiff, Stefan Schmid, Petr Kuznetsov — ACM SIGCOMM CCR 46,1 (2016) The paper considers the design of consistent distributed control planes in which the actions performed on the data plane by different controllers need to be synchronized. The authors propose a synchronization framework for based on atomic transactions implemented in the data plane switches and show that their approach allows to realize fundamental consensus primitives in the presence of controller failures. They also discuss applications for consistent policy composition. With a proof-of-concept implementation, it is demonstrated that the framework can be implemented using the standard OpenFlow protocol.
NetCache: Balancing Key-Value Stores with Fast In-Network Caching Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé, Jeongkeun Lee, Nate Foster, Changhoon Kim, Ion Stoica — ACM SOSP '17 (2017) NetCache implements a small cache in for key-velue stores in a programmable hardware switch data plane. The switch works as a cache at the datacenter's rack level, handling requests directed to the rack's server. The implementation deals with consistency problems and shows how to overcome the constraints of hardware to provide throughput and latency improvements.
KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC Bojie Li, Zhenyuan Ruan, Wencong Xiao, Yuanwei Lu, Yongqiang Xiong, Andrew Putnam, Enhong Chen, Lintao Zhang — ACM SOSP '17 (2017) The paper presents an interesting alternative of NetCache: instead of using in-network programmable switches to cache key-value pairs it leverages programmable NIC to accelerate key-value stores in an "end-to-end" fashion. In particular, KV-Direct extends RDMA using programmable NICs to enable remote direct key-value access to the main host memory, yielding more than 1.2 billion operations per second using 10 parallel NICs.
NetChain: Scale-Free Sub-RTT Coordination Xin Jin, Xiaozhou Li, Haoyu Zhang, Nate Foster, Jeongkeun Lee, Robert Soulé, Changhoon Kim, Ion Stoica — USENIX NSDI '18 (2018) This paper presents NetChain, a new approach that provides scale-free sub-RTT coordination in data centers. NetChain exploits programmable switches to store data and process queries entirely in the network data plane. This eliminates the query processing at coordination servers and cuts the end-to-end latency to as little as half of an RTT. New protocols and algorithms are designed for NetChain guarantees strong consistency and handles switch failures efficiently.

Monitoring, Telemetry, and Measurement

Perhaps the most interesting applications are related to network measurement, monitoring and diagnosis. Indeed, programmable data planes can be a game changer, providing deep insights into the network, even to end-hosts, as we discuss in the following.

Millions of little minions: Using packets for low latency network programming and visibility Vimalkumar Jeyakumar, Mohammad Alizadeh, Yilong Geng, Changhoon Kim, David Mazières — ACM SIGCOMM '14 (2014) Jeyakumar et al. present an approach to give end-hosts visibility into network behavior and to quickly introduce new data plane functionality, via a new Tiny Packet Program (TTP) interface. TTPs are embedded into packets by endhosts and can actively query and manipulate internal network state. The idea is motivated by a clear work division: switches forward and execute TTPs in-band at line rate, and endhosts perform arbitrary (and easily updated) computation on network state. The paper presents a number of use case descriptions motivating In‐band Network Telemetry (INT).
In-band Network Telemetry via Programmable Dataplanes Changhoon Kim, Anirudh Sivaraman, Naga Katta, Antonin Bas, Advait Dixit, Lawrence J Wobker — ACM SOSR '15 Demos (2015) In-band Network Telemetry (INT) is a powerful new network-diagnostics and debug mechanism, which allows, e.g., to diagnose performance problems related to latency spikes. The INT abstraction allows data packets to query switch-internal state (e.g., queue size, link utilization, and queuing latency). The paper reports on a prototype implemented in the P4 language, hence supporting various different programmable network devices.
Towards Accurate Online Traffic Matrix Estimation in Software-defined Networks Yanlei Gong, Xiong Wang, Mehdi Malboubi, Sheng Wang, Shizhong Xu, Chen-Nee Chuah — ACM SOSR '15 (2015) The paper seeks for accurate, feasible and scalable traffic matrix estimation approaches, by designing feasible traffic measurement rules that can be installed in TCAM entries of SDN switches. The statistics of the measurement rules are collected by the controller to estimate fine-grained traffic matrix. Two strategies are proposes, called Maximum Load Rule First (MLRF) and Large Flow First (LFF), both of which LFF satisfy the flow aggregation constraints (determined by associated routing policies) and have low-complexity.
Heavy-Hitter Detection Entirely in the Data Plane Vibhaalakshmi Sivaraman, Srinivas Narayana, Ori Rottenstreich, S. Muthukrishnan, Jennifer Rexford — ACM SOSR '17 (2017) The paper describes HashPipe, a heavy hitter detection algorithm using programmable data planes. HashPipe implements a pipeline of hash tables which retain counters for heavy flows while evicting lighter flows over time. HashPipe is prototyped in P4 and evaluated with packet traces from an ISP backbone link and a data center.
Dapper: Data plane performance diagnosis of TCP Mojgan Ghasemi, Theophilus Benson, Jennifer Rexford — ACM SOSR '17 (2017) Dapper is a system which leverages emerging edge devices offering flexible and high-speed packet processing on commodity hardware, to diagnose cloud performance problems in a timely manner. In particular, Dapper analyzes TCP performance in real time near the end-hosts, i.e., at the hypervisor, NIC, or top-of-rack switch, by determining whether a connection is limited by the sender, the network, or the receiver. Dapper was prototyped in P4.
SketchVisor: Robust Network Measurement for Software Packet Processing Qun Huang, Xin Jin, Patrick P. C. Lee, Runhui Li, Lu Tang, Yi-Chao Chen, Gong Zhang — ACM SIGCOMM '17 (2017) The paper presents SketchVisor, a robust network measurement framework, which augments sketch-based measurement in the data plane with a fast path that is activated under high traffic load to provide high-performance local measurement with slight accuracy degradation. It further recovers accurate network-wide measurement results via compressive sensing. A SketchVisor prototype is build on top of Open vSwitch; testbed experiments show that SketchVisor achieves high throughput and high accuracy for a wide range of network measurement tasks.
Scaling Hardware Accelerated Network Monitoring to Concurrent and Dynamic Queries With *Flow John Sonchack, Oliver Michel, Adam J. Aviv, Eric Keller, Jonathan M. Smith — USENIX ATC '18 (2018) *Flow is a hardware-accelerated network telemetry system that can export flexible packet-level records that analytics applications can use to calculate a wide range of metrics. Contrary to previous approaches, *Flow partitions hardware and software components such that applications can operate concurrently and dynamically on telemetry streams without impacting each other enabling parallel and runtime-configurable measurement analytics applications. The system is implemented in P4 and runs on commodity line rate switching hardware.

Load balancing

Last but not least, and similarly to the above discussion on resilient routing, programmable data planes provide unprecedented flexibilities (and performance) in how traffic can be dynamically load-balanced.

Hula: Scalable load balancing using programmable data planes Naga Katta, Mukesh Hira, Changhoon Kim, Anirudh Sivaraman, Jennifer Rexford — ACM SOSR '16 (2016) HULA is motivated by the shortcomings of ECMP as well as of existing congestion-aware load-balancing techniques such as CONGA, which, due to limited switch memory, can only maintain a limited amount of congestion-tracking state at the edge switches, and hence do not scale. HULA is a more flexible and scalable data-plane load-balancing algorithm in which each switch tracks congestion only for the best path to a destination through a neighboring switch. HULA is designed for programmable switches and is programmed in P4.
SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs Rui Miao, Hongyi Zeng, Changhoon Kim, Jeongkeun Lee, Minlan Yu — ACM SIGCOMM '17 (2017) The paper explores how to use programmable switching ASICs to build much faster load balancers than have been built before. The proposed system, called SilkRoad, is defined in a 400 lines of P4 and, when compiled to a state-of-the-art switching ASIC, it can load-balance ten million connections simultaneously at line rate.
Load Balancing Memcached Traffic Using Software Defined Networking Anat Bremler-Barr, David Hay, Idan Moyal, Liron Schiff — IFIP Networking '17 (2017) Memcached is an in-memory key-value distributed caching solution, commonly used by web servers for fast content delivery. In order to deal with skewed distributions of key popularity in key-value stores, the authors propose and implement MBalancer, a switch-based L7 load balancing scheme, which offloads requests from bottleneck Memcached servers. MBalancer runs as an SDN application, identifies the (typically small number of) hot keys, duplicates these hot keys to many (or all) Memcached servers, and adjusts the switches' forwarding tables accordingly. Experiences with an implementation of MBalancer on a hardware-based OpenFlow switch indicate significant throughput boost and latency reduction.

Control plane acceleration

Hardware-Accelerated Network Control Planes Edgar Costa Molero, Stefano Vissicchio, Laurent Vanbever — ACM HotNets '18 (2018) This seminal paper challenges a fundamental design principle of modern network architecture that the control plane software-based. With the advent of programmable switch ASICs, which can run complex logic at line rate, the paper revisits this principle, by accelerating the control plane offloading some of its tasks directly to the network hardware. Some simple control plane functionality can already be successfully offloaded to P4 hardware, including failure detection and notification, connectivity retrieval, and even policy-based routing protocols, but complex cases involve several tradeoffs and limitations; the paper outlines these and sketches interesting future research directions towards hardware-software co-design of network control planes.
Precise Time-synchronization in the Data-Plane Using Programmable Switching ASICs Pravein Govindan, Kannan Raj, Joshi Mun, Choon Chan — ACM SOSR '19 (2019) Current implementations of time synchronization protocols, like PTP, handle the protocol stack in the control-plane. The paper explores the possibility of using programmable switching ASICs to design and implement a time synchronization protocol, DPTP, with the core logic running in the data-plane. Comprehensive measurement studies running a a dataplane-accelerated DPTP implemented in P4 running on a Barefoot Tofino switch shows that DPTP can achieve median and 99th percentile synchronization error of 19 ns and 47 ns, even under heavy network load.