MMS • RSS
Article originally posted on InfoQ. Visit InfoQ
Facebook’s globe-spanning network consists of both wide-area backbone networks as well as edge Points-of-Presence, which support end-user facing requests and internal traffic, both of which have been growing at a rapid pace. To meet increased network provisioning and maintenance demands, the network engineering team built Vending Machine, a workflow framework that utilizes Zero Touch Provisioning (ZTP) methods to run code to perform any kind of configuration on network devices.
Zero Touch Provisioning (ZTP) is a mechanism by which network devices like switches can be configured from their factory-default state without any human intervention. Supported by many networking vendors, it involves the device sending a request on bootup, usually via DHCP, to fetch the location of a central server from which it can download and apply configuration. This process can subsequently involve automation tools like Chef and Puppet. Network devices that have been traditionally configured using the CLI by a human operator can be automatically setup. As switch vendors started supporting ZTP using DHCP, Facebook’s team worked with IP routers and optical equipment vendors to support similar features. The existing automation work that the Facebook team had done in DHCP-based auto-provisioning could then be reused.
Facebook has built network automation tools in the past, but most of its network provisioning and configuration was done via Method of Procedures (MOPs). MOPs were essentially documentation, like runbooks, that engineers had to follow. As increasing deployment demands led to hiring more engineers to run these MOPs, they became more complicated and error prone. Facebook’s original provisioning system had its roots in a console based system. Over the years, new roles, paths and platforms were added, which made the MOP based system harder to use. The framework called Vending Machine (VM) grew out of these needs. Vending Machine takes “a device role, location, and platform” as input and returns “a freshly provisioned network device, ready to deliver production traffic” as output, according to the article.
Facebook engineers describe the motivation behind building VM in a talk:
In response to a DHCPDISCOVER message, a device is given either a configuration file or a configuration script to execute on the network device. For the scripted option, how the script executes and what it’s capable of varies by each vendor (so far) and by network role. After configuring itself, the device will typically reboot. But in real life we have other things to do before releasing a device to production. We also have had interesting problems of not being able to generate configuration prior to physically installing a device – so if you don’t have configuration pre-generated, how do you respond to a DHCP request with a configuration file? This problem led us to develop a workflow automation system wrapped around ZTP.
According to the talk, Facebook had to make changes in its DHCP stack, which is based on ISC’s open source DHCP server while building VM. A special piece of Python code is downloaded on to the network device by standard ZTP methods, which becomes the starting point for a VM workflow. The Python agent downloads instructions, configuration, firmware and patches from the VM server to the network device, installs them, and sends the output log and exit status back to VM.
Image courtesy : https://code.facebook.com/posts/166812063987311/scaling-the-facebook-backbone-through-zero-touch-provisioning/
VM follows a workflow model where engineering teams could run code in any language in a series of standard steps. Failure of any step would result in requeuing of the step for future execution. Each step can consist of a standalone binary. It is worth noting that the steps can be written in any programming language. Gradually, the teams moved away from MOPs as more and more VM steps were developed to replace them. VM speeded up the provisioning process further by determining which steps were independent and running them in parallel.
VM’s future roadmap includes orchestration of groups of VM jobs, and fully automated rebuilds of planes in Facebook’s global network. VM is an example of the recent wave of DevOps principles being applied to networking.