Back in 2011, a controversial idea was brought to the table inside Facebook. Over the next seven years, it has turned the $178 billion data center computer hardware industry upside down.
By open-sourcing its hardware via its Open Compute Project (OCP), Facebook has done for hardware what Linux, Android and other products did for software: made it free and open source. In the software world, Linux is now used in most data centers around the world and Android has become the most popular smartphone platform worldwide. Powerful companies like Nokia, Blackberry and Microsoft have been severely disrupted along the way, some to the point of bankruptcy.
Seven years after its founding, OCP threatens to continue to wreak a similar kind of havoc on the incumbent hardware companies like Cisco and Juniper Networks.
The Open Compute Project: History & Vision for OCP
The Open Compute Project was started in 2011 when Facebook decided to open source the designs of servers it had built to run more efficient hyperscale data centers. The social network giant’s goal was to encourage other companies to adopt and adapt its initial hardware designs, thereby pushing down costs, encouraging faster innovation in the field of IT hardware and improving quality. From the start, Facebook partnered with Intel and Rackspace to apply open source principles to hardware intended for use in data centers and telecommunication facilities. Participants are encouraged to use these designs to build, market and further develop their own products.
The vision for OCP has certainly so far played out as planned, and shows no signs of stopping any time soon. Goldman Sachs, Intel, Microsoft and Apple have joined the board; along with well-known network engineer Andy Bechtolsheim who co-founded Cisco rival Arista Networks. Its conferences are known to draw in tens of thousands of attendees.
According to a recent study by IHS Markit (commissioned by the OCP), sales of hardware built to Open Compute Designs rose to over $1.2 billion last year, double that of 2016. They are expected to rise to $6 billion by 2021, not including hardware spending by OCP board members Facebook, Intel, Rackspace, Microsoft and Goldman Sachs, all of which also use OCP products. The spend is still relatively minor compared to the overall market for datacenter systems, which Gartner predicts was worth $178 billion in 2017; however IHS predicts a 5-year CAGR (compound annual growth rate) of 59% for OCP products compared to the overall market growth, which is expected to be in the low single digits. 2017 OCP year-on-year growth from non-board member companies was 103%. IHS estimates that EMEA revenue from non-board member companies will surpass $1 billion (US) by 2021 and APAC is predicted to outdo EMEA in adoption as soon as 2020.
In 2016, Business Insider cited an anonymous insider to demonstrate the popularity of OCP. “OCP has a cultlike following,” the anonymous source said. “The whole industry, internet companies, vendors, and enterprises are monitoring OCP.”
Savings via the OCP hardware come on three main fronts: energy, materials and money. Facebook has been trialling these efficiencies since its first data center build in Prineville, Oregon in 2011. Energy efficiency has been boosted by cutting wastages in the power supply and increasing the height of the servers, allowing it to use fans of larger diameters, hence less energy to move more air. Getting rid of unnecessary components, such as unneeded expansion slots, paint, logos and vanity faceplates, saved 6 pounds of material per server. Cost reductions followed, not only at the Prineville facility, but also at further data center builds.
The Barriers to OCP Adoption
According to IHS, the three main barriers to the adoption of OCP are security, sourcing and integration. As the specifications to make OCP hardware are open-source, it opens up potential for hackers to tamper with it prior to delivery making supply chain security more of a challenge. However, OCP leaders recently said they have a plan to address these issues via the creation of a new Security Project focused on creating a standard hardware interface and protocols for guaranteeing boot code integrity both in new products and second-hand hardware.
New OCP Initiatives
OCP is working on a range of other initiatives, including improving the ease of hardware and software integration, something that ramped up since Microsoft joined the OCP board. Another new initiative is the Open System Firmware Project, aimed at open-sourcing the code that initializes server chipsets so that it can be used on a range of platforms and processor types, allowing OCP servers to boot up more easily.
Threats to the incumbent networking equipment vendors like Cisco and Juniper continue. The incumbents tightly link their software and hardware; and OCP fully intends to do the same so that its proprietary software is also tightly customized to its proprietary hardware. OCP is partnering with the Linux Foundation to integrate its hardware with Linux’s Open Platform for NFV (OPNFV) software, and the two organizations recently announced their renewed commitment to joint testing of hardware and software products.
Data Center Expansion over 2017 and 2018
Facebook opened four new data center locations in 2017 – in Denmark, Nebraska, Ohio and Virginia, growing its current total number to twelve locations across Europe and the U.S. One signature of the Facebook data centers is their sustainability – all its new data centers are powered 100% by clean, renewable energy and its goal last year was to reach 50% clean, renewable energy across its entire electricity supply mix by 2018.
Meanwhile across 2018, Facebook has been significantly expanding its cloud campus in Papillion, Nebraska, adding four new data centers there and is now getting ready to expand its Altoona, Iowa data center complex footprint as well. Such giant data center hubs offer economies of scale, allowing companies to easily add new server capacity and electric power when needed.
The Products from Facebook that are Pushing the OCP Envelope
Facebook has been innovating in the hardware space since 2011, and continues to bring out new innovations and strengthen its data center position to serve over 2 billion users each month while keeping its energy footprint low.
Facebook’s stack is disaggregated allowing the company to replace its hardware or software whenever improved technology becomes available. This approach has allowed the company to record impressive performance gains across the compute, storage and networking layers.
Bryce Canyon is a storage chassis introduced in 2017 at the OCP Summit that year intended to be used mainly for high-density storage, including video and photos. The latest storage platform offers a four-fold increase in compute capability along with increased efficiency and improved performance over its predecessor, the Open Vault (Knox). It was built to support more powerful processors and additional memory, looking both to improve current storage needs, and further ahead to future growth – via the modular design, allowing future users of the platform to add new CPU modules as needed to increase performance as new technologies arrive.
The Bryce Canyon storage system can be configured in multiple ways; its flexibility helps streamline data center operations since it shrinks the number of total storage platform configurations needed.
Its design is toolless – every major FRU in the system can be replaced without using any kind of tool. Furthermore, the drive retention system also doesn’t require carriers. Instead, the system uses a latch mechanism to retain the bare drive.
Big Basin v.2
The Big Basin AI server was introduced in 2017 as a successor to Big Sur, Facebook’s first GPU server intended to train machine learning models. Big Basin v.2 was introduced at the OCP Summit 2018 as an upgrade that includes eight of Nvidia’s latest Tesla V100 graphics cards. Its Tioga Pass CPU unit is deployed as the head-node. The PCle bandwidth has been doubled to shuttle data to and from the CPUs and GPUs. Facebook said in a blog post that it has already seen a 66% increase in single GPU-performance compared to the earlier version of Big Basin.
Big Basin allows researchers and engineers to more efficiently build and train larger machine learning models, which Facebook uses in its predictive technology to guess what a particular user will want to see. Machine learning is used for many familiar aspects of the social media platform, including ranking the news feed, personalizing ads, speech recognition and suggesting the right tags for friends in uploaded images.
“We believe that collaborating in the open helps foster innovation for future designs and will allow us to build more complex AI systems that will ultimately power more immersive Facebook experiences,” Facebook said.
Tioga Pass was introduced at OCP 2017 as part of an “end-to-end refresh of our server hardware fleet”. It is the successor to Leopard that is deployed for a range of compute services at Facebook. Tioga Pass has multiple updates to Leonard, including a dual-socket motherboard which uses the same 6.5” by 20” form factor, but supports single-sided and double-sided designs, which maximises the memory configuration.
Tioga Pass also upgraded the PCIe slot from x24 to x32, allowing for two x16 slots, or one x16 slot and two x8 slots, making the server more flexible as the head node for both the Big Basin JBOG and Lightning JBOF. Available PCIe bandwidth is thereby doubled when accessing either GPUs or flash. A 100G network interface controller (NIC) has been added, which also enables higher-bandwidth access to flash storage when used as a head node for Lightning.
Yosemite and Twin Lakes
Yosemite v2 is an update of Yosemite, Facebook’s first-generation multi-node compute platform that holds four 1S server cards, offering flexibility and power efficiency for high-density, scale-out data centers. Yosemite v2 uses a new 4 OU vCubby chassis design, however, it remains compatible with Open Rack v2. Every cubby supports four 1S server cards, or two servers plus two device cards. Each of the four servers can connect to a 50G or 100G multi-host NIC.
The new power design supports hot service i.e. servers don’t need to be powered down when the sled is pulled out of the chassis for components to be serviced.
The Yosemite v2 chassis supports Mono Lake and the Twin Lakes 1S server. Additionally, it supports device cards like the Glacier Point SSD carrier card and Crane Flat device carrier card. This allows Facebook to add augmented flash storage in addition to support for standard PCIe add-in cards. Similarly to the previous generation, Yosemite v2 supports OpenBMC.
Next-Gen Open Switches: Backpack and the Wedge 100
Facebook introduced a new innovation called Backpack in late 2016, the latest version of a second-generation computer switch that directly challenged tech made by market leader Cisco. The crucial difference is that Backpack is much faster than any alternative on the market. It is an 100G optical switch that uses fiber optics (light) to move data instead of the traditional copper wire system.
Backpack is a companion piece to the Wedge 100, a second-generation top-of-rack network switch. The Wedge 100 is used across Facebook’s production environments and is deployed at scale across its data centers. Multiple companies are building their solutions on top of the Wedge 100 platform, including Big Switch Networks and Canonical on the operating system layer; and SnapRoute, FRINX and Apstra on the upper parts of the stack.
The Wedge 100 is available as open-source via OCP, as is the software that runs the switch. Facebook even went the extra mile to arrange for its contract manufacturer, Accton, to mass produce the Wedge 100 devices so that anyone can purchase them.
Back in 2016, Business Insider reported that insiders within the network industry had told them Facebook’s 100G work was “mind blowing” and that costs were being brought down significantly. The switch uses less power and generates less heat and can operate at around 55-degree Celsius, which has never been done previously. It’s all part of the company’s goal of allowing us to hang out in virtual reality in the future, in addition to live-streaming more video.
“We are now creating more immersive, social, interactive 360 video sorts of experiences and that demands a much more scalable and efficient and quick network,” Yuval Backhar, former Facebook OCP engineer, says.
In our next post, we’ll look at Facebook’s threat to the legacy telecom industries. The Telecom Infra Project poses a similar kind of threat to the legacy telecom industry. Now that Facebook’s sights are set on the $500 billion telecom equipment market, will it have the same kind of disruptive impact as it’s had (and continues to have) on the data-center equipment industry?