Facebook’s Analysis of Flash-Based SSD in the Field

Categories

Facebook has released an analysis detailing its use of flash-based solid state drives (SSDs) in the field over the last four years. These drives averaged many millions of SSD-day usage. Due to the recent advances in flash capacity, SSDs are now increasingly used in data storage and Facebook’s study aims to detail some of their findings on the efficiency and abilities of SSDs.

To determine the usefulness of SSDs and the frequency of failure, measurements on the amount of data written and read on the chips, how data was mapped, the amount of data copied, erased, and discarded by the controller, and the flash board temperature and bus power were analyzed. This is one of the first field studies of SSDs that has attempted to capture and analyze this information. Their system was designed to give a “snapshot” of the system’s performance, but cannot store this data as a timeseries. As a result, analysis was performed on this “snapshot” of the SSD system’s behavior and lifetime metrics.

Analysts concluded that SSD failure rates go through distinct periods of failure rates relating to how the failures emerge and how they are detected-early detection, early failure, usable life, and wearout. They also found that read disturbance errors are not common in field use, sparse logical data layout plays a strong role in determining SSD failure rate, higher temperatures lead to high failure rates (especially for SSDs that do not use throttling), and data written by the operating system does not always accurately indicate the amount of wear on the flash cells. They also concluded that there seems to be a relationship between the number of discarded blocks and SSD failure rates and data indicated that more discarded blocks was indicative of higher failures rates.

The Tradeoffs of Next-Gen IDS/ISPs

The popularity of next generation IDS/ISPs have greatly increased in recent years. Now, according to Bromium, all of the major security vendors offer one. With these products, some of the network traffic is routed to VMs for threat detection. However, because there are only a few VMs, all the traffic cannot be routed through them. When operated passively (IDS), threat detection notices are sent to the security team, and when in in-line mode (ISP), blocks any traffic on which malware is detected. Though the IDS/ISP method of threat detection can be effective, it also has several aspects that can make it ineffective.

First, if the user is mobile or off-net, the IDS/ISP will not detect their activity. Second, though this method is excellent at detecting malware from known sources, attacks are often encrypted and is sometimes programmed to be sleepy and not activate within the VM, but wait until the traffic is passed to the endpoint. Any attack that is executed in a honeypot VM will be detected, but many of these attacks may not have been an issue for endpoint to which they were destined, as they would not be able to execute properly there. Finally, using a Windows VM may conflict with Microsoft’s license terms.

Many of these IDS/ISPs create either a flood of threat detections or do not detect threats at all if the thresholds are set too low. The Receiver Operating Characteristic can be designed so that non-attack traffic and attacks are distinct in an attempt to reduce false positives and false negatives while accurately detecting as many attacks as possible.

Though it is impossible to build a detector that will not produce false negatives, creating an appropriate signal to noise (SNR) ratio may be used to assist the security team. Overall, a network-based detection technology is expensive, does not stop the attacks it detects, may waste time and resources detecting attacks that may not be meaningful for the patch levels on the endpoints, and does not stop the first detection compromise, which may result in costly attacks to the system.

Demands for Increased Facebook Consistency

With 1.35 billion users globally, changes to Facebook’s network can be challenging to implement. In 2015, Facebook released an article outlining the difficulties associated with creating a system to ensure consistency in the hopes that a solution to they may be solved and a more consistent system could be put into place. Facebook’s primary concerns with more consistency are integrating the consistency across many stateful services, “tolerat[ing] high query amplification, scal[ing] to handle linchpin objects, and provid[ing] a net benefit to users.”

Facebook has a very disaggregated system, consistent data storage, must ensure access access to decentralized services, and handles results with internally stored data on services. Because of this, integration of a more consistent system is highly challenging. In query amplification, slowdown cascades and latency outliers prevent widespread changes to the systems, and when lynchpin objects are added to the latency outlier concerns, the system may experience dramatic slowing should the highly queried data is on the slowest path.

If a more consistent network is implemented, Facebook also foresees several operational challenges in addition to the difficulties mentioned above. The primary concern, as Facebook is driven by users and all changes must therefore benefit the end-user, is very poor throughput. With its always-on network, Facebook has developed many ways to address failure and quickly recover in the system. Facebook also has a polyglot environment and varying and aggressive deployment schedules, which decrease software engineering risk.

Overall, though Facebook could theoretically create a more consistent system, the benefits of making these changes have decreased over time. As the total number of users and complexity of the network increase dramatically, changes in the consistency of the network will result in increased latency and likely an overall poorer experience for end-users. Until solutions to increase consistency for a system that uses sharding and separation into stateful services is designed, Facebook will not attempt to drastically change this aspect of their network.

Scroll to Top