The main product of cybersecurity company Distil Networks is bot detection and mitigation. The San Francisco HQ’ed enterprise uses a range of detection mechanisms to detect bad bots – from heuristics deployed by its security team to automated systems that track global traffic for suspicious data center activity. Its primary means of bot blocking is via the identification of incoming connections, and the observation of those ID’s as the bot continues to browse.
2014 was the first year that bots were said to have outnumbered actual people on the Internet. Malevolent behavior that came arise from bots include web scraping, fraud, security breaches, downtime and spam.
Distil Networks recently undertook various improvements to its platform architecture, including a major engineering effort to rebuild the infrastructure of its legacy Machine Learning (ML) feature; in doing so, Distil has significantly enhanced the speed and capabilities of its platform architecture and established itself as an industry leader in bot detection.
The New ML System: Critical Requirements
The engineering team focused on rebuilding its legacy system with the following requirements:
- Heightened latency “to better react to user and bot behaviors”;
- Making it scalable – allowing Distil’s Data Science Group to focus more on research and less on maintaining, modifying and supporting the system;
- Creating well-defined, yet flexible interfaces between the Data Science parts of the system with the non-Data Science parts, thereby minimizing the engineering demands of the Data Science team an letting the system do more of the daily heavy-lifting.
Three Primary Technologies
In order to satisfy these requirements, Distil focused on upgrading the ML infrastructure from an AWS EMR based monolith by adding three primary technologies to its existing stack:
- Protocol Buffers – for enforcing datatypes and message structure
- Kafka – for distributed message delivery
- Apache Storm – for distributed and streaming compute
Distil also continues to use Redis for classification features, but expanded its use for ML.
Protocol Buffers
The protocol buffers mean less work for the engineering team in terms of enforcing datatypes and message field structures at each junction of the CDN process: they allow for easier movement of logs from the network edge nodes (where customer traffic is dealt with) to the Distil data center and onwards into the data warehouse and data science processing infrastructure.
A key benefit of protocol buffers for Distil is the fact that they use the same three main languages (Python, Java and Go) that Distil does for its logging/classification pipeline. Protobufs are also backwards compatible and allow for optional fields, meaning that when new fields are added to the message structure, downstream processing remains possible.
Kafka
The next main addition to the stack was Kafka, which is now used as a means to move logs or protocol buffer-based messages in a fault tolerant, recoverable and distributed manner. Distil uses logs for multiple purposes from R&D by the Data Science team to customer reporting in the Portal. This means that its system needs to allow for multiple writers and multiple readers; thus Kafka has become an interface between the Storm ML processing system (that the Data Science team operate) and the streaming log aggregation services (that the Data Engineering team look after).
Apache Storm
Apache Storm was the third and final key addition to Distil’s stack, mainly using the Streamparse project from Parse.ly. Storm is at the heart of the Distil Machine Learning system as it enables the Data Science team to consume log messages and aggregations provided by Data Engineering to construct a distributed graph of actions. The DS team reads aggregated features about Distil’s users and passes them on to Storm bolts for scoring and classification. These are then returned to a new Kafka topic, which is then sent on to the Edge Nodes for action.
Distil data scientist William Cox describes this process as “as an idempotent scoring mechanism provided by Data Science — aggregated information enters the Storm cluster, and threat scores leave the cluster”. He explains the benefits of this process, “Having defined interfaces at the entrance and exit of the cluster frees Data Science to quickly iterate on new ideas without having to explicitly work with the broader engineering group to create new systems”.
Storm, along with explicit feature engineering by the DE team, has replaced the Python-based orchestration and the Impala-based feature aggregation of the legacy system. Distil’s Python-based ML classifiers can be directly loaded into the Storm bolts and users can be scored in a similar way to when the team is carrying out R&D. Data Dog is also used for recording application metrics and monitoring various algorithm performance metrics. Storm offers three critical benefits:
- Simple scaling
- Multi-language support
- Generic distributed streaming compute
Redis, Java and Websockets
Redis and Java are used by the DE team for real-time state storage and computation, meaning that Distil’s high-rate streaming weblogs can be turned into a streaming list of users and the number of IP’s they’ve used in the last four hours. This allows the team to monitor all its data centers in “as close to real time as possible”.
Websockets lets Distil send and receive live data between a browser and web server over just one connection, thus reducing the need for long polling with AJAX.
The DE team replaced MySQL with Redis in order to keep the role of the works as simple as possible. Now “jobs are small and very specific” and it is possible to independently scale out workers as more data centers and servers are added to Distil’s network. This has led to significant performance improvements; so much so that the DE team had to artificially slow down the rate of publishing of data center performance metrics to only once every second.
Customer Benefits
- Decisions about users can be made significantly more quickly using the new ML system;
- The system’s stability and uptime has increased through the use of scalable systems built to grow and expand with data volume;
- The Data Science team can more rapidly try out and develop new ideas and algorithm: the time it takes to move from research to development to production is significantly pared down, allowing the team to be more flexible in its overall approach.
Other Key Functions in the Distil Platform
Trap Analysis and Statistics Report
The Trap Analysis and Statistics report is a new overview-style report that displays a list of triggered bot traps and the number of violations for each. Whenever a malicious bot tries to access a site that Distil Networks protects, a trap is triggered. It is then blocked, monitored or a asked to interact with a CAPTCHA page to prove it is human and not automated. The report captures the IP address for every violation, giving website owners a more granular level of control than previously into how their content is accessed and by whom.
If an IP address is displaying a large number of violations, the report will allow you to access its WHOIS information, allowing for clearer insights into the origin of the bot. Each IP address can be black or whitelisted on an individual basis. All requests from a blacklisted IP address will be blocked until removal from the blacklist. Whitelisted IP addresses can’t be blocked, which is a particularly useful tool for automated internal tools access, which can be easily mistaken as a bad bot.
Go (Golang) Expands the Functionality of Distil’s Data Platform
Another recent change to Distil’s data platform has been the use of Go (Golang). Data architect Kyle Bush outlined the process of setting up his Go development environment in a blog post, which can be read in full here. Bush said he offers his profile settings and process up to blog readers in the hope that “this information will save you a couple of hours!”
Financing & Acquisitions
Part of the reason Distil has been able to rebuild and update its core features for bot detection and mitigation in recent times is its success in raising funds and acquiring other cybersecurity companies within the increasingly consolidated sector. Since its founding in 2011 to 2016, the company raised around $44 million.
In May 2014, the company announced it had raised $10 million in series A financing from the Foundry Group. Distil CEO Rami Essaid said Foundry had backed his company after a series of no’s because it shared his vision for Distil. “Some of the VCs that we talked to were focused on how we make more money doing the same thing,” he said. “Foundry looked beyond what we’re doing today and asked, how do we become a bigger, more holistic security company?”
In early 2016, Distil acquired ScrapeSentry in a stock and cash deal worth around $10 million. In doing so, Distil gained a different angle on its automated, intelligent bot detection trademark by acquiring a company that specializes in human touch and expertise. “Enterprise customers want an analyst they can work with to add a human element to everything we do. It just made sense to combine forces,” Essaid said.
Following its acquisition of the anti-scraping solutions company, Distil announced it had raised $21 million in Series C funding, spearheaded by Silicon Valley Bank.
Clearly, its continued success in raising finance has allowed it to continue to bolster its data platform and establish itself as a leader in bot detection.