Expedia uses NGINX Plus for cloud migration at scale and routing traffic through its front door. Over the years, Expedia has configured NGINX based on three pillars of cloud migration. According to Dave Drinkle, Senior Software Engineer at Expedia, the first of those pillars is multi-region resiliency, also known as cross-regional failover, which ensures that if one region goes down, the configuration auto-fails to another.
The second is “avoiding the knife edge”, which is essentially a philosophy of rolling out changes (i.e. a new app or microservice) gradually and systematically, rather than impetuously, and ensuring they can be rolled back as quickly as possible if necessary.
The third pillar determines how Expedia sets up its proxy to respond to errors.
Prior to migrating to the cloud, Expedia implemented a fairly straightforward traffic routing system that led from browsers to CDN to data centers. When it moved to the cloud, Expedia needed to place NGINX as the man-in-the-middle and route traffic through it.
Basic Configuration
Expedia’s basic configuration comprises two data centers that are weighted 70/30, with max_fails and fail_timeout set up. There’s also a resolve parameter in place, that makes sure that DNS for the data centers are resolved. Expedia also has implemented a basic server configuration as well as a location block that takes all traffic that doesn’t have a secondary route assigned to it, and routes it to the data center with a proxy_pass line.
Multi-Region Resiliency
The merits of multi-region resiliency are that it ensures fault tolerance and consistent responses to customer requests, and reduces latency, by having microservices deployed in as many regions as possible.
In the configuration, traffic is routed through the CDN into regional NGINX clusters, where it then gets routed to the appropriate app. If the app in a region fails, the NGINX cluster in that region stops sending traffic to the failed app, and instead routes it to an app in another region.
To ensure that there is cross-regional communication, the networking layer must be configured properly. Expedia has a secondary server in place as a backup that will take over if the primary server fails, as well as health checks in place to determine when a server has failed.
Removing the Knife Edge
Removing the knife edge helps move traffic from one origin to another in a controlled manner. For instance, traffic to a new microservice can be routed in gradual increments, methodically ramping up from 10% to 100%. This is important especially if you’re operating with the level of traffic that Expedia handles.
Expedia has leveraged two NGINX modules, called the User ID plug-in and the Split client plug-in, which work in tandem. The userid cookie needs to configured to give everyone a guaranteed unique ID (GUID) to ensure a perfect split. The split_clients configuration works to inspect the $cookie_ourbucket and passes that value through a hashing algorithm, called MurmurHash2, which takes that value and generates a percentage. Within a certain threshold (e.g. 10%), it goes to the app_upstream group, whereas beyond it, it goes to the data center, allowing for fine split configuration.
Reacting to Errors
While hard errors (i.e. 400- and 500-level errors) are fairly clear-cut, soft errors require the request to be reprocessed.
Expedia has set its proxy to look after hard errors to create a unified error page for customers. Handling hard errors at the proxy level allows errors to be logged properly within the proxy and, because the proxy handles the error page response, it enables app teams to send stack traces out on the errors without fear of it reaching the customer. Finally, the configuration frees app developers to work on other things.
Soft errors are requests that the app can’t handle, and are usually addressed with a 302 or 307 to a new location, or sent back to the same location with querystring_parameter to signal new response.
Or, soft errors can be handled within the proxy. So a request comes in and is routed by the NGINX proxy to the app. If the app can’t handle, it responds with a special error code, which NGINX catches and re-requests to the data center, from which a response is sent back to the client.