FluxNinja announces Aperture, bringing reliability to your web-scale apps with flow control
Today, FluxNinja is emerging from stealth mode to announce Aperture - the first open source flow control and reliability management platform for modern web applications.
Reliability as a competitive advantage
Over the last decade, cloud computing platforms have enabled online businesses to reach massive scale and empowered physical enterprises to bring their business online. But keeping these applications reliable is more challenging than ever. A sudden spike in traffic for an e-commerce giant on Black Friday can trigger customer-facing blank screens and crashing apps. Outages take a high toll in customer trust, missed revenue targets, and stress for internal DevOps and Site reliability engineering (SRE) teams.
Companies such as LinkedIn (Hodor), Google (Handling Overload), Netflix (Prioritized Load Shedding), and Stripe (API Scaling) have made application reliability a competitive advantage with their cutting-edge flow control technologies. Fundamentally, flow control enables graceful degradation - the ability to preserve key user experience pathways, even in the face of application failures.
Graceful degradation with flow control
Modern web-scale apps are a complex network of interconnected microservices that implement features such as account management, search, payments, and more. This decoupled architecture has advantages for rapid feature development, but introduces complex new failure modes. When traffic surges, it can cause queues to build up on critical services, which kick-starts a negative feedback loop and leads to cascading failures. The application stops serving responses in a timely manner, interrupting critical end-user transactions.
Little's Law governs the relationship between concurrent requests, arrival rate of requests, and response times in applications. To keep the application stable, one must throttle the concurrent requests in the system. Indirect techniques, such as rate-limiting and auto-scaling, fall short in enabling good user experiences or business outcomes. Rate-limiting individual users does not adequately protect services, while auto-scaling is slow to respond and can be cost-prohibitive. As the number of services scales, deploying these techniques becomes increasingly difficult.
This is where flow control comes in. Applications can degrade gracefully in real-time when that uses flow control techniques with Aperture, by prioritizing high-importance features over others.
The flow control technologies used by teams at LinkedIn, Google, Netflix, Stripe, and others have been years in development. But most companies don’t have the luxury of building these in-house. This is why we are excited to release Aperture as an open source project - just as Kubernetes democratized deploying cloud infrastructure, we hope to democratize building reliable applications with effective flow control.
How Aperture works
At the fundamental level, Aperture enables flow control through observing, analyzing, and actuating, facilitated by agents and a controller.
Aperture Agents live next to your service instances as a sidecar and provide powerful flow control components such as a weighted fair queuing scheduler for prioritized load-shedding and a distributed rate-limiter for abuse prevention. A flow is the fundamental unit of work from the perspective of an Aperture Agent. It could be an API call, a feature, or even a database query.
Graceful degradation of services is achieved by prioritizing critical application features over background workloads. Much like when boarding an aircraft, business class passengers get priority over other passengers; every application has workloads with varying priorities. A video streaming service might view a request to play a movie by a customer as a higher priority than running an internal machine learning workload. A SaaS product might prioritize features used by paid users over those being used by free users. Aperture Agents schedule workloads based on their priorities helping maximize user experience or revenue even during overload scenarios.
Aperture Agents monitor golden signals using a built-in telemetry system and a programmable, high-fidelity flow classifier used to label requests based on attributes such as customer tier or request type. These metrics are analyzed by the controller.
The controller is powered by always-on, data flow driven policies that continuously track deviations from service-level objectives (SLOs) and calculate recovery or escalation actions. The policies running in the controller are expressed as circuits, much like circuit networks in the game Factorio.
For example, a gradient control circuit component can be used to implement AIMD (Additive Increase, Multiplicative Decrease) style counter-measures that limit the concurrency on a service when response times deteriorate. Advanced control components like PID can be used to further tune the concurrency limits.
Aperture’s Controller is comparable in capabilities to autopilot in plane or adaptive cruise control in a Tesla.
Aperture can be inserted into service instances with either Service Meshes or SDKs:
- Service Mesh: Aperture can be deployed with no changes to application code, using Envoy. It latches onto Envoy’s External Authorization API for control purposes and collects access logs for telemetry purposes. On each request, Envoy sends request attributes to the Aperture Agent for a flow control decision. Inside the Aperture Agent, the request traverses classifiers, rate-limiters, and schedulers, before the decision to accept or drop the request is sent back to Envoy. Aperture participates in the OpenTelemetry tracing protocol as it inserts flow classification labels into requests, enabling visualization in tracing tools such as Jaeger.
- Aperture SDKs: In addition to service mesh insertion, Aperture provides SDKs that can be used by developers to achieve fine-grained flow control at the feature level inside service code. For example, an e-commerce app might want to prioritize users in the checkout flow over new sessions when the application is experiencing an overload. The Aperture Controller can be programmed to degrade features as an escalated recovery action when basic load shedding is triggered for several minutes.
Bringing it all together
Our team at FluxNinja is no stranger to the plight of DevOps and SRE teams and the operational challenges they face. We previously built Netsil (acquired by Nutanix) which pioneered a network-centric approach to microservices monitoring. Aperture results from technical insights and customer perspectives gathered over many years of operating directly in the field with large-scale web applications.
Reliability can be a significant competitive advantage and at FluxNinja we believe that the path to reliability at web-scale begins with implementing effective flow control.