In modern engineering organizations, service owners don't just build services, but are also responsible for their uptime and performance. Ideally, each new feature is thoroughly tested in development and staging environments before going live. Load tests designed to simulate users and traffic patterns are performed to baseline the capacity of the stack. For significant events like product launches, demand is forecasted, and resources are allocated to handle it. However, the real world is unpredictable. Despite the best-laid plans, below is a brief glimpse of what could still go wrong (and often does):
- Traffic surges: Virality (Slashdot effect) or sales promotions can trigger sudden and intense traffic spikes, overloading the infrastructure.
- Heavy-hitters and scrapers: Some outlier users can hog up a significant portion of a service's capacity, starving regular user requests.
- Unexpected API usage: APIs can occasionally be utilized in ways that weren't initially anticipated. Such unexpected usage can uncover bugs in the end-client code. Additionally, it can expose the system to vulnerabilities, such as application-level DDoS attacks.
- Expensive queries: Certain queries can be resource-intensive due to their complexity or lack of optimization. These expensive queries can lead to unexpected edge cases that degrade system performance. Additionally, these queries could push the system to its vertical scaling boundaries.
- Infrastructure changes: Routine updates, especially to databases, can sometimes lead to unexpected outcomes, like a reduction in database capacity, creating bottlenecks.
- External API quotas: A backend service might rely on external APIs or third-party services, which might impose usage quotas. End users get impacted when these quotas are exceeded.
Every other day, we keep hearing about such failures leading to revenue loss and poor user experience. A mismatch between service capacity and demand is at the root of most performance issues. To mitigate, the services somehow have to do less work. Rate limiting is usually an effective technique to manage load at a service. However, it is often misunderstood and misapplied. Teams might apply a blanket rate limit per-user in hopes to protect against service overloads. While per-user rate limits provide a mechanism to prevent abuse, they do not safeguard against service overloads. In the following sections, we will demystify rate limiting and provide a fresh framework on how to reason about them.
A framework for rate limiting
Rate limiting is more than just setting caps on user requests. It's a strategic approach to prevent abuse, ensure fairness, and avert overloads, especially at the service level.
Below is a 2×2 framework that presents a concise overview of four distinct rate-limiting strategies. A service operator can implement a combination of these strategies based on their service's requirements:
Static | Adaptive | |
---|---|---|
Per-user limit | Fair access | Abuse prevention |
Global service limit | Enforcing quotas | Service protection |
Now, let's delve deeper into the significance and use-cases for each strategy.
Per-user limits
Per-user rate limiting is a technique that restricts the number of requests sent
by an end-user or device within a time period. These limits help curb abuse and
ensure fair access across users. They also act as a security measure to prevent
unchecked API access. Typically, a 429 Too Many Requests
HTTP response is
returned when a user exceeds their rate limit.
Such limits are implemented by tracking request counts at the user level and using algorithms such as:
- Token Bucket: Requests consume tokens from the bucket, which refills at a consistent rate. If the bucket runs out of tokens, the request is rejected. This method can accommodate brief surges in request rates that exceed the bucket's refill rate.
- Leaky Bucket: Incoming requests are queued in the bucket. Tokens are steadily drained (or "leaked") at a fixed rate as the server processes requests. If the bucket reaches capacity, new requests are rejected. Unlike the token bucket, this approach ensures the request rate never surpasses the leak rate, preventing sudden bursts.
- Fixed Window: Limits the total requests within specific time intervals.
- Sliding Window: Allows a certain number of requests over a continuously shifting time frame, providing more fluid control compared to the fixed window technique.
These algorithms can be implemented either locally (for example, on an API Gateway) or globally, using a service like Redis to maintain state on a per-user basis. While the local implementations are lower latency, they don't scale for larger applications as they require the traffic to go through a single choke point as compared to the global implementation. However, the global algorithms are hard to implement because the underlying technologies such as Redis themselves become a bottleneck at high traffic rates. Sophisticated approaches for global rate limiting typically involve distributing the state across multiple instances by sharding the per-user rate limiting keys to handle scale.
Per-user static limits
Static rate limiting is like the speed limit on a highway – a set pace that everyone has to follow.
Most public APIs such as GitHub, Twitter, and OpenAI implement static rate limits. These limits are well-known and publicly shared by the API providers, setting the basic expectations for the end-user in terms of fair use of the service.
The use-cases of per-user static limits include:
- Fair access: A single bad actor can degrade performance for all users by overusing resources if no limit is in place. A static limit ensures fair usage per-user. For example, Docker implemented rate limiting to continue serving millions of users without experiencing abuse from anonymous users' pull rates. According to a 2020 article, approximately 30% of all downloads on Docker Hub were attributed to just 1% of their anonymous users.
- Throttling data scrapers: With the rapid adoption of generative AI, training data is increasingly becoming the most sought-after commodity. However, some aggressive scrapers from AI companies can impact normal users. Rate limiting can be applied to keep them in check.
- Blocking misbehaving scripts: User scripts might inadvertently send a flood of requests. In such cases, per-user static limits serve as a safeguard against such abuse.
Per-user adaptive limits
While static limits provide a basic form of fair access and security, there are scenarios where adaptive rate limiting is beneficial, either in response to specific user behaviors or under particular circumstances for certain user tiers.
Use-cases of adaptive rate limits per-user include:
- Reputation-based abuse prevention: Adjusting the rate limit according to a user's reputation can be effective. A user frequently surpassing rate limits might face a temporary reduction in their allowable rate as a preventive measure. For example, GitHub implements secondary rate limits for users that perform actions too quickly.
Global service limits
These rate limits regulate the overall load on a service to prevent overloads or conform to service quota agreements. They are usually set at the service level and are not tied to a specific user. State-of-the-art techniques can also prioritize requests based on attributes such as service criticality, user-tiering, and so on.
Such limiters are either implemented by using a token bucket algorithm for the entire service or by probabilistically shedding a fraction of requests entering a service.
Static service limits
Static limits are useful when enforcing an agreed-upon quota. For example, Stripe has a fleet usage load shedder that assigns a fixed quota to non-critical requests (80% of provisioned capacity) to ensure that capacity is always available to process critical requests.
Use-cases of static rate limits include:
- Enforcing capacity limits: If a service has undergone testing and is known to support a specific load, this predefined limit can be enforced to keep the load within the operational boundary.
- Inter-service limits: In microservices architecture, each client service could have different quotas allocated based on their criticality while accessing a shared service.
- Client-side rate limits: When interfacing with an external API, or a shared resource, that has a rate limit, it is important for well-behaved clients to comply with the limit to prevent getting penalized due to abuse. For example, YouTube has a client-side rate limiter that voluntarily limits the request rate while accessing a shared resource.
Adaptive service limits
Adaptive service limits regulate the overall load on a service based on service health signals such as database connections, queue sizes, response latency, error rates and so on to protect the service against overloads. For example, Stripe has a worker utilization load shedder which applies saturation-based feedback control to throttle incoming requests.
Use-cases of adaptive service limits include:
- System stability: An overload in one part of the system often snowballs into an application-wide outage. Adaptive service limits act as a protective barrier that stabilizes the application by eliminating the characteristic self-reinforcing feedback loops that lead to cascading failures.
- Adaptive waiting rooms: Certain web-services such as online ticket booking, online shopping and so on can experience a sudden surge in traffic. In such cases, adaptive service limits can be used to throttle requests and redirect them to a waiting room. This ensures that the service remains responsive and prevents it from crashing due to overload.
- Performance management of heavy APIs: Heavy APIs such as queries to analytical databases and generative AI can pose challenges, especially when they're overwhelmed with requests. This can lead to reduced throughput, noisy neighbor problems in multi-tenant environments, or even cascading failures in some cases. Adaptive service limits can be used to protect such APIs from overload and ensure that they remain responsive.
- Fault tolerance: Even with rigorous testing and adherence to best practices, unforeseen failures can occur in production. For example, failures in a few read replicas can increase load on healthy instances, leading to an outage. In such scenarios, adaptive service limits can help achieve graceful degradation.
Implementing rate limits with FluxNinja Aperture
Implementing rate limits in a distributed application at scale is a challenging task. A glance at discussions from industry giants like GitHub and Figma reveals the intricacies involved.
FluxNinja Aperture is an open source project that democratizes access to these sophisticated techniques. Out-of-the-box, Aperture provides the tools needed to safeguard against system overloads and heavy-hitters. It includes capabilities such as adaptive control, request scheduler and distributed rate limiter to address the breadth of use-cases discussed above. Aperture's declarative policy language enables practitioners to express rate-limiting policies that address their application's custom requirements.
The open source core offering is augmented by the commercial Aperture Cloud SaaS product that offers a managed control plane, advanced traffic analytics, policy management and so on.
Conclusion
In this blog, we have provided a 2×2 framework for reasoning about rate limits. The framework maps different scenarios (service protection, abuse prevention and so on) to appropriate rate-limiting techniques (adaptive or static, global or per-user). We hope this framework helps practitioners apply optimal techniques for their specific use-cases.
In addition, we briefly introduced FluxNinja Aperture, an open source project for load management. For a more in-depth understanding of Aperture, feel free to explore our Documentation and GitHub repository. Additionally, join our vibrant Discord community to discuss best practices, ask questions, and engage in insightful discussions with like-minded individuals.