Skip to main content

Why Your Service Needs Adaptive Concurrency Limits

· 10 min read
Sudhanshu Prajapati
Tanveer Gill

Highly available and reliable Services are a hallmark of any thriving business in today’s digital economy. As a Service owner, it is important to ensure that your Services stay within SLAs. But when bugs make it into production or user traffic surges unexpectedly, services can slow down under a large volume of requests and fail. If not addressed in time, such failures tend to cascade across your infrastructure, sometimes resulting in a complete outage.

At FluxNinja, we believe that adaptive concurrency limits are the most effective way to ensure services are protected and continue to perform within SLAs.

What are Concurrency Limits?

Concurrency Limit

Concurrency Limit

Concurrency is the number of requests a service can handle at any given time. It is calculated using Little’s Law, which states that in the long-term, steady state of a production system, the average number of items L in the system is the product of the average arrival rate λ and the average time W that an item spends in the system, that is, L=λW. If any excess requests come in beyond L, they cannot be served immediately and must be queued or rejected. And this could lead to a significant build-up of queues, slowing down service response times. However, queues do not build if services are within their concurrency limits.

Service Topology Example Service Topology Example

Concurrency limits are hard to estimate, especially when there are numerous interdependent microservices and fast-moving environments.

  • Updates in microservices: Micro-services are updated frequently, and whatever concurrency limit you set initially could be outdated in the next release of your microservice, resulting in a performance bottleneck or service outage. Additionally, feature additions and configuration changes make it hard to keep up with changing concurrency limits.
  • High churn environments: Scale-in and scale-out events change concurrency limits - when services scale out, concurrency limits need to be dynamically adjusted to balance out incoming traffic.

This is why dynamically setting concurrency limits (Adaptive Concurrency Limits) based on overall service health is the best way to protect a service & stay within SLAs.

Difference between Adaptive Concurrency Limits and Rate Limits

At first glance, both concurrency limits and rate limits seem to do the same job. But they serve distinct purposes.

Rate limits are a preventive technique - they prevent misuse of a service by a particular user, making sure the service remains available for other users. But this technique does not help if there is a surge in overall traffic, not attributed to any specific user.

However, adaptive concurrency limits are a protective reliability technique. Using adaptive concurrency limits, it is possible to detect when the number of requests to a service exceeds the concurrency limit of a service, and have reliability interventions kick-in.

Using Aperture For Adaptive Concurrency Limits

Aperture is an open source flow control and reliability platform which can help you set adaptive concurrency limits for your services. At the heart of Aperture is a control loop, manifested by:

  1. Observing: Aperture agents monitor the deviation of your service’s current latency from historical trends using Golden Signals and identify load build-up or deterioration.
  2. Analyzing: Aperture Controller, which is the control loop's brain, continuously evaluates deviations from SLAs and communicates flow control decisions back to the agents.
  3. Actuating: Aperture agents sit right next to the service instances, regulating and prioritizing requests through a scheduler.

Aperture Control Loop Aperture Control Loop

To showcase how Adaptive Concurrency Limits can be set in practice, let's deep dive into a demo setup of Aperture agents and controllers.

Demo Setup

Aperture includes a playground, pre-configured with a traffic generator, a sample application, and an instance of Grafana that you can use to see various signals generated by a policy.

Demo Application

The above snap shows a demo application with three services and a traffic generator named wavepool-generator.

Service Topology

Service Topology Diagram Service Topology Diagram

The demo application is an example of microservices topology, where the request flows from service1 to service2 and service2 to service3. Each service adds a delay with a jitter to simulate processing. Service 3 is the upstream service configured with an artificial concurrency limit to simulate overload scenarios.

Traffic Pattern

The traffic generator is designed to generate a symmetrical traffic load for two types of users — subscribers and guests. Basically, the load generator alternates between regular traffic and high-traffic scenarios periodically.

export let vuStages = [
{ duration: "10s", target: 5 },
{ duration: "2m", target: 5 },
{ duration: "1m", target: 30 },
{ duration: "2m", target: 30 },
{ duration: "10s", target: 5 },
{ duration: "2m", target: 5 },

export let options = {
discardResponseBodies: true,
scenarios: {
guests: {
executor: "ramping-vus",
stages: vuStages,
env: { USER_TYPE: "guest" },
subscribers: {
executor: "ramping-vus",
stages: vuStages,
env: { USER_TYPE: "subscriber" },

And generating the following traffic pattern –

  • Ramp up to 5 concurrent users in 10s.
  • Hold at 5 concurrent users for 2m.
  • Ramp up to 30 concurrent users in 1m (overloads service3).
  • Hold at 30 concurrent users for 2m (overloads service3).
  • Ramp down to 5 concurrent users in 10s.
  • Hold at 5 concurrent users for 2m.

Deploying Aperture Policies for Adaptive Concurrency Limits


Aperture includes a pre-packaged list of Aperture Policies and Grafana Dashboards that can be used both as a guide for creating new policies and as ready-to-use Aperture Blueprints for generating policies customized to a Service and the use case. Policies are evaluated periodically, as defined in blueprints. Read more about Aperture Policy generation here.

The playground is configured with a Latency Gradient Policy. This policy is configured to measure the service latency of service3 through Flux Meter, and that signal is used to detect an overloaded state. The concurrency limiter is configured on service1, which is the downstream service (see Service Topology). This ensures that when service3 is overloaded, we stop accepting additional requests at the entry point, i.e., service1, to avoid wasted work.


Aperture includes a dry run mode that can be configured dynamically, allowing us to validate policy behavior without affecting the incoming traffic.

When no protection is set up for services

Using the Grafana dashboard provided by Aperture, the latency of service 3 (In this case, Aperture policy is running in dry run mode) can be easily monitored.

No Protection Overview

Traffic ramping up

Once the Traffic generator starts ramping up the number of users, the latency of service3 (under Flux Meter panel) starts touching 140ms. Whereas in normal conditions, it is under 60ms. These latency spikes could lead to a bad user experience, or if this latency keeps on increasing, it will hit client timeout, and service would become completely unavailable, triggering a potential cascading failure throughout the application.

Also, it is worth mentioning; subscribed users' workload is not prioritized, which implies if guest users make too many requests, the subscribed users will face the consequences such as high latency and request time-out problems.

When Aperture is protecting the service

Once Aperture becomes active, it will start evaluating all the signals. The Signals Dashboard is available under aperture-controller inside Grafana. These signals are passed through a circuit, converting signals into control decisions.

Golden signal metrics in Prometheus are imported as Signals, and each signal can be plotted for understanding the functioning of a circuit, such as –

  • EMA - This is used to calculate Latency Setpoint.
  • IS_OVERLOAD - Tracks whether the policy thinks a service is overloaded.
  • LOAD_MULTIPLIER - Tracks load-shedding decisions being published to Aperture Agents
  • And so on.

Signals Dashboard

Signals Dashboard

After evaluating signals through circuits, decisions are made. One of the policy's benefits is that it can be customized for maximum acceptance latency based on requirements and SLO.

_When Aperture is protecting service._

When Aperture is protecting service.

Here, the traffic pattern is the same as earlier. However, this time around, Aperture is using service concurrency limits to decide whether to approve a request for processing or reject it.

Normal traffic scenario

Under normal circumstances, latency hovers around 50ms. That’s where Aperture is learning the baseline latency by doing an exponential moving average on latency readings. To track incoming traffic, check out the “Incoming Concurrency” panel, and for the accepted traffic, check the “Accepted Concurrency” panel, as shown above in the snapshot.

Both Guest and Subscriber workloads shown on indices 0 and 1, respectively, have equal acceptance rates in the “Workload Decisions” panel, as there are no traffic drops during normal loads at the start.

In addition, Aperture automatically estimates the cost of admitting the request for each workload, which can be tracked in the “Workload Latency” panel. This estimation helps with prioritization and fair scheduling of requests during overload scenarios. Aperture’s Scheduler can prioritize workloads based on request attributes. For example, in this policy, subscribed user workload is configured to have higher priority than guest user workloads.

Traffic ramping up

When traffic generators start ramping up the concurrent number of users, service3 will come under a situation of overload, causing latency to go up. As soon as Aperture detects this latency spike, it limits concurrency on service1. Based on the priorities configured in the policy, more subscribed workloads traffic is being accepted compared to guest workloads.

During the spike in traffic, you see that the “Incoming Concurrency” graph ramps up, but Aperture Agent tries to automatically adjust the “Accepted Concurrency” by flattening the graph. Eventually, as the traffic ramps down, both graphs return to normal.

In the Flux Meter panel, it's visible that the latency on service3 is being maintained within the configured tolerance level, ensuring the service remains responsive throughout the traffic spike.

Traffic ramping down

The traffic spike is subsiding as the traffic rate goes down to normal levels. In the background, the latency gradient policy will keep load shedding to maintain its safe concurrency limit, a limit where the service is not overloaded.

Once the service is no longer overloaded, the Aperture Latency Gradient Policy will try to increase the concurrency limit of the service periodically, leading to maximum acceptance rates of requests.

_Across-the-board Overview of Aperture between protecting and unprotected service._

Across-the-board Overview of Aperture between protecting and unprotected service.

Overall, there is an enormous difference when Aperture comes into the picture, controlling the flow of requests and maintaining latency throughout the period when traffic is ramping up. Latency significantly drops when Aperture is protecting the service.


In this blog post, we learn how powerful Adaptive Concurrency Limits can be in protecting services from overloads and how Aperture policies can be used to set them. This helps service owners with:

  • Preventing cascading failures with load shedding at the right place & time.
  • Providing a high-quality user experience with workload prioritization and a high capacity for critical API requests.
  • Keeping services within SLA with adaptive service protection.

In future posts, we will dive deeper into how Aperture enables prevention, escalation, and recovery techniques for reliability management.

To get started with Aperture open source, visit our GitHub repository and documentation site. Join our vibrant Discord community to discuss best practices, ask questions, and engage in insightful discussions with like-minded individuals.