Skip to main content
Version: 2.32.2

Load Scheduling with Average Latency Feedback

Introduction

This policy detects traffic overloads and cascading failure build-up by comparing the real-time latency with a historical average. A gradient controller calculates a proportional response to limit the accepted token (or request) rate. The token rate is reduced by a multiplicative factor when the service is overloaded, and increased by an additive factor while the service is no longer overloaded.

At a high level, this policy works as follows:

  • Latency trend-based overload detection: A Flux Meter is used to gather latency metrics from a service control point. The historical latency over a large time window (30 minutes by default) is used to establish a long-term trend that can be compared to the current latency to detect overloads.
  • Gradient Controller: Set point latency and current latency signals are fed to the gradient controller that calculates the proportional response to adjust the accepted token rate (Control Variable).
  • Integral Optimizer: When the service is detected to be in the normal state, an integral optimizer is used to additively increase the accepted token rate of the service in each execution cycle of the circuit. This measured approach prevents accepting all the traffic at once after an overload, which can again lead to an overload.
  • Load Scheduler: The accepted token rate at the service is throttled by a weighted-fair queuing scheduler. The output of the adjustments to accepted token rate made by gradient controller and optimizer logic are translated to a load multiplier that is synchronized with Aperture Agents through etcd. The load multiplier adjusts (increases or decreases) the token bucket fill rates based on the incoming token rate observed at each agent.

The following PromQL query (with appropriate filters) is used as SIGNAL for the load scheduler:

sum(increase(flux_meter_sum))/sum(increase(flux_meter_count))
info

See reference for the AIMDLoadScheduler component that is used within this blueprint.

info

See the use-case Adaptive Service Protection with Average Latency Feedback to see this blueprint in use.

Configuration

Blueprint name: load-scheduling/average-latency

Parameters

policy

Parameterpolicy.components
DescriptionList of additional circuit components.
TypeArray of Object (aperture.spec.v1.Component)
Default Value
Expand
[]
Parameterpolicy.policy_name
DescriptionName of the policy.
Typestring
Default Value__REQUIRED_FIELD__
Parameterpolicy.resources
DescriptionAdditional resources.
TypeObject (aperture.spec.v1.Resources)
Default Value
Expand
flow_control:
classifiers: []
policy.load_scheduling_core
Parameterpolicy.load_scheduling_core.dry_run
DescriptionDefault configuration for setting dry run mode on Load Scheduler. In dry run mode, the Load Scheduler acts as a passthrough and does not throttle flows. This config can be updated at runtime without restarting the policy.
TypeBoolean
Default Valuefalse
Parameterpolicy.load_scheduling_core.kubelet_overload_confirmations
DescriptionOverload confirmation signals from kubelet.
TypeObject (kubelet_overload_confirmations)
Default Value
Expand
{}
Parameterpolicy.load_scheduling_core.overload_confirmations
DescriptionList of overload confirmation criteria. Load scheduler can throttle flows when all of the specified overload confirmation criteria are met.
TypeArray of Object (overload_confirmation)
Default Value
Expand
[]
Parameterpolicy.load_scheduling_core.aimd_load_scheduler
DescriptionParameters for AIMD throttling strategy.
TypeObject (aperture.spec.v1.AIMDLoadSchedulerParameters)
Default Value
Expand
alerter:
alert_name: AIMD Load Throttling Event
gradient:
max_gradient: 1
min_gradient: 0.1
slope: -1
load_multiplier_linear_increment: 0.025
load_scheduler:
selectors:
- control_point: __REQUIRED_FIELD__
max_load_multiplier: 2
policy.latency_baseliner
Parameterpolicy.latency_baseliner.flux_meter
DescriptionFlux Meter defines the scope of latency measurements.
TypeObject (aperture.spec.v1.FluxMeter)
Default Value
Expand
selectors:
- control_point: __REQUIRED_FIELD__
Parameterpolicy.latency_baseliner.latency_tolerance_multiplier
DescriptionTolerance factor beyond which the service is considered to be in overloaded state. E.g. if the long-term average of latency is L and if the tolerance is T, then the service is considered to be in an overloaded state if the short-term average of latency is more than L*T.
TypeNumber (double)
Default Value1.25
Parameterpolicy.latency_baseliner.long_term_query_interval
DescriptionInterval for long-term latency query, i.e., how far back in time the query is run. The value should be a string representing the duration in seconds.
Typestring
Default Value1800s
Parameterpolicy.latency_baseliner.long_term_query_periodic_interval
DescriptionPeriodic interval for long-term latency query, i.e., how often the query is run. The value should be a string representing the duration in seconds.
Typestring
Default Value30s

Schemas

driver_criteria

Parameterenabled
DescriptionEnables the driver.
TypeBoolean
Default Value__REQUIRED_FIELD__
Parameterthreshold
DescriptionThreshold for the driver.
TypeNumber (double)
Default Value__REQUIRED_FIELD__

overload_confirmation_driver

Parameterpod_cpu
DescriptionThe driver for using CPU usage as overload confirmation.
TypeObject (driver_criteria)
Default Value
Expand
{}
Parameterpod_memory
DescriptionThe driver for using memory usage as overload confirmation.
TypeObject (driver_criteria)
Default Value
Expand
{}

kubelet_overload_confirmations

Parametercriteria
DescriptionCriteria for overload confirmation.
TypeObject (overload_confirmation_driver)
Default Value__REQUIRED_FIELD__
Parameterinfra_context
DescriptionKubernetes selector for scraping metrics.
TypeObject (aperture.spec.v1.KubernetesObjectSelector)
Default Value__REQUIRED_FIELD__

overload_confirmation

Parameteroperator
DescriptionThe operator for the overload confirmation criteria. oneof: `gt | lt | gte | lte | eq | neq`
Typestring
Default Value
Parameterquery_string
DescriptionThe Prometheus query to be run. Must return a scalar or a vector with a single element.
Typestring
Default Value
Parameterthreshold
DescriptionThe threshold for the overload confirmation criteria.
TypeNumber (double)
Default Value

Dynamic Configuration

note

The following configuration parameters can be dynamically configured at runtime, without reloading the policy.

Parameters

Parameterdry_run
DescriptionDynamic configuration for setting dry run mode at runtime without restarting this policy. In dry run mode the scheduler acts as pass through to all flow and does not queue flows. It is useful for observing the behavior of load scheduler without disrupting any real traffic.
TypeBoolean
Default Value__REQUIRED_FIELD__