Load Scheduling with Average Latency Feedback
Introduction
This policy detects traffic overloads and cascading failure build-up by comparing the real-time latency with a historical average. A gradient controller calculates a proportional response to limit the accepted token (or request) rate. The token rate is reduced by a multiplicative factor when the service is overloaded, and increased by an additive factor while the service is no longer overloaded.
At a high level, this policy works as follows:
- Latency trend-based overload detection: A Flux Meter is used to gather latency metrics from a service control point. The historical latency over a large time window (30 minutes by default) is used to establish a long-term trend that can be compared to the current latency to detect overloads.
- Gradient Controller: Set point latency and current latency signals are fed to the gradient controller that calculates the proportional response to adjust the accepted token rate (Control Variable).
- Integral Optimizer: When the service is detected to be in the normal state, an integral optimizer is used to additively increase the accepted token rate of the service in each execution cycle of the circuit. This measured approach prevents accepting all the traffic at once after an overload, which can again lead to an overload.
- Load Scheduler: The accepted token rate at the service is throttled by a weighted-fair queuing scheduler. The output of the adjustments to accepted token rate made by gradient controller and optimizer logic are translated to a load multiplier that is synchronized with Aperture Agents through etcd. The load multiplier adjusts (increases or decreases) the token bucket fill rates based on the incoming token rate observed at each agent.
The following PromQL query (with appropriate filters) is used as SIGNAL
for
the load scheduler:
sum(increase(flux_meter_sum))/sum(increase(flux_meter_count))
See reference for the
AIMDLoadScheduler
component that is used within this blueprint.
See the use-case Adaptive Service Protection with Average Latency Feedback to see this blueprint in use.
Configuration
Blueprint name: load-scheduling/average-latency
Parameters
policy
Parameter | policy.components |
Description | List of additional circuit components. |
Type | Array of Object (aperture.spec.v1.Component) |
Default Value | Expand
|
Parameter | policy.policy_name |
Description | Name of the policy. |
Type | string |
Default Value | __REQUIRED_FIELD__ |
Parameter | policy.resources |
Description | Additional resources. |
Type | Object (aperture.spec.v1.Resources) |
Default Value | Expand
|
policy.load_scheduling_core
Parameter | policy.load_scheduling_core.dry_run |
Description | Default configuration for setting dry run mode on Load Scheduler. In dry run mode, the Load Scheduler acts as a passthrough and does not throttle flows. This config can be updated at runtime without restarting the policy. |
Type | Boolean |
Default Value | false |
Parameter | policy.load_scheduling_core.kubelet_overload_confirmations |
Description | Overload confirmation signals from kubelet. |
Type | Object (kubelet_overload_confirmations) |
Default Value | Expand
|
Parameter | policy.load_scheduling_core.overload_confirmations |
Description | List of overload confirmation criteria. Load scheduler can throttle flows when all of the specified overload confirmation criteria are met. |
Type | Array of Object (overload_confirmation) |
Default Value | Expand
|
Parameter | policy.load_scheduling_core.aimd_load_scheduler |
Description | Parameters for AIMD throttling strategy. |
Type | Object (aperture.spec.v1.AIMDLoadSchedulerParameters) |
Default Value | Expand
|
policy.latency_baseliner
Parameter | policy.latency_baseliner.flux_meter |
Description | Flux Meter defines the scope of latency measurements. |
Type | Object (aperture.spec.v1.FluxMeter) |
Default Value | Expand
|
Parameter | policy.latency_baseliner.latency_tolerance_multiplier |
Description | Tolerance factor beyond which the service is considered to be in overloaded state. E.g. if the long-term average of latency is L and if the tolerance is T, then the service is considered to be in an overloaded state if the short-term average of latency is more than L*T. |
Type | Number (double) |
Default Value | 1.25 |
Parameter | policy.latency_baseliner.long_term_query_interval |
Description | Interval for long-term latency query, i.e., how far back in time the query is run. The value should be a string representing the duration in seconds. |
Type | string |
Default Value | 1800s |
Parameter | policy.latency_baseliner.long_term_query_periodic_interval |
Description | Periodic interval for long-term latency query, i.e., how often the query is run. The value should be a string representing the duration in seconds. |
Type | string |
Default Value | 30s |
Schemas
driver_criteria
Parameter | enabled |
Description | Enables the driver. |
Type | Boolean |
Default Value | __REQUIRED_FIELD__ |
Parameter | threshold |
Description | Threshold for the driver. |
Type | Number (double) |
Default Value | __REQUIRED_FIELD__ |
overload_confirmation_driver
Parameter | pod_cpu |
Description | The driver for using CPU usage as overload confirmation. |
Type | Object (driver_criteria) |
Default Value | Expand
|
Parameter | pod_memory |
Description | The driver for using memory usage as overload confirmation. |
Type | Object (driver_criteria) |
Default Value | Expand
|
kubelet_overload_confirmations
Parameter | criteria |
Description | Criteria for overload confirmation. |
Type | Object (overload_confirmation_driver) |
Default Value | __REQUIRED_FIELD__ |
Parameter | infra_context |
Description | Kubernetes selector for scraping metrics. |
Type | Object (aperture.spec.v1.KubernetesObjectSelector) |
Default Value | __REQUIRED_FIELD__ |
overload_confirmation
Parameter | operator |
Description | The operator for the overload confirmation criteria. oneof: `gt | lt | gte | lte | eq | neq` |
Type | string |
Default Value |
|
Parameter | query_string |
Description | The Prometheus query to be run. Must return a scalar or a vector with a single element. |
Type | string |
Default Value |
|
Parameter | threshold |
Description | The threshold for the overload confirmation criteria. |
Type | Number (double) |
Default Value |
|
Dynamic Configuration
The following configuration parameters can be dynamically configured at runtime, without reloading the policy.
Parameters
Parameter | dry_run |
Description | Dynamic configuration for setting dry run mode at runtime without restarting this policy. In dry run mode the scheduler acts as pass through to all flow and does not queue flows. It is useful for observing the behavior of load scheduler without disrupting any real traffic. |
Type | Boolean |
Default Value | __REQUIRED_FIELD__ |