Spotinst Drift (Pod Driven Autoscaling)
Spotinst Drift is an infrastructure scaling service for Kubernetes that adjusts infrastructure capacity and size to meet the Pods, Containers and applications needs.
The main purpose of Spotinst Drift is to get pending pods a place to run while dynamically fit the infrastructure based on the Pod size and needs. Spotinst PDA periodically checks whether there are any pending pods and increases the size of the cluster if it makes sense and if the scaled-up cluster is still within the user-provided constraints.
Spotinst Drift is composed out of two components
Spotinst Controller (SPT-CTL)
A pod that leaves within the k8s cluster, responsible for collecting metrics and events. The events are being pushed via one way secured link to the second component for business logic and capacity scale up/down activities.
Spotinst Drift SaaS
The SaaS is responsible to aggregate the metrics from the SPT-CTL and build the cluster topology. Using the aggregated metrics, the SaaS component is applying other business logic algorithms such as Spot Instances availability prediction and Instance size/type recommendation to increase performance and optimize costs via workload density instance pricing models (across On-Demand / Reserved and Spot Instances).
Spotinst Drift vs Metric-based node Autoscaling
Spotinst PDA makes sure that all pods in the cluster have a place to run, no matter if there is any CPU load or not. Moreover, it tries to ensure that there are no unneeded nodes in the cluster.
Metric-based cluster autoscalers don’t care about
Pods when scaling up and down. As a result, they may add a node that will not have any Pods, or remove a node that has some system-critical pods on it, like
kube-dns. Usage of these autoscalers with Kubernetes is discouraged.
Changing the size of the Kubernetes Cluster
Spotinst Drift increases the size of the cluster when
- There are pods that failed to schedule on any of the current nodes due to insufficient resources.
- When some nodes are consistently unneeded for a significant amount of time. A node is unneeded when it has low utilization and all of its important pods can be moved elsewhere.
Ori / Yuval– add more.
Drift checks for any unschedulable pods every
10 seconds. A pod is unschedulable when the Kubernetes scheduler is unable to find a node that can accommodate the pod.
For example, a pod can request more CPU that is available on any of the cluster nodes. Unschedulable pods are recognized by their
PodCondition. Whenever a Kubernetes scheduler fails to find a place to run a pod, it sets “
schedulable” PodCondition to false and reason to “
Drift calculates and aggregates the number of unschedulable Pods waiting to be placed and finds the optimal distribution of nodes. Drift makes sure that the biggest Pod will have enough resources to be placed, it also makes sure to distribute the Pods on the most efficient number of VMs from the desired cloud provider. In some scenarios, it will prefer to provision a distribution of certain 8xl & medium machines based on the Pods requirements and the Spot prices in the relevant region.
[Image, for example, A , Image of example B ]
It may take up to few minutes before the created nodes appear in Kubernetes, in order to minimize this time (to zero) you can read more about Cluster Headroom and Overprovisioning.
Drift constantly checks which nodes are unneeded in the cluster.
A node is considered for removal when:
- All pods running on the node (except these that run on all nodes by default, like manifest-run pods or pods created by daemonsets) can be moved to other nodes in the cluster.
- The sum of cpu and memory requests of all pods running on this node is smaller than 50% of the node’s allocatable (not node capacity )
Drift simulates the cluster’s topology and state “post” the scale down activity and decides whether the action can be executed or not.
Pods & Node Draining Process (Graceful Termination)
Aviv/Yuval/Ori – need your inputs here
Scale down prevention
- Pods with restrictive
PodDisruptionBudget. (Read more)
- are not run on the node by default, *
- don’t have PDB or their PDB is too restrictive
- Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *
- Pods with local storage. *
- Pods that cannot be moved elsewhere due to various constraints (lack of resources, non-matching
- Pods that have the following annotation set:
How does Horizontal Pod Autoscaler(HPA) work with Spotinst Drift?
Horizontal Pod Autoscaler changes the deployment’s or replicaset’s number of replicas based on CPU load or other custom metrics. If the load increases, HPA will create new replicas, for which there may or may not be enough space in the cluster.
If there are not enough resources, Spotinst Drift will try to bring up new nodes, so that the HPA-created pods have a place to run. If the load decreases, HPA will stop some of the replicas. As a result, some nodes may become underutilized or completely empty, and then Spotinst Drift will delete such unneeded nodes.
Pod Priority and Preemption
Spotinst Drift takes pod priorities into account.
Yuval/Ori/Aviv – need here your inputs + examples.
Pod Priority and Preemption feature enable scheduling pods based on priorities if there are not enough resources. On the other hand, Spotinst Drift makes sure that there are enough resources to run all pods.
In order to allow users to schedule “best-effort” pods, which shouldn’t trigger Spotinst Drift actions, but only run when there are spare resources available, we introduced priority cutoff to Cluster Autoscaler.
Pods with a priority lower than this cutoff:
don’t trigger scale-ups – no new node is added in order to run them,
don’t prevent scale-downs – nodes running such pods can be deleted.
Nothing changes for pods with priority greater or equal to cutoff, and pods without priority.
Default priority cutoff is 0. It can be changed using
--expendable-pods-priority-cutoff flag, but we discourage it. Drift also doesn’t trigger scale-up if an unschedulable pod is already waiting for a lower priority pod preemption.
Scale to 0
Ori/Aviv/Twizer – need your inputs here.
it is possible to scale a node group to 0 (and obviously from 0), assuming that all scale-down conditions are met.
Overprovisioning can be configured using deployment running pause pods with very low assigned priority (see Priority Preemption) which keeps resources that can be used by other pods. If there is not enough resources then pause pods are preempted and new pods take their place. Next pause pods become unschedulable and force CA to scale up the cluster.
PodDisruptionBudget in scale-down
Before starting to delete a node, Drift makes sure that
PodDisruptionBudgets for pods scheduled there allow for removing at least one replica. Then it deletes all pods from a node through the
pod eviction API
Node Health Check and Auto-healing