Ocean’s pod-driven scaling for Kubernetes clusters serves three main goals:
- Schedule pods that failed to run on any of the current nodes due to insufficient resources.
- Ensure that frequent scaling pods won’t have to wait for instances to launch (see Headroom section for more details).
- Ensure that cluster resources are optimally utilized.
Spotinst Ocean Vs Metric-Based Node Autoscaling
Spotinst Ocean makes sure that all pods in the cluster have a place and capacity to run, regardless of the current cluster’s load. Moreover, it ensures that there are no underutilized nodes in the cluster.
Metric-based cluster autoscalers are not aware of Pods when scaling up and down. As a result, they may add a node that will not have any Pods, or remove a node that has some system-critical pods on it, like kube-dns. Usage of these autoscalers with Kubernetes is discouraged.
Ocean checks for any unschedulable pods every 10 seconds. A pod is unschedulable when the Kubernetes scheduler is unable to find a node that can accommodate the pod, this can be happening due to insufficient CPU, Memory, GPU or custom Resource.
For example, When a pod request more CPU than what is available on any of the cluster nodes. Unschedulable pods are recognized by their PodCondition. Whenever a Kubernetes scheduler fails to find a place to run a pod, it sets “schedulable” PodCondition to false and reason to “unschedulable“.
Ocean calculates and aggregates the number of unschedulable Pods waiting to be placed and finds the optimal nodes for the job. Ocean makes sure that all the pods will have enough resources to be placed, it also makes sure to distribute the Pods on the most efficient number of VMs from the desired cloud provider. In some scenarios, it will prefer to provide a distribution of certain machines types and sizes based on the Pods requirements and the Spot / Preemptible VMs prices in the relevant region.
It may take a few moments before the created nodes join the Kubernetes cluster, in order to minimize this time (to zero) you can read more about Cluster Headroom.
Ocean constantly checks which nodes are unneeded in the cluster.
A node is considered for removal when:
- All pods running on the node (except these that run on all nodes by default, like manifest-run pods or pods created by daemonsets) can be moved to other nodes in the cluster. (based on Pod Disruption Budget (PDB), Persistent Volumes (PV) allocation, Node and pod affinity /anti-affinity and labels)
- The node’s removal won’t reduce the headroom below the target
- Ocean will prefer to downscale the least utilized nodes first
Ocean simulates the cluster’s topology and state “post” the scale-down activity and decides whether the action can be executed or not.
Scale down prevention
- Pods with restrictive PodDisruptionBudget will be evicted gradually if the scale down will cause a violation of the disruption budget, Ocean will not scale down the node.
- Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc).
- Pods with local storage.
- Pods that cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching anti-affinity, etc..)
- Pods that have the following label: “spotinst.io/restrict-scale-down”:”true”
Pods & Nodes Draining Process
Ocean ensures that pods and nodes are gracefully terminated in a case of scale-down or an instance replacement.
Node Termination process is as follow:
- Check for scale-down restriction label (“spotinst.io/restrict-scale-down”:”true”) on node’s pods
- If found, the node is not eligible for scale-down
- Scan All the pods and mark the ones that need to be rescheduled
- Mark all the pods that don’t have PDB configured, and start evicting them in parallel
- For pods with PDB, Ocean performs the eviction in chunks and makes sure that it won’t interfere with the minimal budget configured (For example a PDB .spec.minAvailable is 3, while there are 5 pods, 4 of them run on the node that is about to get scaled down; Ocean will evict 2 pods, wait for health signal and move to the next 2.
- An eviction is not completed until Ocean gets health signal from the new pod readiness\liveness probe (when configured) AND the old pod was successfully terminated (wait for grace-period or after pre Stop command)
- Oceans provides draining timeout of 120 seconds by default (configurable) for every Pod before terminating it.
Ocean provides the option to include a buffer of spare capacity (vCPU and memory resources) known as headroom. Headroom ensures that the cluster has the capacity to quickly scale more Pods without waiting for new nodes to be provisioned.
Ocean optimally manages the headroom to provide the best possible cost/performance balance. However, headroom may also be manually configured to support any use case.
Customizing scaling configuration
Ocean manages the cluster capacity to ensure all pods are running and that resources are utilized.
If you wish to override the default configuration, you can customize the scaling configuration.
To customize the scaling configuration:
- Navigate to your Ocean cluster
- Click on the ‘Actions’ button on the top-right side of the screen to open the actions menu
- Choose ‘Customize Scaling’
Ocean allow dynamic resource allocation to fit the pods’ needs. Ocean cluster resources are limited to 1000 CPU cores and 4000 GB memory by default, this can be customized via the cluster creation and edit wizards.