# Scaling Kubernetes Clusters

## Overview

Scaling Kubernetes clusters involves adjusting node counts, adding node pools, or configuring autoscaler settings. In Bluebricks, these changes follow the standard Day 2 workflow: update the [blueprint](https://bluebricks.co/docs/core-concepts/packages/blueprints-overview) inputs that control your cluster configuration and start a new run of the [environment](https://bluebricks.co/docs/core-concepts/environments).

## Prerequisites

* An existing [environment](https://bluebricks.co/docs/core-concepts/environments) with a completed run that includes a Kubernetes cluster
* The blueprint exposes inputs for cluster configuration (e.g., `node_count`, `min_nodes`, `max_nodes`, `node_pool_vm_size`)
* Access to the [collection](https://bluebricks.co/docs/core-concepts/collections) that the environment belongs to

## How to scale a cluster

{% tabs %}
{% tab title="Git" %}
Update the cluster-related input in your environment manifest file and push the change. If you use [GitOps environments](https://bluebricks.co/docs/core-concepts/environments/gitops-environments), Bluebricks triggers a plan automatically. Review the plan to confirm the change scope before merging.

```yaml
# environment manifest (e.g., k8s-prod.yaml)
inputs:
  node_count: 5
```

See [Managing Configuration on Git](https://bluebricks.co/docs/workflows/bluebricks-git-repository-guide/managing-configuration-on-git) for the full manifest format.
{% endtab %}

{% tab title="Bluebricks app" %}

1. Open the **Environments** page and go to the environment you want to update
2. In the three-dot menu click **Deploy**
3. Update the cluster-related inputs (e.g., change `node_count` from `3` to `5`)
4. Review the plan to confirm the change scope
5. Click **Deploy**
   {% endtab %}

{% tab title="CLI" %}

```bash
bricks install k8s-platform \
  --collection=production \
  --env-slug=k8s-prod \
  --props '{"node_count": 5}'
```

Preview first:

```bash
bricks install k8s-platform \
  --collection=production \
  --env-slug=k8s-prod \
  --props '{"node_count": 5}' \
  --plan-only
```

{% endtab %}
{% endtabs %}

## Common cluster changes

### Scale node count up or down

Adjusting the number of nodes in an existing node pool is the most common cluster scaling operation. Update the node count input and start a new run:

```bash
bricks install k8s-platform \
  --collection=staging \
  --env-slug=k8s-staging \
  --props '{"node_count": 5}'
```

Scaling down removes nodes from the pool. Kubernetes drains workloads from the removed nodes and reschedules them on remaining nodes.

{% hint style="warning" %}
When scaling down, make sure the remaining nodes have enough capacity for your workloads. Kubernetes evicts pods from removed nodes, and pods without sufficient resources to reschedule will remain in a Pending state.
{% endhint %}

### Add a new node pool

If the blueprint supports multiple node pools (e.g., through a `node_pools` list input), add a new pool by updating the configuration:

```bash
bricks install k8s-platform \
  --collection=production \
  --env-slug=k8s-prod \
  --props-file=./cluster-config.json
```

<details>

<summary>Example cluster-config.json with multiple node pools</summary>

```json
{
  "node_pools": [
    {
      "name": "system",
      "vm_size": "Standard_D4s_v3",
      "node_count": 3,
      "mode": "System"
    },
    {
      "name": "workload",
      "vm_size": "Standard_D8s_v3",
      "node_count": 5,
      "mode": "User"
    },
    {
      "name": "gpu",
      "vm_size": "Standard_NC6s_v3",
      "node_count": 2,
      "mode": "User"
    }
  ]
}
```

</details>

{% hint style="info" %}
The exact input names, VM sizes, and pool configuration vary by cloud provider and how the blueprint author defined the inputs. The example above uses Azure-specific values for illustration.
{% endhint %}

### Configure autoscaling

Many Kubernetes blueprints expose autoscaler inputs. Update the minimum and maximum node counts to enable or adjust autoscaling:

```bash
bricks install k8s-platform \
  --collection=production \
  --env-slug=k8s-prod \
  --props '{"enable_autoscaler": true, "min_nodes": 3, "max_nodes": 10}'
```

When autoscaling is enabled, the cloud provider's cluster autoscaler manages node count within the defined range. Bluebricks sets the boundaries; the autoscaler handles the runtime scaling.

{% hint style="info" %}
Autoscaler settings are declarative. If the current node count is within the new min/max range, no immediate scaling occurs. The autoscaler adjusts capacity based on pod scheduling demand.
{% endhint %}

### Change node VM size

Changing the VM size of an existing node pool typically triggers a rolling replacement of nodes. The behavior varies by cloud provider:

```bash
bricks install k8s-platform \
  --collection=staging \
  --env-slug=k8s-staging \
  --props '{"node_pool_vm_size": "Standard_D8s_v3"}'
```

{% hint style="warning" %}
Changing the VM size of a node pool may cause the entire pool to be replaced. On Azure AKS, this requires creating a new node pool and draining the old one. Always run with `--plan-only` first to understand the impact.
{% endhint %}

## Cloud provider behavior

Node count changes are generally in-place across all providers. VM size changes on existing node pools typically require pool replacement.

<details>

<summary>Azure AKS</summary>

* Node count changes are applied in-place. AKS adds or removes nodes from the pool
* VM size changes require creating a new node pool and deleting the old one (the Terraform `azurerm_kubernetes_cluster_node_pool` resource forces replacement on `vm_size` change)
* Autoscaler configuration changes (`min_count`, `max_count`) are applied in-place
* System node pools require at least one node at all times

</details>

<details>

<summary>AWS EKS</summary>

* Managed node group scaling changes are applied in-place through the Auto Scaling Group
* Instance type changes in managed node groups trigger a rolling update (new nodes are created, old nodes are drained and terminated)
* Cluster autoscaler is deployed as a Helm chart, separate from the EKS node group configuration
* Fargate profiles do not use traditional node scaling

</details>

<details>

<summary>GCP GKE</summary>

* Node pool resize operations are in-place
* Machine type changes require creating a new node pool and migrating workloads
* GKE Autopilot manages node provisioning automatically; you configure resource requests instead of node counts
* Node auto-provisioning creates new node pools based on workload requirements

</details>

## Example: scaling for a traffic spike

A common scenario is scaling up before a known traffic event and scaling back down afterward.

**Baseline configuration** (normal traffic):

```json
{
  "node_pools": [
    { "name": "workload", "vm_size": "Standard_D4s_v3", "node_count": 5 }
  ]
}
```

**Scaled-up configuration** (Black Friday, product launch, etc.):

```json
{
  "node_pools": [
    { "name": "workload", "vm_size": "Standard_D4s_v3", "node_count": 12 }
  ]
}
```

```bash
bricks install k8s-platform \
  --collection=production \
  --env-slug=k8s-prod \
  --props '{"node_count": 12}'
```

After the event, restore the original value by starting a new run with `node_count` set back to `5`. For recurring events, consider using [Git-managed manifests](https://bluebricks.co/docs/workflows/bluebricks-git-repository-guide/managing-configuration-on-git) so each scaling change is tracked as a commit you can revert.

## What to check after cluster scaling

1. **Run status**: confirm the run completed successfully
2. **Node readiness**: verify all nodes are in `Ready` state (`kubectl get nodes`)
3. **Pod scheduling**: check for pods in `Pending` state that may need additional capacity
4. **Workload health**: confirm your applications are running and serving traffic
5. **Autoscaler status**: if autoscaling is enabled, verify the autoscaler is operating within the configured range
