Self-Hosted Runner

Resolve common issues with the Bluebricks Deployments Controller (BDC) and Bricks Runner, including OOMKilled jobs, stuck tasks, and storage errors

Quick reference

Symptom
Cause
Fix

Runner pods OOMKilled

Insufficient memory limits

Increase bricksRunner.cm.resources.limits.memory

Tasks stuck in pending

Controller processing bottleneck

Increase --max-concurrent-reconciles (default: 3)

Auth errors in runner logs

API rate limiting or expired token

Verify API connectivity; check credentials

Storage errors

Persistent volume full

Increase bbxStorageManager.storage.size

Image pull failures

Registry unreachable or wrong tag

Verify image references and registry access

Helm upgrade fails

Version mismatch or CRD conflict

Check chart version compatibility


Runner pods OOMKilled

When a Bricks Runner pod exceeds its memory limit, Kubernetes kills it with an OOMKilled status. This typically happens with large Terraform state files or complex plans.

How to diagnose

kubectl get pods -n <bdc-namespace> | grep -i oom
kubectl describe pod <pod-name> -n <bdc-namespace> | grep -A 5 "Last State"

How to fix

Increase the memory limit in your values.yaml:

bricksRunner:
  cm:
    resources:
      requests:
        memory: "1024Mi"
      limits:
        memory: "2048Mi"

Then upgrade the Helm release:

For sizing guidelines by cluster size, see Sizing and Tuning.


Tasks stuck in pending

Tasks may remain in a pending state if the controller is overloaded and cannot process the task queue fast enough.

How to diagnose

Look for messages indicating task queue backups or reconciliation delays.

How to fix

Increase parallelism by raising --max-concurrent-reconciles in your values.yaml. The default is 3:

This allows BDC to process more tasks in parallel. See the sizing guidelines for recommended values.


Auth errors in runner logs

Authentication errors in the runner logs usually indicate that the BDC cannot communicate with the Bluebricks API, or that requests are being rate-limited.

How to diagnose

How to fix

  1. Verify API connectivity: confirm the runner can reach api.bluebricks.co on port 443

  1. Rate limiting: if you see rate-limit responses, increase the polling interval to reduce request volume

  2. Check credentials: verify the BDC service account token or API key is valid


Storage errors

Storage errors occur when the persistent volume used by the storage manager is full.

How to diagnose

How to fix

Increase the storage size in your values.yaml:

Then upgrade the Helm release:


Image pull failures

If runner pods fail to start with ImagePullBackOff or ErrImagePull, the container images cannot be downloaded.

Common causes

  • Registry unreachable: the cluster cannot reach the image registry. The default Helm chart uses europe-docker.pkg.dev/bbx-registry-prod/public-oci/

  • Wrong image tag: the specified version does not exist

  • Private registry without credentials: if you mirror images to a private registry, pull secrets may be missing

How to fix

  1. Check which image the pod is trying to pull:

BDC uses two different images: the controller image (bdctl) that runs the operator, and the runner image (bricks) that executes IaC operations. Identify which image is failing to pull.

  1. If using a private registry, configure image pull secrets in your values.yaml


Helm upgrade fails

Helm upgrades can fail due to version incompatibilities, CRD conflicts, or invalid values.

Common causes

  • CRD already exists: a previous installation left CRDs that conflict with the new version

  • Invalid values.yaml: a field was renamed or removed in the new chart version

  • Pending Helm release: a previous upgrade was interrupted, leaving the release in a bad state

How to fix

  1. Check the current release status:

  1. If the release is stuck in a pending state:

  1. For CRD conflicts, check installed CRDs:

  1. Review the chart's release notes for breaking changes before upgrading

For the full list of configurable values, see the BDC Helm Chart reference.


Sizing guidelines

Cluster size
Concurrent workers
Runner memory
Storage

Small (< 10 nodes)

2-3

2048Mi

10Gi

Medium (10-50 nodes)

4-8

2048Mi

30Gi

Large (50+ nodes)

8-16

4096Mi

100Gi+


Need more help?

  1. Check BDC logs: kubectl logs -f -l app.kubernetes.io/name=bluebricks-deployments-controller -n <bdc-namespace>

  2. Review the Self-Hosted Runner documentation for installation and configuration details

  3. Contact support with your Helm chart version, values.yaml, and relevant pod logs

Last updated

Was this helpful?