GitOps with Flux and Helm on AKS using Azure DevOps

33 minute read

Over the past few months I have been building and operating a GitOps platform for Kubernetes clusters running on Azure. The driving motivation was a shift away from push-based CI/CD pipelines — where pipelines hold cluster credentials and kubectl apply changes directly — toward a pull-based model where an in-cluster operator continuously reconciles the live state against what Git says it should be. This post is a practical walkthrough of how I implemented that shift using Flux CD, the Flux Operator, Helm, and Azure DevOps.

If you are already familiar with basic GitOps concepts and want to see how things connect in a production-grade setup — including image scanning, signing, private registry mirroring, and a multi-environment onboarding model — this is the post for you.

Why GitOps?

GitOps is a way of managing infrastructure and application delivery where Git is the single source of truth. Any desired state change is made via a pull request, and an in-cluster operator continuously reconciles the live state against what Git says it should be. The key properties, as defined by the OpenGitOps principles, are:

Declarative — the entire desired state is described declaratively
Versioned and immutable — Git history is the canonical audit trail
Pulled automatically — an agent in the cluster pulls state from Git (no external push into the cluster)
Continuously reconciled — if live state drifts from desired state, the agent corrects it

For Kubernetes this model fits perfectly. Every resource is a YAML manifest, Git provides versioning and review workflows via pull requests, and controllers like Flux watch the repository and apply changes automatically. The alternative — pushing manifests from a CI/CD pipeline with kubectl apply — requires the pipeline to have cluster credentials, creates a push-based model that bypasses drift detection, and gives you no built-in reconciliation if someone applies something directly to the cluster.

With the “why” clear, let me walk through how these principles are put into practice: what the repository looks like, how the pipeline fits in, and where Flux takes over.

Architecture Overview

Flux Architecture

Flux CD is a set of Kubernetes controllers, each responsible for a specific piece of the GitOps loop. Understanding what each controller does makes the rest of the setup much easier to reason about:

source-controller — watches Git repositories, Helm repositories, OCI registries, and S3-compatible buckets. It fetches content and makes it available as an in-cluster artifact that other controllers consume. This is the only controller that talks to external sources.
kustomize-controller — watches Kustomization CRDs and applies the manifests they point to (plain YAML, Kustomize overlays, or files that happen to contain HelmRelease objects). It handles decryption, dependency ordering, health checks, and pruning of removed resources.
helm-controller — watches HelmRelease CRDs and manages the full Helm lifecycle: install, upgrade, rollback, and uninstall. It reads chart artifacts from source-controller and renders them server-side.
notification-controller — handles inbound webhooks (to trigger reconciliation) and outbound event notifications (Slack, Teams, GitHub commit status, etc.).
image-reflector-controller and image-automation-controller — together they implement image update automation: scanning a registry for new tags and writing the updated tag back to Git.

These controllers communicate through the Kubernetes API only; they never talk directly to each other. source-controller produces Artifact objects; kustomize-controller and helm-controller consume them. This loose coupling means you can enable only the controllers you need.

Platform Layout

The platform manages multiple Kubernetes environments. The branch-to-environment mapping is straightforward:

Git Branch	Environment
`experimental`	experimental
`dev`	dev
`master`	prod

Any push to these branches triggers an Azure DevOps pipeline that handles the infrastructure layer: scanning and mirroring Flux controller images to a private Azure Container Registry (ACR), deploying the Flux Operator via Helm, and applying the FluxInstance CRD that tells Flux how to configure itself. Once Flux is running, it takes over and continuously reconciles everything under clusters/<env>/systems/. No pipeline trigger required for day-to-day system changes.

Git Push / PR
     │
     ├── clusters/*/flux-instance.yaml           clusters/*/systems/** (PR only)
     │   clusters/*/flux-operator-values.yaml              │
     │   flux-operator/**                                  │
     ▼                                                     ▼
azure-pipelines.yaml                        azure-pipelines-systems.yaml
     │                                                     │
     ├── [On PR] Validate stage                [On PR] Validate stage
     │     ├── Scan Flux images (Trivy)            └── kubectl apply --dry-run=server
     │     ├── Helm dry-run of Flux Operator              clusters/<env>/systems/
     │     └── kubectl dry-run of FluxInstance
     │
     └── [On push] Apply stage
           ├── Scan images (Trivy)
           ├── Mirror images to private ACR (crane)
           ├── Sign images (Cosign)
           ├── Helm upgrade --install Flux Operator
           └── kubectl apply FluxInstance
                     │
                     ▼
              Flux Operator reconciles FluxInstance
                     │
                     ▼
              Flux controllers deployed in cluster
                     │
                     ▼
              Flux syncs clusters/<env>/systems/ from Git
                     │
                     ├── Reconciles GitRepository sources (system repos)
                     └── Reconciles Kustomization → HelmRelease or plain YAML

Repository Structure

The k8s-gitops repository is the single source of truth for the platform layer. Its layout reflects the branch-per-environment model directly:

k8s-gitops/
├── azure-pipelines.yaml               # Main pipeline — triggers and environment wiring
├── azure-pipelines-systems.yaml       # Systems manifest validation pipeline — PR-only
├── ci-cd-templates/                   # Reusable Azure DevOps pipeline templates
│   ├── flux.yaml                      # Top-level: validate (on PR) + apply (on push)
│   ├── flux-operator.yaml             # Helm deploy/upgrade of Flux Operator
│   ├── flux-instance.yaml             # kubectl apply of FluxInstance CRD
│   ├── flux-images.yaml               # Mirror → scan → sign each Flux image
│   ├── scan-image.yaml                # Trivy vulnerability scan step
│   ├── sign-image.yaml                # Cosign image signing step
│   └── systems-validate.yaml          # kubectl dry-run validation of systems/ manifests
├── clusters/
│   ├── experimental/
│   │   ├── flux-instance.yaml         # FluxInstance spec for this environment
│   │   ├── flux-operator-values.yaml  # Helm values for Flux Operator
│   │   └── systems/                   # Flux-managed resources (reconciled by Flux, not the pipeline)
│   │       └── <system>.yaml          # One file per system — GitRepository + Kustomization
│   ├── dev/
│   └── prod/
└── flux-operator/                     # Helm chart for the Flux Operator (vendored)

The key separation is between clusters/<env>/flux-instance.yaml and clusters/<env>/systems/:

flux-instance.yaml and flux-operator-values.yaml are managed by the Azure DevOps pipeline — they define Flux itself.
Everything under systems/ is managed by Flux directly — the pipeline never touches it.

Flux Operator and FluxInstance CRD

One of the first architectural decisions was how to install and lifecycle-manage Flux itself. The traditional approach — flux bootstrap — generates YAML and commits it to your repo. It works, but upgrading Flux means re-running bootstrap or hand-editing generated files. There is no declarative upgrade path.

The Flux Operator solves this. It introduces a FluxInstance CRD that lets you declare Flux’s desired state the same way you declare anything else in Kubernetes. The operator reads the spec and takes care of deploying, configuring, and upgrading the Flux controllers. You never run flux bootstrap again.

Installing the Flux Operator is a single Helm command:

helm upgrade --install flux-operator ./flux-operator \
  --namespace flux-system \
  --create-namespace \
  -f ./clusters/<env>/flux-operator-values.yaml

Once the operator is running, you apply the FluxInstance manifest:

apiVersion: fluxcd.controlplane.io/v1
kind: FluxInstance
metadata:
  name: flux
  namespace: flux-system
spec:
  distribution:
    version: "2.8.3" # Flux version to deploy
    registry: "myacr.azurecr.io/fluxcd" # Private ACR where images are mirrored
    variant: "upstream-alpine" # Required when using a private registry
  components:
    - source-controller
    - kustomize-controller
    - helm-controller
    - notification-controller
    - image-reflector-controller
    - image-automation-controller
  sync:
    kind: GitRepository
    provider: azure # Azure DevOps authentication
    url: "https://dev.azure.com/my-org/my-project/_git/k8s-gitops"
    ref: "refs/heads/experimental"
    path: "clusters/experimental/systems" # Flux watches this path
  kustomize:
    patches: [...] # Workload identity patches (see below)

The spec.distribution block is the key part. When you increment the version number and apply the manifest, the Flux Operator reconciles the change and upgrades all controllers automatically. No manual rollout, no bootstrap re-run. The operator also handles the case where the controllers drift from the declared state and corrects them.

All six Flux controllers are deployed: source-controller, kustomize-controller, helm-controller, notification-controller, image-reflector-controller, and image-automation-controller.

Azure Workload Identity — Secretless Git Authentication

Flux’s source-controller needs to pull from Azure DevOps Git repositories. The naive approach is to create a Personal Access Token (PAT), store it as a Kubernetes secret, and reference it in the GitRepository spec. This works but PATs expire, need rotation, and storing credentials in the cluster creates a credential management burden.

The better approach is Azure Workload Identity. The idea is to federate a Kubernetes service account with an Azure Managed Identity. When the source-controller pod authenticates to Azure DevOps, it uses the federated credential, with no secrets stored anywhere.

The setup involves three steps:

1. Create a Managed Identity with a federated credential:

# Create the managed identity
az identity create \
  --name aks-dev-flux-source-controller \
  --resource-group my-cluster-rg \
  --subscription "my-subscription"

# Create the federated credential linking it to the source-controller service account
az identity federated-credential create \
  --name aks-dev-flux-federated-credential \
  --identity-name aks-dev-flux-source-controller \
  --resource-group my-cluster-rg \
  --issuer <AKS_OIDC_ISSUER_URL> \
  --subject system:serviceaccount:flux-system:source-controller

2. Grant the identity read access to your Azure DevOps project — a project administrator adds it to the Readers group in Project Settings → Permissions.

3. Patch the source-controller service account and deployment via FluxInstance.spec.kustomize.patches:

kustomize:
  patches:
    - patch: |-
        apiVersion: v1
        kind: ServiceAccount
        metadata:
          name: source-controller
          annotations:
            azure.workload.identity/client-id: <AZURE_CLIENT_ID>
            azure.workload.identity/tenant-id: <AZURE_TENANT_ID>
      target:
        kind: ServiceAccount
        name: source-controller
    - patch: |-
        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: source-controller
        spec:
          template:
            metadata:
              labels:
                azure.workload.identity/use: "true"
      target:
        kind: Deployment
        name: source-controller

The Flux Operator applies these patches when it deploys the controllers, so the workload identity configuration is fully declarative and version-controlled alongside everything else.

CI/CD Pipeline: Images, Scanning, Signing

The entry point for the whole Flux infrastructure pipeline is azure-pipelines.yaml. It triggers on pushes to the three environment branches (or on PRs touching those paths) and calls the flux.yaml template once per environment, passing the image list and environment-specific parameters:

# azure-pipelines.yaml
name: k8s-gitops-flux

trigger:
  branches:
    include:
      - experimental
      - master
      - dev
  paths:
    include:
      - clusters/*/flux-instance.yaml
      - clusters/*/flux-operator-values.yaml
      - flux-operator/*

pool: my-linux-agents

parameters:
  - name: helmReleaseName
    default: "flux-operator"
  - name: helmReleaseNamespace
    default: "flux-system"
  - name: helmChartPath
    default: "./flux-operator"
  - name: helmCreateNamespace
    type: boolean
    default: true

stages:
  # Experimental cluster CI/CD
  - template: ci-cd-templates/flux.yaml
    parameters:
      environment: "experimental"
      branch: "experimental"
      serviceConnection: "k8s-rbac-experimental-ado-sc"
      helmReleaseName: $
      helmReleaseNamespace: $
      helmChartPath: $
      helmValuesFilePath: "./clusters/experimental/flux-operator-values.yaml"
      helmCreateNamespace: $
      containerImages:
        - ghcr.io/controlplaneio-fluxcd/flux-operator:v0.45.1
        - ghcr.io/fluxcd/helm-controller:v1.5.3
        - ghcr.io/fluxcd/image-automation-controller:v1.1.1
        - ghcr.io/fluxcd/image-reflector-controller:v1.1.1
        - ghcr.io/fluxcd/kustomize-controller:v1.8.2
        - ghcr.io/fluxcd/notification-controller:v1.8.2
        - ghcr.io/fluxcd/source-controller:v1.8.1

  # Dev cluster CI/CD
  - template: ci-cd-templates/flux.yaml
    parameters:
      environment: "dev"
      branch: "dev"
      serviceConnection: "k8s-rbac-dev-ado-sc"
      helmReleaseName: $
      helmReleaseNamespace: $
      helmChartPath: $
      helmValuesFilePath: "./clusters/dev/flux-operator-values.yaml"
      helmCreateNamespace: $
      containerImages:
        - ghcr.io/controlplaneio-fluxcd/flux-operator:v0.45.1
        - ghcr.io/fluxcd/helm-controller:v1.5.3
        - ghcr.io/fluxcd/image-automation-controller:v1.1.1
        - ghcr.io/fluxcd/image-reflector-controller:v1.1.1
        - ghcr.io/fluxcd/kustomize-controller:v1.8.2
        - ghcr.io/fluxcd/notification-controller:v1.8.2
        - ghcr.io/fluxcd/source-controller:v1.8.1

  # Production cluster CI/CD
  - template: ci-cd-templates/flux.yaml
    parameters:
      environment: "prod"
      branch: "master"
      serviceConnection: "k8s-rbac-prod-ado-sc"
      helmReleaseName: $
      helmReleaseNamespace: $
      helmChartPath: $
      helmValuesFilePath: "./clusters/prod/flux-operator-values.yaml"
      helmCreateNamespace: $
      containerImages:
        - ghcr.io/controlplaneio-fluxcd/flux-operator:v0.45.1
        - ghcr.io/fluxcd/helm-controller:v1.5.3
        - ghcr.io/fluxcd/image-automation-controller:v1.1.1
        - ghcr.io/fluxcd/image-reflector-controller:v1.1.1
        - ghcr.io/fluxcd/kustomize-controller:v1.8.2
        - ghcr.io/fluxcd/notification-controller:v1.8.2
        - ghcr.io/fluxcd/source-controller:v1.8.1

The containerImages list is what the flux-images.sh script generates — on every upgrade you replace this block and nothing else.

One of the things that took the most thought was the image management story. Flux controller images come from ghcr.io/fluxcd (upstream) or ghcr.io/controlplaneio-fluxcd (enterprise). Pulling images directly from public registries in production clusters is a risk: rate limits, availability dependencies, no supply chain verification. The solution is to mirror all images to a private ACR on every release.

The pipeline handles this for each image:

1. Scan with Trivy — before copying anything, the source image is scanned for vulnerabilities. In audit mode the pipeline logs findings without failing; switching to enforce mode fails the pipeline on HIGH or CRITICAL CVEs. The Trivy database itself is hosted in private ACR to avoid GHCR rate limiting (I wrote about this in a previous post).

2. Mirror with crane — crane copy copies the image directly between registries without pulling it locally. Critically, it preserves the full OCI manifest index, so the digest in ACR is identical to the upstream digest. This matters for Cosign signature verification — the signature is tied to the digest. The Flux Operator uses the digest for flux-controller images by default, so the mirrored image must have the same digest for the signature to be valid. If signature verification fails, the pipeline fails. The same signature verification is enforced in the cluster before allowing the workload to run, thanks to Kyverno.

crane copy \
  ghcr.io/fluxcd/source-controller:v1.4.3 \
  myacr.azurecr.io/fluxcd/source-controller:v1.4.3

3. Sign with Cosign — after copying, the image is signed using a key stored in Azure Key Vault. The signature is pushed to ACR alongside the image. Signing only runs on push (not on PRs), and after signing the pipeline verifies the signature before proceeding.

All of this runs for every image in the containerImages list defined in azure-pipelines.yaml. When upgrading Flux, a helper script generates the updated image list:

./scripts/flux-images.sh v2.8.3 v0.45.1

The script uses the flux CLI and yq to produce a ready-to-paste containerImages block. You replace the list in azure-pipelines.yaml, update spec.distribution.version in each FluxInstance, open a PR to experimental, validate, then promote to dev and master.

The pipeline templates are split into focused reusable files:

Template	Purpose
`flux.yaml`	Top-level orchestration — delegates to validate or apply based on trigger
`flux-validate.yaml`	PR: scan images, Helm dry-run, kubectl dry-run of FluxInstance
`flux-apply.yaml`	Push: scan + mirror + sign images, Helm upgrade, apply FluxInstance
`flux-images.yaml`	Processes each image: Trivy → crane → Cosign
`flux-operator.yaml`	Helm upgrade/install of the Flux Operator
`flux-instance.yaml`	Apply FluxInstance, wait for Ready condition
`common-tools.yaml`	Install kubectl, kubelogin, yq, trivy, crane, cosign, helm — with SHA256 verification
`systems-validate.yaml`	PR-only: `kubectl apply --dry-run=server` of systems manifests

Every tool installation in common-tools.yaml downloads from the official release URL and verifies the SHA256 checksum before installing. If the checksum does not match, the step fails immediately. A small but important supply chain control.

Here is how the three core image pipeline templates look in practice.

ci-cd-templates/flux-images.yaml — iterates over the image list and calls scan → copy → sign for each one:

parameters:
  - name: containerImages
    type: object
    default: []
  - name: containerRegistryName
    default: ""
  - name: containerRegistryServiceConnection
    default: ""
  - name: scanMode
    default: "audit" # switch to 'enforce' to fail on HIGH/CRITICAL
  - name: scanImages
    type: boolean
    default: false
  - name: signImages
    type: boolean
    default: false
  - name: environment
    default: ""
  - name: imageSignerServiceConnection
    default: ""

steps:
  - task: Docker@2
    displayName: "Login to Container Registry"
    inputs:
      command: login
      containerRegistry: $

  - $:
      - bash: |
          IMAGE_PATH=$(echo "$" | cut -d'/' -f2-)
          TARGET_IMAGE="$/$IMAGE_PATH"
          echo "##vso[task.setvariable variable=TARGET_IMAGE;]$TARGET_IMAGE"
        displayName: "Prepare target image name $"

      - $:
          - template: scan-image.yaml
            parameters:
              image: $
              scanMode: $

      - bash: |
          crane copy $ $(TARGET_IMAGE)
        displayName: "Copy $"
        condition: and(succeeded(), not(startsWith(variables['Build.SourceBranch'], 'refs/pull')))

      - $:
          - template: sign-image.yaml
            parameters:
              image: $(TARGET_IMAGE)
              serviceConnection: $
              environment: $

  - task: Docker@2
    displayName: "Logout of Container Registry"
    condition: always()
    inputs:
      command: logout
      containerRegistry: $

ci-cd-templates/scan-image.yaml — runs Trivy with two modes: audit (log only) and enforce (fail on HIGH/CRITICAL):

parameters:
  - name: image
    type: string
    default: ""
  - name: containerRegistryName
    default: ""
  - name: scanMode
    type: string
    default: "audit"

steps:
  - bash: |
      trivy image \
        --scanners vuln \
        --ignore-unfixed \
        --pkg-types os \
        --exit-code 0 \
        --db-repository=$/trivy/trivy-db:2 \
        --java-db-repository=$/trivy/trivy-java-db:1 \
        $
    condition: eq('$', 'audit')
    displayName: "Scan $ (audit)"

  - bash: |
      # LOW/MEDIUM — informational only
      trivy image --exit-code 0 --severity LOW,MEDIUM \
        --ignore-unfixed --pkg-types os \
        --db-repository=$/trivy/trivy-db:2 \
        $

      # HIGH/CRITICAL — fail the pipeline
      trivy image --exit-code 1 --severity HIGH,CRITICAL \
        --ignore-unfixed --pkg-types os \
        --db-repository=$/trivy/trivy-db:2 \
        $
    condition: eq('$', 'enforce')
    displayName: "Scan $ (enforce)"

ci-cd-templates/sign-image.yaml — resolves the digest via az acr, signs with cosign using an Azure Key Vault key, then immediately verifies the signature:

parameters:
  - name: image
    type: string
    default: ""
  - name: serviceConnection
    type: string
    default: ""
  - name: environment
    type: string
    default: ""

steps:
  - task: AzureCLI@2
    displayName: "Sign $"
    inputs:
      azureSubscription: "$"
      scriptType: bash
      scriptLocation: inlineScript
      inlineScript: |
        REGISTRY=$(echo "$" | cut -d'/' -f1)
        REPOSITORY=$(echo "$" | cut -d'/' -f2- | cut -d':' -f1)
        TAG=$(echo "$" | grep -o ':[^@]*$' | cut -d':' -f2)

        IMAGE_DIGEST=$(az acr repository show \
          --name $REGISTRY \
          --image $REPOSITORY:$TAG \
          --query "digest" -o tsv)

        if [ "$" == "prod" ]; then
          KV_NAME="my-prod-keyvault"
          KV_KEY="my-prod-cosign-key"
        else
          KV_NAME="my-dev-keyvault"
          KV_KEY="my-dev-cosign-key"
        fi

        cosign sign \
          --key azurekms://$KV_NAME.vault.azure.net/$KV_KEY \
          $REGISTRY/$REPOSITORY@$IMAGE_DIGEST \
          --upload=true --yes=true --tlog-upload=false

  - task: AzureCLI@2
    displayName: "Verify $ signature"
    inputs:
      azureSubscription: "$"
      scriptType: bash
      scriptLocation: inlineScript
      inlineScript: |
        if [ "$" == "prod" ]; then
          SECRET_NAME="cosign-public-key-prod"
          KV_NAME="my-prod-keyvault"
        else
          SECRET_NAME="cosign-public-key-dev"
          KV_NAME="my-dev-keyvault"
        fi

        az keyvault secret show \
          --vault-name $KV_NAME --name $SECRET_NAME \
          --query value -o tsv > cosign.pub

        cosign verify \
          --key cosign.pub \
          $ \
          --private-infrastructure=true

ci-cd-templates/flux.yaml — top-level orchestrator: routes to validate on PRs and to apply on push:

parameters:
  - name: branch
    default: ""
  - name: environment
    default: ""
  - name: serviceConnection
    default: ""
  - name: containerImages
    type: object
    default: []
  - name: scanMode
    default: "audit"
    values: [audit, enforce]
  - name: helmReleaseName
    default: ""
  - name: helmReleaseNamespace
    default: ""
  - name: helmChartPath
    default: ""
  - name: helmValuesFilePath
    default: ""
  - name: helmArgs
    type: object
    default: []
  - name: helmCreateNamespace
    type: boolean
    default: true

stages:
  - stage: "ValidateFluxConfiguration_in_$_cluster"
    displayName: Validate Flux configuration in $ cluster
    condition: startsWith(variables['Build.SourceBranch'], 'refs/pull')
    jobs:
      - template: flux-validate.yaml
        parameters:
          environment: $
          serviceConnection: $
          containerImages: $
          scanImages: true
          scanMode: $
          signImages: false
          helmReleaseName: $
          helmReleaseNamespace: $
          helmChartPath: $
          helmValuesFilePath: $
          helmArgs: $
          helmCreateNamespace: $

  - stage: "ApplyFluxConfiguration_to_$_cluster"
    displayName: Apply Flux configuration to $ cluster
    condition: eq(variables['Build.SourceBranchName'], '$')
    jobs:
      - template: flux-apply.yaml
        parameters:
          environment: $
          serviceConnection: $
          containerImages: $
          scanImages: true
          scanMode: $
          signImages: true
          helmReleaseName: $
          helmReleaseNamespace: $
          helmChartPath: $
          helmValuesFilePath: $
          helmArgs: $
          helmCreateNamespace: $

ci-cd-templates/flux-validate.yaml — PR validation job: scan images, Helm dry-run, FluxInstance dry-run. Note the per-environment ACR and service connection variables resolved at job level:

jobs:
  - job: ValidateFluxConfiguration
    displayName: Validate Flux configuration in $ cluster
    variables:
      - name: containerRegistryName
        $:
          value: myacr-prod.azurecr.io
        $:
          value: myacr-dev.azurecr.io
      - name: containerRegistryServiceConnection
        $:
          value: acr-prod-sc
        $:
          value: acr-dev-sc
      - name: imageSignerServiceConnection
        $:
          value: image-signing-prod-sc
        $:
          value: image-signing-dev-sc
    steps:
      - template: common-tools.yaml

      - template: flux-images.yaml
        parameters:
          environment: $
          containerRegistryName: $
          containerRegistryServiceConnection: $
          imageSignerServiceConnection: $
          containerImages: $
          scanImages: $
          signImages: $ # false on PR
          scanMode: $

      - template: flux-operator.yaml
        parameters:
          environment: $
          serviceConnection: $
          helmReleaseName: $
          helmReleaseNamespace: $
          helmChartPath: $
          helmValuesFilePath: $
          helmDryRun: true # dry-run only on PR

      - template: flux-instance.yaml
        parameters:
          environment: $
          serviceConnection: $
          waitTimeout: "3m"

ci-cd-templates/flux-apply.yaml — push apply job: same structure as validate but signImages: true and helmDryRun: false:

jobs:
  - job: ApplyFluxConfiguration
    displayName: Apply Flux configuration to $ cluster
    variables:
      - name: containerRegistryName
        $:
          value: myacr-prod.azurecr.io
        $:
          value: myacr-dev.azurecr.io
      - name: containerRegistryServiceConnection
        $:
          value: acr-prod-sc
        $:
          value: acr-dev-sc
      - name: imageSignerServiceConnection
        $:
          value: image-signing-prod-sc
        $:
          value: image-signing-dev-sc
    steps:
      - template: common-tools.yaml

      - template: flux-images.yaml
        parameters:
          environment: $
          containerRegistryName: $
          containerRegistryServiceConnection: $
          imageSignerServiceConnection: $
          containerImages: $
          scanImages: $
          signImages: $ # true on push
          scanMode: $

      - template: flux-operator.yaml
        parameters:
          environment: $
          serviceConnection: $
          helmReleaseName: $
          helmReleaseNamespace: $
          helmChartPath: $
          helmValuesFilePath: $
          helmDryRun: false # real install on push

      - template: flux-instance.yaml
        parameters:
          environment: $
          serviceConnection: $
          waitTimeout: "3m"

ci-cd-templates/flux-operator.yaml — Helm template, dry-run, and upgrade of the Flux Operator; credentials are fetched and cleaned up around the Helm steps:

parameters:
  - name: environment
    default: ""
  - name: serviceConnection
    default: ""
  - name: helmReleaseName
    default: ""
  - name: helmReleaseNamespace
    default: ""
  - name: helmChartPath
    default: ""
  - name: helmValuesFilePath
    default: ""
  - name: helmDryRun
    type: boolean
    default: true
  - name: helmCreateNamespace
    type: boolean
    default: true

steps:
  - task: AzureCLI@2
    displayName: "Get AKS credentials"
    inputs:
      azureSubscription: "$"
      scriptType: bash
      scriptLocation: inlineScript
      inlineScript: |
        kubelogin remove-cache-dir
        az aks get-credentials \
          --name aks-$ \
          --resource-group aks-rg-$ \
          --overwrite-existing
        kubelogin convert-kubeconfig -l azurecli

  - task: AzureCLI@2
    displayName: "Helm Template $"
    inputs:
      azureSubscription: "$"
      scriptType: bash
      scriptLocation: inlineScript
      inlineScript: |
        helm template $ \
          ./$ \
          -f ./$

  - task: AzureCLI@2
    displayName: "Helm Dry-Run $"
    condition: eq('$', 'true')
    inputs:
      azureSubscription: "$"
      scriptType: bash
      scriptLocation: inlineScript
      inlineScript: |
        helm upgrade $ \
          ./$ \
          -f ./$ \
          --namespace $ \
          --create-namespace --dry-run --install

  - task: AzureCLI@2
    displayName: "Helm Install $"
    condition: and(succeeded(), not(startsWith(variables['Build.SourceBranch'], 'refs/pull')))
    inputs:
      azureSubscription: "$"
      scriptType: bash
      scriptLocation: inlineScript
      inlineScript: |
        helm upgrade $ \
          ./$ \
          -f ./$ \
          --namespace $ \
          --create-namespace --install --wait

  - task: AzureCLI@2
    displayName: "Cleanup AKS credentials"
    condition: always()
    inputs:
      azureSubscription: "$"
      scriptType: bash
      scriptLocation: inlineScript
      inlineScript: |
        kubelogin remove-cache-dir
        kubectl config delete-context aks-$ 2>/dev/null || true
        kubectl config delete-cluster aks-$ 2>/dev/null || true
        echo "Credentials cleaned up ✓"

ci-cd-templates/flux-instance.yaml — validates the FluxInstance spec using the flux-operator CLI, applies it, and waits for the Ready condition. Skips the apply step on PRs:

parameters:
  - name: environment
    default: ""
  - name: serviceConnection
    default: ""
  - name: waitTimeout
    default: "3m"

steps:
  - task: AzureCLI@2
    displayName: "Get AKS credentials"
    inputs:
      azureSubscription: "$"
      scriptType: bash
      scriptLocation: inlineScript
      inlineScript: |
        kubelogin remove-cache-dir
        az aks get-credentials \
          --name aks-$ \
          --resource-group aks-rg-$ \
          --overwrite-existing
        kubelogin convert-kubeconfig -l azurecli

  - task: AzureCLI@2
    displayName: "Install flux-operator CLI"
    inputs:
      azureSubscription: "$"
      scriptType: bash
      scriptLocation: inlineScript
      inlineScript: |
        VERSION=$(yq '.image.tag' ./clusters/$/flux-operator-values.yaml)
        VERSION_NO_V="${VERSION#v}"
        curl -s -O -L "https://github.com/controlplaneio-fluxcd/flux-operator/releases/download/${VERSION}/flux-operator_${VERSION_NO_V}_linux_amd64.tar.gz"
        curl -s -O -L "https://github.com/controlplaneio-fluxcd/flux-operator/releases/download/${VERSION}/flux-operator_${VERSION_NO_V}_checksums.txt"
        grep "flux-operator_${VERSION_NO_V}_linux_amd64.tar.gz" flux-operator_${VERSION_NO_V}_checksums.txt | sha256sum --check
        tar -xzf flux-operator_${VERSION_NO_V}_linux_amd64.tar.gz -C /tmp flux-operator
        sudo mv /tmp/flux-operator /usr/local/bin/flux-operator
        flux-operator version --client

  - task: AzureCLI@2
    displayName: "Validate FluxInstance on $"
    inputs:
      azureSubscription: "$"
      scriptType: bash
      scriptLocation: inlineScript
      inlineScript: |
        INSTANCE_FILE="./clusters/$/flux-instance.yaml"
        flux-operator build instance -f $INSTANCE_FILE
        kubectl apply -f $INSTANCE_FILE --dry-run=server

  - task: AzureCLI@2
    displayName: "Apply FluxInstance on $"
    condition: and(succeeded(), not(startsWith(variables['Build.SourceBranch'], 'refs/pull')))
    inputs:
      azureSubscription: "$"
      scriptType: bash
      scriptLocation: inlineScript
      inlineScript: |
        INSTANCE_FILE="./clusters/$/flux-instance.yaml"
        APPLY_OUTPUT=$(kubectl apply -f $INSTANCE_FILE)
        echo "$APPLY_OUTPUT"

        if echo "$APPLY_OUTPUT" | grep -q "unchanged"; then
          READY=$(kubectl get fluxinstance flux -n flux-system \
            -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
          if [ "$READY" != "True" ]; then
            kubectl wait fluxinstance/flux --for=condition=Ready \
              --namespace flux-system --timeout=$
          else
            echo "FluxInstance already Ready ✓"
          fi
        else
          kubectl wait fluxinstance/flux --for=condition=Ready \
            --namespace flux-system --timeout=$
        fi

        kubectl -n flux-system get fluxinstance flux
        kubectl -n flux-system get pods

  - task: AzureCLI@2
    displayName: "Cleanup AKS credentials"
    condition: always()
    inputs:
      azureSubscription: "$"
      scriptType: bash
      scriptLocation: inlineScript
      inlineScript: |
        kubelogin remove-cache-dir
        kubectl config delete-context aks-$ 2>/dev/null || true
        kubectl config delete-cluster aks-$ 2>/dev/null || true
        echo "Credentials cleaned up ✓"

ci-cd-templates/common-tools.yaml — installs every tool the pipeline needs with SHA256 checksum verification on each download:

parameters:
  - name: kubectlVersion
    default: "v1.32.9"
  - name: kubeLoginVersion
    default: "v0.2.10"
  - name: yqVersion
    default: "v4.52.4"
  - name: trivyVersion
    default: "0.69.3"
  - name: cosignVersion
    default: "v2.6.1"
  - name: craneVersion
    default: "v0.20.3"
  - name: helmVersion
    default: "v3.18.4"

steps:
  - task: KubectlInstaller@0
    displayName: Install kubectl $
    inputs:
      kubectlVersion: $

  - bash: |
      set -euo pipefail
      curl -fsSL "https://github.com/Azure/kubelogin/releases/download/$/kubelogin-linux-amd64.zip" -o kubelogin-linux-amd64.zip
      curl -fsSL "https://github.com/Azure/kubelogin/releases/download/$/kubelogin-linux-amd64.zip.sha256" -o kubelogin-linux-amd64.zip.sha256
      sha256sum --check kubelogin-linux-amd64.zip.sha256
      unzip kubelogin-linux-amd64.zip && sudo mv bin/linux_amd64/kubelogin /usr/local/bin
      kubelogin --version
    displayName: Install kubelogin $

  - bash: |
      set -euo pipefail
      curl -fsSL "https://github.com/aquasecurity/trivy/releases/download/v$/trivy_$_Linux-64bit.tar.gz" -o trivy.tar.gz
      curl -fsSL "https://github.com/aquasecurity/trivy/releases/download/v$/trivy_$_checksums.txt" -o trivy_checksums.txt
      grep "trivy_$_Linux-64bit.tar.gz" trivy_checksums.txt | sha256sum --check
      tar -xzf trivy.tar.gz trivy && sudo mv trivy /usr/local/bin/trivy
      trivy -v
    displayName: Install Trivy $

  - bash: |
      set -euo pipefail
      curl -fsSL "https://github.com/mikefarah/yq/releases/download/$/yq_linux_amd64" -o yq_linux_amd64
      curl -fsSL "https://github.com/mikefarah/yq/releases/download/$/checksums-bsd" -o yq_checksums-bsd
      grep "^SHA256 (yq_linux_amd64)" yq_checksums-bsd | awk '{print $NF "  yq_linux_amd64"}' | sha256sum --check
      sudo mv yq_linux_amd64 /usr/local/bin/yq && sudo chmod +x /usr/local/bin/yq
      yq --version
    displayName: Install yq $

  - bash: |
      set -euo pipefail
      curl -fsSL "https://github.com/google/go-containerregistry/releases/download/$/go-containerregistry_Linux_x86_64.tar.gz" -o crane.tar.gz
      curl -fsSL "https://github.com/google/go-containerregistry/releases/download/$/checksums.txt" -o crane_checksums.txt
      grep "go-containerregistry_Linux_x86_64.tar.gz" crane_checksums.txt | sha256sum --check
      tar -xzf crane.tar.gz crane && sudo mv crane /usr/local/bin/crane
      crane version
    displayName: Install crane $

  - bash: |
      set -euo pipefail
      curl -fsSL "https://get.helm.sh/helm-$-linux-amd64.tar.gz" -o helm.tar.gz
      curl -fsSL "https://get.helm.sh/helm-$-linux-amd64.tar.gz.sha256sum" -o helm.tar.gz.sha256sum
      sha256sum --check helm.tar.gz.sha256sum
      tar -xzf helm.tar.gz linux-amd64/helm && sudo mv linux-amd64/helm /usr/local/bin/helm
      helm version
    displayName: Install Helm $

  - bash: |
      set -euo pipefail
      curl -fsSL "https://github.com/sigstore/cosign/releases/download/$/cosign-linux-amd64" -o cosign-linux-amd64
      curl -fsSL "https://github.com/sigstore/cosign/releases/download/$/cosign_checksums.txt" -o cosign_checksums.txt
      grep "cosign-linux-amd64$" cosign_checksums.txt | sha256sum --check
      sudo mv cosign-linux-amd64 /usr/local/bin/cosign && sudo chmod +x /usr/local/bin/cosign
      cosign version
    displayName: Install Cosign $

Systems Validation Pipeline

There are actually two pipelines. The main azure-pipelines.yaml handles Flux infrastructure (operator, FluxInstance, images). A separate azure-pipelines-systems.yaml validates system manifests on PRs.

When someone opens a PR that touches clusters/<env>/systems/, this pipeline runs kubectl apply --dry-run=server against the live cluster for the target environment. Server-side dry-run is important — it catches errors that client-side validation misses, like referencing a CRD that does not exist in the cluster or a resource that would conflict with an existing one.

Only the stage matching the PR target branch runs. A PR targeting dev validates clusters/dev/systems/ only — not experimental or prod.

# azure-pipelines-systems.yaml
name: k8s-gitops-systems-validate

trigger: none # no push trigger — PR only via branch policy

pool: my-linux-agents

stages:
  - stage: ValidateExperimental
    displayName: Validate systems manifests in experimental
    condition: eq(variables['System.PullRequest.TargetBranch'], 'refs/heads/experimental')
    jobs:
      - template: ci-cd-templates/systems-validate.yaml
        parameters:
          environment: "experimental"
          serviceConnection: "k8s-rbac-experimental-ado-sc"

  - stage: ValidateDev
    displayName: Validate systems manifests in dev
    condition: eq(variables['System.PullRequest.TargetBranch'], 'refs/heads/dev')
    jobs:
      - template: ci-cd-templates/systems-validate.yaml
        parameters:
          environment: "dev"
          serviceConnection: "k8s-rbac-dev-ado-sc"

  - stage: ValidateProd
    displayName: Validate systems manifests in prod
    condition: eq(variables['System.PullRequest.TargetBranch'], 'refs/heads/master')
    jobs:
      - template: ci-cd-templates/systems-validate.yaml
        parameters:
          environment: "prod"
          serviceConnection: "k8s-rbac-prod-ado-sc"

The trigger: none is intentional — this pipeline is attached as a branch policy in Azure DevOps, not triggered by a push. It only runs when a PR is opened or updated.

Flux does not trigger from changes to clusters/*/systems/. Those are reconciled directly by Flux on its own interval (every 1–5 minutes depending on the resource). The systems pipeline only validates on PRs — it has no push trigger.

ci-cd-templates/systems-validate.yaml — fetches cluster credentials, runs kubectl apply --dry-run=server against all files in clusters/<env>/systems/, then cleans up credentials:

parameters:
  - name: environment
    default: ""
  - name: serviceConnection
    default: ""

jobs:
  - job: ValidateSystemsManifests
    displayName: Validate systems manifests in $ cluster
    steps:
      - template: common-tools.yaml

      - task: AzureCLI@2
        displayName: "Get AKS credentials"
        inputs:
          azureSubscription: "$"
          scriptType: bash
          scriptLocation: inlineScript
          inlineScript: |
            kubelogin remove-cache-dir
            az aks get-credentials \
              --name aks-$ \
              --resource-group aks-rg-$ \
              --overwrite-existing
            kubelogin convert-kubeconfig -l azurecli

      - task: AzureCLI@2
        displayName: "Dry-run systems manifests on $"
        inputs:
          azureSubscription: "$"
          scriptType: bash
          scriptLocation: inlineScript
          inlineScript: |
            set -euo pipefail
            SYSTEMS_PATH="./clusters/$/systems"
            FILES=$(find "$SYSTEMS_PATH" -name "*.yaml" | sort)

            if [ -z "$FILES" ]; then
              echo "No system files found — nothing to validate"
              exit 0
            fi

            echo "Files to validate:"
            echo "$FILES" | while read -r f; do echo "  → $f"; done

            kubectl apply -f "$SYSTEMS_PATH" --dry-run=server
            echo "Validation passed ✓"

      - task: AzureCLI@2
        displayName: "Cleanup AKS credentials"
        condition: always()
        inputs:
          azureSubscription: "$"
          scriptType: bash
          scriptLocation: inlineScript
          inlineScript: |
            kubelogin remove-cache-dir
            kubectl config delete-context aks-$ 2>/dev/null || true
            kubectl config delete-cluster aks-$ 2>/dev/null || true
            echo "Credentials cleaned up ✓"

Onboarding Systems — Three Patterns

The platform supports three onboarding patterns. The platform team’s registration step in k8s-gitops is identical for all three. The difference is entirely in what the system team places in their own repository.

Platform team registration — one file per system, per environment:

# clusters/<env>/systems/<system>.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: k8s-<system>
  namespace: flux-system
spec:
  interval: 1m
  url: https://dev.azure.com/my-org/my-project/_git/k8s-<system>
  ref:
    branch: <env>
  provider: azure
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: <system>
  namespace: flux-system
spec:
  interval: 5m
  path: ./environments/<env>/flux
  prune: true
  sourceRef:
    kind: GitRepository
    name: k8s-<system>

Flux watches clusters/<env>/systems/ in the k8s-gitops repo. When this file is merged, Flux picks it up on its next reconciliation interval, starts watching the system repo at the specified branch and path, and applies whatever it finds there.

Pattern A — System Owns Its Own Chart and HelmRelease

Used for platform components (nginx, cert-manager, kured, etc.) and application systems with custom charts. Each system repo co-locates its Helm chart and HelmRelease.

k8s-<system>/
├── <chart-dir>/                   # Helm chart co-located in the same repo
│   ├── Chart.yaml
│   ├── templates/
│   └── values.yaml                # base values
└── environments/
    ├── experimental/
    │   ├── values.yaml            # env-specific overrides
    │   └── flux/
    │       └── helmrelease.yaml   # Flux reconciles this path
    ├── dev/
    │   ├── values.yaml
    │   └── flux/
    │       └── helmrelease.yaml
    └── prod/
        ├── values.yaml
        └── flux/
            └── helmrelease.yaml

The HelmRelease in each environment references the chart by relative path in the same GitRepository:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: <system>
  namespace: flux-system
spec:
  interval: 10m
  targetNamespace: <system>
  install:
    createNamespace: true
  upgrade:
    cleanupOnFail: true
  chart:
    spec:
      chart: ./<chart-dir>
      reconcileStrategy: Revision
      sourceRef:
        kind: GitRepository
        name: k8s-<system>
        namespace: flux-system
      valuesFiles:
        - ./<chart-dir>/values.yaml
        - ./environments/<env>/values.yaml

The HelmRelease always lives in flux-system. The targetNamespace tells helm-controller where to actually deploy the chart’s resources. Flux auto-creates an internal HelmChart object in flux-system — you never manage that directly.

Pattern B — Central Shared Chart, System Owns Only Values

Pattern B is the standard pattern for business applications. The platform team maintains a single central Helm chart repo that encapsulates everything a well-behaved application deployment needs: namespace creation, RBAC, NetworkPolicy, ResourceQuota, LimitRange, and default ServiceAccount annotations. Application teams supply only a HelmRelease pointing at that central chart with their own values — they never write RBAC YAML, never define quotas, and never touch chart internals.

The platform team registers the central chart repo once in k8s-gitops as a shared GitRepository, then each business application gets its own GitRepository + Kustomization registration as usual. In the application repo the HelmRelease references the central chart source:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: <system>
  namespace: flux-system
spec:
  interval: 10m
  targetNamespace: <system>
  chart:
    spec:
      chart: ./platform-app-chart # chart from the central repo
      sourceRef:
        kind: GitRepository
        name: k8s-platform-charts # central chart GitRepository, registered once
        namespace: flux-system
  values:
    replicaCount: 2
    image:
      repository: myacr.azurecr.io/my-org/<system>
      tag: "1.0.0"
    aadGroupId: "<team-aad-group-id>" # drives RoleBinding generation inside the chart

The aadGroupId value is the key pattern — the central chart uses it to generate a RoleBinding for the team’s Azure AD group, scoped to their namespace. Teams get access to their own namespace without ever writing a RBAC manifest. Baseline security posture (NetworkPolicy, LimitRange, default quotas) is inherited from the chart automatically. It enforces baseline security posture across all application teams without manual steps or platform team involvement per namespace.

Platform SKUs — Right-Sized Resource Quotas

One extension of the central chart model is platform SKUs — a set of pre-defined resource profiles that teams choose from when onboarding. Instead of negotiating ResourceQuota and LimitRange values per team, the platform exposes a small menu of named tiers:

SKU	Max Pods	CPU request → limit	Memory request → limit
`sku-small`	5	100m → 1	128Mi → 1Gi
`sku-medium`	15	200m → 2	256Mi → 4Gi
`sku-large`	30	500m → 4	512Mi → 8Gi
`sku-batch`	20	500m → 8	1Gi → 16Gi

Each SKU is a separate Helm chart (or a named values preset inside the central chart). A team references their chosen SKU alongside their app HelmRelease:

values:
  replicaCount: 2
  image:
    repository: myacr.azurecr.io/my-org/<system>
    tag: "1.0.0"
  aadGroupId: "<team-aad-group-id>"
  sku: sku-medium # injects ResourceQuota + LimitRange for this tier

The platform team owns the SKU definitions centrally — if a tier needs tuning, one chart change propagates to every team using that SKU on the next reconciliation. Teams never touch quota YAML and can request a larger SKU via a PR comment rather than a platform ticket.

Pattern C — Plain YAML or Kustomize (Exception Only)

Pattern C is not a standard onboarding path — it is a deliberate escape hatch for the small category of resources that genuinely cannot fit inside a Helm chart: CRDs, ClusterRoles, cross-namespace objects, or anything that must exist before any chart can be installed. The Kustomization points at a path containing plain YAML files or a kustomization.yaml overlay, and kustomize-controller applies them directly.

If you find yourself reaching for Pattern C for a regular application workload, that is a signal something is missing from the central chart — fix the chart rather than bypassing it. Pattern C should be rare, reviewed carefully, and always additive. It cannot override what a Pattern B chart already manages, and any Pattern C registration requires explicit platform team approval before merging to k8s-gitops.

Choosing a Pattern

Situation	Pattern
System owns a custom Helm chart	A
System uses a platform-provided shared chart	B
Exceptional resources only (CRDs, ClusterRoles, cross-namespace)	C

Business Application CI Pipeline

The k8s-gitops repo and the Flux infrastructure pipeline are entirely separate from how business applications build and deliver their images. Each application team owns their own Azure DevOps pipeline in their own repo — the platform only defines what must happen before an image is allowed into the cluster.

The contract is simple: every image that lands in production must be scanned, pushed to the private ACR, and signed. How the pipeline is structured internally is up to the team, but these three steps are non-negotiable:

Build or mirror, then scan with Trivy — application images are typically built from source in the team’s own pipeline. Third-party images the application depends on (databases, sidecars, off-the-shelf tools) are mirrored from public registries using crane copy. In both cases the image is scanned for HIGH and CRITICAL vulnerabilities before it is pushed to ACR. The pipeline fails if any are found.
Push to private ACR — all images go to the private registry. HelmRelease values and manifests always reference myacr.azurecr.io/..., never a public registry directly.
Sign with Cosign — the image is signed using the shared Azure Key Vault key after pushing. Admission controls in the cluster can verify the signature before allowing the workload to run.

This is where Pattern B closes the loop neatly. The team updates the image.tag value in their HelmRelease after a successful pipeline run, merges to the environment branch, and Flux picks up the change — no pipeline needs cluster credentials, no kubectl is run from CI. The pipeline’s only job is to produce a trusted, signed image in ACR. Flux handles the rest.

Switching from Push Pipelines to GitOps

Before GitOps the deployment model looked like most traditional CI/CD setups: pipelines authenticated to the cluster with a service principal and pushed changes in with kubectl apply. It worked, but over time the friction became hard to ignore:

Credentials in the pipeline — every environment required a service connection with cluster-admin or broad RBAC permissions stored in the CI/CD platform. Rotation was manual, scope was wider than needed.
No drift detection — if someone applied a change directly to the cluster, the pipeline had no idea. The next pipeline run would overwrite it, or not, depending on whether that path was touched. There was no continuous reconciliation.
Deployment only on trigger — the cluster state was only as fresh as the last pipeline run. A failed pipeline meant nothing was deployed, with no retry or self-healing.
No supply chain controls on images — images were mirrored or built in the pipeline but without vulnerability scanning or signing. There was no enforcement gate preventing an unscanned or unsigned image from reaching the cluster.
Pipeline as the gatekeeper — every change had to flow through a pipeline, but the pipeline could not tell you whether the cluster had drifted from what it last applied.

The GitOps model inverts this. The cluster reaches out to Git, not the other way around. Credentials stay inside the cluster (and in this setup are federated via Workload Identity — no secrets at all). The operator reconciles continuously, so a drift is corrected within minutes without any human or pipeline intervention. Because everything is declared in Git and reviewed via pull requests, you get a full audit trail for free.

The shift is not just operational; it changes how teams think about deployments. Instead of “run the pipeline to deploy”, the mental model becomes “merge to the environment branch and Flux will pick it up”. The pipeline still exists, but its job is narrower: validate the infrastructure layer (Flux Operator, FluxInstance), mirror and sign images, and ensure the config is correct before it lands in Git. Day-to-day system changes bypass the pipeline entirely — Flux handles them on its own interval.

Upgrading Flux

Upgrading Flux in this setup is a four-step process:

Generate the updated image list using the helper script:

./scripts/flux-images.sh v2.9.0 v0.46.0 ./clusters/experimental/flux-operator-values.yaml

The script uses the flux CLI to resolve controller images for the requested version and reads the operator image repository from the values file. Output looks like this:

======================================================================
Add to azure-pipelines.yaml containerImages:
----------------------------------------------------------------------
  - myacr.azurecr.io/controlplaneio-fluxcd/flux-operator:v0.46.0
  - ghcr.io/fluxcd/helm-controller:v1.2.0
  - ghcr.io/fluxcd/image-automation-controller:v0.40.0
  - ghcr.io/fluxcd/image-reflector-controller:v0.34.0
  - ghcr.io/fluxcd/kustomize-controller:v1.5.0
  - ghcr.io/fluxcd/notification-controller:v1.5.0
  - ghcr.io/fluxcd/source-controller:v1.5.0
======================================================================

Here is the full script (scripts/flux-images.sh):

#!/bin/bash
# Generates the containerImages list for azure-pipelines.yaml when upgrading
# Flux controllers or the Flux Operator.
#
# Usage:
#   ./flux-images.sh <FLUX_VERSION> <OPERATOR_VERSION> [VALUES_FILE]
#
# Prerequisites: flux CLI, yq
set -euo pipefail

FLUX_VERSION=${1:-""}
OPERATOR_VERSION=${2:-""}
DEFAULT_VALUES=${3:-"values.yaml"}

[[ -z "$FLUX_VERSION" ]]     && { echo "ERROR: FLUX_VERSION required (e.g. v2.9.0)" >&2; exit 1; }
[[ -z "$OPERATOR_VERSION" ]] && { echo "ERROR: OPERATOR_VERSION required (e.g. v0.46.0)" >&2; exit 1; }
[[ ! -f "$DEFAULT_VALUES" ]] && { echo "ERROR: Values file '$DEFAULT_VALUES' not found." >&2; exit 1; }
command -v flux &>/dev/null   || { echo "ERROR: flux CLI not installed." >&2; exit 1; }
command -v yq   &>/dev/null   || { echo "ERROR: yq not installed." >&2; exit 1; }

OPERATOR_REPO=$(yq '.image.repository' "$DEFAULT_VALUES")
OPERATOR_IMAGE="$OPERATOR_REPO:$OPERATOR_VERSION"

FLUX_IMAGES=$(flux install \
  --version="$FLUX_VERSION" \
  --components=source-controller,helm-controller,kustomize-controller,notification-controller \
  --components-extra=image-reflector-controller,image-automation-controller \
  --export \
  | grep 'image: ghcr.io/fluxcd/' \
  | awk '{print $2}' \
  | sort -u)

echo "======================================================================"
echo "Add to azure-pipelines.yaml containerImages:"
echo "----------------------------------------------------------------------"
echo "  - $OPERATOR_IMAGE"
echo "$FLUX_IMAGES" | while IFS= read -r IMAGE; do echo "  - $IMAGE"; done
echo "======================================================================"
echo ""
echo "Next steps:"
echo "  1. Copy the image list into azure-pipelines.yaml"
echo "  2. Update spec.distribution.version in clusters/*/flux-instance.yaml to $FLUX_VERSION"
echo "  3. Update the operator image tag in clusters/*/flux-operator-values.yaml to $OPERATOR_VERSION"
echo "  4. git commit + push + open PR to experimental first"

Requires the flux CLI and yq to be installed locally.

Update azure-pipelines.yaml — replace the containerImages list for all three environments with the output above.
Update spec.distribution.version in each clusters/<env>/flux-instance.yaml.
Open a PR to experimental — validate, merge, then promote to dev and finally master.

The Flux Operator handles the rollout. You do not interact with the cluster directly. The pipeline validates the configuration before it touches anything, and the FluxInstance apply step waits for the Ready condition before the pipeline marks success.

Key Lessons

Use the Flux Operator. flux bootstrap was the right tool for getting started quickly, but the Flux Operator is the right tool for operating Flux long-term. Declarative upgrades, version-controlled configuration, and no generated YAML to maintain.

Mirror images to private ACR. Public registries have rate limits and availability dependencies. Mirroring with crane copy preserves digests, which means Cosign signatures remain valid. Trivy scanning before the copy gives you a vulnerability gate. All three together — mirror, scan, sign — is a practical and automatable supply chain control.

Server-side dry-run on PRs is worth it. Client-side validation misses schema errors for custom resources and referential integrity issues. Server-side dry-run catches these before anything reaches the cluster.

prune: true on Kustomizations. This ensures that when you remove a manifest from Git, the corresponding resource is deleted from the cluster. Without it, removed resources silently linger and you lose the “Git is the source of truth” property.

Workload Identity over credentials. Setting up federated credentials takes a bit more upfront effort than creating a PAT, but you eliminate a class of credential management problems entirely. No rotation, no expiry, no secrets stored anywhere.

Conclusion

Building a GitOps platform that teams actually want to use comes down to making the happy path easy and the unsafe path hard. In this setup the platform team maintains the Flux infrastructure layer and provides the system registration pattern. System teams own their repositories, their charts, their release cadence, and their values. They do not need platform team approval to deploy a new version of their application.

The pipeline handles the trust model: images are scanned before they enter the private registry, signed after they arrive, and verified before they are deployed. Pull requests are validated with server-side dry-run before merge. Flux handles reconciliation continuously, not just on pipeline triggers.

If you are looking to build a similar setup or are in the middle of migrating from a more manual approach, I hope this walkthrough is useful. Although the examples here use Azure DevOps and AKS, the approach is broadly portable — Flux supports GitHub, GitLab, and Gitea natively via the provider field in GitRepository, and the pipeline layer is standard YAML that maps directly to GitHub Actions or GitLab CI. The Flux side of the setup does not change at all. Feel free to reach out if you have questions — I’m happy to dig into any of the details.

Thank you for reading!

Share on

Twitter Facebook LinkedIn

Andrej Trusevic