The Database Dilemma - Mastering PostgreSQL on Kubernetes with CloudNativePG

The “Stateful” Reality Check
#

In our last post, we solved the persistence layer by deploying Longhorn on Talos Linux. We finally have a place to put data. But a raw block device isn’t a database.

Running a database in Kubernetes is often cited as one of the hardest challenges in platform engineering. You have to manage failover, backups, upgrades, and configuration changes… all while ensuring data integrity. In the enterprise world, we often offload this to managed services like AWS RDS or Google Cloud SQL.

However, in my experience working in the private banking sector, “just use RDS” isn’t always an option. Government regulations and data sovereignty laws frequently mandate that data stays on-premise. In these environments, I’ve seen many setups rely on traditional PostgreSQL clusters managed by tools like Patroni on bare-metal or VMs. While effective, they require significant operational overhead to manage (and that’s a topic for a future blog post).

But in the homelab? We are the cloud provider. We have to build our own RDS.

This post details how to build a production-ready PostgreSQL service using CloudNativePG (CNPG), and crucially, how to tune it to play nicely with our underlying Longhorn storage to avoid performance killers.

The Operator Pattern: Why CNPG?
#

You could deploy PostgreSQL using a simple Helm chart that spins up a StatefulSet. It works… until the primary node dies, or you need to major version upgrade, or you need point-in-time recovery.

This is where the Operator Pattern shines. An Operator is essentially a robotic sysadmin running inside your cluster. It watches your custom resources (like a YAML file saying “I want a Postgres Cluster”) and actively manages the underlying Pods and Services to make that reality happen.

I chose CloudNativePG (CNPG) because:

It’s Declarative: You define the desired state of your cluster, not the steps to get there.
Immutability: It treats PostgreSQL instances as disposable. If a node fails, it spins up a new one and resyncs.
Enterprise Origins: Originally built by EDB, it brings serious features like WAL archiving and synchronous replication to the open-source table.

The Double Replication Trap
#

Here is the specific architectural challenge we face when combining Longhorn with a Distributed Database.

By default, Longhorn replicates every block of data to 3 different nodes to ensure availability. By default, a high-availability PostgreSQL cluster also replicates data to 3 different instances.

If you run a standard 3-node CNPG cluster on top of standard Longhorn volumes, you are writing every single byte of data 9 times (3 DB replicas × 3 Storage replicas).

Setup	Data Copies	Performance	Reliability
Default (Longhorn 3 + CNPG 3)	9	Very Slow (High Latency)	Extreme (Overkill)
Minimal (Longhorn 3 + CNPG 1)	3	Stale Data Potential	Low (No DB failover)
Optimized (Longhorn 1 + CNPG 3)	3	Fast (Local Speed)	High (Standard)

The Solution: We need to let the Application (CNPG) handle the High Availability, effectively treating the storage layer as “ephemeral” local disks.

Implementation Guide
#

1. The Optimized StorageClass
#

First, we define a specific Longhorn StorageClass that replicates the behavior of a local SSD. We set numberOfReplicas to 1 and force dataLocality to strict-local.

File: base/longhorn/storage-class-cnpg.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn-cnpg
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
  numberOfReplicas: "1"
  staleReplicaTimeout: "2880" # 48 hours
  fromBackup: ""
  fsType: "ext4"
  dataLocality: "strict-local" # Critical for performance

With strict-local, Longhorn attempts to keep the data on the same node as the Pod. If the node dies, we lose that specific volume—and that is okay. CNPG will detect the failure, promote one of the other two standby instances to Primary, and eventually spin up a new replica to replace the lost one.

2. Deploying the CNPG Operator
#

We use Argo CD to manage the operator lifecycle. This ensures our “robotic sysadmin” handles updates and configuration drift.

File: base/database/cnpg-operator/cnpg-operator.yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: cloudnative-pg-operator
  namespace: argocd
spec:
  destination:
    namespace: cnpg-system
    server: https://kubernetes.default.svc
  project: argo-config
  sources:
    - repoURL: https://cloudnative-pg.github.io/charts
      chart: cloudnative-pg
      targetRevision: 0.27.1
      helm:
        releaseName: cnpg-operator
        values: |
          config:
            clusterWide: true # Manage DBs in all namespaces

3. Deploying a Database Cluster (The GitOps Way)
#

Now we can request a database. Unlike a traditional VM where you might install one Postgres server and create 50 databases inside it, the Kubernetes pattern is one Cluster per Microservice. This ensures isolation: if one app goes rogue and eats all the CPU, it doesn’t take down the others.

To manage this at scale without copying 500 lines of YAML for every microservice, we use a Kustomize Overlay strategy with Argo CD. We define a “Base Application” that knows how to deploy a standard CNPG cluster, and then we just patch the specifics (name, namespace, storage size) for each app.

Step 3.1: The Base Application
#

This manifest tells Argo CD how to deploy a generic CNPG cluster using the official Helm chart.

File: base/database/cnpg-cluster/cnpg-cluster.yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: cnpg-cluster
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  destination:
    namespace: cnpg-system
    server: https://kubernetes.default.svc
  project: argo-config
  sources:
    - repoURL: https://github.com/anvaplus/homelab-k8s-argo-config.git
      targetRevision: main
      ref: valuesRepo
    - repoURL: https://cloudnative-pg.github.io/charts
      path: cnpg
      chart: cluster
      targetRevision: 0.5.0
      helm:
        releaseName: cnpg-cluster
        valueFiles:
          - $valuesRepo/base/database/cnpg-cluster/values.yaml
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Step 3.2: The Kustomize Overlay (e.g., Keycloak)
#

When we need a database for keycloak, we don’t start from scratch. We simply patch the base application to update the target namespace and values file.

File: environments/dev/database/cnpg-cluster/clusters/keycloak/kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - ../../../../../../base/database/cnpg-cluster/

patches:
  - target:
      group: argoproj.io
      version: v1alpha1
      kind: Application
      name: cnpg-cluster
    patch: |
      - op: replace
        path: "/spec/sources/1/targetRevision"
        value: "0.5.0"
      - op: add
        path: "/spec/sources/1/helm/valueFiles/-"
        value: "$valuesRepo/environments/dev/database/cnpg-cluster/clusters/keycloak/override.values.yaml"
      - op: replace
        path: /metadata/name
        value: cnfg-cluster-keycloak
      - op: replace
        path: /spec/destination/namespace
        value: keycloak
      - op: replace
        path: /spec/sources/1/helm/releaseName
        value: cnfg-cluster-keycloak

Step 3.3: The Configuration Values
#

Finally, we define the actual database configuration in override.values.yaml. This is where we reference our optimized storage class.

File: environments/dev/database/cnpg-cluster/clusters/keycloak/override.values.yaml

type: postgresql
mode: standalone

version:
  postgresql: "16"

cluster:
  instances: 3
  storage:
    size: 1Gi
    storageClass: "longhorn-cnpg" # The magic happens here

backups:
  enabled: false # Or true if configured

This tiered approach allows us to spin up production-ready, high-availability databases in seconds by adding just two small files to our GitOps repo.

A Note on Backups
#

Do not use Longhorn Snapshots for Databases.

Snapshots happen at the block level. If you snapshot a running database while it’s flushing memory to disk, you risk capturing a corrupted state. Always use the database’s native backup tools. CNPG integrates with Barman, which streams the Write-Ahead Logs (WAL) to object storage. This allows for Point-In-Time Recovery (PITR)… you can literally restore your database to the state it was in at 14:03:22 yesterday.

Conclusion
#

By combining the CNPG Operator with a tuned Longhorn StorageClass, we have achieved a setup that rivals enterprise RDS offerings:

High Availability: Automated failover in seconds.
Performance: Near-native disk speeds using strict-local storage.
Resilience: Automated backups and self-healing.

As always, all the code and configuration files discussed in this post are available in my GitHub repository.

With networking, storage, and now a robust database layer in place, we have cleared all the infrastructure hurdles. In the next post, we will finally deploy our first major application: Keycloak, the Identity Provider that will secure our entire platform.

Stay tuned! Andrei

The “Stateful” Reality Check#

The Operator Pattern: Why CNPG?#

The Double Replication Trap#

Implementation Guide#

1. The Optimized StorageClass#

2. Deploying the CNPG Operator#

3. Deploying a Database Cluster (The GitOps Way)#

Step 3.1: The Base Application#

Step 3.2: The Kustomize Overlay (e.g., Keycloak)#

Step 3.3: The Configuration Values#

A Note on Backups#

Conclusion#