The “Stateful” Reality Check#
In our last post, we solved the persistence layer by deploying Longhorn on Talos Linux. We finally have a place to put data. But a raw block device isn’t a database.
Running a database in Kubernetes is often cited as one of the hardest challenges in platform engineering. You have to manage failover, backups, upgrades, and configuration changes… all while ensuring data integrity. In the enterprise world, we often offload this to managed services like AWS RDS or Google Cloud SQL.
However, in my experience working in the private banking sector, “just use RDS” isn’t always an option. Government regulations and data sovereignty laws frequently mandate that data stays on-premise. In these environments, I’ve seen many setups rely on traditional PostgreSQL clusters managed by tools like Patroni on bare-metal or VMs. While effective, they require significant operational overhead to manage (and that’s a topic for a future blog post).
But in the homelab? We are the cloud provider. We have to build our own RDS.
This post details how to build a production-ready PostgreSQL service using CloudNativePG (CNPG), and crucially, how to tune it to play nicely with our underlying Longhorn storage to avoid performance killers.
The Operator Pattern: Why CNPG?#
You could deploy PostgreSQL using a simple Helm chart that spins up a StatefulSet. It works… until the primary node dies, or you need to major version upgrade, or you need point-in-time recovery.
This is where the Operator Pattern shines. An Operator is essentially a robotic sysadmin running inside your cluster. It watches your custom resources (like a YAML file saying “I want a Postgres Cluster”) and actively manages the underlying Pods and Services to make that reality happen.
I chose CloudNativePG (CNPG) because:
- It’s Declarative: You define the desired state of your cluster, not the steps to get there.
- Immutability: It treats PostgreSQL instances as disposable. If a node fails, it spins up a new one and resyncs.
- Enterprise Origins: Originally built by EDB, it brings serious features like WAL archiving and synchronous replication to the open-source table.
The Double Replication Trap#
Here is the specific architectural challenge we face when combining Longhorn with a Distributed Database.
By default, Longhorn replicates every block of data to 3 different nodes to ensure availability. By default, a high-availability PostgreSQL cluster also replicates data to 3 different instances.
If you run a standard 3-node CNPG cluster on top of standard Longhorn volumes, you are writing every single byte of data 9 times (3 DB replicas × 3 Storage replicas).
| Setup | Data Copies | Performance | Reliability |
|---|---|---|---|
| Default (Longhorn 3 + CNPG 3) | 9 | Very Slow (High Latency) | Extreme (Overkill) |
| Minimal (Longhorn 3 + CNPG 1) | 3 | Stale Data Potential | Low (No DB failover) |
| Optimized (Longhorn 1 + CNPG 3) | 3 | Fast (Local Speed) | High (Standard) |
The Solution: We need to let the Application (CNPG) handle the High Availability, effectively treating the storage layer as “ephemeral” local disks.
Implementation Guide#
1. The Optimized StorageClass#
First, we define a specific Longhorn StorageClass that replicates the behavior of a local SSD. We set numberOfReplicas to 1 and force dataLocality to strict-local.
File: base/longhorn/storage-class-cnpg.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-cnpg
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
numberOfReplicas: "1"
staleReplicaTimeout: "2880" # 48 hours
fromBackup: ""
fsType: "ext4"
dataLocality: "strict-local" # Critical for performanceWith strict-local, Longhorn attempts to keep the data on the same node as the Pod. If the node dies, we lose that specific volume—and that is okay. CNPG will detect the failure, promote one of the other two standby instances to Primary, and eventually spin up a new replica to replace the lost one.
2. Deploying the CNPG Operator#
We use Argo CD to manage the operator lifecycle. This ensures our “robotic sysadmin” handles updates and configuration drift.
File: base/database/cnpg-operator/cnpg-operator.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: cloudnative-pg-operator
namespace: argocd
spec:
destination:
namespace: cnpg-system
server: https://kubernetes.default.svc
project: argo-config
sources:
- repoURL: https://cloudnative-pg.github.io/charts
chart: cloudnative-pg
targetRevision: 0.27.1
helm:
releaseName: cnpg-operator
values: |
config:
clusterWide: true # Manage DBs in all namespaces3. Deploying a Database Cluster (The GitOps Way)#
Now we can request a database. Unlike a traditional VM where you might install one Postgres server and create 50 databases inside it, the Kubernetes pattern is one Cluster per Microservice. This ensures isolation: if one app goes rogue and eats all the CPU, it doesn’t take down the others.
To manage this at scale without copying 500 lines of YAML for every microservice, we use a Kustomize Overlay strategy with Argo CD. We define a “Base Application” that knows how to deploy a standard CNPG cluster, and then we just patch the specifics (name, namespace, storage size) for each app.
Step 3.1: The Base Application#
This manifest tells Argo CD how to deploy a generic CNPG cluster using the official Helm chart.
File: base/database/cnpg-cluster/cnpg-cluster.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: cnpg-cluster
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
destination:
namespace: cnpg-system
server: https://kubernetes.default.svc
project: argo-config
sources:
- repoURL: https://github.com/anvaplus/homelab-k8s-argo-config.git
targetRevision: main
ref: valuesRepo
- repoURL: https://cloudnative-pg.github.io/charts
path: cnpg
chart: cluster
targetRevision: 0.5.0
helm:
releaseName: cnpg-cluster
valueFiles:
- $valuesRepo/base/database/cnpg-cluster/values.yaml
syncPolicy:
automated:
prune: true
selfHeal: trueStep 3.2: The Kustomize Overlay (e.g., Keycloak)#
When we need a database for keycloak, we don’t start from scratch. We simply patch the base application to update the target namespace and values file.
File: environments/dev/database/cnpg-cluster/clusters/keycloak/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../../../../../base/database/cnpg-cluster/
patches:
- target:
group: argoproj.io
version: v1alpha1
kind: Application
name: cnpg-cluster
patch: |
- op: replace
path: "/spec/sources/1/targetRevision"
value: "0.5.0"
- op: add
path: "/spec/sources/1/helm/valueFiles/-"
value: "$valuesRepo/environments/dev/database/cnpg-cluster/clusters/keycloak/override.values.yaml"
- op: replace
path: /metadata/name
value: cnfg-cluster-keycloak
- op: replace
path: /spec/destination/namespace
value: keycloak
- op: replace
path: /spec/sources/1/helm/releaseName
value: cnfg-cluster-keycloakStep 3.3: The Configuration Values#
Finally, we define the actual database configuration in override.values.yaml. This is where we reference our optimized storage class.
File: environments/dev/database/cnpg-cluster/clusters/keycloak/override.values.yaml
type: postgresql
mode: standalone
version:
postgresql: "16"
cluster:
instances: 3
storage:
size: 1Gi
storageClass: "longhorn-cnpg" # The magic happens here
backups:
enabled: false # Or true if configuredThis tiered approach allows us to spin up production-ready, high-availability databases in seconds by adding just two small files to our GitOps repo.
A Note on Backups#
Do not use Longhorn Snapshots for Databases.
Snapshots happen at the block level. If you snapshot a running database while it’s flushing memory to disk, you risk capturing a corrupted state. Always use the database’s native backup tools. CNPG integrates with Barman, which streams the Write-Ahead Logs (WAL) to object storage. This allows for Point-In-Time Recovery (PITR)… you can literally restore your database to the state it was in at 14:03:22 yesterday.
Conclusion#
By combining the CNPG Operator with a tuned Longhorn StorageClass, we have achieved a setup that rivals enterprise RDS offerings:
- High Availability: Automated failover in seconds.
- Performance: Near-native disk speeds using
strict-localstorage. - Resilience: Automated backups and self-healing.
As always, all the code and configuration files discussed in this post are available in my GitHub repository.
With networking, storage, and now a robust database layer in place, we have cleared all the infrastructure hurdles. In the next post, we will finally deploy our first major application: Keycloak, the Identity Provider that will secure our entire platform.
Stay tuned! Andrei

