I Tried to Secure My Self-Hosted Nextcloud. It Didn't Go Well.

Table of Contents

A homelab horror story about an Unraid update, a perfectly fine Nextcloud, and why you should never touch a working system at 10pm.

How It Actually Started #

It started with an Unraid update.

I upgraded my Unraid server to version 7.3 — a routine update, didn’t think twice about it — and the next thing I knew, Nextcloud wasn’t loading. My immediate assumption: something broke in Nextcloud. I had just finished deploying it to my K3s cluster, moved on to other homelab chores, and came back to find it dead.

So I started digging into Nextcloud. Logs, pod states, exec into containers. The more I looked, the more I found things I wanted to fix. And since I had just installed it fresh and hadn’t really locked it down yet, I thought: this is a good time to do a proper security audit.

What followed was an entire day of chasing my own tail.

The Setup #

I run a 3-node Kubernetes cluster on three Lenovo M910q mini PCs in Prague. Nextcloud is backed by PostgreSQL and Redis, with data on an NFS share from an Unraid server. Everything is managed by ArgoCD with GitOps, secrets encrypted with SOPS+Age.

It was working fine. Then an Unraid update happened, I thought Nextcloud was broken, and I decided to make it more secure.

Act 1: The Unraid Upgrade Nobody Asked About #

The actual root cause revealed itself early: the Unraid 7.3 upgrade had silently broken my NFS server. The array wasn’t starting rpcbind on boot anymore, and the NFS exports had been reset to all_squash — which maps all clients to anonymous user, meaning my init container’s chown commands were being silently ignored.

Forty minutes of digging through /etc/exports, showmount errors, and Unraid forums later, NFS was back. I added a startup hook to /boot/config/go so it survives reboots. Crisis one averted.

But by this point I had already decided to do the security audit. So I kept going.

Act 2: The Security Audit #

The findings were reasonable:

No Redis password (anyone on the cluster could connect)
No NetworkPolicies (pods could talk to anything)
No liveness/readiness probes
No securityContext (containers running with unnecessary privileges)
localhost in trusted domains
federation app enabled unnecessarily

Good findings. Let’s fix them all at once. What could go wrong?

Act 3: Everything, Simultaneously #

Problem 1: The Redis Password Trap #

I generated a Redis password with openssl rand -base64 32. Redis rejected every connection.

The issue: base64 output contains +, /, and = characters. When Kubernetes base64-decodes the secret and injects it as an environment variable, the raw bytes aren’t valid UTF-8 — the container runtime refused to inject the variable at all.

Fix: openssl rand -hex 32. Hex is always valid ASCII.

Problem 2: Dropping Capabilities Breaks Everything #

I added capabilities: drop: ["ALL"] to all three containers. Security textbooks say this is correct. The containers disagreed.

PostgreSQL crashed immediately. The official postgres image runs its entrypoint as root, does chmod on the data directory, then drops to the postgres user. With ALL capabilities dropped, the chmod fails.

Nextcloud crashed more subtly. Apache prefork MPM starts as root to bind port 80, then uses setuid/setgid to drop to www-data. Without SETUID and SETGID, the workers can’t switch users. Apache fails silently.

Redis was fine — the alpine image starts directly as the redis user. No privilege dropping needed.

The lesson: security contexts aren’t one-size-fits-all. You need to understand what the container’s entrypoint actually does before restricting it.

Problem 3: NetworkPolicy Forgot UDP #

I wrote egress rules allowing DNS on port 53. TCP only.

DNS uses UDP. Nextcloud couldn’t resolve postgres.nextcloud.svc.cluster.local. Twenty minutes of confusion before I noticed the missing protocol: UDP line.

Problem 4: ArgoCD Is Always Watching #

I applied fixes directly with kubectl apply to test them quickly. ArgoCD has selfHeal: true. Within 3 minutes it reverted everything back to git.

With GitOps, the only source of truth is git. kubectl apply is not a fix — it’s a temporary hallucination that ArgoCD will cure.

Problem 5: The Probes That Broke Everything #

I added readiness and liveness probes hitting /status.php. PHP hung on the first request. Apache’s worker pool filled up with stuck requests. The probe kept firing, kept hanging, and no workers were ever free to serve traffic.

Symptoms: 503 Service Unavailable with the pod showing 1/1 Running.

Act 4: The Real Problem #

After stripping everything back, the pod was running but every HTTP request timed out. PHP CLI worked fine. The logs told the story:

MISCONF Redis is configured to save RDB snapshots, but it's currently
unable to persist to disk. Commands that may modify the data set are
disabled, because this instance is configured to report errors during
writes if RDB snapshotting fails (stop-writes-on-bgsave-error option).

Redis was in a locked state. When a Redis BGSAVE fails, Redis refuses all write commands to protect data integrity. It had been silently failing to write its RDB snapshot because the container had no persistent volume for it.

Every PHP request touching Redis — sessions, file locks, caching — hung waiting for a write that would never be acknowledged. Apache workers filled up. Cloudflare returned a 524 timeout.

The fix was one flag:

redis-server --requirepass $(REDIS_PASSWORD) --save ""

--save "" disables RDB persistence. Redis for Nextcloud is a pure cache and session store — there’s no data to protect. The persistence feature was never needed and silently poisoned the whole stack.

The Full Damage Report #

What I tried	What broke	Why
Redis password with base64	Redis rejected connections	Raw bytes aren’t valid UTF-8 env vars
`capabilities: drop: ALL` on postgres	Crash loop	Entrypoint needs root to chmod data dir
`capabilities: drop: ALL` on Nextcloud	Apache silent failure	Needs SETUID/SETGID to drop to www-data
`allowPrivilegeEscalation: false`	`Permission denied` on config.php	Same root cause
DNS egress policy (TCP only)	Can’t resolve postgres	DNS is UDP
`kubectl apply` to test changes	Changes silently reverted	ArgoCD selfHeal enforces git state
Readiness/liveness probes	503 with pod “healthy”	Hung workers fill the pool
Redis RDB persistence (default)	All PHP requests hang	BGSAVE fails → Redis blocks all writes

The Takeaway #

This whole day started because I didn’t check whether an update to a completely different system could affect a dependent service. Unraid updated, NFS broke, Nextcloud went down, and I assumed the problem was Nextcloud.

Securing a running system is harder than securing a new one. Every container has assumptions baked into its entrypoint. Every framework has defaults that are traps in containerized environments. And every automated system has opinions it will enforce aggressively.

The security audit was worth doing. Most of the findings were real. But “apply all fixes simultaneously after midnight” is how you spend a full day chasing your own tail.

Next time: check the other systems first. Then one change, test, commit, repeat.

Posted from Prague, sometime after midnight, while Nextcloud finally loads.