Report: Redis failure modes and how they cause data loss

Executive summary

Redis is fast and widely used, but it is not immune to data loss. Across persistence (AOF/RDB), asynchronous replication and failover, filesystem limits, operator misconfiguration, and software bugs, Redis exposes several plausible failure modes that can cause partial or total loss of data. This report stitches together evidence from operator guides, bug reports, research analyses, and postmortems to show where Redis is resilient and where it can fail.

The key failure modes (what goes wrong and why)

1) Asynchronous replication and failover windows

What happens: Redis replication is asynchronous by default. The master acknowledges writes before replicas confirm them, so recent writes can exist only on the master when it crashes.
How that causes loss: If the master fails and a replica is promoted, any writes not yet replicated are lost.

"Because the master doesn’t wait for replicas to acknowledge writes (in the default configuration), if the master fails, some of the last commands it processed might not have made it to a replica. Those writes are lost." (Redis discussion/guide)

Practical impact: short windows of data loss on failover; worse under replication lag, network partitions, or heavy write load. Mitigations include enabling synchronous-like behavior (WAIT command), increasing replica count, and careful topology design. See how does Redis replication lead to data loss?.

2) AOF rewrite and append-only file corruption

What happens: AOF persistence appends every write to a log file. Periodically Redis performs a background rewrite (BGREWRITEAOF) that writes a compacted AOF file, then replaces the old file.
How that causes loss: Fork failures, disk full during rewrite, or corruption of the rewritten file can leave Redis with an incomplete or corrupted AOF. Repair tools may discard trailing operations.

"If the AOF file is not just truncated, but corrupted with invalid byte sequences in the middle, things are more complex... The best thing to do is to run the redis-check-aof utility..." (redis-check-aof docs and troubleshooting)

Examples and evidence: operators report AOF rewrite causing excessive memory use (buffering writes during rewrite) and disk exhaustion. The repair tool redis-check-aof --fix may discard data after corruption point, producing data loss (troubleshooting AOF rewrite issues). See what happens during AOF rewrite and failure modes?.

3) RDB snapshot gaps

What happens: RDB snapshotting writes point-in-time dumps at configured intervals. Snapshots are cheap for restart speed but coarse-grained.
How that causes loss: Any writes between the last snapshot and a crash are lost. Less frequent snapshots increase potential loss.

"Setting RDB snapshots to occur every 15 minutes or more... heightens the risk of data loss if Redis crashes shortly after a snapshot." (RDB persistence guidance)

Mitigations: use AOF (or AOF+RDB hybrid), tune snapshot frequency, and combine with replication. See how to tune RDB and AOF for durability.

4) Fsync semantics and the OS/filesystem trust

What happens: Redis issues fsync calls according to appendfsync policy (always/everysec/no). Redis historically trusts OS success and, in some cases, does not check fsync return codes.
How that causes loss: If fsync silently fails or the OS buffers writes (e.g., battery-backed caches failing), Redis may report a successful write while the data was not persisted to disk. Choosing appendfsync no or everysec accepts the risk of losing recent writes.

"Redis trusts the file system to successfully persist the data and does not check the fsync return code." (research notes)

Mitigation: choose appendfsync always (higher latency), ensure reliable storage with write barriers or battery-backed caches, and monitor disk health. See what are appendfsync tradeoffs?.

5) Disk full, permission, and filesystem failures

What happens: AOF rewrites and RDB dumps require disk space. If disk fills, writes fail; if permissions or hardware errors occur, persistence can fail.
How that causes loss: Running out of disk can interrupt rewrite operations, cause corrupted files, or stop Redis from accepting writes, potentially resulting in application-visible data loss.

"ERROR: Could not write to RDB file: No space left on device" (operator logs and troubleshooting guides) (example troubleshooting).

Mitigation: monitor disk; provision headroom; use separate disks for AOF temp files; automate alerts.

6) Memory pressure and eviction policies

What happens: Redis is an in-memory database. When the instance reaches maxmemory, eviction policy determines behavior (volatile-lru, allkeys-lru, noeviction, etc.).
How that causes loss: With eviction policies that remove keys, user data can be evicted and effectively lost from the dataset. With noeviction the server returns errors for writes, which may drop data at the application layer.

"If your Redis instance hits the memory limit and eviction is enabled, keys will be evicted according to the policy; if eviction is disabled, writes will fail." (Redis memory and eviction docs)

Mitigation: size memory appropriately, choose eviction policies intentionally, and use persistence to recover evicted-but-important data where possible.

7) Split-brain and network partitions

What happens: Network partitions can isolate masters from replicas or subsets of a cluster; without proper quorum rules, different nodes may accept writes.
How that causes loss: Divergent writes on partitioned masters can be lost or require complex reconciliation when partition heals. In worst cases, promotion of an outdated replica causes newer writes (on the prior master) to be lost.

"Network partitions can cause split-brain situations, where multiple nodes believe they are the master. This can result in divergent data states and potential data loss." (Jepsen/cluster analyses)

Mitigation: design cluster topology with quorum, multi-AZ Sentinels, and odd counts of voting Sentinels; use Redis Enterprise or Active-Active CRDTs for use cases that must tolerate partitions. See how to prevent split-brain in Redis Cluster?.

8) Operator error and misconfiguration

What happens: Misconfiguring persistence (disabling save, setting appendfsync no), improperly restoring backups, or accidental flushes can erase or prevent persistence of data.
How that causes loss: Misconfiguration can leave Redis with no durable copy, making crashes permanent losses; careless restores or commands (FLUSHALL) can delete production datasets.

"A misconfigured Redis instance with 'save ""' (disabling snapshots) and no AOF enabled will not persist data to disk." (persistence documentation and operator guides)

Mitigation: use configuration management, runbooks, and guardrails (role-based access, require confirmations for destructive commands), and test restores regularly.

9) Software bugs and third-party extensions (Redis-Raft, modules, operators)

What happens: Bugs in Redis or in ecosystem pieces (Redis-Raft, Kubernetes operators, modules) can introduce logic errors that lead to data loss, crashes, or split-brain.
How that causes loss: Jepsen and other audits found critical bugs in Redis-Raft that could cause total data loss on failover; operator bugs or module crashes can leave the cluster in an inconsistent state.

"We found twenty-one issues in development builds of Redis-Raft, including ... split-brain leading to lost updates, and total data loss on any failover." (Jepsen Redis-Raft analysis)

Mitigation: track CVEs and bug reports, run version-pinned tested releases, and prefer battle-tested operators and modules. See are Redis operators safe for production?.

10) Managed/cloud provider pitfalls

What happens: Managed Redis (ElastiCache, Memorystore, Azure Cache) simplifies operations but masks underlying tradeoffs and failures (AZ failover, storage performance, backup restore windows).
How that causes loss: Provider misconfiguration, restore mistakes, or limits in managed offerings (e.g., snapshot frequency, replication lag across AZs) can result in data loss or larger failover windows.

"Depending on how in-sync the promoted read replica is with the primary node, the failover process can take several minutes. ... Some data loss might occur if you have rapidly changing data. This effect is currently a limitation of Redis replication itself." (AWS ElastiCache guidance)

Mitigation: understand provider SLAs and failure modes, configure multi-AZ, enable backups/AOF where available, and test provider failover/restore procedures.

Real incidents, audits, and evidence (selected excerpts)

"We found twenty-one issues in development builds of Redis-Raft, including partial unavailability in healthy clusters, crashes, infinite loops on any request, stale reads, aborted reads, split-brain leading to lost updates, and total data loss on any failover." (Jepsen analysis of Redis-Raft)

"If an AOF rewrite fails due to disk space issues, Redis continues appending to the existing AOF file. This ongoing growth can eventually exhaust disk space, leading to write failures and potential data loss." (troubleshooting AOF rewrite)

"Because the master doesn’t wait for replicas to acknowledge writes... if the master fails, some of the last commands it processed might not have made it to a replica. Those writes are lost." (Redis replication docs)

"Redis trusts the file system to successfully persist the data and does not check the fsync return code." (research paper notes on filesystem assumptions)

These quotes demonstrate that the failure modes are real, documented, and have produced concrete issues in practice.

Where Redis holds up well

Fast write throughput and mature persistence options (AOF/RDB) let you tune durability vs latency.
Sentinel and Cluster provide automated failover and sharding; when configured correctly they substantially reduce downtime.
Managed offerings (ElastiCache, Memorystore) provide automated backups and multi-AZ deployments that reduce operator burden.

However, "holds up well" is conditional: correct configuration, operational practices, and understanding of tradeoffs are required.

Practical checklist to reduce data-loss risk

Enable AOF (appendfsync everysec or always for critical data); combine with RDB snapshots for faster restarts.
Use WAIT for critical synchronous write guarantees where applicable.
Monitor replication lag, disk usage, fsync errors, and AOF rewrite activity; alert on anomalies.
Deploy Sentinels/replicas across independent failure domains (AZs, machines) and run an odd number of Sentinels.
Provision disk headroom and use separate disks or volumes for AOF temporary files.
Test failover and restore procedures in staging, and rehearse disaster recovery.
Avoid disabling persistence in production; restrict destructive commands with ACLs.
Keep Redis and ecosystem components (operators, modules, Redis-Raft) up to date and follow security/bug advisories.
Consider Active-Active CRDT-based solutions or fully-consistent alternatives (e.g., etcd, CockroachDB, Consul) for strict durability under partitions.

Conclusion and synthesis (the debate)

The "Redis Advocate" case: Redis provides powerful tools—AOF, snapshots, Sentinel, Cluster—that let you tune durability and availability to your needs. When configured correctly and run on reliable infrastructure, Redis can meet the durability needs of many production systems and is used successfully at scale by cloud providers and companies (AWS ElastiCache examples).
The "Skeptical Sysadmin" case: Redis's default behaviors (asynchronous replication, fsync tradeoffs), the complexity of AOF rewrites, file-system trust, operator errors, and documented bugs (e.g., Redis-Raft failures) mean that Redis can and has lost data. These are not hypothetical: Jepsen audits and operator postmortems show cases where replication and persistence decisions led to loss (Jepsen).

Net: Redis is not inherently durable by default — it is configurable. You must understand the tradeoffs, pick appropriate persistence and replication options, architect for failure domains, and test restores. Treat Redis like a component with defined failure modes, not a magically durable source of truth.

This report includes inline links to deeper topics you may want explored next: how does Redis replication lead to data loss?, what happens during AOF rewrite and failure modes?, what are appendfsync tradeoffs?, how to tune RDB and AOF for durability, how to prevent split-brain in Redis Cluster?, are Redis operators safe for production?