Apache Kafka is a distributed streaming platform designed to handle real-time data feeds with high throughput and low latency. The question of whether Kafka stores data purely in memory or on disk is not exactly straightforward, because it uses both.
Kafka stores all messages on disk and has the capability to hold immense amounts of data for long periods of time. It's not an in-memory system that loses data when a process shuts down; instead, it persists data to disk, ensuring durability and fault-tolerance. In other words, even when a Kafka server goes down, no data loss occurs.
However, Kafka also makes substantial use of the operating system's page cache, which effectively stores frequently accessed parts of the log in memory. This is done for performance reasons - reading from and writing to memory is significantly faster than doing the same operations on disk. As such, while its storage is primarily on disk, Kafka often behaves like an in-memory database due to its intelligent usage of the OS page cache.
Here is a simplified view on how Kafka leverages memory:
Remember, this might be overly simplified and actual code in Kafka would look different. The idea here is to demonstrate that Kafka efficiently uses both memory and disk to ensure high performance and reliability.