Dragonfly Cloud announces new enterprise security features - learn more

Question: What is Kafka Tiered Storage?

Answer

Kafka Tiered Storage is a data management strategy introduced in Apache Kafka 3.6.0, designed to optimize storage costs and improve scalability by categorizing storage into two tiers: local and remote.

Background and Motivation

In traditional Kafka clusters, each broker has local storage attached, and data is stored only on these local disks. As data volumes increase, scaling storage becomes costly and operationally complex. Adding more disks or larger disks to brokers can be expensive, and managing multiple log directories (JBOD) adds operational overhead. Tiered Storage addresses these issues by providing a built-in mechanism to move log segments from local storage to remote storage, thereby decoupling storage from compute resources.

How it Works

  1. Data Storage Tiers:

    • Local Tier: This is the faster, more expensive storage where the most recent data is stored. It uses the local disks attached to Kafka brokers.
    • Remote Tier: This is the slower, cost-effective storage where historical data is archived. Examples include cloud storage systems like Amazon S3, HDFS, or Azure.
  2. Data Movement:

    • Data is initially stored in the local tier as log segments.
    • Based on retention policies (local.log.retention.ms and log.retention.ms), eligible log segments are asynchronously moved to the remote tier. The metadata of these remote objects is stored in an internal topic, allowing Kafka to retrieve the data when needed.
  3. Data Retrieval:

    • When a consumer requests data, Kafka checks if the data is available in the local tier. If not, it fetches the data from the remote tier and caches it locally for faster access.

Benefits

  • Cost Optimization: By using high-performance storage for latency-sensitive data and lower-cost storage for less frequently accessed data, overall storage costs are reduced.
  • Scalability: Compute and storage resources can be scaled independently, allowing for more efficient cluster management and reduced operational complexity.
  • Elasticity: This feature enables longer data retention without the need for separate data pipelines, making Kafka a viable option for long-term storage.

Limitations

  • Early Access: As of Kafka 3.6.0, Tiered Storage is in early access and should not be used for production use cases.
  • Unsupported Features: It does not support compacted topics, multiple log directories on a broker (JBOD), and once enabled for a topic, it cannot be disabled without support.
  • Configuration: Specific configurations like remote.log.storage.system.enable and remote.storage.enable are required at the cluster and topic levels.

Implementation

To enable Tiered Storage, you need to configure your Kafka cluster and topics accordingly. Here are the key steps:

  1. Enable Tiered Storage at the Cluster Level:

    remote.log.storage.system.enable=true
  2. Enable Tiered Storage at the Topic Level:

    remote.storage.enable=true
  3. Configure Retention Policies:

    local.log.retention.ms=1000 # Example: 1 second log.retention.ms=86400000 # Example: 1 day

These configurations ensure that data is moved from local to remote storage based on the specified retention policies.

Conclusion

Kafka Tiered Storage is a significant feature that enhances the scalability, cost efficiency, and operational simplicity of Kafka clusters. While it is still in early access and has some limitations, it offers promising benefits for managing large volumes of data effectively.

Was this content helpful?

White Paper

Free System Design on AWS E-Book

Download this early release of O'Reilly's latest cloud infrastructure e-book: System Design on AWS.

Free System Design on AWS E-Book

Switch & save up to 80% 

Dragonfly is fully compatible with the Redis ecosystem and requires no code changes to implement. Instantly experience up to a 25X boost in performance and 80% reduction in cost