Question: What is Kafka Tiered Storage?
Answer
Kafka Tiered Storage is a data management strategy introduced in Apache Kafka 3.6.0, designed to optimize storage costs and improve scalability by categorizing storage into two tiers: local and remote.
Background and Motivation
In traditional Kafka clusters, each broker has local storage attached, and data is stored only on these local disks. As data volumes increase, scaling storage becomes costly and operationally complex. Adding more disks or larger disks to brokers can be expensive, and managing multiple log directories (JBOD) adds operational overhead. Tiered Storage addresses these issues by providing a built-in mechanism to move log segments from local storage to remote storage, thereby decoupling storage from compute resources.
How it Works
-
Data Storage Tiers:
- Local Tier: This is the faster, more expensive storage where the most recent data is stored. It uses the local disks attached to Kafka brokers.
- Remote Tier: This is the slower, cost-effective storage where historical data is archived. Examples include cloud storage systems like Amazon S3, HDFS, or Azure.
-
Data Movement:
- Data is initially stored in the local tier as log segments.
- Based on retention policies (
local.log.retention.ms
andlog.retention.ms
), eligible log segments are asynchronously moved to the remote tier. The metadata of these remote objects is stored in an internal topic, allowing Kafka to retrieve the data when needed.
-
Data Retrieval:
- When a consumer requests data, Kafka checks if the data is available in the local tier. If not, it fetches the data from the remote tier and caches it locally for faster access.
Benefits
- Cost Optimization: By using high-performance storage for latency-sensitive data and lower-cost storage for less frequently accessed data, overall storage costs are reduced.
- Scalability: Compute and storage resources can be scaled independently, allowing for more efficient cluster management and reduced operational complexity.
- Elasticity: This feature enables longer data retention without the need for separate data pipelines, making Kafka a viable option for long-term storage.
Limitations
- Early Access: As of Kafka 3.6.0, Tiered Storage is in early access and should not be used for production use cases.
- Unsupported Features: It does not support compacted topics, multiple log directories on a broker (JBOD), and once enabled for a topic, it cannot be disabled without support.
- Configuration: Specific configurations like
remote.log.storage.system.enable
andremote.storage.enable
are required at the cluster and topic levels.
Implementation
To enable Tiered Storage, you need to configure your Kafka cluster and topics accordingly. Here are the key steps:
-
Enable Tiered Storage at the Cluster Level:
remote.log.storage.system.enable=true
-
Enable Tiered Storage at the Topic Level:
remote.storage.enable=true
-
Configure Retention Policies:
local.log.retention.ms=1000 # Example: 1 second log.retention.ms=86400000 # Example: 1 day
These configurations ensure that data is moved from local to remote storage based on the specified retention policies.
Conclusion
Kafka Tiered Storage is a significant feature that enhances the scalability, cost efficiency, and operational simplicity of Kafka clusters. While it is still in early access and has some limitations, it offers promising benefits for managing large volumes of data effectively.
Was this content helpful?
Other Common Data Tiering Questions (and Answers)
- What is the difference between data migration and data tiering?
- What is the difference between dynamic tiering and data aging?
- How does Amazon MemoryDB data tiering work?
- What is the difference between dynamic tiering and data tiering?
- How does NetApp data tiering work?
- What is the purpose of data tiering?
- What is automated data tiering and how does it work?
- How does policy management work for data tiering?
- What is Azure data tiering and how does it work?
- How does ElastiCache data tiering work?
- What is SAP HANA Data Tiering?
- How does Redis data tiering work?
Free System Design on AWS E-Book
Download this early release of O'Reilly's latest cloud infrastructure e-book: System Design on AWS.
Switch & save up to 80%
Dragonfly is fully compatible with the Redis ecosystem and requires no code changes to implement. Instantly experience up to a 25X boost in performance and 80% reduction in cost