Implementing Change Data Capture (CDC) in Databricks
Change Data Capture (CDC) is a process used to track changes made to data in a database. In Databricks, CDC can be simplified using Delta Live Tables, which provide APIs like APPLY CHANGES and APPLY CHANGES FROM SNAPSHOT to manage these changes efficiently.
Using APPLY CHANGES API
The APPLY CHANGES API is used to process changes from a change data feed (CDF). It supports both SCD Type 1 and Type 2 for updating tables. SCD Type 1 updates records directly without retaining history, while SCD Type 2 retains a history of changes.
Example with Python
import dlt from pyspark.sql.functions import col, expr @dlt.view def users(): return spark.readStream.table("cdc_data.users") dlt.create_streaming_table("target") dlt.apply_changes( target = "target", source = "users", keys = ["userId"], sequence_by = col("sequenceNum"), apply_as_deletes = expr("operation = 'DELETE'"), except_column_list = ["operation", "sequenceNum"], stored_as_scd_type = "2", track_history_except_column_list = ["city"] )
Example with SQL
-- Create and populate the target table. CREATE OR REFRESH STREAMING TABLE target; APPLY CHANGES INTO target FROM stream(cdc_data.users) KEYS (userId) APPLY AS DELETE WHEN operation = "DELETE" SEQUENCE BY sequenceNum COLUMNS * EXCEPT (operation, sequenceNum) STORED AS SCD TYPE 2 TRACK HISTORY ON * EXCEPT (city)
Frequently Asked Questions
- Q: What is the main difference between APPLY CHANGES and APPLY CHANGES FROM SNAPSHOT?
A: APPLY CHANGES processes changes from a change data feed, while APPLY CHANGES FROM SNAPSHOT processes changes from database snapshots.
- Q: Can I use DML statements to modify streaming tables created by APPLY CHANGES?
A: Yes, but only in a shared Unity Catalog cluster or a SQL warehouse using Databricks Runtime 13.3 LTS and above. Ensure that your DML statements do not attempt to evolve the table schema.
- Q: How do I handle out-of-sequence records in CDC?
A: Delta Live Tables simplifies handling out-of-sequence records by using the SEQUENCE BY clause to order changes correctly.
- Q: Can I use APPLY CHANGES FROM SNAPSHOT with SQL?
A: Currently, APPLY CHANGES FROM SNAPSHOT is only supported in the Delta Live Tables Python interface.
- Q: What is the purpose of the except_column_list parameter in APPLY CHANGES?
A: The except_column_list parameter specifies columns that should be excluded from the target table.
- Q: How do I display HTML content in a Databricks notebook?
A: You can use the displayHTML function in Databricks to display HTML content.
- Q: Can I use Delta Live Tables with external data sources like Fivetran or Debezium?
A: Yes, Delta Live Tables can be used with external data sources like Fivetran or Debezium to ingest CDC data.
Bottom Line
Implementing CDC in Databricks using Delta Live Tables simplifies the process of capturing and managing changes in data. The APPLY CHANGES and APPLY CHANGES FROM SNAPSHOT APIs provide flexible options for handling CDC data, making it easier to maintain data integrity and history.