Debezium Blog

Remember when debugging streaming data pipelines felt like playing detective in a crime scene where the evidence kept moving? Well, grab your magnifying glass because we’re about to turn you into Sherlock Holmes of the streaming world. After our introduction to OpenLineage integration with Debezium, it’s time to roll up our sleeves and get our hands dirty with some real detective work. We’ll build a complete order processing pipeline that captures database changes with Debezium, processes them through Apache Flink, and tracks every breadcrumb of data lineage using OpenLineage and Marquez – because losing track of your data is like losing your keys, except infinitely more embarrassing in production.

Case definition

In this showcase, we demonstrate how to leverage lineage metadata to troubleshoot issues in data pipelines. Our e-commerce order processing pipeline, despite its simplicity, effectively illustrates the benefits of lineage metadata for operational monitoring and debugging. We will simulate a configuration change in the Debezium connectors that causes the order processing job to skip records. Using the lineage graph, we’ll navigate through the pipeline components to identify the root cause of the problem and understand how metadata tracking enables faster issue resolution.

Chances are if you’re using the Debezium for Oracle connector, you’ve encountered the infamous exception about the SCN could not be found in your transaction logs. In this blog post, we’ll not only talk about what this exception means, but the why, and troubleshooting steps you should take.

The Debezium team has been extremely busy this past quarter as we prepared for this summer release, and we’re excited to announce the immediate availability of Debezium 3.2.0.Final. This release includes a slew of features including integration with OpenLineage, a new Quarkus DevService/GraalVM extension, Qdrant vector database sink support, improvements to Debezium Platform and AI, and much more!

It’s useful from time to time to evaluate the performance of an entire project - or at least selected parts of it. This is especially important when adding new features or performing major code refactoring. However, performance checks can also be done ad hoc, or ideally, on a regular basis.

In this blog post, I’d like to demonstrate a quick way to identify and analyze a particular type of performance issue in Debezium. The post walks through the full cycle: setting up a lightweight performance test, analyzing the results, proposing an improvement, and evaluating its impact.

The modern data landscape bears little resemblance to the centralized databases and simple ETL processes of the past. Today’s organizations operate in environments characterized by diverse data sources, real-time streaming, microservices architectures, and multi-cloud deployments. What began as straightforward data flows from operational systems to reporting databases has evolved into complex networks of interconnected pipelines, transformations, and dependencies. The shift from ETL to ELT patterns, the adoption of data lakes, and the proliferation of streaming platforms like Apache Kafka have created unprecedented flexibility in data processing. However, this flexibility comes at a cost: understanding how data moves, transforms, and evolves through these systems has become increasingly challenging.

Understanding data lineage

Data lineage is the process of tracking the flow and transformations of data from its origin to its final destination. It essentially maps the "life cycle" of data, showing where it comes from, how it’s changed, and where it ends up within a data pipeline. This includes documenting all transformations, joins, splits, and other manipulations the data undergoes during its journey.

At its core, data lineage answers critical questions: Where did this data originate? What transformations has it undergone? Which downstream systems depend on it? When issues arise, where should teams focus their investigation?

×