“Go with the Flow”: Stream-based – Real-Time ETL without Batch Processes
ETL is batch: efficiency, data consistency, traceability, simplicity, established methods and mature tools make building a data warehouse (DWH) a standard job.
At the end of June, we were once again participants and sponsors of the TDWI conference. Peter Welker spoke about real-time ETL. Stream processing is technically no problem. But how do you build complex functional transformations ‘in the stream’? How do you ensure data consistency? Where to put the data history? How can this be done in a resource-efficient way?
For all those who did not have the opportunity to attend the talk, we would like to give you the opportunity to briefly read the most important points.
ASKING PETER WELKER…
We asked Peter Welker what information he would like to share with us about his Talk at TDWI:
“The ETL processes are the heart of a data warehouse or a data lake. They ensure the data supply and take care of the availability, uniformity, correctness, and professional preparation of all data. Since the processing of data in ETL processes is very resource-intensive and often takes place at night or at off-peak times, the data processed in this way is often only available to the DWH end user with a delay of around one day”
THE DISADVANTAGE
Peter Welker also mentions the disadvantage that (near) real-time analyses on data with only a few minutes or even seconds delay are not possible. This means that use cases such as some types of fraud detection or the detection of production problems, but also direct analyses of user behaviour, for example, to carry out a quick scoring of entries when concluding online contracts, cannot be implemented with data warehouse or data lake data.
NOW
In recent years, however, technologies and services have become available that can also process larger amounts of data with only a few seconds delay simply and securely. These include Spark, Flink or kSQL, which works on the basis of Apache Kafka.
We have therefore no longer implemented ETL processes as batch processing, but as a continuous flow of data in real time based on kSQL and Kafka in SQL. They are using all the common procedures and patterns that are used in data warehouses, such as historization, change detection or various data cleansing procedures.
RESULT
The conversion of ETL processes to near-real-time procedures is already possible today for many use cases. In our example, new or changed data were available for analysis and reporting on average 2 seconds after their manipulation in the upstream system.
The complete demo solution is available.
If you want to watch the complete video of this presentation or all videos of the conference again and have access to it, the recorded sessions are now available on the TDWI conference platform.