Flink Job Unit Testing
Apache Flink is an open-source framework for parallel stream processing, the latest Big data technology that is rapidly gaining momentum in the market.
You can find a lot of examples on how to start with Apache Flink from a very simple one in the official documentation to a more practical Wikipedia Edit Stream, an example of real-time stream processing with detailed explanation. The idea of the latest one is to collect in a data stream Wikipedia article real-time changes from an IRC channel where all edits to the wiki are logged.
In this article, we are going to slightly modify the Wikipedia Edit Stream example to describe a practical approach to cover Flink job by unit tests. Guide for Unit Testing in Apache Flink shows cases on how to unit test task operators separately one by one, but usually Flink applications operators are composed together to build a job and the goal here is to unit test the whole job.
For these purposes, Apache Flink provides a JUnit rule allowing jobs testing against a local mini-cluster. In order to be able to test the whole pipeline against the local Flink cluster, we need to make a source and sink functions pluggable into our pipeline. Let’s start by defining a simple pipeline.
For simplicity, this pipeline has a Source function of WikipediaEditEvent objects. It filters the out null edits (the ones that have a wrong format to be parsed). It filters changes made to Talks and it converts them into a new object type TalkEditEvent. As the final step objects are sent to the sink function which just outputs them to the standard output stream.
Such pipeline could serve as an example of validation in real-life applications where messages that are not parsed or do not pass a validation rule are rejected (in real life they are sent to the different sink for further processing). This pipeline also models data enrichment: the map function adds an id to the TalkEditEvent generated based on the title and the user.
Simple unit test for our pipeline would be to pass test data to the source of our pipeline, to run it and to collect the data from the sink, which will allow making some assertions on the collected data. In order to be able to generate a large enough amount of test data let’s put it into a text file and read it from there line by line. Reading test data from a text file into the data stream will require to implement a source function.
This source function follows Apache Flink recommendations on the source function implementation and can be easily converted to a more generic one.
Collecting pipeline execution results is also described in the Testing Flink Jobs example with a few remarks. The most important remark is to use a static variable to collect values. Flink serializes all operators and distributes them across a cluster. Depending on local mini-cluster configuration a few Jobs will be executing the pipeline in different threads and sink result into the static variable.
Another problem with this approach is that some Jobs might not manage to finish before the test assertions are checked. One option is to decrease the task parallelism to 1 by pipeline modification or mini cluster configuration, but our goal is to reproduce execution in clustered production environment. A better option is to synchronize data writing and reading by the use of locks.
In this collect sink example, the resulting values are collected into a Map to be able to make assertions on concrete entries. The generated id for TalkEditEvent allows to do that, but this may not always be the case. If values from different threads are collected in a List, there is no guarantee the order in subsequent test runs will be the same, so assertions might be done on the whole list only, not on separate elements.
The unit test itself repeats the logic of WikipediaAnalysis main method, but plugs source and sink functions discussed above.
Please note that with the static map of the collected values in place it is important to clear it before running the next test.
This simple application with unit test classes and data is available on GitHub.
Unit testing of your Apache Flink Job as a whole is possible. To achieve that, this article suggests: to refactor it to be able to plug source and sink functions; to implement a source function providing a decent amount of test data; to implement a sink function able to collect data from different threads and to write a unit test using Flink mini-cluster simulating execution of the Job by multiple task managers.