Historically, most firms generating a large amount of data had no choice but to discard it, since storing it was costly and processing it inconceivable. This situation changed with the advent of the first Big Data storage technologies like Hadoop and Cassandra. Nowadays it is possible to store the data but processing it is still a difficult task, so new technologies began to crop up to help in this work: Pig, Hive and, above all, Spark.
Once the processing problem had been solved, companies realised that the reason for storing so much data was because they received it continually in streams and then it progressively lost value as time passed. This in turn led to the advent of streaming solutions, mainly Spark Streaming and Storm, and most recently to the star turn of this blog: Apache Flink.
These streaming frameworks had to come up with a response to new challenges that up to now (batch processing) posed no real problem:
- Transformations and processing of data chained together had to take into account the fact that this data will arrive indefinitely, i.e., it is not possible to wait to read all the lines and then chop up the words and count them (to use a typical example).
- In the event of any service downtime, we need mechanisms to make sure that none of the stream data is lost. Now it becomes necessary to separate off the acquisition part of the processing, and it is important to pay special attention to this aspect when designing the platform.
- The processing-time requisites now come into play. Our program needs to be capable of ensuring the maximum data return time. Furthermore, in relation to the above point, the backpressure effect also comes into play, i.e., our processing chain has to be capable of processing the data quickly enough to ensure non-saturation of the first stages of the processing chain.
- In the event of the failure of any of the nodes, our framework has to make sure that each event is processed exactly once. This requisite habitually calls for costly synchronization and error-control mechanisms. Although not necessary for all applications it may pose a serious problem for implementations of some types, for example a twice-recorded alarm or, even worse, an alarm that goes unrecorded!
Spark Streaming and Storm have come up with answers to these problems (or most of them), although it seems to be Flink that provides the best solution without forfeiting capacity and processing speed.
Without doubt the biggest surprise of Flink is its Application Programming Interface (API) for managing time events: very easy to understand and at the same time very powerful. It has introduced into the language terms like sliding windows, triggers or keyed streams. It also natively manages aspects like which reference time should be used to order events: the time at which the event was generated? When it was received? Or when it is being processed? Flink enables the developer to choose between them.
This abstraction makes it very easy to develop applications that manage events in which time has to be taken into account and also satisfy the requirements mentioned at the start of the article.
In a couple of days we’ll publish another article with a practical Flink-use case, walking through the technology step by step. Watch this space!
Las opiniones vertidas por el autor son enteramente suyas y no siempre representan la opinión de GMV
The author’s views are entirely his own and may not reflect the views of GMV