Add Thesis

Comparison of Popular Data Processing Systems

Written by Kamil Nasr

Paper category

Master Thesis


Computer Science




Thesis: Data Processing System Big data processing takes an annual cycle and can be roughly divided into four generations, as shown in Figure 2.1.1. The first-generation first-generation data processing system is Apache Hadoop [8], which mainly focuses on batch data. It also introduces the concepts of Map and Reduce, thereby providing an open source implementation of MapReduce [28]. Apache Hadoop provides many advantages, but its biggest limitation is that it involves a lot of disk operations. The second-generation second-generation data processing system is slightly improved over the first-generation system. One of the most popular second-generation systems is Tez. In addition to batch processing, Tez also introduced interactive programming [64][19]. The third-generation Apache Spark [10] is the most famous third-generation data processing system, which is considered a unified model for stream processing and batch processing. The concept of Resilient Distributed Data Set (RDD) is the core of Apache Spark [83]. Apache Spark can implement machine learning because it supports iterative processing. One of the advantages of Apache Spark is to support in-memory computing and processing optimization. Spark applications can be written in Java, R, Python, and Scala. The fourth-generation Apache Flink [6] is basically the fourth-generation data processing system. Unlike many other frameworks, Flink supports real-time stream processing. Because of the DataSet and DataStream core API, it also supports batch and stream processing. Other supported APIs include SQL and table APIs. Flink also handles stateful stream processing and iterative processing calculations. It can also effectively deal with fault tolerance and scalable state management [22] [64]. 2.2 Kaggle is a very famous website for data scientists and engineers. It enables them to access large data sets. It also hosts frequent competitions and challenges, anyone can join. It even provides prizes for the winners. The daily summary data set of carbon monoxide used for experiments in this project was found on Kaggle. It is published by the USEnvironmental Protection Agency and contains a summary of daily CO levels from 1990 to 2017. What's interesting about Kaggle is that it is called "AirBnB for Data Scientists". It has approximately 500,000 active users from more than 190 countries. It was acquired by Google in 2017. Another important aspect of Kaggle is that it aims to provide opportunities for data scientists. They usually have no chance to practice on real data at least before joining the company, and have the opportunity to practice on the data sets available on the platform in different ways. Including organized competitions and challenges. 2.3 Batch processing and stream processing In the world of big data and data analysis, batch data processing and stream data processing [20] are very important concepts. It is important to understand the difference between these two principles. Generally speaking, in batch processing, data is collected first and then processed, while stream processing is real-time, which means that the data is sent to the analysis tool one by one. Let us discuss these two concepts in more detail and provide some examples and use cases for each concept. 2.3.1 Batch processing When we process relatively large amounts of data, and/or when the source of these data is an old or legacy system that is incompatible with streaming data processing, batch processing is ideal. For example, mainframe[81] data is processed in batches by default. Using mainframe data in newer analytical environments can be timely and inconvenient, so converting it to streaming data is a challenge. Figure 2.3.1 shows how Hadoop MapReduce, a popular batch processing framework, processes data. 2.3.2 Stream processing If we need to analyze the results in real time, then stream processing is the only way. The moment the data is generated, it uses the data stream input analysis tool. This allows us to obtain almost instant results. Stream processing can be used for fraud detection because it allows real-time detection of anomalies. The delay in stream processing is usually measured in seconds or milliseconds. This is possible because the data in stream processing is analyzed before it reaches the disk [20]. Figure 2.3.2 explains how real-time processing works in tools such as ApacheSpark. . Batch and stream processing The type of data being processed by a data engineer or scientist largely determines whether batch processing or stream processing is better. However, batch data can be converted to streaming data to take advantage of real-time analysis results. Where time constraints apply, this may provide opportunities to respond to opportunities or challenges more quickly. Apache Beam is a parallel computing framework. It is an open SDK based on the data flow model proposed by Google in this article [3]. GoogleDataflow is based on the processing framework FlumeJava [25] for batch data processing and MillWheel [2] for stream data processing [5]. Most parallel processing frameworks try to optimize latency, correctness, or cost. For example, the developer may wait more time before starting processing to ensure that the data to be processed is complete and all late data exists. This is likely to lead to an increase in correctness, but it also leads to an increase in latency. The opposite situation is that developers start processing early, which ultimately leads to lower latency but incomplete data and increased costs. Read Less