3.1.3. Yarn
Apache Yarn (Yet Another Resource Navigator) is Hadoop's cluster
resource management system. Yarn was first introduced in Hadoop version 2 to
enhance MapReduce's performance. Today, however, various distributed computing
frameworks such as Spark and Tez are used to manage resources within a cluster.
Yarn provides key services through two types of long-running daemon:
ResourceManager and NodeManager. The only ResourceManager in the cluster
manages the usage of the entire resources of the cluster and the NodeManager,
which runs on all machines, is responsible for running and monitoring the
Container. The client connects to the ResourceManager to run the application on
Yarn and requests the operation of the Application Master process. For example,
a spark job request was received while one Hadoop MapReduce job was running on
100 servers. If you don't have Yarn, you can configure a separate cluster on 50
servers for MapReduce and a spark cluster on the remaining 50. In this case, if
Hadoop's MapReduce finishes earlier than Spark's, it can be a waste of
resources because 50 servers do nothing. As a result, Yarn can minimize server
resource waste by allowing tasks working on different distributed computing
platforms to be managed by a single resource administrator. This is the purpose
of creating Yarn and is its greatest advantage.
3.2. Spark
Spark is an integrated computer engine that houses libraries that
process data in parallel in a cluster environment. Spark has the philosophy of
'Providing the Integrated Platform for Developing BigData Applications'. From
simple data reads to SparkSQL, Spark MLlib, and Spark Streaming, data analysis
is designed to be performed with the same computation engine and consistent
APIs. Spark does not store data internally for a long time and does not prefer
specific storage systems. Therefore, it is designed to focus on processing
regardless of where the data is stored. The fundamental background of Spark's
emergence is due to
- 9 -
changes in the economic factors underlying Computer Application
and Hardware. Historically, computers have accelerated year by year thanks to
improved processor performance. Unfortunately, the performance improvement of
hardware stopped around 2005 due to physical heat dissipation. This phenomenon
requires parallel processing to improve the performance of applications,
resulting in distributed computer platforms such as Spark and Hadoop. [6].
A computer cluster brings together resources from multiple
computers so that they can be used as a single computer. However, configuring a
cluster is not enough and requires a framework to coordinate operations in the
cluster. Spark is the framework that does that. The Spark Application consists
of a Driver process and a number of Executor processes. The Driver process runs
on one of the cluster nodes and is essential because it performs maintenance of
application information, response to user programs or inputs, analysis,
distribution, and scheduling roles associated with the work of the overall
Executor process. The Driver process is a heart-like existence that keeps all
relevant information in memory throughout the life cycle of the application.
The Executor, on the other hand, performs the tasks that the Driver process
assigns. That is, it performs two roles: executing code assigned by the driver
and reporting progress back to the Driver node. Spark splits data into chunk
units called partitions so that all executors can work in parallel. For
example, if there's only one Partition, even if there are thousands of
Executors in the Spark, parallelism is one. In addition, even with hundreds of
partitions, parallelism is one if there is only one Executor.
|