3.2.1. Low-Level API
Spark consists of a base element Low-level API, a Structured API,
and a set of standard libraries that provide additional functionality. In most
situations after Spark version 2, it is recommended to use the Structured API
and access to RDD will be lost after version 3. However, it is important
because RDD, the smallest units that run Spark, and distributed shared
variables, are in the Low-Level API. With the Low-Level API, there is a unique
concept of Spark called Resilient Distributed Dataset (RDD) for distributed
data processing. Simply put, a collection of partitioned records that are
immutable and can be processed in parallel. It is also mostly used to create a
physical execution plan optimized in the Dataframe API.
There are two types of distributed shared variables: Broadcast
Variable and Accumulator. The reason for the two shared variables is not to run
applications on a single machine, but to do the same thing on a cluster
containing multiple machines, you need to share certain values. Therefore,
these variables are used to act as the exchange of certain values. Broadcast
Variable is an immutable sharing variable that caches all equipment within a
cluster. This method is very efficient, especially when large variables such as
machine learning models must be used several times, so machine learning
algorithms that operate in most sparks include Broadcast Variable as a
necessity. Accumulator is used to update various values within Transformation.
And it can deliver values from Executors to Driver in an efficient way while
ensuring fault tolerance. Similar to Broadcast Variable, the accumulator is
frequently used when working on a deep learning model to deliver a
hyperparameter to the driver, updating the value several times, and finding a
better model.
3.2.2. Structured API
While each record in Dataframe, one of the structural APIs, is a
structured row consisting of fields that know the schema, RDD's record is
simply an object in the programming language chosen by the programmer. On the
other hand, the structured API consists of
- 10 -
Dataset, Dataframe, and SQL, and aims to handle big data using
the API. Structured APIs are recommended because RDDs are not available in the
Spark version 3 booth, which will be released in 2021.
Structured APIs are basic abstraction concepts that define data
flows. The API's execution process consists of four steps. First, write code
using Dataset, Dataframe, SQL first. For the second normal code, the Spark
engine converts to a logical execution plan. Converts a third logical execution
plan to a physical execution plan and verifies that further optimization is
possible in the process. Physical execution plan during the process means
converting query defined by structured API into RDD. Finally, send the plan to
the Spark Driver to execute the physical execution plan within the cluster. The
spark then returns the processing results to the user.
3.2.3. Machine Learning on Spark (MLlib)
MLlib offers data collection, refining, feature extraction and
selection, learning and tuning of map and non-map learning machine learning
models for large data, and an interface that helps these models to be used in
the operating environment. MLlib is often compared with Scikit-Learn and
TensorFlow. From a broad perspective, MLlib can be thought of as carrying out
Scikit-Learn or similar tasks provided by Python. The tools mentioned above are
tools that can perform machine learning on a single machine basis. However,
because the package operates in a cluster environment, it has complementary
relationships. MLlib has several basic 'structural' types, such as transformer,
estimators, evaluators and pipelines.
|