3.2.3.1.Transformer
The transformer is a function that converts raw data in various
ways. This could be creating a new interaction variable, normalizing a column,
or changing an Integer type to a Double type in order to enter the model.
3.2.3.2.Estimator
The estimator has two meanings. First, it means a kind of
transducer that initializes the data. For example, to normalize numeric data,
the transformation is initialized using the current value information in the
column you want to normalize. Second, the algorithms used by users to learn the
model from the data are also called estimators.
3.2.3.3.Evaluator
It allows us to see how the performance of a given model works
according to one
criterion, like the Receiver Operating Characters (ROC) curve.
After selecting the best model among the models tested using the Evaluator, the
final prediction can be made using that model.
3.2.3.4.External libraries
Sparks can run various projects using external libraries as well
as embedded packages. Among them, a variety of external deep learning libraries
can be used, especially in the new field, such as TensorFrame, BigDL,
TensorFlowOnSpark, DeepLearning4J, and Elephas. There are two ways to develop a
new deep learning model. One is to use a spark cluster to parallelize learning
on a single model in multiple servings and update the final results through
communication between each server. The other is how to use a specific library
to learn various model objects in parallel and to review various model
architectures and hyperparameters to efficiently select and optimize the final
model.
- 11 -
Library
|
Framework based on DL
|
Case of application
|
TensorFrame
|
Tensorflow
|
Inference, Transfer learning
|
BigDL
|
BigDL
|
Distributed learning, Inference
|
TensorFlowOnSpark
|
Tensorflow
|
Distributed learning
|
DeepLearning4J
|
DeepLearning4J
|
Inference, Transfer learning, Distributed learning
|
Elephas
|
Keras
|
Distributed learning
|
[Table 1] Deep Learning External Libraries
Elephas is a library designed to run the Keras Deep Learning
Framework in Spark. Keras maintains simplicity and high usability to support
distributed models that can be run on large datasets. Using Spark's RDD and
Dataframe, it is implemented on Keras as a class of data parallel algorithms,
initialized from the Driver of the Spark, then serialized the data and passed
to the Executor, and the parameters needed for the Model are passed from the
Executor using the distributed shared variables Broadcast Variable and
Accumulator of the Spark. Subsequently, the learned data and hyperparameters
are passed back to the Driver. These values are synced updater by Optimizer in
the Master of Node and continue learning.
3.3. Docker
Docker is an open-source project that makes it easier to use
applications as containers by adding multiple features of Linux containers.
Docker is written in Go language. Unlike virtual machines, which are
traditional methods of virtualization, Docker containers have little
performance loss, drawing attention from many developers in next-generation
cloud infrastructure solutions. There are many projects related to Docker,
including Docker Compose, Private Registry, Docker Machine, Kitemetic, and so
on, but typically Docker refers to the Docker Engine. The Docker Engine is the
main project of the Docker that creates and manages containers and provides a
variety of functions and controls the containers on its own [7].
Traditional virtualization technology used Hypervisor to create
and use multiple operating systems on a single host. These operating systems
are identified as virtual machines, and each virtual machine has Ubuntu,
CentOS, and so on. Operating systems created and managed by Hypervisor use
independent space and system resources that are completely different from each
guest operating system. Typical virtualization tools for this approach include
VirtualBox, VMware, and others. However, virtualizing machines and creating
independent space is a must-have hypervisor, resulting in performance loss
compared to normal hosts. So while virtual machines have the advantage of
creating a complete operating system, they have the potential to lose
performance compared to typical hosts, and it's hard to deploy gigabytes of
virtual machine images to applications.
In comparison, the Docker container has little performance loss
because it creates a process-level isolation environment by using Linux's own
features, chroot, namespace and cgroups, to create virtualized space. Because
the necessary kernel for the container shares and uses the kernel on the host,
and there are only libraries and executable files in the container that are
needed to run the application, the image capacity is also significantly reduced
when the container is imaged. This is faster than virtual machines and has the
advantage of having little performance loss when using virtualized space.
- 12 -
|