Home | Publier un mémoire | Une page au hasard

The impact of covid-19: to predict the breaking point of the disease from big data by neural networks

par Woohyun SHIN
Paris School of Business - MSc Data Management 2001
Dans la categorie: Informatique et Télécommunications > Intelligence artificielle

Télécharger le fichier original

précédent sommaire suivant

3.2.1. Low-Level API

Spark consists of a base element Low-level API, a Structured API, and a set of standard libraries that provide additional functionality. In most situations after Spark version 2, it is recommended to use the Structured API and access to RDD will be lost after version 3. However, it is important because RDD, the smallest units that run Spark, and distributed shared variables, are in the Low-Level API. With the Low-Level API, there is a unique concept of Spark called Resilient Distributed Dataset (RDD) for distributed data processing. Simply put, a collection of partitioned records that are immutable and can be processed in parallel. It is also mostly used to create a physical execution plan optimized in the Dataframe API.

There are two types of distributed shared variables: Broadcast Variable and Accumulator. The reason for the two shared variables is not to run applications on a single machine, but to do the same thing on a cluster containing multiple machines, you need to share certain values. Therefore, these variables are used to act as the exchange of certain values. Broadcast Variable is an immutable sharing variable that caches all equipment within a cluster. This method is very efficient, especially when large variables such as machine learning models must be used several times, so machine learning algorithms that operate in most sparks include Broadcast Variable as a necessity. Accumulator is used to update various values within Transformation. And it can deliver values from Executors to Driver in an efficient way while ensuring fault tolerance. Similar to Broadcast Variable, the accumulator is frequently used when working on a deep learning model to deliver a hyperparameter to the driver, updating the value several times, and finding a better model.

3.2.2. Structured API

While each record in Dataframe, one of the structural APIs, is a structured row consisting of fields that know the schema, RDD's record is simply an object in the programming language chosen by the programmer. On the other hand, the structured API consists of

- 10 -

Dataset, Dataframe, and SQL, and aims to handle big data using the API. Structured APIs are recommended because RDDs are not available in the Spark version 3 booth, which will be released in 2021.

Structured APIs are basic abstraction concepts that define data flows. The API's execution process consists of four steps. First, write code using Dataset, Dataframe, SQL first. For the second normal code, the Spark engine converts to a logical execution plan. Converts a third logical execution plan to a physical execution plan and verifies that further optimization is possible in the process. Physical execution plan during the process means converting query defined by structured API into RDD. Finally, send the plan to the Spark Driver to execute the physical execution plan within the cluster. The spark then returns the processing results to the user.

3.2.3. Machine Learning on Spark (MLlib)

MLlib offers data collection, refining, feature extraction and selection, learning and tuning of map and non-map learning machine learning models for large data, and an interface that helps these models to be used in the operating environment. MLlib is often compared with Scikit-Learn and TensorFlow. From a broad perspective, MLlib can be thought of as carrying out Scikit-Learn or similar tasks provided by Python. The tools mentioned above are tools that can perform machine learning on a single machine basis. However, because the package operates in a cluster environment, it has complementary relationships. MLlib has several basic 'structural' types, such as transformer, estimators, evaluators and pipelines.

précédent sommaire suivant