Home | Publier un mémoire | Une page au hasard

The impact of covid-19: to predict the breaking point of the disease from big data by neural networks

par Woohyun SHIN
Paris School of Business - MSc Data Management 2001
Dans la categorie: Informatique et Télécommunications > Intelligence artificielle

Télécharger le fichier original

précédent sommaire suivant

3.1.3. Yarn

Apache Yarn (Yet Another Resource Navigator) is Hadoop's cluster resource management system. Yarn was first introduced in Hadoop version 2 to enhance MapReduce's performance. Today, however, various distributed computing frameworks such as Spark and Tez are used to manage resources within a cluster. Yarn provides key services through two types of long-running daemon: ResourceManager and NodeManager. The only ResourceManager in the cluster manages the usage of the entire resources of the cluster and the NodeManager, which runs on all machines, is responsible for running and monitoring the Container. The client connects to the ResourceManager to run the application on Yarn and requests the operation of the Application Master process. For example, a spark job request was received while one Hadoop MapReduce job was running on 100 servers. If you don't have Yarn, you can configure a separate cluster on 50 servers for MapReduce and a spark cluster on the remaining 50. In this case, if Hadoop's MapReduce finishes earlier than Spark's, it can be a waste of resources because 50 servers do nothing. As a result, Yarn can minimize server resource waste by allowing tasks working on different distributed computing platforms to be managed by a single resource administrator. This is the purpose of creating Yarn and is its greatest advantage.

3.2. Spark

Spark is an integrated computer engine that houses libraries that process data in parallel in a cluster environment. Spark has the philosophy of 'Providing the Integrated Platform for Developing BigData Applications'. From simple data reads to SparkSQL, Spark MLlib, and Spark Streaming, data analysis is designed to be performed with the same computation engine and consistent APIs. Spark does not store data internally for a long time and does not prefer specific storage systems. Therefore, it is designed to focus on processing regardless of where the data is stored. The fundamental background of Spark's emergence is due to

- 9 -

changes in the economic factors underlying Computer Application and Hardware. Historically, computers have accelerated year by year thanks to improved processor performance. Unfortunately, the performance improvement of hardware stopped around 2005 due to physical heat dissipation. This phenomenon requires parallel processing to improve the performance of applications, resulting in distributed computer platforms such as Spark and Hadoop. [6].

A computer cluster brings together resources from multiple computers so that they can be used as a single computer. However, configuring a cluster is not enough and requires a framework to coordinate operations in the cluster. Spark is the framework that does that. The Spark Application consists of a Driver process and a number of Executor processes. The Driver process runs on one of the cluster nodes and is essential because it performs maintenance of application information, response to user programs or inputs, analysis, distribution, and scheduling roles associated with the work of the overall Executor process. The Driver process is a heart-like existence that keeps all relevant information in memory throughout the life cycle of the application. The Executor, on the other hand, performs the tasks that the Driver process assigns. That is, it performs two roles: executing code assigned by the driver and reporting progress back to the Driver node. Spark splits data into chunk units called partitions so that all executors can work in parallel. For example, if there's only one Partition, even if there are thousands of Executors in the Spark, parallelism is one. In addition, even with hundreds of partitions, parallelism is one if there is only one Executor.

précédent sommaire suivant