Home | Publier un mémoire | Une page au hasard

The impact of covid-19: to predict the breaking point of the disease from big data by neural networks

par Woohyun SHIN
Paris School of Business - MSc Data Management 2001
Dans la categorie: Informatique et Télécommunications > Intelligence artificielle

Télécharger le fichier original

précédent sommaire suivant

3.1.1. HDFS

Hadoop works with two important concepts. One of them, HDFS, is an expanded, reliable, distributed file storage system created by Java, and files are stored in block units and operate in a master-worker structure. Block size is the maximum amount of data that can be read and written at a time. There are some benefits of introducing the concept of block abstraction into a distributed file system. The first benefit is that a single file can be larger than the capacity of a single disk. Multiple blocks that make up a single file need not be stored only on the same disk, so they can be stored on any disk in the cluster. The second benefit is that abstracting into blocks rather than file units can simplify Storage's subsystem. Furthermore, blocks are highly suitable for implementing the necessary replication to provide fault tolerance and availability. To cope with block damage and disk and machine failures, each block replicates data to a number of physically separated machines. If one block becomes unavailable, the client can be informed to read the copy on the other machine. In the Hadoop 2 version, the block size is 128 Mbyte by default, and you can adjust the cluster's block size in the hdfs-site.xml file. You can also set the number of replications in the hdfs-site.xml file.

An HDFS cluster operates on two types of nodes operating in a master-worker pattern. It consists of one Namenode, the master, and several Datanodes, the walker. The Namenode manages the namespace of the file system. The Namenode maintains metadata about the file system tree and all of the files and directories contained in it. This information is stored permanently on the local disk as two types of files, FSimage and Editlog. It also identifies which Datanodes all the blocks in the file are on. The DataNode is a real worker in the file system. A Datanode stores and navigates blocks at the request of a client or a Namenode, and periodically reports a list of blocks it is storing to the Namenode. In addition, the Datanode sends the disk-available spatial data, data movement, and load carrying capacity to the Namenode every three seconds through `heart beat'. Therefore, NameNode constructs metadata based on that information.

3.1.2. MapReduce

MapReduce is a programming model for data processing. It is also an action program for Hadoop clusters written in Java. To take advantage of the parallel processing offered by Hadoop, we need to re-express client's requirements in MapReduce Job. MapReduce operations are largely divided into Map and Reduce phases, helping to process data faster

- 8 -

through a simple step called Shuffling between Map and Reduce. Each step has a pair of key-value as input and output, and the type is chosen by the programmer. The MapReduce program follows a series of development procedures. First, write a unit test to implement the Map and Reduce functions and verify that they work well. Next, write a driver program that runs the Job, and run it on Integrated Development Environment (IDE) using part of the dataset to verify that it works properly. The Map process converts the entered data into a value of one pair of key-value values for each record (row). Afterwards, several values are put into one list based on the common key value through the process called Shuffling. Finally, the data organized by key-list is returned through the Reduce step. For example, if a meteorological station between 1950 and 1959 wants to write a programming that wants to know the highest temperature each year, it first maps the conversion of daily weather data records to a single key-value value. This results in a total of 3650 (365 days * 10 years) records, such as (1950, 0), (1950, 2) and ... Shuffling then proceeds to reduce the role of reducing to one list as of each year (1950,[0,2,...], (1951, [2,6,...], ...), ... (1959, [3,8,..., 32]). Subsequently, if an algorithm is created and executed to find the maximum value of each record through the Reduce process, results such as (1950,43), (1951,46),... (1959,32) can be obtained.

Hadoop provides the MapReduce API, which allows you to write Map and Reduce functions in other languages besides Java. Hadoop Streaming uses UNIX standard streams as an interface between Hadoop and user programs. Thus, users can write the MapReduce program using various languages, such as Python and Ruby, which can read standard inputs and write them as standard outputs.

précédent sommaire suivant