3.1.1. HDFS
Hadoop works with two important concepts. One of them, HDFS, is
an expanded, reliable, distributed file storage system created by Java, and
files are stored in block units and operate in a master-worker structure. Block
size is the maximum amount of data that can be read and written at a time.
There are some benefits of introducing the concept of block abstraction into a
distributed file system. The first benefit is that a single file can be larger
than the capacity of a single disk. Multiple blocks that make up a single file
need not be stored only on the same disk, so they can be stored on any disk in
the cluster. The second benefit is that abstracting into blocks rather than
file units can simplify Storage's subsystem. Furthermore, blocks are highly
suitable for implementing the necessary replication to provide fault tolerance
and availability. To cope with block damage and disk and machine failures, each
block replicates data to a number of physically separated machines. If one
block becomes unavailable, the client can be informed to read the copy on the
other machine. In the Hadoop 2 version, the block size is 128 Mbyte by default,
and you can adjust the cluster's block size in the hdfs-site.xml file. You can
also set the number of replications in the hdfs-site.xml file.
An HDFS cluster operates on two types of nodes operating in a
master-worker pattern. It consists of one Namenode, the master, and several
Datanodes, the walker. The Namenode manages the namespace of the file system.
The Namenode maintains metadata about the file system tree and all of the files
and directories contained in it. This information is stored permanently on the
local disk as two types of files, FSimage and Editlog. It also identifies which
Datanodes all the blocks in the file are on. The DataNode is a real worker in
the file system. A Datanode stores and navigates blocks at the request of a
client or a Namenode, and periodically reports a list of blocks it is storing
to the Namenode. In addition, the Datanode sends the disk-available spatial
data, data movement, and load carrying capacity to the Namenode every three
seconds through `heart beat'. Therefore, NameNode constructs metadata based on
that information.
3.1.2. MapReduce
MapReduce is a programming model for data processing. It is also
an action program for Hadoop clusters written in Java. To take advantage of the
parallel processing offered by Hadoop, we need to re-express client's
requirements in MapReduce Job. MapReduce operations are largely divided into
Map and Reduce phases, helping to process data faster
- 8 -
through a simple step called Shuffling between Map and Reduce.
Each step has a pair of key-value as input and output, and the type is chosen
by the programmer. The MapReduce program follows a series of development
procedures. First, write a unit test to implement the Map and Reduce functions
and verify that they work well. Next, write a driver program that runs the Job,
and run it on Integrated Development Environment (IDE) using part of the
dataset to verify that it works properly. The Map process converts the entered
data into a value of one pair of key-value values for each record (row).
Afterwards, several values are put into one list based on the common key value
through the process called Shuffling. Finally, the data organized by key-list
is returned through the Reduce step. For example, if a meteorological station
between 1950 and 1959 wants to write a programming that wants to know the
highest temperature each year, it first maps the conversion of daily weather
data records to a single key-value value. This results in a total of 3650 (365
days * 10 years) records, such as (1950, 0), (1950, 2) and ... Shuffling then
proceeds to reduce the role of reducing to one list as of each year
(1950,[0,2,...], (1951, [2,6,...], ...), ... (1959, [3,8,..., 32]).
Subsequently, if an algorithm is created and executed to find the maximum value
of each record through the Reduce process, results such as (1950,43),
(1951,46),... (1959,32) can be obtained.
Hadoop provides the MapReduce API, which allows you to write Map
and Reduce functions in other languages besides Java. Hadoop Streaming uses
UNIX standard streams as an interface between Hadoop and user programs. Thus,
users can write the MapReduce program using various languages, such as Python
and Ruby, which can read standard inputs and write them as standard outputs.
|