How exactly Hadoop stores the data

Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data. © Wikipedia

How about we explore how exactly distributed storage works in Hadoop?

Envision that we need to store a ton of documents, do some computation on them like counting all the words, and store the measurement of the words close to the source data. To regularly complete this operation with Hadoop, we firstly have to store the data to our hardware.

For the beginning, let’s envision storing the data with one machine without Hadoop:

Storing the data with one machine

Now, imagine a scenario in which we need to handle enormous number of files. Imagine that a file can be even bigger than a single machine’s HDD. For this situation, we can upgrade our machine by adding more HDD. Doing this will result in the following impediments:

  • Fitting enormous measures of assets into a single machine is incredibly costly.
  • If our machine goes down unexpectedly, we can’t handle any data any longer.

To take care of these issues, we need to move towards splitting calculations into a few machines. To achieve this we add 3 extra machines connected with one another. We want them to be a computing cluster. The objective is to store data, utilizing as many resources of each machine as possible at the same time. We can call these machines as workers or nodes.

Few machines to store the data

Now, we can simply divide all input records by 4 (number of our node) so each machine would accept ¼ of input files. We expect that all documents should be loaded similarly on 4 machines. But then we face considerably more disadvantages:

  • Before we begin handling any data, we have to analyze each worker to know whether it has enough capacity to store the data. It can require some time (particularly if the connection isn’t strong enough) if we have multiple workers.
  • We need to monitor the processing advancement of every worker to know when the activity is finished.
  • If a node breaks down, we have to distinguish it and figure out how to manage the documents, relegated to this machine. For this situation, the processed data ought to be reassigned to alive nodes.
  • aWe need a solitary point to access all the stored data. Else, we can never be certain that we asked all of the workers which may contain information we need.

Now, let’s talk about real Hadoop architecture. Turns out, to beat the troubles mentioned above, we have to characterize an essential node (known as NameNode in Hadoop) from one of the workers (or DataNodes in Hadoop).

Hadoop nodes

If we consider one node as NameNode, it can be liable for allocating input files to workers, monitor the executions and, if necessary, reassign input documents to various workers. If we need our information to be reliable, the NameNode should manage all the replications of the stored data.

For these purposes all the HDD space of each node becomes divided by blocks. The information about each block and data inside is stored in Metadata on the NameNode.

At the point when we have the NameNode, writing the data to Hadoop become to resemble this:

Hadoop writing

When there is a sign to write the data to the Hadoop cluster, it goes to the NameNode.

  1. The NameNode checks its metadata about blocks with enough free space. Then, it chooses in what block exactly each part of the data will be stored (and replicas as well). When the choice about target blocks is done, NameNode tells the client which DataNodes should be utilized for each part of data, and in what block precisely it should put each part of the data.
  2. While the client writes its data, NameNode waits for it to acknowledge the writing or to reset the Metadata.
  3. The client connects to the objective DataNode, writes the data to the target blocks and waits for the acknowledge.
  4. After the data is written, before sending the acknowledge, DataNode checks if we need to have replications for just loaded data. If we do, the DataNode replicates new data to another DataNodes, targeted by NameNode at the beginning and so on.
  5. When the last DataNode completed the replication, it sends acknowledge back to the previous DataNode, until the acknowledge reaches at the client.
  6. When the client gets the acknowledge, it goes to the NameNode and commits the successful writing.
  7. The NameNode saves the metadata about each block that was acknowledged. In this moment the writing is done and data is saved and replicated among DataNodes. Now NameNode has the information in what blocks exactly each part of data is stored.

Described Hadoop architecture allows us to get rid of disadvantages mentioned above:

  • NameNode always knows about all of the other nodes. No need to check all of the nodes every time we want to write the data.
  • NameNode monitor every active writing and resets its metadata when timeout is reached.
  • If DataNode goes down, NameNode will redirect the writing to alive DataNode.
  • NameNode becomes a solitary point to access all the stored data because it keeps metadata about each block on other nodes.
  • Finnaly, If we need more computing resources, we can simply connect more DataNodes to enlarge the cluster’s capacity.

Data Engineer, Teacher and Technical Writer.