Hadoop is a framework for running applications on large clusters built of commodity hardware.

  • Scalable: It can reliably store and process petabytes.
  • Economical: It distributes the data and processing across clusters of commonly available computers (in thousands).
  • Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located.
  • Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.

Why Hadoop?

challenge:   Read 1 TB of data
1 Machine:                               
10 Machines:
4 I/O Channels       
4 I/O Channels       
Each Channel:100 MB/s          
Each Channel:100 MB/s          
45 Minutes
4.5 Minutes
HDFS( ):

HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.
A computer cluster consists of a set of loosely connected or tightly connected computers that work together so that in many respects they can be viewed as a single system.
Commodity Hardware: commonly available hardware available from multiple vendors

HDFS Files:

User data divided into 64MB blocks and replicated across local disks of cluster node to address:
  • Cluster network bottleneck
  • Cluster node crashes
Master/Slave Architecture
  • Master (Namenode) maintains a name space and metadata 
  • Slaves (Datanodes): maintain three copies of each data block
HDFS Architecture:



Posted by Unknown On 23:52 No comments

0 comments:

Post a Comment

  • RSS
  • Delicious
  • Digg
  • Facebook
  • Twitter
  • Linkedin
  • Youtube

Blog Archive

Contact Us


Name

E-mail *

Message *