Hadoop is a framework for running applications on large clusters built of commodity hardware.
- Scalable: It can reliably store and process petabytes.
- Economical: It distributes the data and processing across clusters of commonly available computers (in thousands).
- Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located.
- Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.
Why Hadoop?
challenge:
Read 1 TB of data
|
|
1 Machine:
|
10 Machines:
|
4 I/O Channels
|
4 I/O Channels
|
Each Channel:100 MB/s
|
Each Channel:100 MB/s
|
45 Minutes
|
4.5 Minutes
|
HDFS( ):
HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.
A computer cluster consists of a set of loosely connected or tightly connected computers that work together so that in many respects they can be viewed as a single system.
Commodity Hardware: commonly available hardware available from multiple vendors
HDFS Files:
User data divided into 64MB blocks and replicated across local disks of cluster node to address:
- Cluster network bottleneck
- Cluster node crashes
Master/Slave Architecture
- Master (Namenode) maintains a name space and metadata
- Slaves (Datanodes): maintain three copies of each data block
HDFS Architecture:
0 comments:
Post a Comment