본문 바로가기

Data Engineering/Hadoop

HDFS Architecture Guide

HDFS(Hadoop Distributed File System): distributed file system to run on commodity hardware.

 

1. Assumptions & Goals

- In case of some server failures, data can still be accessed.

- Should be able to work with various Hadoop Ecosystem Applications, some having streaming features

- High Throughput of data

- Can read/write large datasets, in TB

 

2. Blocks

- A file replicated by replication factors and located across different datanodes.

- A File is split into one or more "blocks". (File == sequence of blocks)

- Default size of a block: 64MB

 

3. NameNode, DataNode

- NameNode

    - Masterserver

    - Manages filesystem namespace and (replication/mapping/metadata) info about blocks

    - Communicates with DataNode to check its state/availability

    - SPOF(Single Point Of Failure): If NameNode fails, the whole system fails.

- DataNode

    - Manage data(block)

- Rack: Cluster of "physically close" DataNodes. (20~30 nodes)

    - Common replication policy: if replication factor is three, two are located in a same rack, one is located in external rack.

    - Effect: High Network Bandwidth, Low Latency

 

4. Typical Failures

- NameNode failure

- DataNode failure

- Network Partitions: ???