HDFS(Hadoop Distributed File System): distributed file system to run on commodity hardware.
1. Assumptions & Goals
- In case of some server failures, data can still be accessed.
- Should be able to work with various Hadoop Ecosystem Applications, some having streaming features
- High Throughput of data
- Can read/write large datasets, in TB
2. Blocks
- A file replicated by replication factors and located across different datanodes.
- A File is split into one or more "blocks". (File == sequence of blocks)
- Default size of a block: 64MB
3. NameNode, DataNode
- NameNode
- Masterserver
- Manages filesystem namespace and (replication/mapping/metadata) info about blocks
- Communicates with DataNode to check its state/availability
- SPOF(Single Point Of Failure): If NameNode fails, the whole system fails.
- DataNode
- Manage data(block)
- Rack: Cluster of "physically close" DataNodes. (20~30 nodes)
- Common replication policy: if replication factor is three, two are located in a same rack, one is located in external rack.
- Effect: High Network Bandwidth, Low Latency
4. Typical Failures
- NameNode failure
- DataNode failure
- Network Partitions: ???
'Data Engineering > Hadoop' 카테고리의 다른 글
[Hive] ALTER TABLE PARTITION SET LOCATION 이 안되는 현상 (0) | 2020.03.02 |
---|---|
hadoop distcp에서 queue 지정하기 (0) | 2020.02.12 |