2013年3月7日 星期四

[hive][apache] hive basic instruction

[hive][apache] hive basic instruction

hive的簡介

緣起
- hadoop core 的介紹
- hdfs
- MapReduce 特點與架構
發展
- facebook 發展
- Qubole
The Qubole Data Service (QDS) is a Software-as-a-Service analytics platform running on leading cloud offerings like AWS.
- 使用的公司
* baidu
* facebook
* qubole

使用比例
為什麼使用hive的轉變

tips
- 對於同一個表使用多個查詢
(Making Multiple Passes over the Same Data)
The following rewrite achieves the same thing, but using a single pass through the source history table


HDFS was designed for many millions of large files, not billions of small files
Each partition corresponds to a directory that usually contains multiple files.
MapReduce processing converts a job into multiple tasks.

Another solution is to use two levels of partitions along different dimensions. For ex- ample, the first partition might be by day and the second-level partition might be by geographic region, like the state:

The primary reason to avoid normalization is to minimize disk seeks, such as those typically required to navigate foreign key relations
when you have 10s of terabytes to many petabytes of data, optimizing speed makes these limitations worth accepting.


沒有留言:

張貼留言