[hadoop][hive]What are partitions in Hive

2013年2月28日星期四

[hadoop][hive]What are partitions in Hive

What are partitions in Hive

Partitioning tables changes how Hive structures the data storage
在設計資料的物理結構的時候，可以透過 partition 的方式增加處理的效率。
也就是說，我們把同樣的資料放在同樣的一個區塊，意味著，他們存放在底層的hdfs，是在同一個dir，同一個sortfile。

舉一個例子來說，我們的員工資料分別分為各country與各個state來做partition。


CREATE TABLE employees (
name STRING,
salary FLOAT,
)
PARTITIONED BY (country STRING, state STRING);

我們在hdfs上看的物理結構可能會是存放在
hdfs://master_server/user/hive/warehouse/mydb.db/employees
裡面的資料夾跟files可能是這樣長的


 .../employees/country=CA/state=AB 
.../employees/country=CA/state=BC 
.../employees/country=US/state=AL 
.../employees/country=US/state=AK

這樣做有什麼好處呢?

我們在查詢的時候就可以加快查找同一個country與同一個state的速度。

For example, the following query selects all employees in the state of Illinois in the United States:
我們需要找，在Illinois state, US country的員工。
直覺來說，我們就可以馬上找到那個存放records的files是哪一個。
這樣一來我們就不用遍歷所有的tables內的files了。


SELECT * FROM employees
WHERE country = 'US' AND state = 'IL';

沒有留言:

張貼留言

訂閱：張貼留言 (Atom)

2013年2月28日 星期四