2013年10月11日 星期五

[hadoop]MapFile


MapFile 是 排序且帶索引的 hadoop SequenceFile 。
一個 MapFile 在 HDFS上是一個資料夾,包含兩個file組成,一個是index,也就是key的索引,另外一個就是 data,排序好的原始資料。
在查找時,只需要把index載入,memory中,使用binary search的方式,就可以很快查找到要找的key。

index
內含 
# hadoop fs -text numbers.map/index


1 128
129 5820
257 11539
385 17255
513 22971
641 28676
769 34388
897 40107

每128 key會有一個索引,第2欄是offset


data 內就含有排序後的key value 。


# hadoop fs -text numbers.map/data
13/10/11 17:17:23 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/10/11 17:17:23 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/10/11 17:17:23 INFO compress.CodecPool: Got brand-new decompressor
1 one 11 fsdf afd fsdf 111
2 two 222 fsdf d fsd sd 222
3 thref sfd sfsdf e 333 fsd 333
4 four 44 fds 4sfsd fsdfs 4444
5 five 555 fsd fdsf fsd f sf sdfsdfsdf 5555
6 one 11 fsdf afd fsdf 111
7 two 222 fsdf d fsd sd 222
8 thref sfd sfsdf e 333 fsd 333



org.apache.hadoop.io 
Class MapFile


java.lang.Object

extended by


org.apache.hadoop.io.MapFile

Direct Known Subclasses:
ArrayFileSetFile


public class MapFileextends Object

A file-based map from keys to values.


A map is a directory containing two files, the data file, containing all keys and values in the map, and a smaller index file, containing a fraction of the keys. The fraction is determined by MapFile.Writer.getIndexInterval().


The index file is read entirely into memory. Thus key implementations should try to keep themselves small.


Map files are created by adding entries in-order. To maintain a large database, perform updates by copying the previous version of a database and merging in a sorted change list, to create a new version of the database in a new file. Sorting large change lists can be done with SequenceFile.Sorter.


 


 


 


沒有留言:

張貼留言