MapFile 是 排序且帶索引的 hadoop SequenceFile 。
一個 MapFile 在 HDFS上是一個資料夾,包含兩個file組成,一個是index,也就是key的索引,另外一個就是 data,排序好的原始資料。
在查找時,只需要把index載入,memory中,使用binary search的方式,就可以很快查找到要找的key。
index
內含
# hadoop fs -text numbers.map/index
1 128
129 5820
257 11539
385 17255
513 22971
641 28676
769 34388
897 40107
每128 key會有一個索引,第2欄是offset
data 內就含有排序後的key value 。
# hadoop fs -text numbers.map/data
13/10/11 17:17:23 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/10/11 17:17:23 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/10/11 17:17:23 INFO compress.CodecPool: Got brand-new decompressor
1 one 11 fsdf afd fsdf 111
2 two 222 fsdf d fsd sd 222
3 thref sfd sfsdf e 333 fsd 333
4 four 44 fds 4sfsd fsdfs 4444
5 five 555 fsd fdsf fsd f sf sdfsdfsdf 5555
6 one 11 fsdf afd fsdf 111
7 two 222 fsdf d fsd sd 222
8 thref sfd sfsdf e 333 fsd 333
org.apache.hadoop.io
Class MapFile
java.lang.Object
org.apache.hadoop.io.MapFile
- Direct Known Subclasses:
- ArrayFile, SetFile
public class MapFileextends Object
A file-based map from keys to values.
A map is a directory containing two files, the data
file, containing all keys and values in the map, and a smaller index
file, containing a fraction of the keys. The fraction is determined by MapFile.Writer.getIndexInterval()
.
The index file is read entirely into memory. Thus key implementations should try to keep themselves small.
Map files are created by adding entries in-order. To maintain a large database, perform updates by copying the previous version of a database and merging in a sorted change list, to create a new version of the database in a new file. Sorting large change lists can be done with SequenceFile.Sorter
.