2015年10月29日 星期四

[tech] LinkedIn open-sources PalDB, a key-value store for handling 'side data'



cool

一個 high throughput 的 read only database,主要是用來解決一些使用 side data的情境。例如,在做自然語言處理時,你要使用的stop word檔案,或是在做分類或是spam 偵測時,你需要預先讀入的處理過分類資料。而在Linkedin 中,這樣的操作,居然成為他們的瓶頸所在,所以他們開發了PalDB 。


Side data can be defined as the extra read-only data needed by a process to do its job. For instance, a list of stopwords used by a natural language processing algorithm is side data. Machine learning models used in machine translation, content classification or spam detection are also side data. When this side data becomes large it can rapidly be a bottleneck for applications depending on them. PalDB aims to fill this gap.

Performance


效能來說每個新database推出的時候,總會有個令人經驗的對照。
與常見的LevelDB或是RocksDB對比甚至可以高達八倍的throughput。




PalDB is specifically optimized for fast read performance and compact store sizes. Performances can be compared to in-memory data structures such as Java collections (e.g. HashMap, HashSet) or other key-values stores (e.g. LevelDB, RocksDB).
Current benchmark on a 3.1Ghz Macbook Pro with 10M integer keys index shows an average performance of ~2M reads/s for a memory usage 6X less than using a traditional HashSet. That is 8X faster throughput compared to LevelDB (1.8) or RocksDB (3.9.0).

Limitations

  • PalDB is optimal in replacing the usage of large in-memory data storage but still use memory (off-heap, yet much less) to do its job. Disabling memory mapping and relying on seeks is possible but is not what PalDB has been optimized for.
  • The size of the index is limited to 2GB. There's no limitation in the data size however.
  • PalDB is not thread-safe at the moment so synchronization should be done externally if multi-threaded.


LinkedIn open-sources PalDB, a key-value store for handling 'side data' | VentureBeat | Big Data | by Jordan Novet
http://venturebeat.com/2015/10/26/linkedin-open-sources-paldb-a-key-value-store-for-handling-side-data/
linkedin/PalDB
https://github.com/linkedin/PalDB

沒有留言:

張貼留言