在使用搜尋引擎中,在 index record 時,有個隱含的概念,這個概念也成為早期設計搜尋引擎碰到的瓶頸。
目前的large scale search engine 大多使用 inverted index 反向索引的方式索引,在索引資料的時候,我們要知道哪些 term 對應到哪些 document ,而每個 document 就有一個document id 。
早些年,單一個index 索引的record有最大的上限多半是卡在這個 document id 最大值 int上。
在 elasticsearch 裡面 document id 是使用 UUID.randomBase64UUID 的方式去產生的。
如果使用 post 的方式,索引 record 他會幫你產生一個 "_id" 的 document id 欄位,這是一個長度 22 的 base64encode 的 UUID。
{
- _index: i2
- _type: test2
- _id: hvLUDOYXR8C5CP0HwsuDLQ
- _version: 1
- _score: 1
在分散式的搜尋引擎中,譬如像是 elasticsearch 大量使用這樣的技巧。
像是我們的 nodeid , cluster id 都是使用類似的方式產生 id。
[2014-05-07 15:12:30,259][INFO ][cluster.service ] [renode1] detected_master [renode2][ZTOYwtRDRSiQ2DQaRMTn5g][renode2][inet[/10.1.191.177:9300]], added {[renode2][ZTOYwtRDRSiQ2DQaRMTn5g][renode2][inet[/10.1.191.177:9300]],}, reason: zen-disco-receive(from master [[renode2][ZTOYwtRDRSiQ2DQaRMTn5g][renode2][inet[/10.1.191.177:9300]]])
[2014-05-07 15:12:30,335][INFO ][discovery ] [renode1] elasticsearchpc2/me2UTC43TpKWoHZaR6cFOA
source code:
/**
* Returns a Base64 encoded version of a Version 4.0 compatible UUID
* as defined here: http://www.ietf.org/rfc/rfc4122.txt
*/
public static String randomBase64UUID() {
return randomBase64UUID(SecureRandomHolder.INSTANCE);
}
/**
* Returns a Base64 encoded version of a Version 4.0 compatible UUID
* randomly initialized by the given {@link Random} instance
* as defined here: http://www.ietf.org/rfc/rfc4122.txt
*/
public static String randomBase64UUID(Random random) {
final byte[] randomBytes = new byte[16];
random.nextBytes(randomBytes);
/* Set the version to version 4 (see http://www.ietf.org/rfc/rfc4122.txt)
* The randomly or pseudo-randomly generated version.
* The version number is in the most significant 4 bits of the time
* stamp (bits 4 through 7 of the time_hi_and_version field).*/
randomBytes[6] &= 0x0f; /* clear the 4 most significant bits for the version */
randomBytes[6] |= 0x40; /* set the version to 0100 / 0x40 */
/* Set the variant:
* The high field of th clock sequence multiplexed with the variant.
* We set only the MSB of the variant*/
randomBytes[8] &= 0x3f; /* clear the 2 most significant bits */
randomBytes[8] |= 0x80; /* set the variant (MSB is set)*/
try {
byte[] encoded = Base64.encodeBytesToBytes(randomBytes, 0, randomBytes.length, Base64.URL_SAFE);
// we know the bytes are 16, and not a multi of 3, so remove the 2 padding chars that are added
assert encoded[encoded.length - 1] == '=';
assert encoded[encoded.length - 2] == '=';
// we always have padding of two at the end, encode it differently
return new String(encoded, 0, encoded.length - 2, Base64.PREFERRED_ENCODING);
} catch (IOException e) {
throw new ElasticsearchIllegalStateException("should not be thrown");
}
}
ref
UniqueKey - Solr Wiki
http://wiki.apache.org/solr/UniqueKey
"The Solr uniqueKey field encodes the identity semantics of a document. In database jargon, the primary key."
沒有留言:
張貼留言