[elasticsearch] 在 production 的環境中，no down time 的重起 elasticsearch cluster

2015年3月2日星期一

[elasticsearch] 在 production 的環境中，no down time 的重起 elasticsearch cluster

rolling restart of nodes (full cluster restart)edit

在 production 環境中使用 Elasticsearch ，很難避免的有時要重新啟動整個 clsuter。

rolling restart (滾動重起) 就是一個不錯的作法。讓ES cluster 一次只重新啟動一個 node ，使用者甚至不會察覺到任何的 downtime 。

我們可以 follow 幾個 step 來操作，

關閉 shard reallocation :

關閉 shard reallocation 可以是為了讓關機後可以更快的啟動。如果不執行這一個步驟，當節點啟動時，他們會試著去複製各個shard 的 replica ，這將會後費很許多時間跟I/O。

當我們在做 rolling restart 時，可以先關閉 shard reallocation 等到全部完成後再開啟。這樣子，在做各個節點重起時，當不會去試著 relalance 各個節點的shard。

curl -XPUT localhost:9200/_cluster/settings -d '{
                "transient" : {
                    "cluster.routing.allocation.enable" : "none"
                }
        }'

關閉在cluster 中的單一節點 (如果你有指定特定node當做master節點，必須在開始data node 時，先開啟 master node)
開啟節點，並且確認新開啟的節點有沒有加入 cluster中。(這裡可以藉著觀察master與data node的 log 或是透過 health , nodes api來檢查)

重新開啟 shard reallocation

   curl -XPUT localhost:9200/_cluster/settings -d '{
                "transient" : {
                    "cluster.routing.allocation.enable" : "all"
                }
        }'

檢查所有的shard 是否正確的分配在各台機器上。在做 balancing 可能需要花費一點時間(4000 個shard 大概要花 20min左右)
對其他需要重起的node進行上面的步驟。

如此一來，便能做一個安全並且有效率的維運。 (必須注意 node數 >= primary shard 才能做到 no downtime 的 restart )

沒有留言:

張貼留言

訂閱：張貼留言 (Atom)

2015年3月2日 星期一