peicheng note

2016年10月31日星期一

[mac] 移除已經安裝的印表機驅動程式 remove mac printer driver

位置在 /Library/Printers/PPDs/Contents/Resources

看看要移除哪些

example

移除所有的 Fuji Xerox 驅動程式

rm -rf FX*

and

rm -rf Fuji\ Xerox\ *

[mac] install fx printer driver for mac os x 使用 fujixerox 印表機的驅動程式

使用 Fuji Xerox 的 printer ，從 web console 看起來是 ApeosPort-V C3374 。
可是裝了 mac 的 Fuji Xerox 印表機驅動程式 v3.0 (OS X) 選了對應型號也是無法使用。
https://support.apple.com/kb/DL1776?locale=zh_TW

從 air printer 自己選擇的driver 來看確實是 Fuji Xeror 的機器沒錯但是使用不同的 driver 。

a在加入新的 printer 時，輸入ip後他會幫你選擇使用的driver ，這時候自行選用 fx printer driver for mac os x (下方有下載網址)

Print Driver for Mac OS X : Description : Download : Fuji Xerox

https://www.fujixerox.co.jp/download/apeosport/5_c7780/mac1010e/prt/

使用了這個 driver 就可以在 mac 上使用 fuji xerox 的印表機了。

2016年10月21日星期五

[elasticsearch] elasticsearch indexing throughput 單機每秒索引超過四萬筆資料

先講結論， elasticsearch 單機每秒可以索引超過四萬筆資料。

以前我們架設完自己的 elasticsearch cluster 會自己做一下 benchmark ，但是遇到同樣使用 elasticsearch的朋友，在描述各自的 indexing throughput 或是 query performance ，就比較難真實的評比出來。

後來有了 Rally ，幫助我們怎麼做出一致的 benchmark參考值。
elastic/rally: Macrobenchmarking framework for Elasticsearch
https://github.com/elastic/rally

根據一樣的 setting file 各自可以做出屬於自己 cluster的對照數據。
rally-tracks/track.json at master · elastic/rally-tracks
https://github.com/elastic/rally-tracks/blob/master/geonames/track.json

目前 elastic 使用了 rally 做了
Elasticsearch Nightly Benchmarks
https://elasticsearch-benchmarks.elastic.co/geonames/index.html

Elasticsearch Nightly Benchmarks
https://elasticsearch-benchmarks.elastic.co/index.html

All benchmarks are run on a bare metal machine with the following specifications:

CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
RAM: 32 GB
SSD: two Crucial MX200 (software RAID 0)
OS: Linux Kernel version 4.4.0-28

2016年10月17日星期一

[hadoop][EMR][hive] 自動建立 hive 的 partition , automatically partition hive external table on amazon S3

使用 AWS 上有 Hadoop Cluster service Apache EMR 。

Hive 的 table 有兩種，一種是把 data 放在 hive 資料夾下，另外一種稱作 external table，也就是說，把data 放在非 hive 預設的資料夾下。

舉例來說，如果 hive的預設資料夾在 hdfs 上的 /user/hive/warehouse 下，create table 後可以使用 load data inpath 載入檔案，資料會搬到這個資料夾下面。

但是，如果使用了其他分析工具，如 mapreduce , spark ,pig ... 分析完後，有時候是放在另外的資料夾下。如果不直接搬移過來這個資料夾，可以使用 create external table ，來讓資料存放在其他資料夾下，但是只是在 hive metastore 設定說，有這個 table的資料放在哪個資料夾下。

在 Amazon EMR 上就提供了從 s3 :// 上讀入檔案的方式，所以，大量分析需求的時候起了一些EMR ，把資料算完後放在 S3 上。當下次有需要大量分析時，再使用 create external table 的方式讀取放在s3上的資料。

Hive 中有個加速資料處理的方式，稱作 partition ，簡單來說就是，可以透過 partition 的條件設定，讓 Hive 需要處理的資料便少。

像是有根據時間一直增加的這種資料類型，一般來說都是要分析某個區間的資料。
如果大部分情況都是以天來做分析，就可以把日期拿來建立 partition 欄位，使用這個機制來減少要處理的資料。

資料從外部處理完後，按造日期每天放入 S3 上的資料夾，來供 hive 使用。

CREATE EXTERNAL TABLE IF NOT EXISTS posts (
`id` int ,
`date` string ,
`pid` int ,
`country` string ,
`placement` string ,
`type` string ,
`count` double ,
`updated` timestamp
)
PARTITIONED BY (day string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://mys3-table/posts';

而我在 s3上的資料夾 naming 規則就是如此， partition 欄位其實就是個資料夾命名規則，
"day="

s3://mys3-table/posts/day=20161016/posts_ 20161016
s3://mys3-table/posts/day=20161017/posts_ 20161017

當資料放好後，
使用 hive -e "msck repair table posts;"
就可以在 hive 內看到隸屬該 partition 的資料了。

訂閱：文章 (Atom)

2016年10月31日 星期一