[python]detect webpage encode big5 utf-8

2012年7月4日星期三

[python]detect webpage encode big5 utf-8

[python]detect webpage encode big5 utf-8

情境
常常在我們使用urllib urlopen 後，
我們fetch了一個網頁，但是使用print 或者是後續導到某些framework時
卻發生，

at
'utf8' codec can't decode byte 0xc1 in position 0: invalid start byte

類似這樣的錯誤訊息，
(以上是從web.py回傳的)
因為一般來說都是預設使用utf-8來開發framework，
所以當你自己再處理的部份導過去時，也理應先處理成utf-8。

使用是chardet 偵測網頁編碼
除了在每個py檔案加上


# -*- coding: utf-8 -*-

再使用 chardet
先install
easy_install chardet
即可


import chardet                                                                  
if sys.getdefaultencoding() != 'utf-8':                                         
        reload(sys)                                                             
        sys.setdefaultencoding('utf-8')

實際上使用他會回傳一個dict


        htmltxt=urllib2.urlopen(url).read()                                     
        chardetdict=chardet.detect(htmltxt)                                     
        if chardetdict.get('encoding')=='Big5':                                 
            htmltxt=htmltxt.decode('big5','ignore').encode('utf-8','ignore')

這樣就可以大致上解決在fetch網頁後編碼變成亂碼的問題了。

沒有留言:

張貼留言

訂閱：張貼留言 (Atom)

2012年7月4日 星期三

[python]detect webpage encode big5 utf-8

沒有留言:

張貼留言

2012年7月4日星期三