[问题] 从网页抓资料,中文处理上的问题 elmo56 PTT批踢踢实业坊

[问题] 从网页抓资料,中文处理上的问题

楼主: elmo56 (小树芽) 2014-10-25 20:13:12

我的python 是2.7版
是用beautifulsoup 去抓网页资料
抓到了table里面的值
例如 a[2]= <td> 雅虎新闻 Yahoo news </td>
a[3]= <td> 四 thr </td>
我也透过 a[2]=a[2].get_text()
把tag给去掉
只留下 text的部分
若我现在 print a[2],a[3]
结果: 雅虎新闻 Yahoo news 四 thr
但现在问题是
若我设一个 newslist=[]
再把 newslist.append(a[2])
newslist.append(a[3])
在print newslist
结果会变成中文字是乱码
英文是正常的
单独印出那个位置的时候正常
printf newslist[0] 会显示雅虎新闻 Yahoo news
printf newslist 会变成 u'\u4eda\u623f\u4eds\ Yahoo news u'\u4dsw thr
上面编码是我乱打的但会是这样的情况
要印出整个list 或是dict 就会乱乱的
故发文求解惑
谢谢大家

作者: alibuda174 (阿哩不达) 2014-10-25 20:53:00

应该不会你说的乱码是什么？试试 print newslist[0]那个不是乱码而是中文字符的unicode改用python3的话可能就会正常印出中文字

楼主: elmo56 (小树芽) 2014-10-25 21:19:00

谢谢解答,但若在python2.7下你会有其他方式解决吗

作者: penguin7272 (企鹅) 2014-10-25 22:01:00

print " ".join(newslist)

作者: uranusjr (â†é€™äººæ˜¯è¶…ç´šç¬¨è›‹) 2014-10-26 00:14:00

推荐你这个好棒的 uniout 函式库

楼主: elmo56 (小树芽) 2014-10-26 01:10:00

uniout OK,因为我未来还想搭配画图

作者: yauhh (小y宝贝) 2014-11-02 14:34:00

2.x版默认环境是ascii,档案开头加# -*- coding: utf-8 -*-可以正确显示. 假如是windows环境你自己找找看类似方案.

继续阅读

[问题] .items()seiryou Re: [问题] 搜寻 nested list 中的字串yauhh Re: [问题] 搜寻 nested list 中的字串bigpigbigpig Re: [问题] 搜寻 nested list 中的字串penguin7272 [问题] 搜寻 nested list 中的字串hohiyan [问题] python定时执行程式eve2508 [问题] python字串print疑问final01 [问题] tkinter安装问题wsqa Re: [问题] txt内容切割加总apua [问题] 新手请教用cmd开启python的方式ayugioh2003