scrapy xpath extraction 以及其编码的问题 stevec PTT批踢踢实业坊

scrapy xpath extraction 以及其编码的问题

楼主: stevec (steve) 2014-11-29 19:20:32

有点不晓得为什么,想请各位大大看一下
下面的程式码只要是想利用scrapy 里面的xpath extract一些我想要的info
raw_html_article_content_ 是储存我想extract的部分资讯
raw 是储存范围比较大的部分
所以理论上raw会包含raw_html_article_content_ 的资讯
可是raw包含的部分会有点跟raw_html_article_content_里面的不一样
例如:
raw: 结婚并无Z>B (这跟chrom浏览器打开source code的看到的是一样的)
raw_html_article_content_ : 结婚并无Z>B
我要怎么让raw里面储存的跟raw_html_article_content_的一样啊？
ps:环境win 7, python 2.7,scrappy 1.4
from scrapy.http import HtmlResponse
from scrapy.selector import Selector
import urllib
import urllib2
address = "http://www.ptt.cc/bbs/Boy-Girl/M.1416362560.A.881.html"
response = urllib2.urlopen(address)
html = response.read()
html_response = HtmlResponse( address, body=html)
sel = Selector(html_response)
recog_assist_word = u"※ 文章网址: "
xpath = """/html/body/div[@id="main-container"]/div[@id="main-content"]/
span[@class="f2" and text()="%s"][last()]/preceding-sibling::node()"""
% recog_assist_word
raw_html_article_content_ = sel.xpath(xpath).extract()
raw_html_article_content_ = "".join([_ for _ in raw_html_article_content_])
raw=sel.xpath(u"""/html""").extract()[0]
print raw_html_article_content_
print raw

作者: dritchie (卍~迈斯纳效应~卍) 2014-11-30 01:27:00

那个编码叫HTML entity

楼主: stevec (steve) 2014-11-30 11:03:00

感谢大大,可是在python里要怎么样让name entities显示正常呢？为什么scrapy有时候会帮忙修正,有时候又不会呢？这个眉角在哪啊？

继续阅读

[问题] 变量范围Arim Re: [问题]如何让os.system执行多笔指令uranusjr [问题]如何让os.system执行多笔指令arnold0613 [问题] 如何将照片使用接口让user切割成方形sobonbon [问题] 安装gensim包出现问题OoShiunoO [问题] 请教区网开启和停用 ?Love1019 Re: [问题] Django POST部份资料呈现在redirect pagewalelile Re: [问题]Django Transaction error MacPerson [心得] iPython 在win8 底下成功安装的套件sjgau Re: [问题] Django POST部份资料呈现在redirect pageuranusjr