1、原来运行正常的脚本,
start_urls = ['http://quote.eastmoney.com']
rules=(
Rule(LinkExtractor(allow=('/[s][z,h][0,3,6][0]\d{4}.html',)),
callback='parse_news',follow=True),)
当添加如下代码试图抓取js生成内容后,就只能抓取到start_urls一页的内容。
RENDER_HTML_URL = "http://127.0.0.1:8050/render.html"
def start_requests(self): print self.start_urls for url in self.start_urls: body = json.dumps({"url": url, "wait": 0.5}) headers = Headers({'Content-Type': 'application/json'}) yield Request(RENDER_HTML_URL, self.parse_news, method="POST",body=body, headers=headers)
2、pipeline怎么筛选中文?目前的code如下,如何修正?
from scrapy.exceptions import DropItem
class FilterWordsPipeline(object):
"""A pipeline for filtering out items which contain certain words in their
description"""
def process_item(self, item, spider): w = '新股' w1 = w.encode('') w2 = w1.decode('') if [w1.decode('')] in item['name']: return item else: print [w1.decode('')][3:10] raise DropItem("======================drop=====================")
![]() | 1 XuTao 2015-03-23 11:54:43 +08:00 ![]() 1、只能抓取到start_urls一页的内容说明你递归的地方出错了 2、python编码问题,最简单的办法就是都解码成unicode |
![]() | 2 withrock 2015-03-23 16:43:35 +08:00 ![]() >>> s = 'v2ex汉字v2ex' >>> l = [] >>&t; for i in s: ... if ord(i) < 0x20 or ord(i) > 0x7E: ... continue ... l.append(i) ... >>> print ''.join(l) v2exv2ex http://git.oschina.net/mktime/scrapy-douban-group 最近我也在玩scrapy,希望能帮到你。 |