- 使用 Scrapy CrawlSpider 时,在 rules 中定义了 callback 方法,但无法进入定义的 callback 函数 parse_item
- 将 parse_item 替换成 parse 能正常进入 parse 回调( Scrapy 默认的回调,Scrapy 不建议替换)
- 使用 Scrapy shell 能正常输出 response
- 想请教下为什么这里进不去我定义的 parse_item 回调
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class CrspiderSpider(CrawlSpider): name = 'crSpider' allowed_domains = ['china-railway.com.cn'] start_urls = ['http://www.china-railway.com.cn/xwzx/ywsl/'] rules = ( Rule(LinkExtractor(allow=r'http://www.china-railway.com.cn/xwzx/[a-zA-Z]+/'), follow=True), Rule(LinkExtractor(allow=r'http://www.china-railway.com.cn/xwzx/[a-zA-Z]+/index_\d+.html'), follow=True), Rule(LinkExtractor(allow=r'http://www.china-railway.com.cn/xwzx/.+t\d{8}_\d{6}.html'), callback='parse_item') ) def parse_item(self, response): self.logger.info('Hi, this is an item page! %s', response.url) print('-' * 40, '进入回调', '-' * 40, ) newsName = response.xpath('//h1').get() print(newsName) # def parse(self, response): # item = {} # print('-' * 40, '进入 parse 回调', '-' * 40, ) # print(response.text) # newsName = response.xpath('//h1').get() # return item
- 这里是部分输出,能获取到符合规则的页面但无法输出 parse_item 的回调
2020-03-13 12:38:25 [scrapy.core.engine] INFO: Spider opened 2020-03-13 12:38:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-03-13 12:38:25 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024 2020-03-13 12:38:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/> (referer: None) 2020-03-13 12:38:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/> (referer: http://www.china-railway.com.cn/xwzx/ywsl/) 2020-03-13 12:38:26 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.china-railway.com.cn/xwzx/ywsl/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 2020-03-13 12:38:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200304_101019.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/) 2020-03-13 12:38:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200305_101067.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/) 2020-03-13 12:38:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200305_101100.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/) 2020-03-13 12:38:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200306_101120.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/) 2020-03-13 12:38:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200307_101174.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/) 2020-03-13 12:38:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200310_101326.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/) 2020-03-13 12:38:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200311_101362.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)