爬其他的 URL 都可以啊,是因为新浪微博被重定向的原因吗?
import scrapy import re from scrapy.selector import Selector from scrapy.http import Request from tutorial.items import DmozItem from string import maketrans from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor def extractData(regex, content, index=1): r = '0' p = re.compile(regex) m = p.search(content) if m: r = m.group(index) return r class DmozSpider(CrawlSpider): name = "dmoz" allowed_domains = ["weibo.com"] download_delay = 2 rules=[ Rule(LinkExtractor(allow=('/')),callback='parse_item',follow=True) ] headers = { "Accept": "*/*", "Accept-Encoding": "gzip, deflate, sdch, br", "Accept-Language": "zh-CN,zh;q=0.8", "Connection": "keep-alive", # "Host": "login.sina.com.cn", "Referer": "http://weibo.com/", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36" } cookies = { 'ALF':'我的 cookie', 'Apache':'我的 cookie', 'SCF':'我的 cookie', 'SINAGLOBAL':'我的 cookie', 'SSOLoginState':'我的 cookie', 'SUB':'我的 cookie', 'SUBP':'我的 cookie', 'SUHB':'我的 cookie', 'TC-Page-G0':'我的 cookie', 'TC-Ugrow-G0':'我的 cookie', 'TC-V5-G0':'我的 cookie', 'ULV':'我的 cookie', 'UOR':'我的 cookie', 'WBStorage':'我的 cookie', 'YF-Page-G0':'我的 cookie', 'YF-Ugrow-G0':'我的 cookie', 'YF-V5-G0':'我的 cookie', '_s_tentry':'-', 'log_sid_t':'我的 cookie', 'un':'我的 cookie', } def start_requests(self): return [Request("http://weibo.com/u/2010226570?refer_flag=1001030101_&is_all=1",cookies = self.cookies,headers=self.headers)] def parse_item(self, response): print "comehere!" regexID=r'class=\\"username\\">(.*)\<\\/h1>' cOntent=response.body item=DmozItem() ID=extractData(regexID,content,1) item['ID']=ID print ID yield item
控制台输出如下:
2017-01-08 17:51:34 [scrapy.core.engine] INFO: Spider opened 2017-01-08 17:51:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-01-08 17:51:34 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-01-08 17:51:46 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://login.sina.com.cn/sso/login.php?url=http%3A%2F%2Fweibo.com%2Fu%2F2010226570%3Frefer_flag%3D1001030101_%26is_all%3D1&_rand=1483869098.691&gateway=1&service=miniblog&entry=miniblog&useticket=1&returntype=META&sudaref=http%3A%2F%2Fweibo.com%2F&_client_version=0.6.23> from <GET http://weibo.com/u/2010226570?refer_flag=1001030101_&is_all=1> 2017-01-08 17:51:47 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (meta refresh) to <GET http://weibo.com/u/2010226570?refer_flag=1001030101_&is_all=1&sudaref=weibo.com&retcode=6102> from <GET http://login.sina.com.cn/sso/login.php?url=http%3A%2F%2Fweibo.com%2Fu%2F2010226570%3Frefer_flag%3D1001030101_%26is_all%3D1&_rand=1483869098.691&gateway=1&service=miniblog&entry=miniblog&useticket=1&returntype=META&sudaref=http%3A%2F%2Fweibo.com%2F&_client_version=0.6.23> 2017-01-08 17:51:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://weibo.com/u/2010226570?refer_flag=1001030101_&is_all=1&sudaref=weibo.com&retcode=6102> (referer: http://weibo.com/) 2017-01-08 17:51:49 [scrapy.core.engine] INFO: Closing spider (finished)
![]() | 1 XDMonkey OP ~ |
![]() | 2 gouchaoer 2017-01-08 18:33:22 +08:00 via Android 你不知道只有梁博能搞微博么? |
![]() | 3 hiluluke 2017-01-08 18:45:26 +08:00 随便塞点 cookie 就不会重定向了。。。 |
![]() | 6 sunwei0325 2017-01-08 23:56:30 +08:00 建议实施 wap 版的微博 |
![]() | 7 XDMonkey OP @sunwei0325 多谢 |