php 淘宝、天猫店铺商品采集 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
请不要在回答技术问题时复制粘贴 AI 生成的内容
hpxl
V2EX    程序员

php 淘宝、天猫店铺商品采集

  •  
  •   hpxl 2014-04-29 22:07:06 +08:00 15559 次点击
    这是一个创建于 4190 天前的主题,其中的信息可能已经有所发展或是发生改变。
    能够规避淘宝防采集功能,通过代理快速采集店铺商品,商品信息以及图片默认存放在./data目录。

    https://github.com/hpxl/fetch-taobao-goods
    如果觉得有用,欢迎star
    第 1 条附言    2014-04-30 22:28:30 +08:00
    1.修复当淘宝店铺没有店铺分类时,商品采集失败的问题。
    2.脚本运行需要开启curl扩展
    18 条回复    2014-09-03 14:30:13 +08:00
    sadara
        1
    sadara  
       2014-04-29 22:49:51 +08:00 via iPhone
    记得有个淘宝客程序叫单店宝
    mahone3297
        2
    mahone3297  
       2014-04-29 23:35:37 +08:00
    已fork。。。
    leyle
        3
    leyle  
       2014-04-29 23:55:55 +08:00 via Android
    这个有意思,先关注下,白天电脑看看
    bigshan
        4
    bigshan  
       2014-04-30 01:49:46 +08:00 via iPhone
    明天用电脑看看咯
    huangsong
        5
    huangsong  
       2014-04-30 10:35:31 +08:00
    fork 一下
    aWangami
        6
    aWangami  
       2014-04-30 12:40:28 +08:00
    C:\Users\Administrator\Desktop\Fetch-Taobao>php fetch.php 'http://shop65262430.taobao.com'
    PHP Warning: file_put_contents(/tmp/fetchgoods.pid): failed to open stream: No such file or directo
    ry in C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php on line 13
    PHP Stack trace:
    PHP 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    PHP 2. file_put_contents() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:13

    Warning: file_put_contents(/tmp/fetchgoods.pid): failed to open stream: No such file or directory in
    C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php on line 13

    Call Stack:
    0.0010 127528 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobo\fetch.php:0
    0.0010 128008 2. file_put_contents() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php
    :13

    PHP Notice: Undefined index: scheme in C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class
    .php on line 59
    PHP Stack trace:
    PHP 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    PHP 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:33
    PHP 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php
    :50

    Notice: Undefined index: scheme in C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php
    on line 59

    Call Stack:
    0.0010 127528 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    0.0068 192584 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:3
    3
    0.0068 193152 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\Fe
    tchGoods.class.php:50

    PHP Notice: Undefined index: host in C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.p
    hp on line 59
    PHP Stack trace:
    PHP 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    PHP 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:33
    PHP 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php
    :50

    Notice: Undefined index: host in C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php on
    line 59

    Call Stack:
    0.0010 127528 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    0.0068 192584 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:3
    3
    0.0068 193152 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\Fe
    tchGoods.class.php:50

    shop_url:'http://shop65262430.taobao.com' ... start_time:04-29 15:19:11 ... start!
    PHP Fatal error: Call to undefined function curl_init() in C:\Users\Administrator\Desktop\Fetch-Tao
    bao\HttpFetch.class.php on line 127
    PHP Stack trace:
    PHP 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    PHP 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:33
    PHP 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php
    :50
    PHP 4. HttpFetch->get() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php:74
    PHP 5. HttpFetch->disguise_curl() C:\Users\Administrator\Desktop\Fetch-Taobao\HttpFetch.class.php:
    29

    Fatal error: Call to undefined function curl_init() in C:\Users\Administrator\Desktop\Fetch-Taobao\H
    ttpFetch.class.php on line 127

    Call Stack:
    0.0010 127528 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    0.0068 192584 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:3
    3
    0.0068 193152 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\Fe
    tchGoods.class.php:50
    0.0215 197880 4. HttpFetch->get() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.c
    lass.php:74
    0.0215 197896 5. HttpFetch->disguise_curl() C:\Users\Administrator\Desktop\Fetch-Taobao\Ht
    tpFetch.class.php:29

    PHP Warning: unlink(/tmp/fetchgoods.pid): No such file or directory in C:\Users\Administrator\Deskt
    op\Fetch-Taobao\fetch.php on line 15
    PHP Stack trace:
    PHP 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    PHP 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:33
    PHP 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php
    :50
    PHP 4. HttpFetch->get() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php:74
    PHP 5. HttpFetch->disguise_curl() C:\Users\Administrator\Desktop\Fetch-Taobao\HttpFetch.class.php:
    29
    PHP 6. removePidFile() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    PHP 7. unlink() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:15

    Warning: unlink(/tmp/fetchgoods.pid): No such file or directory in C:\Users\Administrator\Desktop\Fe
    tch-Taobao\fetch.php on line 15

    Call Stack:
    0.0010 127528 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    0.0068 192584 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:3
    3
    0.0068 193152 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\Fe
    tchGoods.class.php:50
    0.0215 197880 4. HttpFetch->get() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.c
    lass.php:74
    0.0215 197896 5. HttpFetch->disguise_curl() C:\Users\Administrator\Desktop\Fetch-Taobao\Ht
    tpFetch.class.php:29
    0.0342 194016 6. removePidFile() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    0.0342 194128 7. unlink() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:15


    C:\Users\Administrator\Desktop\Fetch-Taobao>
    andyhu
        7
    andyhu  
       2014-05-01 04:45:49 +08:00
    mark关注下,不过采集这东西用php有点太痛苦了
    ptsa
        8
    ptsa  
       2014-05-02 22:28:53 +08:00
    @sadara $1199 现在淘宝客不好做吧
    ptsa
        9
    ptsa  
       2014-05-02 22:30:23 +08:00
    @sadara 而且还是去年的版本 不知道好不好用
    hanchengluo
        10
    hanchengluo  
       2014-05-03 10:22:41 +08:00
    @andyhu 我也是用PHP采集的,2G数据用了差不多一个月时间,有更好的推荐吗?
    andyhu
        11
    andyhu  
       2014-05-03 10:43:43 +08:00
    @hanchengluo 试下node.js+request+cheerio吧,我其实工作中是用PHP的,但如果有需要抓取远程页面这种工作,用完这个组合以后再回去PHP会觉得非常痛苦
    andyhu
        12
    andyhu  
       2014-05-03 10:45:02 +08:00
    @ptsa 淘宝客,主要不好做在哪方面?听说蘑菇街和美丽说都转型了,具体是怎么一个情况?
    hanchengluo
        13
    hanchengluo  
       2014-05-03 10:52:18 +08:00
    @andyhu 主要是取出标签再存入数据库,主要压力应该是抓取速度和数据库IO。我想应该和所用的程序没关的。
    www.smartweb.cn
    andyhu
        14
    andyhu  
       2014-05-03 10:57:42 +08:00
    html parsing也浪费时间,另外php不支持多线程,每个请求都要等待很慢的。数据库我用的是mongodb,速度还是很快的
    andyhu
        15
    andyhu  
       2014-05-03 11:01:49 +08:00
    @hanchengluo 刚才看了您的网站,网页快照用的是什么啊?是phantomjs搞定的吗?node有个thumbbot比较强悍,可以通吃网页 图片 视频缩略图预览。不过是基于phantomjs的,如果需要截取带flash的界面,估计还是要用特殊定制的版本才行,老版的phantomjs已经不支持flash了。总体感觉抓取这东西,php和node.js毫无可比性。python都比php好用很多,也有不少专业的爬虫模块
    hanchengluo
        16
    hanchengluo  
       2014-05-03 11:14:40 +08:00
    @andyhu 多谢光临,我就只用PHP下面的CI,对JS也不熟。以前想搞个爬虫,想学下GoLang,但没坚持,还是用php了,人老了,学不动了。准备将网站改成一个小门户,还在构思中,没采集又没资料,但又怕采集被K。
    laodao
        17
    laodao  
       2014-05-03 12:14:27 +08:00
    ym1623
        18
    ym1623  
       2014-09-03 14:30:13 +08:00
    我发现你这个项目不行啊,,一样会被天猫拦截到...
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     2546 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 24ms UTC 02:28 PVG 10:28 LAX 19:28 JFK 22:28
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86