求救:屏蔽爬虫试了 2 天,没成功 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
cokyhe
V2EX    NGINX

求救:屏蔽爬虫试了 2 天,没成功

  •  
  •   cokyhe 2024-08-04 17:16:28 +08:00 4332 次点击
    这是一个创建于 431 天前的主题,其中的信息可能已经有所发展或是发生改变。

    一台 10 年的老服务器,最近 bingbot 疯狂刷流量,用$http_user_agent 为啥屏蔽不了... nginx 日志里 N 多类似这样的记录:

    172.68.244.177 - - [04/Aug/2024:04:04:10 -0400] "GET /find-app/%E0%B8%AD%E0%B8%B2%E0%B8%8A%E0%B8%B5%E0%B8%9E%E0%B8%82%E0%B8%AD%E0%B8%87%E0%B8%8A%E0%B8%B2%E0%B8%A7%E0%B8%AB%E0%B8%A7%E0%B8%B9%E0%B9%88%E0%B8%AB%E0%B8%A5%E0%B8%B4%E0%B8%87%E0%B8%84%E0%B8%B7%E0%B8%AD%E0%B8%81%E0%B8%B2%E0%B8%A3%E0%B8%95%E0%B8%81%E0%B8%9B%E0%B8%A5%E0%B8%B2%E3%80%90ta777.me%E3%80%91%E0%B8%AD%E0%B8%B2%E0%B8%8A%E0%B8%B5%E0%B8%9E%E0%B8%82%E0%B8%AD%E0%B8%87%E0%B8%8A%E0%B8%B2%E0%B8%A7%E0%B8%AB%E0%B8%A7%E0%B8%B9%E0%B9%88%E0%B8%AB%E0%B8%A5%E0%B8%B4%E0%B8%87%E0%B8%84%E0%B8%B7%E0%B8%AD%E0%B8%81%E0%B8%B2%E0%B8%A3%E0%B8%95%E0%B8%81%E0%B8%9B%E0%B8%A5%E0%B8%B2%E3%80%90ta777.me%E3%80%91w7t?page=2 HTTP/1.1" 403 571 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36" 52.167.144.211 172.71.222.36 - - [04/Aug/2024:04:04:10 -0400] "GET /find-app/%E6%B9%96%E5%8D%97%E6%B0%B4%E5%88%A9%E6%B0%B4%E7%94%B5%E8%81%8C%E4%B8%9A%E6%8A%80%E6%9C%AF%E5%AD%A6%E9%99%A2%E6%AF%95%E4%B8%9A%E8%AF%81%E6%A0%B7%E6%9C%AC%E5%9B%BE%E7%89%87%E2%8F%A9%E5%8A%9E%E7%90%86%E7%BD%91zhengjian.shop%E2%8F%AA-%E5%93%AA%E9%87%8C%E4%B9%B0%E6%B9%96%E5%8D%97%E6%B0%B4%E5%88%A9%E6%B0%B4%E7%94%B5%E8%81%8C%E4%B8%9A%E6%8A%80%E6%9C%AF%E5%AD%A6%E9%99%A2%E6%AF%95%E4%B8%9A%E8%AF%81%E6%A0%B7%E6%9C%AC%E5%9B%BE%E7%89%87%F0%9F%8C%9F%E5%8A%9E%E8%AF%81%E7%BD%91zhengjian.shop%F0%9F%8C%9F-%E5%BC%A0%E5%AE%B6%E6%B8%AF%E6%B9%96%E5%8D%97%E6%B0%B4%E5%88%A9%E6%B0%B4%E7%94%B5%E8%81%8C%E4%B8%9A%E6%8A%80%E6%9C%AF%E5%AD%A6%E9%99%A2%E6%AF%95%E4%B8%9A%E8%AF%81%E6%A0%B7%E6%9C%AC%E5%9B%BE%E7%89%87%E5%93%AA%E9%87%8C%E6%9C%89-%E5%93%AA%E9%87%8C%E5%8A%9E%E6%B9%96%E5%8D%97%E6%B0%B4%E5%88%A9%E6%B0%B4%E7%94%B5%E8%81%8C%E4%B8%9A%E6%8A%80%E6%9C%AF%E5%AD%A6%E9%99%A2%E6%AF%95%E4%B8%9A%E8%AF%81%E6%A0%B7%E6%9C%AC%E5%9B%BE%E7%89%87Q5?page=3 HTTP/1.1" 403 571 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36" 52.167.144.211 

    以下是完整配置

    server { listen 80; listen 443 ssl; # ssl_certificate /etc/letsencrypt/live/domain.com/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/domain.com/privkey.pem; ## ssl_session_timeout 5m; ssl_protocols TLSv1 TLSv1.1 TLSv1.2; ssl_prefer_server_ciphers on; ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:!aNULL:!eNULL:!LOW:!3DES:!MD5:!EXP:!PSK:!SRP:!DSS; ## enable HSTS including subdomains add_header Strict-Transport-Security "max-age=31536000; includeSubdomains"; server_name domain.com www.domain.com; index index.html index.htm index.php; root /opt/htdocs/www.domain.com/public; #301 if ($host = 'domain.com') { rewrite ^/(.*)$ https://www.domain.com/$1 permanent; } #location ~ /find-app { if ($http_user_agent ~* "bingbot|AhrefsBot") { return 403; } #} location / { try_files $uri $uri/ /index.php?$query_string; } #申请 let’s Encrypt SSL 用 location ~ /.well-known { allow all; } if (!-e $request_filename) { } location ~ .*\.(php|php5)?$ { #fastcgi_pass unix:/tmp/php-cgi.sock; fastcgi_pass 127.0.0.1:9000; fastcgi_index index.php; include fcgi.conf; } location ~ .*\.(gif|jpg|jpeg|png|bmp|swf)$ { expires 30d; } location ~ .*\.(js|css)?$ { expires 15d; } access_log /data0/logs/domain.log access; } 
    16 条回复    2024-08-06 01:00:29 +08:00
    Xusually
        1
    Xusually  
       2024-08-04 17:19:33 +08:00 via iPhone   1
    屏蔽成功了呀 你贴的两条日志 status code 都是 403 ,说明你规则生效了。
    ZhilingQwQ
        2
    ZhilingQwQ  
       2024-08-04 17:24:00 +08:00
    为什么不用 robots.txt 呢
    unidotnet
        3
    unidotnet  
       2024-08-04 17:25:06 +08:00
    为什么不用 robots.txt 呢 +1
    cokyhe
        4
    cokyhe  
    OP
       2024-08-04 17:26:27 +08:00
    日志里是 403 了,但是对应的 php 代码确还在执行,cpu 狂飙,我再查查吧,谢谢楼上的兄弟
    julyclyde
        5
    julyclyde  
       2024-08-04 17:31:08 +08:00
    @cokyhe 按说 return403 就是最终结果了吧,不至于再去 fastcgi_pass 一次吧?
    AkaGhost
        6
    AkaGhost  
       2024-08-04 17:49:38 +08:00
    bingbot 遵守 robots.txt ,直接用 robots.txt 告诉 bingbot 别来爬网了就行
    Xusually
        7
    Xusually  
       2024-08-04 18:06:55 +08:00
    而且你贴的这 2 个 IP 根本不是 Bing 的 bot ip ,你可以在这里确认是否是真的 bing bot: https://www.bing.com/toolbox/verify-bingbot

    那两个 IP 是 cloudflare 的 ip ,是伪造的爬虫
    moyaya
        8
    moyaya  
       2024-08-04 19:05:08 +08:00
    最近各种爬虫太多了,尤其是 AI 公司。
    yb2313
        9
    yb2313  
       2024-08-04 19:55:48 +08:00
    给他爬的的东西里下毒, 反正爬虫也不管这些
    xqzr
        10
    xqzr  
       2024-08-04 20:15:50 +08:00
    @Xusually 日志行尾 是真的 IP
    lavvrence
        11
    lavvrence  
       2024-08-04 20:33:41 +08:00
    套 CDP ,WAF 上拦截 JA3 / JA4 。
    国外的服务无脑 [Cloudflare]( https://lawrenceli.me/blog/cloudflare#client-hello---ja3)
    Xusually
        12
    Xusually  
       2024-08-04 21:03:21 +08:00 via iPhone
    @xqzr 是的,我看错了,这个 IP 是 bing bot
    xiaoxiaov
        13
    xiaoxiaov  
       2024-08-05 07:04:24 +08:00 via Android
    给服务器弄个雷池社区版,一劳永逸
    lpe234
        14
    lpe234  
       2024-08-05 11:53:33 +08:00
    这是爬虫吗?这是来打广告的吧。 [app/湖南水利水电职业技术学院毕业证样本图片办理网]
    RangerWolf
        15
    RangerWolf  
       2024-08-05 16:37:48 +08:00
    @Xusually 有没有 facebook 的爬虫 IP 验证的?
    我的站被 facebook 爬了
    coder001
        16
    coder001  
       2024-08-06 01:00:29 +08:00
    @lpe234 看来可能又是黑产的新玩法,比如诱导搜索引擎爬虫去干坏事之类的
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     867 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 26ms UTC 21:44 PVG 05:44 LAX 14:44 JFK 17:44
    Do have faith in what you're doing.
    ubao snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86