Gitbook2pdf :抓取 Gitbook 生成的网站生成 pdf 文件的工具 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
爱意满满的作品展示区。
fuergaosi
V2EX    分享创造

Gitbook2pdf :抓取 Gitbook 生成的网站生成 pdf 文件的工具

  •  
  •   fuergaosi 2019-03-07 10:21:36 +08:00 7717 次点击
    这是一个创建于 2435 天前的主题,其中的信息可能已经有所发展或是发生改变。

    介绍

    经常发现很多用gitbook生成的书籍质量很高
    就想离线下来看
    但是gitbook生成的pdf都无法复制且体积很大
    有的网站甚至不提供下载的选项
    就和小伙伴一起做了个工具
    对于gitbook生成的网站进行抓取
    解析以后使用weasyprint进行生成文件

    特性

    • 异步抓取 使用aiohttp抓取 对于网站内容抓取基本秒速完成

    • 文本可复制

    • 保持原目录结构

    • 保留原文链接

    • 完整还原原 html 页面样式
    • 体积小,800+页的 pdf 只占用 4.6M
    第 1 条附言    2019-03-07 19:15:28 +08:00

    项目地址:gitbook2pdf

    33 条回复    2019-05-07 23:51:07 +08:00
    fuergaosi
        1
    fuergaosi  
    OP
       2019-03-07 10:25:20 +08:00
    求 star
    magicZ
        2
    magicZ  
       2019-03-07 10:28:11 +08:00
    给个链接呀
    fuergaosi
        3
    fuergaosi  
    OP
       2019-03-07 10:31:40 +08:00
    忘记放链接了
    gitbook2pdf: https://github.com/fuergaosi233/gitbook2pdf
    22k
        4
    22k  
       2019-03-07 10:32:00 +08:00
    昨天还在想着有没有能下载 gitbook 的书籍,mark 一下,楼主可以分享的话更新下原帖。谢谢大佬
    fuergaosi
        5
    fuergaosi  
    OP
       2019-03-07 10:50:09 +08:00
    @22k 看 3 楼
    changjiangzzZ
        6
    changjiangzzZ  
       2019-03-07 11:22:48 +08:00
    已 star :)
    newmind
        7
    newmind  
       2019-03-07 11:27:17 +08:00
    效果很不错, 已赞
    newmind
        8
    newmind  
       2019-03-07 11:28:13 +08:00
    要是能有个在线版就更好了
    jasonslyvia
        9
    jasonslyvia  
       2019-03-07 11:55:25 +08:00
    赞,一直想要一个这样的工具,希望能持续打磨!
    FakeLeung
        10
    FakeLeung  
       2019-03-07 11:59:18 +08:00
    没有 usage 吗?
    看代码貌似是直接修改 main 里面那个 run 的 url ?

    ps:github 地址可以 append。
    fffflyfish
        11
    fffflyfish  
       2019-03-07 12:19:46 +08:00
    点赞!终于看到有人做了
    mseasons
        12
    mseasons  
       2019-03-07 14:31:23 +08:00
    aiohttp.client_exceptions.ClientConnectorCertificateError: Cannot connect to host wizardforcel.gitbooks.io:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)')]
    d5
        13
    d5  
       2019-03-07 14:34:09 +08:00
    楼主可以考虑做一个在线版,后端放在外地主机上~
    privil
        14
    privil  
       2019-03-07 16:32:32 +08:00
    ……好像比较吃内存,被 kill 掉了
    tongdongdong
        15
    tongdongdong  
       2019-03-07 18:59:15 +08:00
    C:\Users\TDD\Desktop>python -m weasyprint https://ts.xcatliu.com ts.pdf
    WARNING: Ignored `text-rendering:auto` at 4:620, unknown property.
    WARNING: Ignored `filter:none` at 4:2882, unknown property.
    WARNING: Expected a media type, got (max-width:600px)
    WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:83.
    WARNING: Expected a media type, got (max-width:600px)
    WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:669.
    WARNING: Ignored `box-shadow:none` at 9:1092, unknown property.
    WARNING: Ignored `text-overflow:ellipsis` at 9:1686, unknown property.
    WARNING: Expected a media type, got (max-width:1000px)
    WARNING: Invalid media type " (max-width:1000px)" the whole @media rule was ignored at 9:1805.
    WARNING: Ignored `box-shadow:0 6px 12px rgba(0,0,0,.175)` at 9:2336, unknown property.
    WARNING: Ignored `overflow-y:auto` at 9:3908, unknown property.
    WARNING: Ignored `text-overflow:ellipsis` at 9:4934, unknown property.
    WARNING: Expected a media type, got (max-width:600px)
    WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:5254.
    WARNING: Expected a media type, got (min-width:600px)
    WARNING: Invalid media type " (min-width:600px)" the whole @media rule was ignored at 9:5583.
    WARNING: Expected a media type, got (max-width:600px)
    WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:5650.
    WARNING: Ignored `overflow-y:auto` at 9:6180, unknown property.
    WARNING: Ignored `overflow-y:auto` at 9:6418, unknown property.
    WARNING: Expected a media type, got (max-width:1240px)
    WARNING: Invalid media type " (max-width:1240px)" the whole @media rule was ignored at 9:6434.
    WARNING: Ignored `text-size-adjust:100%` at 9:7377, unknown property.
    WARNING: Expected a media type, got (max-width:1240px)
    WARNING: Invalid media type " (max-width:1240px)" the whole @media rule was ignored at 9:11595.
    WARNING: Ignored `box-shadow:none` at 9:12111, unknown property.
    WARNING: Ignored `text-size-adjust:100%` at 9:12512, unknown property.
    WARNING: Ignored `text-rendering:optimizeLegibility` at 9:20972, unknown property.
    WARNING: Ignored `font-smoothing:antialiased` at 9:21006, unknown property.
    WARNING: Ignored `text-size-adjust:100%` at 9:21124, unknown property.
    WARNING: Ignored `box-shadow: none` at 235:3, unknown property.
    WARNING: Ignored `box-shadow: none` at 272:3, unknown property.
    然后只有首页转成功了!!!
    changjiangzzZ
        16
    changjiangzzZ  
       2019-03-07 19:02:54 +08:00
    @tongdongdong 老哥麻烦看看文档先~
    changjiangzzZ
        17
    changjiangzzZ  
       2019-03-07 19:04:38 +08:00
    @mseasons 国内网络环境不太好,连接的时候 timeout 了,添加个代理试试
    fuergaosi
        18
    fuergaosi  
    OP
       2019-03-07 19:13:55 +08:00
    @privil 吃内存是因为`weasyprint`的问题 正在尝试分片输出
    @tongdongdong 出门左转`weasyprint`的 issues 区
    @mseasons 我无法访问这个 url 不知道你是怎么访问的 希望你可以把问题以及抓取的 url 发在`issues`区
    @FakeLeung 感谢提醒 之前没找到 append 的按钮(_) 另外目前是修改 url 使用 等下改一下使用方法 之前一直这样测试 就没注意这些方面
    Ahs
        19
    Ahs  
       2019-03-07 19:14:26 +08:00 via Android
    已 Star
    fuergaosi
        20
    fuergaosi  
    OP
       2019-03-07 19:21:27 +08:00
    @d5 @newmind 这个东西有点吃内存 解决这个问题以后会考虑做个在线版的
    aWangami
        21
    aWangami  
       2019-03-07 19:27:16 +08:00
    (Python3) gitbook2pdf python gitbook.py
    Traceback (most recent call last):
    File "gitbook.py", line 5, in <module>
    import weasyprint
    File "/Users/Python3/lib/python3.7/site-packages/weasyprint/__init__.py", line 393, in <module>
    from .css import preprocess_stylesheet # noqa
    File "/Users/Python3/lib/python3.7/site-packages/weasyprint/css/__init__.py", line 26, in <module>
    from . import computed_values
    File "/Users/Python3/lib/python3.7/site-packages/weasyprint/css/computed_values.py", line 17, in <module>
    from .. import text
    File "/Users/Python3/lib/python3.7/site-packages/weasyprint/text.py", line 14, in <module>
    import cairocffi as cairo
    File "/Users/Python3/lib/python3.7/site-packages/cairocffi/__init__.py", line 39, in <module>
    cairo = dlopen(ffi, 'cairo', 'cairo-2', 'cairo-gobject-2', 'cairo.so.2')
    File "/Users/Python3/lib/python3.7/site-packages/cairocffi/__init__.py", line 36, in dlopen
    raise OSError("dlopen() failed to load a library: %s" % ' / '.join(names))
    OSError: dlopen() failed to load a library: cairo / cairo-2 / cairo-gobject-2 / cairo.so.2

    这是啥情况?
    privil
        22
    privil  
       2019-03-07 19:28:17 +08:00
    @fuergaosi #18 抓取的时候也报错了,不过我 vps 内存真小,才 512Mb,抓原来的 k8s handbook 是不行的。

    https://funhacks.gitbooks.io/explore-python
    crawling : https://funhacks.gitbooks.io/explore-python/Conclusion/reference_material.html
    Traceback (most recent call last):
    File "gitbook.py", line 298, in <module>
    Gitbook2PDF("https://funhacks.gitbooks.io/explore-python/").run()
    File "gitbook.py", line 190, in run
    loop.run_until_complete(self.crawl_main_content(content_urls))
    File "/usr/local/python3.7.2/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
    return future.result()
    File "gitbook.py", line 212, in crawl_main_content
    await asyncio.gather(*tasks)
    File "gitbook.py", line 233, in gettext
    text = ChapterParser(metatext, level).parser()
    File "gitbook.py", line 95, in parser
    if len(context.find('footer')):
    TypeError: object of type 'NoneType' has no len()
    privil
        23
    privil  
       2019-03-07 19:30:23 +08:00
    hooych
        24
    hooych  
       2019-03-07 19:38:27 +08:00
    @aWangami mac python3 同 OSError: dlopen() failed to load a library: cairo / cairo-2 / cairo-gobject-2 / cairo.so.2
    @fuergaosi 啥情况
    hooych
        25
    hooych  
       2019-03-07 19:40:01 +08:00
    @fuergaosi #23 多谢,安装再试下
    fuergaosi
        26
    fuergaosi  
    OP
       2019-03-07 19:40:07 +08:00
    @privil 无法重现 这个报错是官方推荐的锅 我本来没有写 len 今天跑的时候官方提示我以后可能不让直接 if None 了 就推荐写成这样 结果成了个 bug 我这就去改
    mseasons
        27
    mseasons  
       2019-03-07 22:15:15 +08:00
    @changjiangzzZ 不是 timeout 的问题,似乎是 https 验证的问题。我把所有的 get 请求参数增加 verify=False 就好了。
    mseasons
        28
    mseasons  
       2019-03-07 22:18:23 +08:00
    @fuergaosi url 我没改,直接 git clone 下来运行的源码。我后面查了一下文档,将所有的 get 请求增加参数 verify=False 就通过了。
    dyxang
        29
    dyxang  
       2019-03-07 22:24:18 +08:00 via Android
    好想直接用,为什么不 py2exe ?
    leesymbol
        30
    leesymbol  
       2019-03-08 08:22:04 +08:00 via iPhone
    帮顶
    cye3s
        31
    cye3s  
       2019-03-08 11:25:50 +08:00
    试了个,目录结构没保留啊,比如这个
    https://go.tanglei.name/content
    fuergaosi
        32
    fuergaosi  
    OP
       2019-03-08 14:18:46 +08:00
    @cye3s 我测试了一下 目录结构保留了 不过因为有两个 404 所以少了两个章节 ![kz37f1.png]( https://s2.ax1x.com/2019/03/08/kz37f1.png) 另外希望有问题可以直接发到 issues 区
    @dyxang 因为我没有 windows ┑( ̄Д  ̄)┍
    soulteary
        33
    soulteary  
       2019-05-07 23:51:07 +08:00
    @fuergaosi 你的小工具很好用鸭,但是看到有些同学搞不定环境,所以我封装了一个容器镜像,代码在这里: https://github.com/soulteary/docker-gitbook-pdf-generator

    如果你愿意稍微调整项目目录结构 & 打 release tag 的话,后续升级维护能够更方便,比如定制电子书风格, etc...
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     1110 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 41ms UTC 17:53 PVG 01:53 LAX 09:53 JFK 12:53
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86