nodejs Python PHP ruby go perl 处理单个 4 百兆 csv 文件比较 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
zhouyin
V2EX    分享发现

nodejs Python PHP ruby go perl 处理单个 4 百兆 csv 文件比较

  •  
  •   zhouyin 244 天前 1806 次点击
    这是一个创建于 244 天前的主题,其中的信息可能已经有所发展或是发生改变。

    ###耗时

    perl 最慢 等不及处理完 就停止了 perl

    nodejs 1 分钟多

    php 30 多秒

    ruby 30 多秒

    python 11 秒左右

    go 4 秒左右

    ###时间上 go 和 python 胜出

    ###功能上面 这个 csv 文件不标准 有个字段有个单个双引号

    go 和 nodejs 和 ruby 都报错 无法处理完 上面它们两个的时间是把那个单引号移除后的 csv 文件

    php 没报错 但因为单个双引号忽略了很多行 它把那些双引号当分界符了

    功能上 python 胜出 python 完全能处理不标准的 csv 最后能生成正确 csv 就几行代码

    ###代码写起来 nodejs 最恶心

    nodejs 什么 非常像 ghostscirpt 作者评价 perl 的话:perl 像从狗的肛门里吐出来的东西

    写这么个小项目 感觉 nodejs 才像从狗的肛门里吐出来的东西

    23 条回复    2025-02-10 19:31:20 +08:00
    ysc3839
        1
    ysc3839  
       244 天前 via Android
    所以代码呢?
    zhouyin
        2
    zhouyin  
    OP
       244 天前
    代码传不上来

    看这里

    https://cowtransfer.com/s/f0a48d2009fd4f
    zhouyin
        3
    zhouyin  
    OP
       244 天前
    hefish
        4
    hefish  
       244 天前
    哈哈,说的非常高级。
    gainsurier
        5
    gainsurier  
       244 天前 via iPhone
    估计 C 写需要一秒吗
    zhouyin
        6
    zhouyin  
    OP
       244 天前
    @gainsurier
    python 和 php ruby 不就是 c 实现的么 只是 python 实现得好
    chenqh
        8
    chenqh  
       244 天前
    等等 nodejs 怎么这么快?JIT 呢?比 php 和 ruby 这种没 JIT 都慢?
    zhouyin
        9
    zhouyin  
    OP
       244 天前
    @gainsurier
    还有 nodejs c++实现 没 python 做得好
    henbf
        10
    henbf  
       244 天前   1
    喷 Node.js 之前反思一下自己是不是应该先搞清楚 I/O 和流的基本概念
    zhouyin
        11
    zhouyin  
    OP
       244 天前
    @henbf
    我不是 nodejs 高手 我把 a.js 更新了 使用了输出流 但现在报堆溢出错误了 :

    ```bash
    -bash-4.2# node a.js
    (node:17974) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 drain listeners added to [WriteStream]. Use emitter.setMaxListeners() to increase limit
    (Use `node --trace-warnings ...` to show where the warning was created)
    (node:17974) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 drain listeners added to [WriteStream]. Use emitter.setMaxListeners() to increase limit
    (node:17974) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 drain listeners added to [WriteStream]. Use emitter.setMaxListeners() to increase limit

    <--- Last few GCs --->

    [17974:0x1c3dbf0] 40306 ms: Scavenge (reduce) 2046.8 (2082.1) -> 2046.5 (2082.6) MB, 44.4 / 0.0 ms (average mu = 0.342, current mu = 0.316) allocation failure
    [17974:0x1c3dbf0] 40396 ms: Scavenge (reduce) 2047.2 (2082.6) -> 2046.8 (2082.8) MB, 31.1 / 0.0 ms (average mu = 0.342, current mu = 0.316) allocation failure


    <--- JS stacktrace --->

    FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - Javascript heap out of memory
    1: 0x7fcfb6136908 node::Abort() [/lib64/libnode.so.93]
    2: 0x7fcfb6024451 [/lib64/libnode.so.93]
    3: 0x7fcfb732a552 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/lib64/libnode.so.93]
    4: 0x7fcfb732a8e7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/lib64/libnode.so.93]
    5: 0x7fcfb74ea305 [/lib64/libnode.so.93]
    6: 0x7fcfb74ea3e5 [/lib64/libnode.so.93]
    7: 0x7fcfb74fe77c v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [/lib64/libnode.so.93]
    8: 0x7fcfb74ff0a1 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/lib64/libnode.so.93]
    9: 0x7fcfb7502269 v8::internal::Heap::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/lib64/libnode.so.93]
    10: 0x7fcfb75022f7 v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/lib64/libnode.so.93]
    11: 0x7fcfb74c27d0 v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [/lib64/libnode.so.93]
    12: 0x7fcfb74badb4 v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap(int, v8::internal::AllocationType, v8::internal::Map, v8::internal::AllocationAlignment) [/lib64/libnode.so.93]
    13: 0x7fcfb74bcbdf v8::internal::FactoryBase<v8::internal::Factory>::NewRawOneByteString(int, v8::internal::AllocationType) [/lib64/libnode.so.93]
    14: 0x7fcfb74c4d5d v8::internal::Factory::NewStringFromUtf8(v8::base::Vector<char const> const&, v8::internal::AllocationType) [/lib64/libnode.so.93]
    15: 0x7fcfb733d59d v8::String::NewFromUtf8(v8::Isolate*, char const*, v8::NewStringType, int) [/lib64/libnode.so.93]
    16: 0x7fcfb6215390 node::StringBytes::Encode(v8::Isolate*, char const*, unsigned long, node::encoding, v8::Local<v8::Value>*) [/lib64/libnode.so.93]
    17: 0x7fcfb6123ef3 [/lib64/libnode.so.93]
    18: 0x7fcfb71ba3cc [/lib64/libnode.so.93]
    Aborted
    ```
    henbf
        12
    henbf  
       244 天前
    @zhouyin 你的写的不对

    const { createReadStream, createWriteStream } = require("fs");
    const { parse } = require("csv-parse");

    const inputPath = "../outpy.csv";
    const outputPath = "./test.txt";

    const readStream = createReadStream(inputPath);
    const writeStream = createWriteStream(outputPath, { flags: "a" });

    const parser = parse({ delimiter: ",", from_line: 2 });

    readStream.pipe(parser);

    parser.on("data", (row) => {
    writeStream.write(row.join(",") + "\n");
    });

    parser.on("end", () => {
    console.log("finished");
    writeStream.end();
    });

    parser.on("error", (error) => {
    console.error("CSV Parsing Error:", error);
    });
    zhouyin
        13
    zhouyin  
    OP
       244 天前
    一开始我就是差不多你这样写的 没想到速度没提升 所以改成那样 以为 write 那里有缓冲区

    一字不换把你的代码 运行 结果 耗时 一分钟多 望 python 莫及

    -bash-4.2# time node a.js
    finished

    real 1m3.579s
    user 1m4.103s
    sys 0m2.478s
    henbf
        14
    henbf  
       244 天前
    @zhouyin 这中间还要看你对 csv 的每一行进行了怎么样的处理,你用 python 只是一读一写没有任何额外的处理,相当于复制。用 Node.js ,你却把每一行转换成数组,写的时候又把数组转换成字符串,当然慢了。

    const { createReadStream, createWriteStream } = require("fs");

    const inputPath = "../outpy.csv";
    const outputPath = "./test.txt";


    const readStream = createReadStream(inputPath, { highWaterMark: 256 * 1024 });
    const writeStream = createWriteStream(outputPath, { flags: "a" });

    readStream.pipe(writeStream);

    readStream.on("end", () => {
    console.log("finished");
    writeStream.end();
    });

    readStream.on("error", (err) => {
    console.error("Error reading file:", err);
    });

    writeStream.on("error", (err) => {
    console.error("Error writing file:", err);
    });
    zhouyin
        15
    zhouyin  
    OP
       244 天前 via Android
    @henbf
    python 返回的是数组 只是写入的也是数组
    zhouyin
        16
    zhouyin  
    OP
       244 天前
    @henbf

    我又用了一个库 csvwriter 慢得不得了

    python 库就是设计得好 不服不行
    zhouyin
        17
    zhouyin  
    OP
       244 天前
    @zhouyin
    用了 csvwriter 时间 3 分多

    -bash-4.2# time node a.js
    finished

    real 3m45.028s
    user 4m12.751s
    sys 2m59.847s
    henbf
        18
    henbf  
       244 天前
    @zhouyin ,Node.js 不适合解析 csv ,Python 牛逼
    stabc
        19
    stabc  
       244 天前
    1. 解析 csv ,要一个字符一个字符拆分和拼接,底层语言绝对优势,因为可以根据位置拿来直接用,而 node 每次都创建新 string 对象。

    2. python 标准库就有 csv 模块,所以也是底层在执行,那么他比 go 语言慢那么多,说明写的比较差。

    3. 我刚才简单测试了一下,node 如果优化一下解析过程,减少字符串拼接,解析 400M 的 csv 文件,总用时可以压缩到 5 秒以内。
    gesse
        20
    gesse  
       243 天前
    @henbf 哈哈哈
    julyclyde
        21
    julyclyde  
       242 天前
    @stabc 为什么,因为“标准库有”所以就“底层”了?
    https://github.com/python/cpython/blob/main/Lib/csv.py
    python 的 csv 模块是个纯 python 的啊,并不是 C 的
    stabc
        22
    stabc  
       242 天前   2
    @julyclyde 你这个是接口层,底层在这里: https://github.com/python/cpython/blob/main/Modules/_csv.c
    julyclyde
        23
    julyclyde  
       241 天前
    @stabc 谢谢你的指正。我去学习一下
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     888 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 24ms UTC 19:58 PVG 03:58 LAX 12:58 JFK 15:58
    Do have faith in what you're doing.
    ubao snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86