Python B 文件里的每行数据匹配 A 文件里的内容 逻辑问题 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
aaa5838769
V2EX    Python

Python B 文件里的每行数据匹配 A 文件里的内容 逻辑问题

  •  1
     
  •   aaa5838769 2020-05-26 21:59:26 +08:00 2718 次点击
    这是一个创建于 1969 天前的主题,其中的信息可能已经有所发展或是发生改变。

    A 文件是个 json 文件: a.txt

    { "_id": "113.254.82.124", "_index": "fofapro_subdomain", "header": "HTTP/1.1 401 Unauthorized\r\nConnection: close\r\nContent-Length: 195\r\nCache-Control: no-cache\r\nContent-Type: text/html\r\nDate: Sat, 20 Oct 2018 15:59:44 GMT\r\nEtag: \"0-29d-b90\"\r\nServer: Embedthis-Appweb/3.3.1\r\nWww-Authenticate: Basic realm=\"DCS-2530L\"\r\nX-Frame-Options: SAMEORIGIN\r\n", } { "_id": "http://10.254.82.12", "_index": "fofapro_subdomain", "header": "HTTP/1.1 401 Unauthorized\r\nConnection: close\r\nContent-Length: 195\r\nCache-Control: no-cache\r\nContent-Type: text/html\r\nDate: Sat, 20 Oct 2018 15:59:44 GMT\r\nEtag: \"0-29d-b90\"\r\nServer: Embedthis-Appweb/3.3.1\r\nWww-Authenticate: Basic realm=\"DCS-2530L\"\r\nX-Frame-Options: SAMEORIGIN\r\n", } { "_id": "https://192.168.1.10:9090", "_index": "fofapro_subdomain", "header": "HTTP/1.1 401 Unauthorized\r\nConnection: close\r\nContent-Length: 195\r\nCache-Control: no-cache\r\nContent-Type: text/html\r\nDate: Sat, 20 Oct 2018 15:59:44 GMT\r\nEtag: \"0-29d-b90\"\r\nServer: Embedthis-Appweb/3.3.1\r\nWww-Authenticate: Basic realm=\"DCS-2530L\"\r\nX-Frame-Options: SAMEORIGIN\r\n", } { "_id": "127.0.0.1:8343", "_index": "fofapro_subdomain", "header": "HTTP/1.1 401 Unauthorized\r\nConnection: close\r\nContent-Length: 195\r\nCache-Control: no-cache\r\nContent-Type: text/html\r\nDate: Sat, 20 Oct 2018 15:59:44 GMT\r\nEtag: \"0-29d-b90\"\r\nServer: Embedthis-Appweb/3.3.1\r\nWww-Authenticate: Basic realm=\"DCS-2530L\"\r\nX-Frame-Options: SAMEORIGIN\r\n", } 

    B 文件: b.txt

    127.0.01 192.168.1.10 192.168.88.88 

    代码

    import re import json def filesJson(filepath,dstpaths): datas = set() #正则匹配 rule = re.compile('^[a-zA-z]{1}.*$') with open(filepath, 'r', encoding='UTF-8') as a, open(dstpaths, 'r', encoding='UTF-8') as b: b.seek(0) for realine_a in a: json_datas = json.loads(realine_a) ips = json_datas['_id'] if rule.findall(ips): ips = ips.strip("http[s]?://") ips = ips.split(":")[0] datas.add(ips) for realine_b in b: if realine_b in datas: print(realine_b) else: break if __name__ == '__main__': file_paths = "a.txt" dstpaths = 'b.txt' filesJson(file_paths, dstpaths) 

    我的想法是把 A 文件里的 IP,去除协议和端口,只保留 IP 写入到一个集合中,然后在通过 B 文件的数据去匹配这个集合,有没有这个 IP,如果有这个 IP,把 A 文件这行数据写入到 C 文件中,现在问题是 B 文件无法匹配 A 文件的数据,而且如果 A 文件内容是几百万行数据,B 文件内容是几万行数据,这种逻辑是不是有很大的问题。

    第 1 条附言    2020-05-27 19:16:27 +08:00
    非常感谢各位的回复,我还发现一个坑,就是提取 B 文件里的数据。会有一个\n 换行符也会影响的您的逻辑。
    20 条回复    2020-05-27 22:32:21 +08:00
    F281M6Dh8DXpD1g2
        1
    F281M6Dh8DXpD1g2  
       2020-05-26 22:01:24 +08:00 via iPhone
    pyspark 安排了
    aaa5838769
        2
    aaa5838769  
    OP
       2020-05-26 22:03:16 +08:00
    @liprais 感谢回复,但是我现在的代码逻辑是无法匹配到数据。
    telnetning
        3
    telnetning  
       2020-05-26 22:13:38 +08:00 via Android
    确认一下每个 Json 是在一行吗; rule 的正则可以再看一看?;可以直接断点调试一下~
    noqwerty
        4
    noqwerty  
       2020-05-26 22:14:51 +08:00 via Android
    别的不说,你现在每插入一条数据就要遍历一次 b 完全没意义啊,拿到两个 set 之后直接取交集不就行了吗
    aaa5838769
        5
    aaa5838769  
    OP
       2020-05-26 22:30:18 +08:00 via iPhone
    @noqwerty 我试过交集,发现还是没有数据。
    aaa5838769
        6
    aaa5838769  
    OP
       2020-05-26 22:32:47 +08:00 via iPhone
    @telnetning 每个 json 不在一行,一个 json 有很多数据,我只是提取出来几个字段,我的正则主要是为了匹配到 http 开头的,然后删除协议,如果有端口,进行截取,再存到集合中。
    AFlash
        7
    AFlash  
       2020-05-26 22:39:52 +08:00
    试试把“else:break”去掉,
    noqwerty
        8
    noqwerty  
       2020-05-26 22:39:55 +08:00
    @aaa5838769 #6 不在一行的话你的 json.loads(realine_a) 不会报错吗?
    ipwx
        9
    ipwx  
       2020-05-26 22:40:27 +08:00
    @aaa5838769 。。。

    b_set = set(filter(lambda s: s, (l.strip() for l in open('b.txt', 'r', encoding='utf-8'))))
    with open('a.txt', 'r', encoding='utf-8') as a_file, open('c.txt', 'r', encoding='utf-8') as c_file:
    ....for s in a:
    ........do what you have do to get s_ip
    ........if s_ip in b_set:
    ............c_file.write(s_ip + '\n')
    ipwx
        10
    ipwx  
       2020-05-26 22:40:59 +08:00
    open('c.txt', 'r', encoding='utf-8') as c_file => open('c.txt', 'w', encoding='utf-8') as c_file:
    ipwx
        11
    ipwx  
       2020-05-26 22:41:41 +08:00
    楼主你这需求,像我上面这样,把 b 做成 set 过一遍 a 就行了。不理解你在纠结啥。
    xujunfu
        12
    xujunfu  
       2020-05-26 22:42:17 +08:00 via iPhone
    取出 b 文件所以 ips,再把每个 ip 转成 long 数值,将数组排序;遍历 a 文件取出 ip 并转 long 数值,二分查找在不在 long_ips 数组中
    aaa5838769
        13
    aaa5838769  
    OP
       2020-05-26 22:46:30 +08:00 via iPhone
    @noqwerty 不好意思,在一行上,我脑子短路啊,我专门格式化一下,然后复制到论坛里。
    aaa5838769
        14
    aaa5838769  
    OP
       2020-05-26 22:51:49 +08:00 via iPhone
    @ipwx 我之前用过两个 set 做交集,也没有匹配到数据,所以换了另一种逻辑。
    wuwukai007
        15
    wuwukai007  
       2020-05-26 23:06:16 +08:00
    a 文件格式是这样?
    {‘_id’:xxx}下一行
    {‘_id’:xxx}下一行
    aaa5838769
        16
    aaa5838769  
    OP
       2020-05-26 23:14:45 +08:00 via iPhone
    @wuwukai007 是的
    rrfeng
        17
    rrfeng  
       2020-05-26 23:41:47 +08:00 via Android
    这个最佳实践是:

    b 预读完,做成 hash ( dict ),ip 当 key,放在内存里。

    然后 a 用流式 json 处理解析完直接查找 hash 表。
    wuwukai007
        18
    wuwukai007  
       2020-05-27 00:24:58 +08:00
    a.txt 格式
    {xxxxxxxx}\n
    {xxxxxxxx}\n
    b.txt 格式
    192.168.1.10
    192.168.1.10

    import pandas as pd
    import re
    df = pd.read_csv(r'a.txt',header=None)
    df.rename(columns={0:'ip'},inplace=True)
    regex = '(([01]{0,1}\d{0,1}\d|2[0-4]\d|25[0-5])\.){3}([01]{0,1}\d{0,1}\d|2[0-4]\d|25[0-5])'
    df['ip_host'] = df.ip.str.replace(rf'^((?!{regex}).)*','').str.replace('(:\d+)|"','')
    df2 = pd.read_csv(r"b.txt",sep='\n',header=None)
    df2.columns = ['ip_host']
    df3 = df2.merge(df,on='ip_host',how='inner')
    df3.drop('ip_host',axis=1,inplace=True)
    df3.to_csv(r"c.txt",index=None,header=None,quoting=None,quotechar="'")
    wuwukai007
        19
    wuwukai007  
       2020-05-27 00:26:09 +08:00
    a.txt 格式
    {xxxxxxxx}换行符
    {xxxxxxxx}换行符
    b.txt 格式
    192.168.1.10 换行符
    192.168.1.10 换行符

    import pandas as pd
    import re
    df = pd.read_csv(r'a.txt',header=None)
    df.rename(columns={0:'ip'},inplace=True)
    regex = '(([01]{0,1}\d{0,1}\d|2[0-4]\d|25[0-5])\.){3}([01]{0,1}\d{0,1}\d|2[0-4]\d|25[0-5])'
    df['ip_host'] = df.ip.str.replace(rf'^((?!{regex}).)*','').str.replace('(:\d+)|"','')
    df2 = pd.read_csv(r"b.txt",sep='\n',header=None)
    df2.columns = ['ip_host']
    df3 = df2.merge(df,on='ip_host',how='inner')
    df3.drop('ip_host',axis=1,inplace=True)
    df3.to_csv(r"c.txt",index=None,header=None,quoting=None,quotechar="'")

    输出 c.txt
    {xxxxxxxx}换行符
    {xxxxxxxx}换行符
    aaa5838769
        20
    aaa5838769  
    OP
       2020-05-27 22:32:21 +08:00
    @rrfeng value 为空么?
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     1548 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 26ms UTC 16:25 PVG 00:25 LAX 09:25 JFK 12:25
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86