寻求一个概率型计数器,效果类似 BloomFilter + Counter,能告诉我每个 item 大约出现了多少次 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录

寻求一个概率型计数器,效果类似 BloomFilter + Counter,能告诉我每个 item 大约出现了多少次

  •  
  •   RedisMasterNode 2023-05-05 18:31:27 +08:00 1383 次点击
    这是一个创建于 899 天前的主题,其中的信息可能已经有所发展或是发生改变。

    背景

    在一些流处理场景中,因为数据太多,不想把所有东西都持久化保存,所以需要一些采样的手段。已有的采样策略会按照例如“是否包含错误”、“是否 bla bla bla”来采样,这些都是可以的。

    目前在寻求新的思路,尽可能覆盖到不同的数据,例如所有数据的可能组合1 亿种,那每种出现的频率可能不一样,如果用一个 map 来计数的话很精确,但是会用很多内存,例如:

    {"组合_uniq_id_1": 999, "组合_uniq_id_2": 12, ..., "组合_uniq_id_192818939291": 21} 

    需求

    BloomFilter 可以把 1 亿种可能映射到一个 BitMap 上,空间很小,但是可以告诉我这种组合是否出现过。如果我不光想知道某种组合(可能)出现过,还想大致了解一下它出现过多少次,有没有合适的数据结构满足这个要求呢?

    8 条回复    2023-05-06 11:47:39 +08:00
    lxy42
        1
    lxy42  
       2023-05-05 19:00:18 +08:00 via Android
    hyperloglog ,Redis 就支持
    aphorism
        2
    aphorism  
       2023-05-05 19:30:55 +08:00   1
    CM Sketch or Counting Bloom filter.
    RedisMasterNode
        3
    RedisMasterNode  
    OP
       2023-05-05 20:11:35 +08:00
    @lxy42 HyperLogLog 是不行的,HyperLogLog 能告诉我某个东西,例如 “今日见过的 IP”,里面有多少个不同的 item ,我要的不是不同的次数,每个不同的 item 出现的次数
    RedisMasterNode
        4
    RedisMasterNode  
    OP
       2023-05-05 20:11:52 +08:00
    @aphorism 谢谢,搜了一下很接近了,我详细了解下
    matrix1010
        5
    matrix1010  
       2023-05-05 21:34:19 +08:00   1
    如果你需要比较精确而空间又有限的话 Count-min Sketch 不一定合适,通常来说使用 Count-min Sketch 来进行相对比较,比如 LFU cache 。可以看看这个: https://redis.com/blog/count-min-sketch-the-art-and-science-of-estimating-stuff/ 里面的 Count-min Sketch example 和 What is Count-min Sketch good for?部分
    lxy42
        6
    lxy42  
       2023-05-05 22:31:12 +08:00 via Android
    @RedisMasterNode 我记混了,看到 bloom filter 和你 ID 里的 Redis ,就想起了 hyperloglog
    RedisMasterNode
        7
    RedisMasterNode  
    OP
       2023-05-05 22:55:40 +08:00
    @matrix1010 非常有帮助,感谢
    RedisMasterNode
        8
    RedisMasterNode  
    OP
       2023-05-06 11:47:39 +08:00
    @matrix1010
    @aphorism
    谢谢两位,信息很有用,Redis 里面已经有拓展的模块可以试验一下。
    同时还有几种概率型数据结构的“一句话用途描述”,快速理解: https://redis.io/docs/stack/bloom/

    Use these data structures to answer a set of common questions concerning data streams:

    - HyperLogLog: How many unique values appeared so far in the data stream?
    - Bloom filter and Cuckoo filter: Did the value v already appear in the data stream?
    - Count-min sketch: How many times did the value v appear in the data stream?
    - Top-K: What are the k most frequent values in the data stream?
    - t-digest can be used to answer these questions:
    - - What fraction of the values in the data stream are smaller than a given value?
    - - How many values in the data stream are smaller than a given value?
    - - Which value is smaller than p percent of the values in the data stream? (what is the p-percentile value)?
    - - What is the mean value between the p1-percentile value and the p2-percentile value?
    - - What is the value of the n-th smallest / largest value in the data stream? (what is the value with [reverse] rank n)?
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     2955 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 25ms UTC 13:28 PVG 21:28 LAX 06:28 JFK 09:28
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86