Learning-augmented count-min sketches via Bayesian nonparametrics
Emanuele Dolera, Stefano Favaro, Stefano Peluchetti; 24(12):1−60, 2023.
Abstract
The count-min sketch (CMS) is an efficient randomized data structure that estimates the frequencies of tokens in a data stream of tokens, known as point queries, based on random hashed data. Recently, a learning-augmented version of the CMS, called CMS-DP, was proposed by Cai, Mitzenmacher, and Adams (NeurIPS 2018). CMS-DP incorporates Bayesian nonparametric (BNP) modeling of the data stream of tokens using a Dirichlet process (DP) prior. Estimates of point queries are obtained as mean functionals of the posterior distribution of the point query, given the hashed data. While CMS-DP has shown improvement over CMS in certain aspects, it relies on a “constructive” proof tailored to the DP prior, making it unsuitable for other nonparametric priors.
In this paper, we present a “Bayesian” proof of CMS-DP that can be used under the popular Pitman-Yor process (PYP) prior. PYP is a more flexible extension of the DP prior, allowing for various tail behaviors ranging from geometric tails to heavy power-law tails. This result enables the development of a novel learning-augmented CMS, called CMS-PYP, specifically designed for power-law data streams. CMS-PYP utilizes BNP modeling of the data stream of tokens with a PYP prior. By applying the arguments of the “Bayesian” proof of CMS-DP, adapted to the PYP prior, we compute the posterior distribution of a point query given the hashed data.
Experimental results on synthetic and real textual data demonstrate that CMS-PYP outperforms both CMS and CMS-DP in estimating low-frequency tokens, which are of critical interest in textual data analysis. Furthermore, CMS-PYP is competitive with a variation of CMS designed for low-frequency token estimation. Additionally, we discuss extending our BNP approach to more general queries, such as range queries.
[abs]