Space-efficient representation of genomic k-mer count tables - Archive ouverte HAL Access content directly
Journal Articles Algorithms for Molecular Biology Year : 2022

Space-efficient representation of genomic k-mer count tables

, (1) ,
1
Yoshihiro Shibuya
  • Function : Author
Djamal Belazzougui
  • Function : Author
Gregory Kucherov
  • Function : Author

Abstract

Motivation: k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. Results: In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E.Coli and C.Elegans) and unassembled reads, as well as on k-mer document frequency tables for 29 E.Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k's.
Fichier principal
Vignette du fichier
locom.pdf (514.26 Ko) Télécharger le fichier
Origin : Publisher files allowed on an open archive

Dates and versions

hal-03867526 , version 1 (23-11-2022)

Identifiers

Cite

Yoshihiro Shibuya, Djamal Belazzougui, Gregory Kucherov. Space-efficient representation of genomic k-mer count tables. Algorithms for Molecular Biology, 2022, 17 (1), pp.5. ⟨10.1186/s13015-022-00212-0⟩. ⟨hal-03867526⟩

Collections

CNRS
0 View
0 Download

Altmetric

Share

Gmail Facebook Twitter LinkedIn More