Efficient reconciliation of genomic datasets of high similarity

Yoshihiro Shibuya; Djamal Belazzougui; Gregory Kucherov

doi:10.4230/LIPIcs.WABI.2022.14

Communication Dans Un Congrès Année : 2022

Efficient reconciliation of genomic datasets of high similarity

(1) , (2) , (1)

1
2

Yoshihiro Shibuya

Fonction : Auteur

Laboratoire d'Informatique Gaspard-Monge

Djamal Belazzougui

Fonction : Auteur

DTISI

Gregory Kucherov

Fonction : Auteur
PersonId : 14903
IdHAL : gregory-kucherov
ORCID : 0000-0001-5899-5424
IdRef : 093602189

Laboratoire d'Informatique Gaspard-Monge

Résumé

We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes).

Mots clés

k-mers counts compression Compressed Static Function Bloom filter phrases k-mers sketching Invertible Bloom Lookup Tables IBLT MinHash syncmers minimizers

Domaines

Algorithme et structure de données [cs.DS] Bio-informatique [q-bio.QM]

Fichier principal

aldiff.pdf (661.98 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Gregory Kucherov : Connectez-vous pour contacter le contributeur

https://cnrs.hal.science/hal-03867538

Soumis le : mercredi 23 novembre 2022-13:42:20

Dernière modification le : mercredi 15 novembre 2023-15:43:42

Archivage à long terme le : vendredi 24 février 2023-19:01:10

Dates et versions

hal-03867538 , version 1 (23-11-2022)

Identifiants

HAL Id : hal-03867538 , version 1
DOI : 10.4230/LIPIcs.WABI.2022.14

Citer

Yoshihiro Shibuya, Djamal Belazzougui, Gregory Kucherov. Efficient reconciliation of genomic datasets of high similarity. 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022), Sep 2022, Potsdam, Germany. ⟨10.4230/LIPIcs.WABI.2022.14⟩. ⟨hal-03867538⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENPC CNRS PARISTECH LIGM LIGM_MOA UNIV-EIFFEL LIGM_ADA

17 Consultations

53 Téléchargements

Efficient reconciliation of genomic datasets of high similarity

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager