Application of N-Gram Based Distances to Genetic Texts Comparison

Kirzhner, Valery; Volkovich, Zeev

doi:10.1007/s12304-021-09442-y

Application of N-Gram Based Distances to Genetic Texts Comparison

Original Research
Published: 20 August 2021

Volume 14, pages 271–285, (2021)
Cite this article

Biosemiotics Aims and scope Submit manuscript

Valery Kirzhner¹ &
Zeev Volkovich²

147 Accesses
Explore all metrics

Abstract

The article discusses the possible “physical” meaning of the distance between genetic sequences, based on comparing the set of all words of fixed length (N-gram) occurring in two genomic sequences. The considered distances suitable describe phylogenetic relationships and allow ranking by the genomes similarities in situations where it is practically impossible to provide by alignment methods. A simulation shows that the distances between the N-gram distributions change almost linearly, with genome lengths growing for relatively small artificial evolutionary modifications. In the general case of comparing two genetic texts, a function for “calibrating” the distance between N-gram distributions is found. This fact makes it possible to interpret the considered distances by means of the number of elementary operators performed in an alignment process between the compared sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Near-term advances in quantum natural language processing

Article 11 April 2024

Longest Common Substring with Approximately k Mismatches

Article Open access 16 February 2019

References

Amelin, K., Granichin, O., Kizhaeva, N., & Volkovich, Z. (2018). Patterning of writing style evolution by means of dynamic similarity. Pattern Recognition, 77, 45–64.
Article Google Scholar
Barbieri, M. (2005). Life is ‘artifact-making’. Journal of Biosemiotics, 1, 113–142.
Google Scholar
Blaisdell, B. E. (1986). A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences, 83(14), 5155–5159. https://doi.org/10.1073/pnas.83.14.5155 URL: 10.1073/pnas.83.14.5155.
Article CAS Google Scholar
Bernard, G., Greenfield, P., Ragan, MA., Chan, CX., Claesson MJ (2018) mSystems 3(6) https://doi.org/10.1128/mSystems.00257-18
Compeau, P. & P Pevzner (2011) Genome reconstruction: A puzzle with a BillionPieces. In P.Pevzner & R. Shamir (Eds) Bioinformatics for Biologists (pp 36-65). Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511984570.005
Deza,E. & Deza, M. (2006). Dictionary of Distances, 11, Distances on String and Permutations, Elsevier
Durbin, R., S.Eddy, A. Krogh & G. Mithinson (ed) (1998) Biological sequence analysis. Probabilistic models of proteins and nucleic acid. Cambridge, 356pp.
Kandel, D., Matias, Y., Unger, R., & Winkler, P. (1996). Shuffling biological sequences. Discrete Applied Mathematics, 71(1–3), 171–185. https://doi.org/10.1016/s0166-218x(97)81456-4 URL: 10.1016/s0166-218x(97) 81456-4.
Article Google Scholar
Katz, G. (2008). The hypothesis of a genetic protolanguage: An epistemological investigation. Biosemiotics, 1, 57–73. https://doi.org/10.1007/s12304-008-9005-5.
Article Google Scholar
Kirzhner, V. M., Korol, A. B., Bolshoy, A., & Nevo, E. (2002). Compositional spectrum—revealing patterns for genomic sequence characterization and comparison. Physica A: Statistical Mechanics and its Applications, 312(3–4), 447–457. https://doi.org/10.1016/s0378-4371(02)00843-9 URL 10.1016/s0378-4371(02)00843-9.
Article Google Scholar
Kirzhner, V., Korol, A., Bolshoy, A., & Nevo, E. (2003). A large-scale comparison of genomic sequences: One promising approach. Acta Biotheoretica, 51(2), 73–89.
Article Google Scholar
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707–710.
Google Scholar
Mrazek, J. (2009) Phylogenetic Signals in DNA Composition: Limitations and Prospects. Molecular Biology and Evolution 26(5), 1163–1169 https://doi.org/10.1093/molbev/msp032
Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443–453. https://doi.org/10.1016/0022-2836(70)90057-4 URL 10.1016/0022-2836(70)90057-4.
Article CAS PubMed Google Scholar
Patil, A., McHardy, C (2013) Alignment-Free Genome Tree Inference by Learning Group-Specific Distance Metrics. Genome Biology and Evolution 5(8), 1470–1484. https://doi.org/10.1093/gbe/evt105
Qi Dai, Yanchun Yang & Tianming Wang (2008) Markov mo plus k-word distibutions: a synergy that produces novel statistical measures for sequence comparison. Bioinformatics, 24, 20 2296–2302
Röhling, S., Linne, A., Schellhorn, J., Hosseini, M., Dencker, T., & Morgenstern, B. (2020). The number of N-gram matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLOS ONE, 15(2), e0228070–e0228070. https://doi.org/10.1371/journal.pone.0228070 URL 10.1371/journal.pone.0228070.
Article CAS PubMed PubMed Central Google Scholar
Shannon, CE. (1948) A mathematical theory of communication. Bell System Technical J.
Ukkonen, E. (1992). Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1), 191–211. https://doi.org/10.1016/0304-3975(92)90143-4 URL 10.1016/0304-3975(92)90143-4.
Article Google Scholar
Kirzhner, V., Toledano-Kitai, D., Volkovich, Z., Giorgio G (2020) Evaluating the number of different genomes in a metagenome by means of the compositional spectra approach. PLOS ONE 15(11) e0237205-https://doi.org/10.1371/journal.pone.0237205
Vinga, S., Almeida, J., (2003) Alignment-free sequence comparison--a review. Bioinformatics 19(4), 513–523. https://doi.org/10.1093/bioinformatics/btg005
Volkovich, Z., Kirzhner, V., Bolshoy, A., Nevo, E., & Korol, A. (2005). The method of -grams in large-scale clustering of DNA texts. Pattern Recognition, 38(11), 1902–1912. https://doi.org/10.1016/j.patcog.2005.05.002, URL 10.1016/j.patcog.2005.05.002.
Article Google Scholar
Witzany, G. (2011). Natural genome editing from a biocommunicative perspective. Biosemiotics, 4, 349–368. https://doi.org/10.1007/s12304-011-9111-7.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Evolution, University of Haifa, 31905, Haifa, Israel
Valery Kirzhner
Software Engineering Department, ORT Braude College of Engineering, 21982, Karmiel, Israel
Zeev Volkovich

Authors

Valery Kirzhner
View author publications
You can also search for this author in PubMed Google Scholar
Zeev Volkovich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valery Kirzhner.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kirzhner, V., Volkovich, Z. Application of N-Gram Based Distances to Genetic Texts Comparison. Biosemiotics 14, 271–285 (2021). https://doi.org/10.1007/s12304-021-09442-y

Download citation

Received: 26 March 2021
Accepted: 14 July 2021
Published: 20 August 2021
Issue Date: August 2021
DOI: https://doi.org/10.1007/s12304-021-09442-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Application of N-Gram Based Distances to Genetic Texts Comparison

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Near-term advances in quantum natural language processing

Longest Common Substring with Approximately k Mismatches

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Application of N-Gram Based Distances to Genetic Texts Comparison

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Near-term advances in quantum natural language processing

Longest Common Substring with Approximately k Mismatches

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation