Skip to main content

Advertisement

Log in

Application of N-Gram Based Distances to Genetic Texts Comparison

  • Original Research
  • Published:
Biosemiotics Aims and scope Submit manuscript

Abstract

The article discusses the possible “physical” meaning of the distance between genetic sequences, based on comparing the set of all words of fixed length (N-gram) occurring in two genomic sequences. The considered distances suitable describe phylogenetic relationships and allow ranking by the genomes similarities in situations where it is practically impossible to provide by alignment methods. A simulation shows that the distances between the N-gram distributions change almost linearly, with genome lengths growing for relatively small artificial evolutionary modifications. In the general case of comparing two genetic texts, a function for “calibrating” the distance between N-gram distributions is found. This fact makes it possible to interpret the considered distances by means of the number of elementary operators performed in an alignment process between the compared sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Amelin, K., Granichin, O., Kizhaeva, N., & Volkovich, Z. (2018). Patterning of writing style evolution by means of dynamic similarity. Pattern Recognition, 77, 45–64.

    Article  Google Scholar 

  • Barbieri, M. (2005). Life is ‘artifact-making’. Journal of Biosemiotics, 1, 113–142.

    Google Scholar 

  • Blaisdell, B. E. (1986). A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences, 83(14), 5155–5159. https://doi.org/10.1073/pnas.83.14.5155 URL: 10.1073/pnas.83.14.5155.

    Article  CAS  Google Scholar 

  • Bernard, G., Greenfield, P., Ragan, MA., Chan, CX., Claesson MJ (2018) mSystems 3(6) https://doi.org/10.1128/mSystems.00257-18

  • Compeau, P. & P Pevzner (2011) Genome reconstruction: A puzzle with a BillionPieces. In P.Pevzner & R. Shamir (Eds) Bioinformatics for Biologists (pp 36-65). Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511984570.005

  • Deza,E. & Deza, M. (2006). Dictionary of Distances, 11, Distances on String and Permutations, Elsevier

  • Durbin, R., S.Eddy, A. Krogh & G. Mithinson (ed) (1998) Biological sequence analysis. Probabilistic models of proteins and nucleic acid. Cambridge, 356pp.

  • Kandel, D., Matias, Y., Unger, R., & Winkler, P. (1996). Shuffling biological sequences. Discrete Applied Mathematics, 71(1–3), 171–185. https://doi.org/10.1016/s0166-218x(97)81456-4 URL: 10.1016/s0166-218x(97) 81456-4.

    Article  Google Scholar 

  • Katz, G. (2008). The hypothesis of a genetic protolanguage: An epistemological investigation. Biosemiotics, 1, 57–73. https://doi.org/10.1007/s12304-008-9005-5.

    Article  Google Scholar 

  • Kirzhner, V. M., Korol, A. B., Bolshoy, A., & Nevo, E. (2002). Compositional spectrum—revealing patterns for genomic sequence characterization and comparison. Physica A: Statistical Mechanics and its Applications, 312(3–4), 447–457. https://doi.org/10.1016/s0378-4371(02)00843-9 URL 10.1016/s0378-4371(02)00843-9.

    Article  Google Scholar 

  • Kirzhner, V., Korol, A., Bolshoy, A., & Nevo, E. (2003). A large-scale comparison of genomic sequences: One promising approach. Acta Biotheoretica, 51(2), 73–89.

    Article  Google Scholar 

  • Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707–710.

    Google Scholar 

  • Mrazek, J.   (2009) Phylogenetic Signals in DNA Composition: Limitations and Prospects. Molecular Biology and Evolution 26(5), 1163–1169 https://doi.org/10.1093/molbev/msp032

  • Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443–453. https://doi.org/10.1016/0022-2836(70)90057-4 URL 10.1016/0022-2836(70)90057-4.

    Article  CAS  PubMed  Google Scholar 

  • Patil, A., McHardy, C (2013) Alignment-Free Genome Tree Inference by Learning Group-Specific Distance Metrics. Genome Biology and Evolution 5(8), 1470–1484. https://doi.org/10.1093/gbe/evt105

  • Qi Dai, Yanchun Yang & Tianming Wang (2008) Markov mo plus k-word distibutions: a synergy that produces novel statistical measures for sequence comparison. Bioinformatics, 24, 20 2296–2302

  • Röhling, S., Linne, A., Schellhorn, J., Hosseini, M., Dencker, T., & Morgenstern, B. (2020). The number of N-gram matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLOS ONE, 15(2), e0228070–e0228070. https://doi.org/10.1371/journal.pone.0228070 URL 10.1371/journal.pone.0228070.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Shannon, CE. (1948) A mathematical theory of communication. Bell System Technical J.

  • Ukkonen, E. (1992). Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1), 191–211. https://doi.org/10.1016/0304-3975(92)90143-4 URL 10.1016/0304-3975(92)90143-4.

    Article  Google Scholar 

  • Kirzhner, V., Toledano-Kitai, D., Volkovich, Z., Giorgio G (2020) Evaluating the number of different genomes in a metagenome by means of the compositional spectra approach. PLOS ONE 15(11) e0237205-https://doi.org/10.1371/journal.pone.0237205

  • Vinga, S., Almeida, J., (2003) Alignment-free sequence comparison--a review. Bioinformatics 19(4), 513–523. https://doi.org/10.1093/bioinformatics/btg005

  • Volkovich, Z., Kirzhner, V., Bolshoy, A., Nevo, E., & Korol, A. (2005). The method of -grams in large-scale clustering of DNA texts. Pattern Recognition, 38(11), 1902–1912. https://doi.org/10.1016/j.patcog.2005.05.002, URL 10.1016/j.patcog.2005.05.002.

    Article  Google Scholar 

  • Witzany, G. (2011). Natural genome editing from a biocommunicative perspective. Biosemiotics, 4, 349–368. https://doi.org/10.1007/s12304-011-9111-7.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Valery Kirzhner.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kirzhner, V., Volkovich, Z. Application of N-Gram Based Distances to Genetic Texts Comparison. Biosemiotics 14, 271–285 (2021). https://doi.org/10.1007/s12304-021-09442-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12304-021-09442-y

Keywords

Navigation