Abstract
The article discusses the possible “physical” meaning of the distance between genetic sequences, based on comparing the set of all words of fixed length (N-gram) occurring in two genomic sequences. The considered distances suitable describe phylogenetic relationships and allow ranking by the genomes similarities in situations where it is practically impossible to provide by alignment methods. A simulation shows that the distances between the N-gram distributions change almost linearly, with genome lengths growing for relatively small artificial evolutionary modifications. In the general case of comparing two genetic texts, a function for “calibrating” the distance between N-gram distributions is found. This fact makes it possible to interpret the considered distances by means of the number of elementary operators performed in an alignment process between the compared sequences.
Similar content being viewed by others
References
Amelin, K., Granichin, O., Kizhaeva, N., & Volkovich, Z. (2018). Patterning of writing style evolution by means of dynamic similarity. Pattern Recognition, 77, 45–64.
Barbieri, M. (2005). Life is ‘artifact-making’. Journal of Biosemiotics, 1, 113–142.
Blaisdell, B. E. (1986). A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences, 83(14), 5155–5159. https://doi.org/10.1073/pnas.83.14.5155 URL: 10.1073/pnas.83.14.5155.
Bernard, G., Greenfield, P., Ragan, MA., Chan, CX., Claesson MJ (2018) mSystems 3(6) https://doi.org/10.1128/mSystems.00257-18
Compeau, P. & P Pevzner (2011) Genome reconstruction: A puzzle with a BillionPieces. In P.Pevzner & R. Shamir (Eds) Bioinformatics for Biologists (pp 36-65). Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511984570.005
Deza,E. & Deza, M. (2006). Dictionary of Distances, 11, Distances on String and Permutations, Elsevier
Durbin, R., S.Eddy, A. Krogh & G. Mithinson (ed) (1998) Biological sequence analysis. Probabilistic models of proteins and nucleic acid. Cambridge, 356pp.
Kandel, D., Matias, Y., Unger, R., & Winkler, P. (1996). Shuffling biological sequences. Discrete Applied Mathematics, 71(1–3), 171–185. https://doi.org/10.1016/s0166-218x(97)81456-4 URL: 10.1016/s0166-218x(97) 81456-4.
Katz, G. (2008). The hypothesis of a genetic protolanguage: An epistemological investigation. Biosemiotics, 1, 57–73. https://doi.org/10.1007/s12304-008-9005-5.
Kirzhner, V. M., Korol, A. B., Bolshoy, A., & Nevo, E. (2002). Compositional spectrum—revealing patterns for genomic sequence characterization and comparison. Physica A: Statistical Mechanics and its Applications, 312(3–4), 447–457. https://doi.org/10.1016/s0378-4371(02)00843-9 URL 10.1016/s0378-4371(02)00843-9.
Kirzhner, V., Korol, A., Bolshoy, A., & Nevo, E. (2003). A large-scale comparison of genomic sequences: One promising approach. Acta Biotheoretica, 51(2), 73–89.
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707–710.
Mrazek, J. (2009) Phylogenetic Signals in DNA Composition: Limitations and Prospects. Molecular Biology and Evolution 26(5), 1163–1169 https://doi.org/10.1093/molbev/msp032
Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443–453. https://doi.org/10.1016/0022-2836(70)90057-4 URL 10.1016/0022-2836(70)90057-4.
Patil, A., McHardy, C (2013) Alignment-Free Genome Tree Inference by Learning Group-Specific Distance Metrics. Genome Biology and Evolution 5(8), 1470–1484. https://doi.org/10.1093/gbe/evt105
Qi Dai, Yanchun Yang & Tianming Wang (2008) Markov mo plus k-word distibutions: a synergy that produces novel statistical measures for sequence comparison. Bioinformatics, 24, 20 2296–2302
Röhling, S., Linne, A., Schellhorn, J., Hosseini, M., Dencker, T., & Morgenstern, B. (2020). The number of N-gram matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLOS ONE, 15(2), e0228070–e0228070. https://doi.org/10.1371/journal.pone.0228070 URL 10.1371/journal.pone.0228070.
Shannon, CE. (1948) A mathematical theory of communication. Bell System Technical J.
Ukkonen, E. (1992). Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1), 191–211. https://doi.org/10.1016/0304-3975(92)90143-4 URL 10.1016/0304-3975(92)90143-4.
Kirzhner, V., Toledano-Kitai, D., Volkovich, Z., Giorgio G (2020) Evaluating the number of different genomes in a metagenome by means of the compositional spectra approach. PLOS ONE 15(11) e0237205-https://doi.org/10.1371/journal.pone.0237205
Vinga, S., Almeida, J., (2003) Alignment-free sequence comparison--a review. Bioinformatics 19(4), 513–523. https://doi.org/10.1093/bioinformatics/btg005
Volkovich, Z., Kirzhner, V., Bolshoy, A., Nevo, E., & Korol, A. (2005). The method of -grams in large-scale clustering of DNA texts. Pattern Recognition, 38(11), 1902–1912. https://doi.org/10.1016/j.patcog.2005.05.002, URL 10.1016/j.patcog.2005.05.002.
Witzany, G. (2011). Natural genome editing from a biocommunicative perspective. Biosemiotics, 4, 349–368. https://doi.org/10.1007/s12304-011-9111-7.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kirzhner, V., Volkovich, Z. Application of N-Gram Based Distances to Genetic Texts Comparison. Biosemiotics 14, 271–285 (2021). https://doi.org/10.1007/s12304-021-09442-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12304-021-09442-y