Abstract
This paper develops a construction-based dialectometry capable of identifying previously unknown constructions and measuring the degree to which a given construction is subject to regional variation. The central idea is to learn a grammar of constructions (a CxG) using construction grammar induction and then to use these constructions as features for dialectometry. This offers a method for measuring the aggregate similarity between regional CxGs without limiting in advance the set of constructions subject to variation. The learned CxG is evaluated on how well it describes held-out test corpora while dialectometry is evaluated on how well it can model regional varieties of English. The method is tested using two distinct datasets: First, the International Corpus of English representing eight outer circle varieties; Second, a web-crawled corpus representing five inner circle varieties. Results show that the method (1) produces a grammar with stable quality across sub-sets of a single corpus that is (2) capable of distinguishing between regional varieties of English with a high degree of accuracy, thus (3) supporting dialectometric methods for measuring the similarity between varieties of English and (4) measuring the degree to which each construction is subject to regional variation. This is important for cognitive sociolinguistics because it operationalizes the idea that competition between constructions is organized at the functional level so that dialectometry needs to represent as much of the available functional space as possible.
Acknowledgements
This research was supported in part by an appointment to the Visiting Scientist Fellowship at the National Geospatial-Intelligence Agency administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the U.S. Department of Energy and NGA. The views expressed in this presentation are the author’s and do not imply endorsement by the DoD or the NGA.
Appendix
A Spatially-conditioned constructions
This appendix contains five of the top constructions for each region. The models ultimately depend on a large number of constructions, each of which has a relatively small degree of conditioning. A small number of highly predictive features for a region indicates a shallow model that is exploiting some irregularity in a small number of samples from that region (cf. Koppel et al. 2007). Thus, these top features only include those with a feature weight less than 0.02, a threshold that removes a very small number of unusually predictive features that occur infrequently. In order to aid interpretation of these representations, examples of the semantic domains contained here are given in Appendix B.
East Africa | Singapore |
[<25>– adv – ‘that’] | [verb – ‘down’] |
[‘one’ –<25>– pron] | [‘my’ – adj] |
[‘out’ – ‘of’] | [det – verb – adv] |
[‘one’ – pron] | [det –<25>– ‘as’] |
[<25>– ‘from’ – noun] | [‘when’ – ‘the’] |
Hong Kong | Australia |
[pron – verb – pron – noun] | [‘people’ – adp] |
[‘government’ – noun] | [<25>– ‘young’ – noun] |
[noun – noun – ‘is’] | [<47>– conj] |
[det – ‘world’] | [‘use’ – ‘of’] |
[‘do’ –<25>– verb] | [aux – ‘only’] |
India | Canada |
[verb – pron – ‘is’] | [‘please’ – verb] |
[adp – pron – pron – verb] | [‘all’ – adp] |
[<25>– verb – ‘there’] | [<49>– noun –<25>] |
[adp –<25>–<25>– ‘this’] | [‘for’ – adj – noun – adp] |
[aux – ‘given’ –<25>] | [‘it’ – verb – det] |
Ireland | New Zealand |
[‘‘s – verb] | [‘high’ –<25>] |
[<25>– ‘and’ – pron – aux] | [<25>– ‘required’ –<25>] |
[‘‘s’ –<25>– adp] | [<49>– aux] |
[‘say’ –<25>] | [‘you’ – ‘to’] |
[‘said’ – pron] | [‘or’ – adp – det] |
Jamaica | United Kingdom |
[<25>– sconj –<25>– adv] | [‘are’ – verb –<25>–<25>–<25>] |
[‘end’ – ‘of’] | [‘taken’ – adp] |
[<25>– ‘in’ – noun – adp] | [‘down’ –<25>] |
[‘would’ – verb –<25>–<25>–<25>] | [<25>– ‘this’ – verb] |
[adp – ‘a’ –<25>–<25>– det] | [‘range’ – adp] |
Nigeria | South Africa |
[noun –<96>] | [‘you’ – ‘to’] |
[sconj – ‘are’] | [det – ‘world’] |
[noun – ‘from’ –<25>] | [<25>–<39>–<25>] |
[‘of’ – ‘and’] | [‘where’ – pron –<25>] |
[adp – ‘people’] | [‘your’ – adj] |
Philippines | |
[‘and’ – noun – conj] | |
[<25>– ‘let’] | |
[sconj –<25>– verb – pron] | |
[‘that’ –<25>–<25>– adv –<25>] | |
[adp – ‘other’ – noun] |
B Examples of semantic domains
This appendix shows 10 lexical items that belong to each of a select number of semantic domains, selected to aid interpretation of the example representations in Appendix A. A complete inventory of each semantic domain is contained in the external resources accompanying this paper.
<25> | <39> | <47> |
---|---|---|
auditorium | wheelchairs | law |
industry | contraband | concurrence |
fundraisers | yard | severally |
members | spare | exempts |
press | depots | sentence |
delighted | handpicked | federal |
appeared | storage | purporting |
wondered | assortment | administering |
expecting | wheelie | certifying |
discovering | torches | commissioners |
<49> | <96> | |
srt | occupations | |
cetls | government-sponsored | |
aba | homebuy | |
rcr | anti-poverty | |
cmg | burglary | |
gnn | self-build | |
lcs | householder | |
gdl | landfill | |
pss | dwellers | |
ecc | municipal |
References
Argamon, S., M. Koppel, J. Fine & A. R. Shimoni. 2003. Gender, genre, and writing style in formal written texts. Text 23(3). 321–346.10.1515/text.2003.014Search in Google Scholar
Baayen, R. Harald, P. Milin, D. Durdević, P. Hendrix & M. Marelli. 2011. An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review 118. 438–482.10.1037/a0023851Search in Google Scholar
Baroni, M., S. Bernardini, A. Ferraresi & E. Zanchetta. 2009. The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43. 209–226.10.1007/s10579-009-9081-4Search in Google Scholar
Biber, Douglas. 2014. Using multi-dimensional analysis to explore cross-linguistic universals of register variation. Languages in Contrast 14(1). 7–34.10.1075/bct.87.02bibSearch in Google Scholar
Bybee, Joan. 2006. From usage to grammar: The mind’s response to repetition. Language 82(4). 711–733.10.1353/lan.2006.0186Search in Google Scholar
Cilibrasi, R. & P. Vitanyi. 2007. The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3). 370–383.10.1109/TKDE.2007.48Search in Google Scholar
Claes, Jeroen. 2014. A cognitive construction grammar approach to the pluralization of presentational haber in Puerto Rican Spanish. Language Variation and Change 26(2). 219–246.10.1017/S0954394514000052Search in Google Scholar
Dąbrowska, Ewa. 2012. Different speakers, different grammars: Individual differences in native language attainment. Linguistic Approaches to Bilingualism 2(3). 219–253.10.1075/lab.2.3.01dabSearch in Google Scholar
Dąbrowska, Ewa. 2014. Words that go together: Measuring individual differences in native speakers’ knowledge of collocations. The Mental Lexicon 9(3). 401–418.10.1075/ml.9.3.02dabSearch in Google Scholar
Dijvak, Dagmar, Ewa Dąbrowska & Antti Arppe. 2016. Machine meets man: Evaluating the psychological reality of corpus-based probabilistic models. Cognitive Linguistics 27(1). 1–33.10.1515/cog-2015-0101Search in Google Scholar
Dunn, Jonathan. 2017. Computational learning of construction grammars. Language and Cognition 9(2). 254–292.10.1017/langcog.2016.7Search in Google Scholar
Dunn, Jonathan. 2018. Modeling the complexity and descriptive adequacy of construction grammars. In Proceedings of the Society for Computation in Linguistics (SCiL 2018), 81–90. Stroudsburg, PA: Association for Computational Linguistics.Search in Google Scholar
Dunn, Jonathan, S. Argamon, A. Rasooli & G. Kumar. 2016. Profile-based authorship analysis. Literary and Linguistic Computing 31(4). 689–710.10.1093/llc/fqv019Search in Google Scholar
Firth, J. 1957. Papers in linguistics, 1934–1951. Oxford: Oxford University Press.Search in Google Scholar
Geeraerts, Dirk. 2010. Lexical variation in space. In P. Auer & J. Schmidt (eds.), Language in space: An international handbook of linguistic variation. Vol. 1: Theories and methods, 821–837. Berlin & New York: Mouton de Gruyter.10.1515/9783110220278.821Search in Google Scholar
Geeraerts, Dirk. 2016. The sociosemiotic commitment. Cognitive Linguistics 27(4). 527–542.10.1515/cog-2016-0058Search in Google Scholar
Gisborne, N. 2011. Constructions, word grammar, and grammaticalization. Cognitive Linguistics 22(1). 155–182.10.1515/cogl.2011.007Search in Google Scholar
Goebl, H. 1982. Dialektometrie. Prinzipien und Methoden des Einsatzes der numerischen Taxonomie im Bereich der Dialektgeographie (Denkschriften, Bd. 157). Wien: Österreichische Akademie der Wissenschaften.Search in Google Scholar
Goebl, H. 1984. Dialektometrische Studien: Anhand italoromanischer, rätoromanischer und galloromanischer Sprachmaterialien aus AIS und ALF (Beihefte zur Zeitschrift für romanische Philologie, Bd. 191). Tübingen: Niemeyer.Search in Google Scholar
Goebl, H. 2006. Recent advances in Salzburg dialectometry. Literary and Linguistic Computing 21(4). 411–435.10.1093/llc/fql042Search in Google Scholar
Goldberg, Adele. 2006. Constructions at work: The nature of generalization in language. Oxford: Oxford University Press.10.1093/acprof:oso/9780199268511.001.0001Search in Google Scholar
Goldberg, Adele. 2011. Corpus evidence of the viability of statistical preemption. Cognitive Linguistics 22(1). 131–154.10.1515/9783110335255.57Search in Google Scholar
Goldhahn, D., T. Eckart & U. Quasthoff. 2012. Building large monolingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages. Proceedings of the Eighth Conference on Language Resources and Evaluation 2012 (LREC’12), 759–765. http://www.lrec-conf.org/proceedings/lrec2012/index.html (accessed 18 March 2018).Search in Google Scholar
Grieve, Jack. 2013. A statistical comparison of regional phonetic and lexical variation in American English. Literary and Linguistic Computing 28. 82–107.10.1093/llc/fqs051Search in Google Scholar
Grieve, Jack. 2014. A comparison of statistical methods for the aggregation of regional linguistic variation. In Benedikt Szmrecsanyi & Bernhard Wälchli (eds.), Aggregating dialectology, typology, and register analysis: Linguistic variation in text and speech, within and across languages, 53–88. Berlin & New York: Walter de Gruyter.10.1515/9783110317558.53Search in Google Scholar
Grieve, Jack. 2016. Regional variation in written American English. Cambridge, UK: Cambridge University Press.10.1017/CBO9781139506137Search in Google Scholar
Grieve, Jack, Dirk Speelman & Dirk Geeraerts. 2011. A statistical method for the identification and aggregation of regional linguistic variation. Language Variation & Change 23. 1–29.10.1017/S095439451100007XSearch in Google Scholar
Heeringa, W. 2004. Measuring dialect pronunciation differences using Levenshtein distance. Groningen, Netherlands: University of Groningen dissertation.Search in Google Scholar
Henderson, J., G. Zarrella, C. Pfeifer & J. Burger. 2013. Discriminating non-native English with 350 words. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, 101–110. Stroudsburg, PA: Association for Computational Linguistics.Search in Google Scholar
Hoffmann, T. & G. Trousdale. 2011. Variation, change, and constructions in English. Cognitive Linguistics 22(1). 1–24.10.1515/cogl.2011.001Search in Google Scholar
Hollmann, W. & A. Siewierska. 2011. The status of frequency, schemas, and identity in cognitive sociolinguistics: A case study on definite article reduction. Cognitive Linguistics 22(1). 25–54.10.1515/cogl.2011.002Search in Google Scholar
Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In C. Ne’dellec (ed.), Machine learning: ECML-98: 10th European Conference on Machine Learning, 137–142. Berlin: Springer.10.1007/BFb0026683Search in Google Scholar
Kay, Paul & Charles J. Fillmore. 1999. Grammatical constructions and linguistic generalizations: The Whats X Doing Y? construction. Language 75(1). 1–33.10.2307/417472Search in Google Scholar
Koppel, Moshe, J. Schler & E. Bonchek-Dokow. 2007. Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research 8. 1261–1276.Search in Google Scholar
Kortmann, Bernd, E. Schneider, K. Burridge, R. Mesthrie & C. Upton (eds). 2004. A handbook of varieties of English. Berlin & New York: Mouton de Gruyter.Search in Google Scholar
Kretzschmar, William A. 1992. Isoglosses and predictive modeling. American Speech 67(3). 227–249.10.2307/455562Search in Google Scholar
Kretzschmar, William A. 1996. Quantitative areal analysis of dialect features. Language Variation & Change 8. 13–39.10.1017/S0954394500001058Search in Google Scholar
Kretzschmar, William A., I. Juuso & C. Bailey. 2014. Computer simulation of dialect feature diffusion. Journal of Linguistic Geography 2. 41–57.10.1017/jlg.2014.2Search in Google Scholar
Labov, William, S. Ash & C. Boberg. 2005. The atlas of North American English: Phonetics, phonology and sound change. Berlin: De Gruyter Mouton.10.1515/9783110167467Search in Google Scholar
Langacker, Ronald. 1987. Foundations of cognitive grammar, Vol. 1: Theoretical prerequisites. Stanford: Stanford University Press.Search in Google Scholar
Langacker, Ronald. 2008. Cognitive grammar: A basic introduction. Oxford: Oxford University Press.10.1093/acprof:oso/9780195331967.001.0001Search in Google Scholar
Lee, Jay & William A. Kretzschmar. 1993. Spatial analysis of linguistic data with GIS functions. International Journal of Geographical Information Systems 7(6). 541–560.10.1080/02693799308901981Search in Google Scholar
Levshina, Natalia. 2016. When variables align: A Bayesian multinomial mixed-effects model of English permissive constructions. Cognitive Linguistics 27(2). 235–268.10.1515/cog-2015-0054Search in Google Scholar
Milin, Petar, D. Divjak, S. Dimitrijević & R. H. Baayen. 2016. Towards cognitively plausible data science in language research. Cognitive Linguistics 27(4). 507–526.10.1515/cog-2016-0055Search in Google Scholar
Nagy, N. 2016. Heritage languages as new dialects. In M. Cote & J. Nerbonne (eds.), The future of dialects, 15–35. Berlin: Language Science Press.Search in Google Scholar
Nelson, G., S. Wallis & B. Aarts. 2002. Exploring natural language. Working with the British component of the International Corpus of English. Amsterdam: John Benjamins.10.1075/veaw.g29Search in Google Scholar
Nerbonne, John. 2006. Identifying linguistic structure in aggregate comparison. Literary and Linguistic Computing 21(4). 463–476.10.1093/llc/fql041Search in Google Scholar
Nerbonne, John. 2009. Data-driven dialectology. Language and Linguistics Compass 3(1). 175–198.10.1111/j.1749-818X.2008.00114.xSearch in Google Scholar
Nerbonne, John & W. Heeringa. 2010. Measuring dialect differences. In S. Jürgen & P. Auer (eds.), Language and space: Theories and methods in series handbooks of linguistics and communication science, 550–567. Berlin: Mouton De Gruyter.10.1515/9783110220278.550Search in Google Scholar
Nerbonne, John & P. Kleiweg. 2007. Toward a dialectological yardstick. Journal of Quantitative Linguistics 14(2/3). 148–166.10.1080/09296170701379260Search in Google Scholar
Nerbonne, John, P. Kleiweg, W. Heeringa & F. Manni. 2008. Projecting dialect distances to geography: Bootstrap clustering vs. noisy clustering. In C. Preisach, L. Schmidt-Thieme, H. Burkhardt & R. Decker (eds.), Data analysis, machine learning and applications, 647–654. Berlin: Springer.10.1007/978-3-540-78246-9_76Search in Google Scholar
Nerbonne, John & W. Kretzschmar. 2013. Dialectometry++. Literary and Linguistic Computing 28(1). 2–12.10.1093/llc/fqs062Search in Google Scholar
Nguyen, Dat Quoca, Dai Quocb Nguyen, Dang Ducc Pham & Son Baod Pham. 2016. A robust transformation-based learning approach using ripple down rules for part-of-speech tagging. AI Communications 29(3). 409–422.10.3233/AIC-150698Search in Google Scholar
Onishi, T. 2016. Timespan comparison of dialectal distributions. In M. Cote & J. Nerbonne (eds.), The future of dialects, 377–388. Berlin: Language Science Press.Search in Google Scholar
Peirsman, Yves, Dirk Geeraerts & Dirk Speelman. 2010. The automatic identification of lexical variation between language varieties. Natural Language Engineering 16(4). 469–491.10.1017/S1351324910000161Search in Google Scholar
Petrov, Slav, D. Das & R. McDonald 2012. A universal part-of-speech tagset. Proceedings of the Eighth Conference on Language Resources and Evaluation 2012 (LREC’12), 2089–2096. http://www.lrec-conf.org/proceedings/lrec2012/index.html (accessed 18 March 2018).Search in Google Scholar
Pickl, Simon. 2016. Fuzzy dialect areas and prototype theory: Discovering latent patterns in geolinguistic variation. In M. Cote & J. Nerbonne (eds.), The future of dialects, 75–98. Berlin: Language Science Press.Search in Google Scholar
Pickl, Simon, A. Spettl, S. Pröll, S. Elspaß, W. König & V. Schmidt. 2014. Linguistic distances in dialectometric intensity estimation. Journal of Linguistic Geography 2. 25–40.10.1017/jlg.2014.3Search in Google Scholar
Pröll, Simon. 2013. Detecting structures in linguistic maps: Fuzzy clustering for pattern recognition in geostatistical dialectometry. Literary and Linguistic Computing 28(1). 108–118.10.1093/llc/fqs059Search in Google Scholar
Řehůřek, Radim & Petr Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50. Valetta, Malta: University of Malta.Search in Google Scholar
Roller, Stephen, M. Speriosu, S. Rallapalli, B. Wing & J. Baldridge. 2012. Supervised text-based geolocation using language models on an adaptive grid. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 1500–1510. Stroudsburg, PA: Association for Computational Linguistics.Search in Google Scholar
Ruette, Tom, Dirk Geeraerts & Dirk Speelman. 2014. Lexical variation in aggregate perspective. In Augusto da Silva Soares (ed.), Pluricentricity: Language variation and sociocognitive dimensions, 103–126. Berlin: de Gruyter.10.1515/9783110303643.103Search in Google Scholar
Rumpf, Jonas, S. Pickl, S. Elspaß, W. König & V. Schmidt. 2009. Structural analysis of dialect maps using methods from spatial statistics. Zeitschrift für Dialektologie und Linguistik 76(3). 280–308. Stuttgart: Franz Steiner Verlag.10.25162/zdl-2009-0010Search in Google Scholar
Sanders, Nathan C. 2007. Measuring syntactic difference in British English. In Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop, 1–6. Association for Computational Linguistics. http://aclweb.org/anthology/P07-3 (accessed 18 March 2018).10.3115/1557835.1557837Search in Google Scholar
Sanders, Nathan C. 2010. A statistical method for syntactic dialectometry. Bloomington: Indiana University dissertation.Search in Google Scholar
Schmid, Hans-Jörg. 2016. Why cognitive linguistics must embrace the social and pragmatic dimensions of language and how it could do so more seriously. Cognitive Linguistics 27(4). 543–557.10.1515/cog-2016-0048Search in Google Scholar
Schneider, E. 2007. Postcolonial English: Varieties around the world. Cambridge, UK: Cambridge University Press.10.1017/CBO9780511618901Search in Google Scholar
Séguy, Jean. 1973. La dialectome ́trie dans l’Atlas linguistique de la Gascogne. Revue de linguistique romane 37. 1–24.Search in Google Scholar
Siblr, Pius, R. Weibel, E. Glaser & G. Bart. 2012. Cartographic visualization in support of dialectology. In The 2012 AutoCarto International Symposium on Automated Cartography, Columbus, Ohio, USA, 16–18 September.Search in Google Scholar
Stefanowitsch, A. 2011. Constructional preemption by contextual mismatch: A corpus-linguistic investigation. Cognitive Linguistics 22(1). 107–129.10.1515/9783110335255.33Search in Google Scholar
Szmrecsanyi, Benedikt. 2009. Corpus-based dialectometry: Aggregate morphosyntactic variability in British English dialects. International Journal of Humanities and Arts Computing 2(1/2). 279–296.10.3366/E1753854809000433Search in Google Scholar
Szmrecsanyi, Benedikt. 2013. Grammatical variation in British English dialects: A study in corpus-based dialectometry (Studies in English Language). Cambridge: Cambridge University Press.10.1017/CBO9780511763380Search in Google Scholar
Szmrecsanyi, Benedikt. 2014. Forests, trees, corpora, and dialect grammars. In Benedikt Szmrecsanyi & Bernhard WäLchli (eds.), Aggregating dialectology, typology, and register analysis: Linguistic variation in text and speech, 89–112. Berlin: Mouton De Gruyter.10.1515/9783110317558.89Search in Google Scholar
Szmrecsanyi, Benedikt. 2016. About text frequencies in historical linguistics: Disentangling environmental and grammatical change. Corpus Linguistics and Linguistic Theory 12(1). 153–171.10.1515/cllt-2015-0068Search in Google Scholar
Uiboaed, K., C. Hasselblatt, L. Lindström, K. Muischnek & J. Nerbonne. 2013. Variation of verbal constructions in Estonian dialects. Literary and Linguistic Computing 28(1). 42–62.10.1093/llc/fqs053Search in Google Scholar
Wible, David & Nai-Lung Tsao. 2010. StringNet as a computational resource for discovering and investigating linguistic constructions. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, 25–31. http://www.aclweb.org/anthology/W10-0804 (accessed 18 March 2018).Search in Google Scholar
Wieling, Martijn, W. Heeringa & J. Nerbonne. 2007. An aggregate analysis of pronunciation in the Goeman-Taeldeman-Van Reenen-Project data. Taal en Tongval 59. 84–116.Search in Google Scholar
Wieling, Martijn & S. Montemagni. 2016. Infrequent forms: Noise or not?. In M. Cote & J. Nerbonne (eds.), The future of dialects, 215–224. Berlin: Language Science Press.Search in Google Scholar
Wieling, Martijn, J. Nerbonne & R. H. Baayen. 2011. Quantitative social dialectology: Explaining linguistic variation geographically and socially. PloS One 6(9). e23613. doi:10.1371/journal.pone.0023613 (accessed 18 March 2018).).Search in Google Scholar
Wieling, Martijn & John Nerbonne. 2011. Bipartite spectral graph partitioning for clustering dialect varieties and detecting their linguistic features. Computer Speech & Language 25(3). 700–715.10.1016/j.csl.2010.05.004Search in Google Scholar
Wieling, Martijn & John Nerbonne. 2015. Advances in dialectometry. Annual Review of Linguistics 1. 243–264.10.1146/annurev-linguist-030514-124930Search in Google Scholar
Wolk, C. & B. Szmrecsanyi. 2016. Top-down and bottom-up advances in corpus-based dialectometry. In M. Cote & J. Nerbonne (eds.), The future of dialects, 225–244. Berlin: Language Science Press.Search in Google Scholar
Zenner, Eline, Dirk Speelman & Dirk Geeraerts. 2012. Cognitive sociolinguistics meets loanword research: Measuring variation in the success of anglicisms in Dutch. Cognitive Linguistics 23(4). 749–792.10.1515/9783110335255.251Search in Google Scholar
© 2018 Walter de Gruyter GmbH, Berlin/Boston