The Relationship Between Word Length and Average Information Content in Japanese

Yuki Tanida

Download from

dx.doi.org

More download options

The Relationship Between Word Length and Average Information Content in Japanese

Yuki Tanida

Cognitive Science 47 (6):e13302 (2023) Copy BIBT_EX

Abstract

Piantadosi, Tily, and Gibson analyzed a large‐scale web‐scraping corpus (the Google 1T dataset) and reported that word length is independently predicted from average information content (surprisal) calculated by a 2‐ to 4‐gram model (hereafter, longer‐span surprisal) across 11 Indo‐European languages, namely, Czech, Dutch, English, French, German, Italian, Polish, Spanish, Portuguese, Romanian, and Swedish. However, a recent article by Meylan and Griffiths suggested the importance of preprocessing for studies with large‐scale corpora and reanalyzed the same databases. After their preprocessing, the results in Piantadosi et al. were not replicated in Czech, Romanian, and Swedish. Additionally, a German‐specific study by Koplenig, Kupietz, and Wolfer showed that the strict analysis did not replicate the result in Piantadosi et al. for that language with the preprocessing suggested by Meylan and Griffiths in a large‐scale but less noisy database. These three studies provide evidence from 11 Indo‐European languages and one Afro‐Asiatic language, Hebrew, as relevant in this debate. However, we do not have evidence from other linguistic groups. This study provides evidence about Japanese based on a strict preprocessing of Google's web‐scraping database. The results show that Japanese word length can be predicted independently by 2‐ to 4‐gram surprisal.

Cite

Plain text

BibTeX

Formatted text

Zotero

EndNote

Reference Manager

RefWorks

Options

Edit

Mark as duplicate

Find it on Scholar

Request removal from index

Revision history

Keywords

Corpus analysis Information theory Large-scale corpora N-gram model Zipf's law

Reprint years

DOI

10.1111/cogs.13302

My notes

Analytics

Added to PP
2023-06-14

Downloads
11 (#1,166,624)

6 months
8 (#415,167)

Historical graph of downloads

How can I increase my downloads?

Citations of this work

No citations found.

Add more citations

References found in this work

The Challenges of Large‐Scale, Web‐Based Language Datasets: Word Length and Predictability Revisited.Stephan C. Meylan & Thomas L. Griffiths - 2021 - Cognitive Science 45 (6):e12983.

Add more references

Applied ethics	Epistemology	History of Western Philosophy	Meta-ethics	Metaphysics	Normative ethics
Philosophy of biology	Philosophy of language	Philosophy of mind	Philosophy of religion	Science Logic and Mathematics	More ...

The Relationship Between Word Length and Average Information Content in Japanese

Abstract

Categories

Keywords

Reprint years

DOI

Links

PhilArchive

External links

Through your library

My notes

Similar books and articles

Analytics

Citations of this work

References found in this work