Skip to main content

2024 | OriginalPaper | Buchkapitel

How Large Corpora Sizes Influence the Distribution of Low Frequency Text n-grams

verfasst von : Joaquim F. Silva, Jose C. Cunha

Erschienen in: Advances in Knowledge Discovery and Data Mining

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The prediction of the numbers of distinct word n-grams and their frequency distributions in text corpora is important in domains like information processing and language modelling. With big data corpora, there is an increased application complexity due to the large volume of data. Traditional studies have been confined to small or moderate size corpora leading to statistical laws on word frequency distributions. However, when going to very large corpora, some of the assumptions underlying those laws need to be revised, related to the corpus vocabulary and numbers of word occurrences. So, although it becomes critical to know how the corpus size influences those distributions, there is a lack of models that characterise such influence. This paper aims at filling this gap, enabling the prediction of the impact of corpus growth upon application time and space complexities. It presents a fully principled model, which, distinctively, considers words and multiwords in very large corpora, predicting the cumulative numbers of distinct n-grams above or equal to a given frequency in a corpus, as well as the sizes of equal-frequency n-gram groups, from unigrams to hexagrams, as a function of corpus size, in a language, assuming a finite n-gram vocabulary. The model applies to low occurrence frequencies, encompassing the larger populations of n-grams. Practical assessment with real corpora shows relative errors around \(3\%\), stable over the considered ranges of n-gram frequencies, n-gram sizes and corpora sizes from million to billion words, for English and French.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Obviously, this excludes the corpora collecting and n-gram counting, which is made only once for parameters estimation and validation of the model.
 
2
Equation (1) is equivalent to \(\frac{\frac{dD(k,C)}{D(k,C)}}{\frac{dC}{C}}=g_k \, \frac{V - D(k,C)}{V}\). The infinite V assumption would imply that the ratio in left side of the equation should be a constant (equal to \(g_k\)) wrt C, but the empirical observations showed that ratio decreases instead. Such decrease is captured by the vocabulary finiteness assumption (second factor).
 
3
Indeed, (2) can be written as \(\frac{V - D(k,C)}{D(k,C)}= (h_k\,C)^{-g_k} \), which, for each k and n, is a power law wrt to C, since \(g_k\) and \(h_k\) were found constants wrt C.
 
4
Empirical counts were obtained from the corpora with the help of Carlos Gonçalves.
 
Literatur
1.
Zurück zum Zitat Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74(1), 47–97 (2002)MathSciNetCrossRef Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74(1), 47–97 (2002)MathSciNetCrossRef
3.
Zurück zum Zitat Balasubrahmanyan, V.K., Naranan, S.: Algorithmic information, complexity and Zipf law. Glottometrics 4, 1–26 (2002) Balasubrahmanyan, V.K., Naranan, S.: Algorithmic information, complexity and Zipf law. Glottometrics 4, 1–26 (2002)
4.
Zurück zum Zitat Bass, F.M.: A new product growth for model consumer durables. Manage. Sci. 15(5), 215–227 (1969)CrossRef Bass, F.M.: A new product growth for model consumer durables. Manage. Sci. 15(5), 215–227 (1969)CrossRef
5.
Zurück zum Zitat Bernhardsson, S., da Rocha, L.E.C., Minnhagen, P.: Size dependent word frequencies and translational invariance of books. CoRR abs/0906.0716 (2009) Bernhardsson, S., da Rocha, L.E.C., Minnhagen, P.: Size dependent word frequencies and translational invariance of books. CoRR abs/0906.0716 (2009)
6.
Zurück zum Zitat Booth, A.D.: A “law’’ of occurrences for words of low frequency. Inf. Control 10, 386–393 (1967)CrossRef Booth, A.D.: A “law’’ of occurrences for words of low frequency. Inf. Control 10, 386–393 (1967)CrossRef
7.
Zurück zum Zitat Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Joint Conference on EMNLP - CoNLL, pp. 858–867. ACL (2007) Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Joint Conference on EMNLP - CoNLL, pp. 858–867. ACL (2007)
8.
Zurück zum Zitat Buck, C., Heafield, K., van Ooyen, B.: N-gram counts and language models from the Common Crawl. In: LREC’14. European Language Resources Association (2014) Buck, C., Heafield, K., van Ooyen, B.: N-gram counts and language models from the Common Crawl. In: LREC’14. European Language Resources Association (2014)
9.
Zurück zum Zitat Egghe, L.: Untangling Herdan’s law and Heaps’ law: mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58(5), 702–709 (2007)CrossRef Egghe, L.: Untangling Herdan’s law and Heaps’ law: mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58(5), 702–709 (2007)CrossRef
12.
Zurück zum Zitat Mandelbrot, B.: On the theory of word frequencies and on related Markovian models of discourse. Struct. Lang. Math. Aspects 12, 190–219 (1953) Mandelbrot, B.: On the theory of word frequencies and on related Markovian models of discourse. Struct. Lang. Math. Aspects 12, 190–219 (1953)
13.
Zurück zum Zitat Newman, M.: Power laws, Pareto distributions and Zipf law. Contemp. Phys. 46(5), 323–351 (2005)CrossRef Newman, M.: Power laws, Pareto distributions and Zipf law. Contemp. Phys. 46(5), 323–351 (2005)CrossRef
14.
Zurück zum Zitat Price, D.S.: A general theory of bibliometric and other cumulative advantage processes. J. Am. Soc. Inf. Sci. 27(5), 292–306 (1976)CrossRef Price, D.S.: A general theory of bibliometric and other cumulative advantage processes. J. Am. Soc. Inf. Sci. 27(5), 292–306 (1976)CrossRef
17.
Zurück zum Zitat Silva, J.F., Gonçalves, C., Cunha, J.C.: A theoretical model for n-gram distribution in big data corpora. In: 2016 IEEE International Conference on Big Data, pp. 134–141 (2016) Silva, J.F., Gonçalves, C., Cunha, J.C.: A theoretical model for n-gram distribution in big data corpora. In: 2016 IEEE International Conference on Big Data, pp. 134–141 (2016)
19.
Zurück zum Zitat Zipf, G.K.: Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949) Zipf, G.K.: Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)
Metadaten
Titel
How Large Corpora Sizes Influence the Distribution of Low Frequency Text n-grams
verfasst von
Joaquim F. Silva
Jose C. Cunha
Copyright-Jahr
2024
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-97-2259-4_16

Premium Partner