Skip to main content
Erschienen in: Arabian Journal for Science and Engineering 9/2021

02.03.2021 | Research Article-Computer Engineering and Computer Science

Addressing Limited Vocabulary and Long Sentences Constraints in English–Arabic Neural Machine Translation

verfasst von: Safae Berrichi, Azzeddine Mazroui

Erschienen in: Arabian Journal for Science and Engineering | Ausgabe 9/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Neural Machine Translation (NMT) has attracted growing interest in recent years for its promising performance compared to traditional approaches such as Statistical Machine Translation. However, its application to languages having different structures, like the (English, Arabic) pair that interests us in this work, degrades its performance. Indeed, the limited vocabulary size required by the NMT models decreases the vocabulary coverage rate of the Arabic language, well known by its morphological richness. Likewise, long sentences present an additional challenge to NMT systems because they perform less well for longer sentences than for the shorter ones. In this paper, we provide a series of experiments to mitigate the effects of these constraints. To address the problem of out-of-vocabulary words, we integrated into factored NMT models morphosyntactic features as an output factor, namely stem, lemma, POS, root, and pattern. We have also developed two techniques for segmenting long sentences into smaller sub-sentences. The first uses a list of lexical markers that we have collected as segmentation points, and the second integrates into the NMT model the parallel phrases extracted by an SMT system. The experiments carried out on the English–Arabic pair show that the proposed approaches considerably improve the translation quality compared to the basic NMT system.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
3.
Zurück zum Zitat Ataman, D., Federico, M.; Compositional representation of morphologically-rich input for neural machine translation. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 305–311. Association for Computational Linguistics, Melbourne, Australia (2018). 10.18653/v1/P18-2049 Ataman, D., Federico, M.; Compositional representation of morphologically-rich input for neural machine translation. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 305–311. Association for Computational Linguistics, Melbourne, Australia (2018). 10.18653/v1/P18-2049
4.
Zurück zum Zitat Ataman, D.; Federico, M.: An evaluation of two vocabulary reduction methods for neural machine translation. In: Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pp. 97–110. Association for Machine Translation in the Americas, Boston, MA (2018). https://www.aclweb.org/anthology/W18-1810 Ataman, D.; Federico, M.: An evaluation of two vocabulary reduction methods for neural machine translation. In: Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pp. 97–110. Association for Machine Translation in the Americas, Boston, MA (2018). https://​www.​aclweb.​org/​anthology/​W18-1810
5.
Zurück zum Zitat Ataman, D.; Negri, M.; Turchi, M.; Federico, M.: Linguistically motivated vocabulary reduction for neural machine translation from turkish to english. Prague Bull. Math. Linguist. 108(1), 331–342 (2017)CrossRef Ataman, D.; Negri, M.; Turchi, M.; Federico, M.: Linguistically motivated vocabulary reduction for neural machine translation from turkish to english. Prague Bull. Math. Linguist. 108(1), 331–342 (2017)CrossRef
6.
Zurück zum Zitat Bahdanau, D.; Cho, K.; Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2015) Bahdanau, D.; Cho, K.; Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2015)
7.
Zurück zum Zitat Banar, N., Daelemans, W., Kestemont, M.: Character-level transformer-based neural machine translation. arXiv preprint arXiv:2005.11239 (2020) Banar, N., Daelemans, W., Kestemont, M.: Character-level transformer-based neural machine translation. arXiv preprint arXiv:​2005.​11239 (2020)
9.
Zurück zum Zitat Caglayan, O.; García-Martínez, M.; Bardet, A.; Aransa, W.; Bougares, F.; Barrault, L.: Nmtpy: A flexible toolkit for advanced neural machine translation systems. Prague Bull. Math. Linguist. 109(1), 15–28 (2017)CrossRef Caglayan, O.; García-Martínez, M.; Bardet, A.; Aransa, W.; Bougares, F.; Barrault, L.: Nmtpy: A flexible toolkit for advanced neural machine translation systems. Prague Bull. Math. Linguist. 109(1), 15–28 (2017)CrossRef
10.
Zurück zum Zitat Costa-Jussà, M.R.; Fonollosa, J.A.R.: Character-based neural machine translation (2016) Costa-Jussà, M.R.; Fonollosa, J.A.R.: Character-based neural machine translation (2016)
11.
Zurück zum Zitat Ding, S.; Renduchintala, A.; Duh, K.: A call for prudent choice of subword merge operations in neural machine translation. In: Proceedings of Machine Translation Summit XVII Volume 1: Research Track, pp. 204–213. European Association for Machine Translation, Dublin, Ireland (2019). https://www.aclweb.org/anthology/W19-6620 Ding, S.; Renduchintala, A.; Duh, K.: A call for prudent choice of subword merge operations in neural machine translation. In: Proceedings of Machine Translation Summit XVII Volume 1: Research Track, pp. 204–213. European Association for Machine Translation, Dublin, Ireland (2019). https://​www.​aclweb.​org/​anthology/​W19-6620
14.
Zurück zum Zitat Firat, O.; Cho, K.: Conditional gated recurrent unit with attention mechanism. System BLEU baseline 31, (2016) Firat, O.; Cho, K.: Conditional gated recurrent unit with attention mechanism. System BLEU baseline 31, (2016)
16.
Zurück zum Zitat Garcia-Martinez, M.; Barrault, L.; Bougares, F.: Factored Neural Machine Translation Architectures. In: International Workshop on Spoken Language Translation (IWSLT’16). Seattle, United States (2016) Garcia-Martinez, M.; Barrault, L.; Bougares, F.: Factored Neural Machine Translation Architectures. In: International Workshop on Spoken Language Translation (IWSLT’16). Seattle, United States (2016)
17.
Zurück zum Zitat Glorot, X.; Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Y.W. Teh, M. Titterington (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 9, pp. 249–256. PMLR, Chia Laguna Resort, Sardinia, Italy (2010) Glorot, X.; Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Y.W. Teh, M. Titterington (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 9, pp. 249–256. PMLR, Chia Laguna Resort, Sardinia, Italy (2010)
18.
Zurück zum Zitat Goh, C.L.; Sumita, E.: Splitting long input sentences for phrase-based statistical machine translation. In: Proceedings of the 17th Annual Meeting of the Association for Natural Language Processing, pp. 802–805 (2011) Goh, C.L.; Sumita, E.: Splitting long input sentences for phrase-based statistical machine translation. In: Proceedings of the 17th Annual Meeting of the Association for Natural Language Processing, pp. 802–805 (2011)
19.
Zurück zum Zitat Habash, N.; Sadat, F.: Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 49–52. Association for Computational Linguistics, New York City, USA (2006). https://www.aclweb.org/anthology/N06-2013 Habash, N.; Sadat, F.: Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 49–52. Association for Computational Linguistics, New York City, USA (2006). https://​www.​aclweb.​org/​anthology/​N06-2013
20.
Zurück zum Zitat Habash, N.; Zalmout, N.; Taji, D.; Hoang, H.; Alzate, M.: A parallel corpus for evaluating machine translation between Arabic and European languages. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 235–241. Association for Computational Linguistics, Valencia, Spain (2017) Habash, N.; Zalmout, N.; Taji, D.; Hoang, H.; Alzate, M.: A parallel corpus for evaluating machine translation between Arabic and European languages. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 235–241. Association for Computational Linguistics, Valencia, Spain (2017)
21.
Zurück zum Zitat Jean, S.; Cho, K.; Memisevic, R.; Bengio, Y.: On using very large target vocabulary for neural machine translation. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1–10. Association for Computational Linguistics, Beijing, China (2015). 10.3115/v1/P15-1001 Jean, S.; Cho, K.; Memisevic, R.; Bengio, Y.: On using very large target vocabulary for neural machine translation. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1–10. Association for Computational Linguistics, Beijing, China (2015). 10.3115/v1/P15-1001
22.
Zurück zum Zitat Johnson, H.; Martin, J.; Foster, G.; Kuhn, R.: Improving translation quality by discarding most of the phrasetable. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 967–975 (2007) Johnson, H.; Martin, J.; Foster, G.; Kuhn, R.: Improving translation quality by discarding most of the phrasetable. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 967–975 (2007)
23.
Zurück zum Zitat Kim, Y.; Jernite, Y.; Sontag, D.; Rush, A.M.: Character-aware neural language models (2015) Kim, Y.; Jernite, Y.; Sontag, D.; Rush, A.M.: Character-aware neural language models (2015)
24.
Zurück zum Zitat Kingma, D.P.; Ba, J.: Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015) Kingma, D.P.; Ba, J.: Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)
25.
Zurück zum Zitat Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Federico, M.; Bertoldi, N.; Cowan, B.; Shen, W.; Moran, C.; Zens, R.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pp. 177–180. Association for Computational Linguistics (2007) Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Federico, M.; Bertoldi, N.; Cowan, B.; Shen, W.; Moran, C.; Zens, R.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pp. 177–180. Association for Computational Linguistics (2007)
26.
Zurück zum Zitat Koehn, P.; Knowles, R.: Six challenges for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, pp. 28–39. Association for Computational Linguistics, Vancouver (2017). 10.18653/v1/W17-3204 Koehn, P.; Knowles, R.: Six challenges for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, pp. 28–39. Association for Computational Linguistics, Vancouver (2017). 10.18653/v1/W17-3204
27.
Zurück zum Zitat Koehn, P.; Och, F.J.; Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 48–54. Association for Computational Linguistics (2003) Koehn, P.; Och, F.J.; Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 48–54. Association for Computational Linguistics (2003)
28.
Zurück zum Zitat Kuang, S.; Xiong, D.: Automatic long sentence segmentation for neural machine translation. In: Lin, C.Y., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds.) Natural Language Understanding and Intelligent Applications, pp. 162–174. Springer International Publishing, Cham (2016)CrossRef Kuang, S.; Xiong, D.: Automatic long sentence segmentation for neural machine translation. In: Lin, C.Y., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds.) Natural Language Understanding and Intelligent Applications, pp. 162–174. Springer International Publishing, Cham (2016)CrossRef
29.
Zurück zum Zitat Maamouri, M.; Bies, A.; Buckwalter, T.; Mekki, W.: The penn arabic treebank: Building a large-scale annotated arabic corpus. In: NEMLAR conference on Arabic language resources and tools, vol. 27, pp. 466–467. Cairo (2004) Maamouri, M.; Bies, A.; Buckwalter, T.; Mekki, W.: The penn arabic treebank: Building a large-scale annotated arabic corpus. In: NEMLAR conference on Arabic language resources and tools, vol. 27, pp. 466–467. Cairo (2004)
30.
Zurück zum Zitat Och, F.J.; Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)CrossRef Och, F.J.; Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)CrossRef
31.
Zurück zum Zitat Oudah, M., Almahairi, A., Habash, N.: The impact of preprocessing on Arabic-English statistical and neural machine translation. In: Proceedings of Machine Translation Summit XVII Volume 1: Research Track, pp. 214–221. European Association for Machine Translation, Dublin, Ireland (2019) Oudah, M., Almahairi, A., Habash, N.: The impact of preprocessing on Arabic-English statistical and neural machine translation. In: Proceedings of Machine Translation Summit XVII Volume 1: Research Track, pp. 214–221. European Association for Machine Translation, Dublin, Ireland (2019)
32.
Zurück zum Zitat Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. pp. 311–318. Association for Computational Linguistics (2002) Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. pp. 311–318. Association for Computational Linguistics (2002)
33.
Zurück zum Zitat Pasha, A.; Al-Badrashiny, M.; Diab, M.T.; El Kholy, A.; Eskander, R.; Habash, N.; Pooleery, M.; Rambow, O.; Roth, R.: MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic. LREC 14, 1094–1101 (2014) Pasha, A.; Al-Badrashiny, M.; Diab, M.T.; El Kholy, A.; Eskander, R.; Habash, N.; Pooleery, M.; Rambow, O.; Roth, R.: MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic. LREC 14, 1094–1101 (2014)
34.
Zurück zum Zitat Sajjad, H.; Dalvi, F.; Durrani, N.; Abdelali, A.; Belinkov, Y.; Vogel, S.: Challenging language-dependent segmentation for Arabic: An application to machine translation and part-of-speech tagging. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 601–607. Association for Computational Linguistics, Vancouver, Canada (2017). 10.18653/v1/P17-2095 Sajjad, H.; Dalvi, F.; Durrani, N.; Abdelali, A.; Belinkov, Y.; Vogel, S.: Challenging language-dependent segmentation for Arabic: An application to machine translation and part-of-speech tagging. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 601–607. Association for Computational Linguistics, Vancouver, Canada (2017). 10.18653/v1/P17-2095
35.
Zurück zum Zitat Sennrich, R.; Firat, O.; Cho, K.; Birch, A.; Haddow, B.; Hitschler, J.; Junczys-Dowmunt, M.; Läubli, S.; Miceli Barone, A.V.; Mokry, J.; Nadejde, M.: Nematus: a toolkit for neural machine translation. In: Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 65–68. Association for Computational Linguistics, Valencia, Spain (2017) Sennrich, R.; Firat, O.; Cho, K.; Birch, A.; Haddow, B.; Hitschler, J.; Junczys-Dowmunt, M.; Läubli, S.; Miceli Barone, A.V.; Mokry, J.; Nadejde, M.: Nematus: a toolkit for neural machine translation. In: Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 65–68. Association for Computational Linguistics, Valencia, Spain (2017)
36.
Zurück zum Zitat Sennrich, R.; Haddow, B.: Linguistic input features improve neural machine translation. In: Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers, pp. 83–91. Association for Computational Linguistics, Berlin, Germany (2016). 10.18653/v1/W16-2209 Sennrich, R.; Haddow, B.: Linguistic input features improve neural machine translation. In: Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers, pp. 83–91. Association for Computational Linguistics, Berlin, Germany (2016). 10.18653/v1/W16-2209
37.
Zurück zum Zitat Sennrich, R.; Haddow, B.; Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. Association for Computational Linguistics, Berlin, Germany (2016). 10.18653/v1/P16-1162 Sennrich, R.; Haddow, B.; Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. Association for Computational Linguistics, Berlin, Germany (2016). 10.18653/v1/P16-1162
38.
Zurück zum Zitat Sutskever, I.; Vinyals, O.; Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, p. 3104–3112. MIT Press (2014) Sutskever, I.; Vinyals, O.; Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, p. 3104–3112. MIT Press (2014)
39.
Zurück zum Zitat Tamchyna, A.; Weller-Di Marco, M.; Fraser, A.: Modeling target-side inflection in neural machine translation. In: Proceedings of the Second Conference on Machine Translation, pp. 32–42. Association for Computational Linguistics, Copenhagen, Denmark (2017). 10.18653/v1/W17-4704 Tamchyna, A.; Weller-Di Marco, M.; Fraser, A.: Modeling target-side inflection in neural machine translation. In: Proceedings of the Second Conference on Machine Translation, pp. 32–42. Association for Computational Linguistics, Copenhagen, Denmark (2017). 10.18653/v1/W17-4704
40.
Zurück zum Zitat Tien, H.N.; Minh, H.N.T.: Long sentence preprocessing in neural machine translation. 2019 IEEE-RIVF International Conference on Computing and Communication Technologies (RIVF) pp. 1–6 (2019) Tien, H.N.; Minh, H.N.T.: Long sentence preprocessing in neural machine translation. 2019 IEEE-RIVF International Conference on Computing and Communication Technologies (RIVF) pp. 1–6 (2019)
41.
Zurück zum Zitat Ziemski, M.; Junczys-Dowmunt, M.; Pouliquen, B.: The united nations parallel corpus v1.0. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 3530–3534. European Language Resources Association (ELRA), Portorož, Slovenia (2016) Ziemski, M.; Junczys-Dowmunt, M.; Pouliquen, B.: The united nations parallel corpus v1.0. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 3530–3534. European Language Resources Association (ELRA), Portorož, Slovenia (2016)
Metadaten
Titel
Addressing Limited Vocabulary and Long Sentences Constraints in English–Arabic Neural Machine Translation
verfasst von
Safae Berrichi
Azzeddine Mazroui
Publikationsdatum
02.03.2021
Verlag
Springer Berlin Heidelberg
Erschienen in
Arabian Journal for Science and Engineering / Ausgabe 9/2021
Print ISSN: 2193-567X
Elektronische ISSN: 2191-4281
DOI
https://doi.org/10.1007/s13369-020-05328-2

Weitere Artikel der Ausgabe 9/2021

Arabian Journal for Science and Engineering 9/2021 Zur Ausgabe

Research Article-Computer Engineering and Computer Science

Multi-Level Cross-Architecture Binary Code Similarity Metric

Research Article-Computer Engineering and Computer Science

An Improved Blockchain Consensus Algorithm Based on Raft

    Marktübersichten

    Die im Laufe eines Jahres in der „adhäsion“ veröffentlichten Marktübersichten helfen Anwendern verschiedenster Branchen, sich einen gezielten Überblick über Lieferantenangebote zu verschaffen.