Skip to main content
Erschienen in: Social Network Analysis and Mining 1/2024

01.12.2024 | Original Article

Construction of a training dataset for a sentiment analysis model of dairy products tweets in Brazil

verfasst von: Thallys da Silva Nogueira, Kennya Beatriz Siqueira, Priscila Vanessa Zabala Capriles Goliatt

Erschienen in: Social Network Analysis and Mining | Ausgabe 1/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Creating specific datasets for machine learning models is a frequent and challenging task, requiring considerable effort in sample collection and maintaining a balanced representation of each class. In this study, our objective was to create a training dataset for a sentiment analysis model by combining results obtained from 5 natural language processing tools through 3 distinct approaches, aiming to automatically label various tweets in the negative, neutral, and positive classes. Additionally, we applied data balancing techniques to assess different methods' impacts on the sentiment analysis models' ability to generalize classes to previously unseen samples. The results demonstrated that the three approaches used to combine tool results and apply balancing techniques provided significantly superior outcomes compared to models with imbalanced datasets. These advancements enabled sentiment analysis models to achieve greater precision and generalization capacity for novel samples. These findings underscore the importance of considering effective data balancing strategies when creating training datasets for machine learning applications, especially in tasks sensitive to class imbalance, such as sentiment analysis. This enhanced approach is crucial to improving the performance and applicability of sentiment analysis models in real-world scenarios, providing more precise data analyses that unveil valuable insights in digital marketing.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
Dairy Drinks, Sour Cream, Dulce de Leche, Yogurt, Milk, Condensed Milk, Fermented Milk, Butter, Cheese, and Ice Cream.
 
Literatur
Zurück zum Zitat Barabba T, Zaltaman P (1991) Hearing the voice of the market. Harvard Business School Press, Brighton Barabba T, Zaltaman P (1991) Hearing the voice of the market. Harvard Business School Press, Brighton
Zurück zum Zitat Chernyaev A, Spryiskov A, Ivashko A, Bidulya Y (2020) A rumor detection in russian tweets. In: Karpov A, Potapova R (eds) Speech and computer. Springer, Cham, pp 108–118CrossRef Chernyaev A, Spryiskov A, Ivashko A, Bidulya Y (2020) A rumor detection in russian tweets. In: Karpov A, Potapova R (eds) Speech and computer. Springer, Cham, pp 108–118CrossRef
Zurück zum Zitat Hnaif A, Kanan E, Kanan T (2021) Sentiment analysis for arabic social media news polarity. Intell Autom Soft Comput 28:107–119CrossRef Hnaif A, Kanan E, Kanan T (2021) Sentiment analysis for arabic social media news polarity. Intell Autom Soft Comput 28:107–119CrossRef
Zurück zum Zitat Hovy E, Lavid J (2010) Towards a ‘science’ of corpus annotation: a new methodological challenge for corpus linguistics. Int J Trans 22(1):13–36 Hovy E, Lavid J (2010) Towards a ‘science’ of corpus annotation: a new methodological challenge for corpus linguistics. Int J Trans 22(1):13–36
Zurück zum Zitat Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Machine Learn Res 18(17):1–5 Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Machine Learn Res 18(17):1–5
Zurück zum Zitat Liu B (2012) Sentiment analysis and opinion mining. Synthesis lectures on human language technologies. Springer, Berlin p, pp 1–168CrossRef Liu B (2012) Sentiment analysis and opinion mining. Synthesis lectures on human language technologies. Springer, Berlin p, pp 1–168CrossRef
Zurück zum Zitat Nogueira TS, Mouro VA, Siqueira KB, Goliatt PVZC (2022) Analysis of the brazilian artisanal cheese market from the perspective of social networks. In: Abraham A, Gandhi N, Hanne T, Hong TP, Nogueira Rios T, Ding W (eds) Intelligent systems design and applications. Springer, Cham. https://doi.org/10.1007/978-3-030-96308-8_84CrossRef Nogueira TS, Mouro VA, Siqueira KB, Goliatt PVZC (2022) Analysis of the brazilian artisanal cheese market from the perspective of social networks. In: Abraham A, Gandhi N, Hanne T, Hong TP, Nogueira Rios T, Ding W (eds) Intelligent systems design and applications. Springer, Cham. https://​doi.​org/​10.​1007/​978-3-030-96308-8_​84CrossRef
Zurück zum Zitat Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Machine Learn Res 12:2825–2830MathSciNet Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Machine Learn Res 12:2825–2830MathSciNet
Zurück zum Zitat Usselmann H, Ahmad R, Siemon D (2021) A personality mining system for german twitter posts with global vectors word embedding. IEEE Access 9:165576–165610CrossRef Usselmann H, Ahmad R, Siemon D (2021) A personality mining system for german twitter posts with global vectors word embedding. IEEE Access 9:165576–165610CrossRef
Zurück zum Zitat Batista G, Bazzan A, Monard M. (2003) Balancing training data for automated annotation of keywords: a case study. In: The Proceedings Of Workshop on Bioinformatics, pp 10–18 Batista G, Bazzan A, Monard M. (2003) Balancing training data for automated annotation of keywords: a case study. In: The Proceedings Of Workshop on Bioinformatics, pp 10–18
Zurück zum Zitat Brito EMN (2017) Mineração de Textos: detecção automática de sentimentos em comentários nas mídias sociais. Projetos e Dissertações em Sistemas de Informação e Gestão do Conhecimento, 6 Brito EMN (2017) Mineração de Textos: detecção automática de sentimentos em comentários nas mídias sociais. Projetos e Dissertações em Sistemas de Informação e Gestão do Conhecimento, 6
Zurück zum Zitat Brum H, Nunes MGV (2018) Building a Sentiment Corpus of Tweets in Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA) Brum H, Nunes MGV (2018) Building a Sentiment Corpus of Tweets in Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA)
Zurück zum Zitat Cavalcante PEC, Barbosa YAM (2017) Um dataset para análise de sentimmentos na língua portuguesa Cavalcante PEC, Barbosa YAM (2017) Um dataset para análise de sentimmentos na língua portuguesa
Zurück zum Zitat Chawla N, Bowyer K, Hall LO, Kegelmeyer WP (2002) Smote: Synthetic minority over-sampling technique. ArXiv, abs/1106.1813 Chawla N, Bowyer K, Hall LO, Kegelmeyer WP (2002) Smote: Synthetic minority over-sampling technique. ArXiv, abs/1106.1813
Zurück zum Zitat He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp 1322–1328. ISSN 2161–4407 He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp 1322–1328. ISSN 2161–4407
Zurück zum Zitat Jonathan B, Putra PH, Ruldeviyani Y (2020) Observation imbalanced data text to predict users selling products on female daily with smote, tomek, and smote-tomek. In:2020 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), pp 81–85 Jonathan B, Putra PH, Ruldeviyani Y (2020) Observation imbalanced data text to predict users selling products on female daily with smote, tomek, and smote-tomek. In:2020 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), pp 81–85
Zurück zum Zitat Junczys-Dowmunt M, Grundkiewicz R, Dwojak T, Hoang H, Heafield K, Neckermann T, Seide F, Germann U, Aji AF, Bogoychev N, Martins AFT, Birch-Mayne A (2018) Marian: Fast Neural Machine Translation in C++. In: The 56th Annual Meeting of the Association for Computational Linguistics. 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp 15–20 Junczys-Dowmunt M, Grundkiewicz R, Dwojak T, Hoang H, Heafield K, Neckermann T, Seide F, Germann U, Aji AF, Bogoychev N, Martins AFT, Birch-Mayne A (2018) Marian: Fast Neural Machine Translation in C++. In: The 56th Annual Meeting of the Association for Computational Linguistics. 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp 15–20
Zurück zum Zitat Lample G, Denoyer L, Ranzato M (2017) Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043 Lample G, Denoyer L, Ranzato M (2017) Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:​1711.​00043
Zurück zum Zitat Loper E, Bird S (2002) NLTK: The natural language toolkit. In: Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Philadelphia: Association for Computational Linguistics Loper E, Bird S (2002) NLTK: The natural language toolkit. In: Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Philadelphia: Association for Computational Linguistics
Zurück zum Zitat McKinney W (2010) Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference. 445, pp 51–56 McKinney W (2010) Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference. 445, pp 51–56
Zurück zum Zitat Moraes SM, Manssour IH, Silveira MS (2015) 7x1pt: um corpus extraído do twitter para análise de sentimentos em língua portuguesa. In: Anais do X Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pp 21–25. SBC Moraes SM, Manssour IH, Silveira MS (2015) 7x1pt: um corpus extraído do twitter para análise de sentimentos em língua portuguesa. In: Anais do X Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pp 21–25. SBC
Zurück zum Zitat Narayanan R, Liu B, Choudhary A (2009) Sentiment analysis of conditional sentences. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1, pp 180–189. Association for Computational Linguistics Narayanan R, Liu B, Choudhary A (2009) Sentiment analysis of conditional sentences. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1, pp 180–189. Association for Computational Linguistics
Zurück zum Zitat Pinto HL, Rocio V (2019) Combining Sentiment Analysis Scores to Improve Accuracy of Polarity Classification in MOOC Posts. In: Progress in Artificial Intelligence: 19th EPIA Conference on Artificial Intelligence, EPIA 2019, Vila Real, Portugal, September 3–6, 2019, Proceedings, Part I. Springer-Verlag, Berlin, Heidelberg, pp 35–46. https://doi.org/10.1007/978-3-030-30241-2_4 Pinto HL, Rocio V (2019) Combining Sentiment Analysis Scores to Improve Accuracy of Polarity Classification in MOOC Posts. In: Progress in Artificial Intelligence: 19th EPIA Conference on Artificial Intelligence, EPIA 2019, Vila Real, Portugal, September 3–6, 2019, Proceedings, Part I. Springer-Verlag, Berlin, Heidelberg, pp 35–46. https://​doi.​org/​10.​1007/​978-3-030-30241-2_​4
Zurück zum Zitat Sennrich R, Haddow B, Birch A (2016) Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 86–96, Berlin, Germany. Association for Computational Linguistics Sennrich R, Haddow B, Birch A (2016) Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 86–96, Berlin, Germany. Association for Computational Linguistics
Zurück zum Zitat Silva PS (2016) Avaliação do desempenho de métodos de análise de sentimentos na presença das figuras de linguagem sarcasmo e ironia. 115 f. Trabalho de Conclusão de Curso (Graduação) - Universidade Federal do Sul e Sudeste do Pará, Campus Universitário de Marabá, Instituto de Geociências e Engenharias, Faculdade de Computação e Engenharia Elétrica, Curso de Bacharelado em Sistemas de Informação, Marabá, 2016. Available from: http://repositorio.unifesspa.edu.br/handle/123456789/233 Silva PS (2016) Avaliação do desempenho de métodos de análise de sentimentos na presença das figuras de linguagem sarcasmo e ironia. 115 f. Trabalho de Conclusão de Curso (Graduação) - Universidade Federal do Sul e Sudeste do Pará, Campus Universitário de Marabá, Instituto de Geociências e Engenharias, Faculdade de Computação e Engenharia Elétrica, Curso de Bacharelado em Sistemas de Informação, Marabá, 2016. Available from: http://​repositorio.​unifesspa.​edu.​br/​handle/​123456789/​233
Zurück zum Zitat Veríssimo B, Lepre L, Tincani D (2018) Diferenças entre pesquisa de marketing e pesquisa de neuromarketing Veríssimo B, Lepre L, Tincani D (2018) Diferenças entre pesquisa de marketing e pesquisa de neuromarketing
Zurück zum Zitat Zhang J, Mani I (2003) KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In: Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Datasets Zhang J, Mani I (2003) KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In: Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Datasets
Metadaten
Titel
Construction of a training dataset for a sentiment analysis model of dairy products tweets in Brazil
verfasst von
Thallys da Silva Nogueira
Kennya Beatriz Siqueira
Priscila Vanessa Zabala Capriles Goliatt
Publikationsdatum
01.12.2024
Verlag
Springer Vienna
Erschienen in
Social Network Analysis and Mining / Ausgabe 1/2024
Print ISSN: 1869-5450
Elektronische ISSN: 1869-5469
DOI
https://doi.org/10.1007/s13278-024-01254-5

Weitere Artikel der Ausgabe 1/2024

Social Network Analysis and Mining 1/2024 Zur Ausgabe

Premium Partner