2.1 Data augmentation for text classification
The goal of data augmentation techniques is to enhance data quality and diversity without the need for additional data collection. Several fields have successfully applied data augmentation (DA) to improve model performance and generalization (Jiang et al.
2023; Wang et al.
2023; Yadav and Vishwakarma
2023). To accomplish this, new samples are generated through methods such as noise addition, cropping, or flipping the training dataset while maintaining the integrity of the original dataset (Li et al.
2022,
2023; Al-Dhabyani et al.
2019).
Unlike computer vision, which focuses on pixels, NLP data augmentation emphasizes linguistic variations. Utilized methods include synonym substitution, sentence shuffling, back-translation, and embedding modification. These solutions encompass tasks such as balancing classes within unbalanced datasets and generating additional data for under-resourced domains (Wang et al.
2023; Bayer et al.
2023). Text DA can be implemented based on feature space or data space (Bayer et al.
2022). In data space DA, adjustments can be made at various granularities, including the character level, word level, sentence level, or document level (Bayer et al.
2022). These strategies aim to expose the model to a broad array of linguistic patterns, fostering improved generalization and performance across diverse NLP tasks such as sentiment analysis and named entity recognition (Wang et al.
2023; Jiang et al.
2023). Deep learning models, including transfer models like BERT, have gained popularity in natural language processing (NLP) tasks owing to their ability to streamline feature extraction efforts (Gupta et al.
2023; Yadav et al.
2023; Asta and Setiawan
2023). While the use of data augmentation in NLP is still in its early stages, it plays a crucial role in addressing the lack of data and strengthening the resilience of language models as the field continues to evolve (Pellicer et al.
2023; Wang et al.
2023).
Text augmentation for fake news detection is becoming increasingly popular as a method for improving model resilience and generalization when dealing with the growing number of fake news sources (Refai et al.
2022). A variety of methods are employed to augment or expand the training dataset, enhancing the model’s capacity to handle a broader range of cases that simulate the complexity and diversity found in real-life misleading information (Refai et al.
2022; Kumar et al.
2020; Hua et al.
2023). These methods include back-translation, paraphrasing, and context-based augmentation. Back translation is a method that utilizes translation models to create paraphrases. In this approach, text is translated into another language and then back-translated into the original language. The underlying concept of back-translation is rooted in the complexity of natural language processing (NLP), which results in multiple translations for a given text. This approach is highly effective due to its high paraphrasing capabilities and the preservation of labels for newly generated instances. Back-translation outperformed the EDA method, GPT2, and BERT pre-trained models for data augmentation (DA) (Kumar et al.
2020; Hua et al.
2023). A study conducted by (Kumar et al.
2020), the authors applied DA based on back-translation to two datasets. Their experimental findings demonstrated that the back-translation technique surpasses other methods in terms of precision, thus demonstrating the efficiency of contemporary translation systems in maintaining language semantics. In the work presented by Refai et al. (
2022), a novel DA method was introduced for Arabic text classification, aiming to integrate the unique features of the Arabic language. The motivation behind this endeavor stemmed from the established efficacy of textual augmentation in enhancing the performance of text classification tasks. The authors utilized Arabic transformers, specifically AraGPT-2 and AraBERT, for the generation and processing of Arabic text. Furthermore, they employed well-known similarity metrics, including cosine, Euclidean, Jaccard, and BLEU measures, to ensure the quality of the augmented text. This consideration encompassed aspects such as diversity, context, and semantics. Additionally, in Sabty et al. (
2021), data augmentation (DA) was employed to augment the limited amount of labeled data available for named entity recognition (NER). Various automatic augmentation techniques, such as back-translation, modified EDA, and word embedding substitution, were introduced to expand the training data and enhance the performance of Arabic NER. The study’s findings demonstrated that the performance of NER can be improved through the application of combinations of different DA techniques.
Automatically predicting misleading information in Arabic social media is both a technological and socio-cultural imperative. Addressing inaccurate information on Arabic social media is crucial for several reasons (Albalawi et al.
2023; Singh et al.
2023). Firstly, misinformation can significantly impact public perceptions, potentially giving rise to misguided beliefs, fostering fear, or even inciting unjustified actions. Secondly, the linguistic and cultural characteristics of Arabic pose specific challenges for the automated identification of misleading information, necessitating specialized methodologies and models.
In social media platforms, machine-learning (ML) algorithms offer a sophisticated and efficient way to analyze large volumes of textual and contextual data. Recent research in fake news detection sheds light on two primary approaches: classification and propagation (El Ballouli et al.
2017; Jin et al.
2014; Albalawi et al.
2023). Propagation-based approaches (Singh et al.
2023; Azad
2023) delve into the analysis of social graph structures to identify misinformation (Jin et al.
2014). In contrast, classification-based approaches employ machine-learning algorithms that rely on textual features extracted from the content itself (El Ballouli et al.
2017; Zubiaga et al.
2017; Sabbeh and BAATWAH
2018).
Additionally, advancements in the field have introduced two key dimensions to misinformation detection: source-based and content-based features. Content-based approaches rely on factors such as text length, the presence of hashtags (#) in the text, and sentiment features (El Ballouli et al.
2017; Kazmi et al.
2023). On the other hand, source-based features are derived from user characteristics, including follower count and user account verification. Some studies, such as (El Ballouli et al.
2017; Hassan et al.
2018), propose a hybrid approach, combining both source and content features for a more comprehensive analysis.
Lorek et al. (
2015) automated the matching between the contents of external links and text content as an evidence feature for classifying tweets as credible or incredible. Zubiaga et al. (
2017) introduced a misinformation detection system to help users identify fake tweets using the conditional random fields algorithm. Their approach is based on content-based features and social features to compare results. According to the obtained results, the textual features of the tweet’s content effectively detect the tweet’s credibility.
Furthermore, Hassan et al. (
2018) examined different feature sets, including content and source features, over the dataset used in (Lorek et al.
2015). Their work concluded that using features related to the source is better than using features related to the content. Results showed that source features improved
F-measure by 49% (Lorek et al.
2015). Unfortunately, the creation of handcrafted features is not only time-consuming but also poses the risk of being misleading. An illustrative example is the reliance on the number of followers or reshares as indicators of a tweet’s credibility, a metric that may not reliably signify the verification of content by users before resharing (Ravikumar et al.
2012).
In the domain of Arabic misinformation, a considerable challenge arises from the scarcity of labeled datasets essential for training machine-learning (ML) algorithms. Recent research efforts in the field have explored deep learning (DL) approaches to tackle misinformation detection in Arabic blogs. In (Gaanoun
2020), the authors employed a semi-supervised technique based on the Arabic-BERT model and ensemble models after the DA process. Their approach achieved higher performance using Arabic-BERT models compared to baseline models. However, the majority of DA works for Arabic focus on sentiment analysis tasks. In Abuzayed and Al-Khalifa (
2021), several BERT models were utilized, and the DA process was applied to tweets to enhance the performance of sentiment detection. Another study (Alkadri
2022) investigated the effect of using DA for Arabic tweets in the spam detection task. Their focus was on addressing the dataset imbalance problem by increasing the instances of minority classes. They utilized a large corpus to extract Word2Vec embedding vectors to represent tweet contents. A notable exploration into DL methodologies was conducted by Ajao et al. (
2018), employing recurrent neural networks (RNN) and long short-term memory models (LSTM) for social media text classification. This research achieved an impressive 82% accuracy using the dataset also utilized in previous works (Hassan et al.
2018; Lorek et al.
2015). The advantage of DL models lies in their automatic feature extraction, eliminating the need for manual crafting of features. However, the substantial requirement for labeled data in training remains a drawback.
In another direction, a novel approach was introduced using an
N-gram model for misinformation detection in tweets, as proposed by Hassan et al. (
2020).
N-gram features, illuminating word relationships and their context within a sentence, were applied to both the Arabic dataset from El Ballouli et al. (
2017) and the PHEME dataset. Remarkably, the
N-grams model exhibited superior performance on the PHEME and CAT datasets compared to the LSTM DL model presented by Ajao et al. (
2018) for fake tweet detection. Applying the DL approach to the PHEME dataset, the
N-grams-based model surpassed LSTM by 48% in
F-measure and 2% in accuracy.
This emphasizes the effectiveness of relying primarily on text features for optimal performance. The advantage of N-gram features lies in their ease of extraction from textual content. In addition to eliminating the need for extensive data corpora during the training process, as required by handcrafted features such as word embeddings, the exceptional performance of N-gram features stems from their ability to discern words and phrases effectively, enabling the model to grasp contextual information surrounding each word. This paper aims to fill the existing research gaps by studying the effect of the DA back-translation technique on the fake/misinformation detection task. The study presents various representations for tweet content to identify the most effective one. Moreover, it leverages the high performance of the Arab-BERT model to enhance detection performance compared to traditional machine-learning algorithms.