Skip to main content

2024 | Buch

Advances in Knowledge Discovery and Data Mining

28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024, Taipei, Taiwan, May 7–10, 2024, Proceedings, Part III

herausgegeben von: De-Nian Yang, Xing Xie, Vincent S. Tseng, Jian Pei, Jen-Wei Huang, Jerry Chun-Wei Lin

Verlag: Springer Nature Singapore

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The 6-volume set LNAI 14645-14650 constitutes the proceedings of the 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024, which took place in Taipei, Taiwan, during May 7–10, 2024.

The 177 papers presented in these proceedings were carefully reviewed and selected from 720 submissions. They deal with new ideas, original research results, and practical development experiences from all KDD related areas, including data mining, data warehousing, machine learning, artificial intelligence, databases, statistics, knowledge engineering, big data technologies, and foundations.

Inhaltsverzeichnis

Frontmatter

Interpretability and Explainability

Frontmatter
Neural Additive and Basis Models with Feature Selection and Interactions

Deep neural networks (DNNs) exhibit attractive performance in various fields but often suffer from low interpretability. The neural additive model (NAM) and its variant called the neural basis model (NBM) use neural networks (NNs) as nonlinear shape functions in generalized additive models (GAMs). Both models are highly interpretable and exhibit good performance and flexibility for NN training. NAM and NBM can provide and visualize the contribution of each feature to the prediction owing to GAM-based architectures. However, when using two-input NNs to consider feature interactions or when applying them to high-dimensional datasets, training NAM and NBM becomes intractable due to the increase in the computational resources required. This paper proposes incorporating the feature selection mechanism into NAM and NBM to resolve computational bottlenecks. We introduce the feature selection layer in both models and update the selection weights during training. Our method is simple and can reduce computational costs and model sizes compared to vanilla NAM and NBM. In addition, it enables us to use two-input NNs even in high-dimensional datasets and capture feature interactions. We demonstrate that the proposed models are computationally efficient compared to vanilla NAM and NBM, and they exhibit better or comparable performance with state-of-the-art GAMs.

Yasutoshi Kishimoto, Kota Yamanishi, Takuya Matsuda, Shinichi Shirakawa
Random Mask Perturbation Based Explainable Method of Graph Neural Networks

Graph Neural Networks (GNNs) have garnered considerable attention due to their potential applications across multiple domains. However, enhancing their interpretability is a significant challenge for the crucial application. This paper proposes an innovative node perturbation-based method to explicate GNNs and unveil their decision-making processes. Categorized as a black-box method, it generates explanations of node importance solely through the input-output analysis of the model, obviating the necessity for internal access. The method employs fidelity as a metric for calculating the significance of perturbation masks and utilizes a sparsity threshold to filter the computation results. Furthermore, recognizing the impact of different node combinations on model prediction outcomes, we treat the mask as a random variable. By randomly sampling various masks, we compute perturbed node importance, facilitating the generation of user-friendly explanations. Comparative experiments and ablation studies conducted on both real and synthetic datasets substantiate the efficacy of our approach in interpreting GNNs. Additionally, through a case study, we visually demonstrate the method’s compelling interpretative evidence regarding model prediction outcomes.

Xinyue Yang, Hai Huang, Xingquan Zuo
RouteExplainer: An Explanation Framework for Vehicle Routing Problem

The Vehicle Routing Problem (VRP) is a widely studied combinatorial optimization problem and has been applied to various practical problems. While the explainability for VRP is significant for improving the reliability and interactivity of practical VRP applications, it remains unexplored. In this paper, we propose RouteExplainer, a post-hoc explanation framework that explains the influence of each edge in a generated route. Our framework realizes this by rethinking a route as the sequence of actions and extending counterfactual explanations based on the action influence model to VRP. To enhance the explanation, we additionally propose an edge classifier that infers the intentions of each edge, a loss function to train the edge classifier, and explanation-text generation by Large Language Models (LLMs). We quantitatively evaluate our edge classifier on four different VRPs. The results demonstrate its rapid computation while maintaining reasonable accuracy, thereby highlighting its potential for deployment in practical applications. Moreover, on the subject of a tourist route, we qualitatively evaluate explanations generated by our framework. This evaluation not only validates our framework but also shows the synergy between explanation frameworks and LLMs. See https://ntt-dkiku.github.io/xai-vrp for code, appendices, and demo.

Daisuke Kikuta, Hiroki Ikeuchi, Kengo Tajiri, Yuusuke Nakano
On the Efficient Explanation of Outlier Detection Ensembles Through Shapley Values

Feature bagging models have revealed their practical usability in various contexts, among them in outlier detection, where they build ensembles to reliably assign outlier scores to data samples. However, the interpretability of so-obtained outlier detection methods is far from achieved. Among the standard black-box models interpretability approaches, we find Shapley values that clarify the roles of single inputs. However, Shapley values are characterized by high computational runtimes that make them useful in pretty low-dimensional applications. We propose bagged Shapley values, a method to achieve interpretability of feature bagging ensembles, especially for outlier detection. The method not only assigns local importance scores to each feature of the initial space, helping to increase the interpretability but also solves the computational issue; specifically, the bagged Shapley values can be exactly computed in polynomial time.

Simon Klüttermann, Chiara Balestra, Emmanuel Müller
Interpreting Pretrained Language Models via Concept Bottlenecks

Pretrained language models (PLMs) have made significant strides in various natural language processing tasks. However, the lack of interpretability due to their “black-box” nature poses challenges for responsible implementation. Although previous studies have attempted to improve interpretability by using, e.g., attention weights in self-attention layers, these weights often lack clarity, readability, and intuitiveness. In this research, we propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans. For example, we learn the concept of “Food” and investigate how it influences the prediction of a model’s sentiment towards a restaurant review. We introduce C $$^3$$ 3 M, which combines human-annotated and machine-generated concepts to extract hidden neurons designed to encapsulate semantically meaningful and task-specific concepts. Through empirical evaluations on real-world datasets, we show that our approach offers valuable insights to interpret PLM behavior, helps diagnose model failures, and enhances model robustness amidst noisy concept labels.

Zhen Tan, Lu Cheng, Song Wang, Bo Yuan, Jundong Li, Huan Liu
Unmasking Dementia Detection by Masking Input Gradients: A JSM Approach to Model Interpretability and Precision

The evolution of deep learning and artificial intelligence has significantly reshaped technological landscapes. However, their effective application in crucial sectors such as medicine demands more than just superior performance, but trustworthiness as well. While interpretability plays a pivotal role, existing explainable AI (XAI) approaches often do not reveal Clever Hans behavior where a model makes (ungeneralizable) correct predictions using spurious correlations or biases in data. Likewise, current post-hoc XAI methods are susceptible to generating unjustified counterfactual examples. In this paper, we approach XAI with an innovative model debugging methodology realized through Jacobian Saliency Map (JSM). To cast the problem into a concrete context, we employ Alzheimer’s disease (AD) diagnosis as the use case, motivated by its significant impact on human lives and the formidable challenge in its early detection, stemming from the intricate nature of its progression. We introduce an interpretable, multimodal model for AD classification over its multi-stage progression, incorporating JSM as a modality-agnostic tool that provides insights into volumetric changes indicative of brain abnormalities. Our extensive evaluation including ablation study manifests the efficacy of using JSM for model debugging and interpretation, while significantly enhancing model accuracy as well.

Yasmine Mustafa, Tie Luo
Towards Nonparametric Topological Layers in Neural Networks

Various topological techniques and tools have been applied to neural networks in terms of network complexity, explainability, and performance. One fundamental assumption of this line of research is the existence of a global (Euclidean) coordinate system upon which the topological layer is constructed. Despite promising results, such a topologization method has yet to be widely adopted because the parametrization of a topologization layer takes a considerable amount of time and lacks a theoretical foundation, leading to suboptimal performance and lack of explainability. This paper proposes a learnable topological layer for neural networks without requiring an Euclidean space. Instead, the proposed construction relies on a general metric space, specifically a Hilbert space that defines an inner product. As a result, the parametrization for the proposed topological layer is free of user-specified hyperparameters, eliminating the costly parametrization stage and the corresponding possibility of suboptimal networks. Experimental results on three popular data sets demonstrate the effectiveness of the proposed approach.

Gefei Shen, Dongfang Zhao

Online, Streaming, Distributed Algorithms

Frontmatter
Streaming Fair k-Center Clustering over Massive Dataset with Performance Guarantee

Emerging applications are imposing challenges for incorporating fairness constraints into k-center clustering in the streaming setting. Different from the traditional k-center problem, the fairness constraints require that the input points be divided into disjoint groups and the number of centers from each group is constrained by a given upper bound. Moreover, observing the applications of fair k-center in massive datasets, we consider the problem in the streaming setting, where the data points arrive in a streaming manner that each point can be processed at its arrival. As the main contributions, we propose a two-pass streaming algorithm for the fair k-center problem with two groups, achieving an approximation ratio of $$3+\epsilon $$ 3 + ϵ and consuming only $$O(k\log n)$$ O ( k log n ) memory and O(k) update time, matching the state-of-art ratio for the offline setting. Then, we show that the algorithm can be easily improved to a one-pass streaming algorithm with an approximation ratio of $$7+\epsilon $$ 7 + ϵ and the same memory complexity and update time. Moreover, we show that our algorithm can be simply tuned to solve the case with an arbitrary number of groups while achieving the same ratio and space complexity. Lastly, we carried out extensive experiments to evaluate the practical performance of our algorithm compared with the state-of-the-art algorithms.

Zeyu Lin, Longkun Guo, Chaoqi Jia
Projection-Free Bandit Convex Optimization over Strongly Convex Sets

Projection-free algorithms for bandit convex optimization have received increasing attention, due to the ability to deal with the bandit feedback and complicated constraints simultaneously. The state-of-the-art ones can achieve an expected regret bound of $$O(T^{3/4})$$ O ( T 3 / 4 ) . However, they need to utilize a blocking technique, which is unsatisfying in practice due to the delayed reaction to the change of functions, and results in a logarithmically worse high-probability regret bound of $$O(T^{3/4}\sqrt{\log T})$$ O ( T 3 / 4 log T ) . In this paper, we study the special case of bandit convex optimization over strongly convex sets, and present a projection-free algorithm, which keeps the $$O(T^{3/4})$$ O ( T 3 / 4 ) expected regret bound without employing the blocking technique. More importantly, we prove that it can enjoy an $$O(T^{3/4})$$ O ( T 3 / 4 ) high-probability regret bound, which removes the logarithmical factor in the previous high-probability regret bound. Furthermore, empirical results on synthetic and real-world datasets have demonstrated the better performance of our algorithm.

Chenxu Zhang, Yibo Wang, Peng Tian, Xiao Cheng, Yuanyu Wan, Mingli Song
Adaptive Prediction Interval for Data Stream Regression

Prediction Interval (PI) is a powerful technique for quantifying the uncertainty of regression tasks. However, research on PI for data streams has not received much attention. Moreover, traditional PI-generating approaches are not directly applicable due to the dynamic and evolving nature of data streams. This paper presents AdaPI (ADAptive Prediction Interval), a novel method that can automatically adjust the interval width by an appropriate amount according to historical information to converge the coverage to a user-defined percentage. AdaPI can be applied to any streaming PI technique as a postprocessing step. This paper develops an incremental variant of the pervasive Mean and Variance Estimation (MVE) method for use with AdaPI. An empirical evaluation over a set of standard streaming regression tasks demonstrates AdaPI’s ability to generate compact prediction intervals with a coverage close to the desired level, outperforming alternative methods.

Yibin Sun, Bernhard Pfahringer, Heitor Murilo Gomes, Albert Bifet
Probabilistic Guarantees of Stochastic Recursive Gradient in Non-convex Finite Sum Problems

This paper develops a new dimension-free Azuma-Hoeffding type bound on summation norm of a martingale difference sequence with random individual bounds. With this novel result, we provide high-probability bounds for the gradient norm estimator in the proposed algorithm Prob-SARAH, which is a modified version of the StochAstic Recursive grAdient algoritHm (SARAH), a state-of-art variance reduced algorithm that achieves optimal computational complexity in expectation for the finite sum problem. The in-probability complexity by Prob-SARAH matches the best in-expectation result up to logarithmic factors. Empirical experiments demonstrate the superior probabilistic performance of Prob-SARAH on real datasets compared to other popular algorithms.

Yanjie Zhong, Jiaqi Li, Soumendra Lahiri
Rethinking Personalized Federated Learning with Clustering-Based Dynamic Graph Propagation

Most existing personalized federated learning approaches are based on intricate designs, which often require complex implementation and tuning. In order to address this limitation, we propose a simple yet effective personalized federated learning framework. Specifically, during each communication round, we group clients into multiple clusters based on their model training status and data distribution on the server side. We then consider each cluster center as a node equipped with model parameters and construct a graph that connects these nodes using weighted edges. Additionally, we update the model parameters at each node by propagating information across the entire graph. Subsequently, we design a precise personalized model distribution strategy to allow clients to obtain the most suitable model from the server side. We conduct experiments on three image benchmark datasets and create synthetic structured datasets with three types of typologies. Experimental results demonstrate the effectiveness of the proposed FedCedar.

Jiaqi Wang, Yuzhong Chen, Yuhang Wu, Mahashweta Das, Hao Yang, Fenglong Ma
Unveiling Backdoor Risks Brought by Foundation Models in Heterogeneous Federated Learning

The foundation models (FMs) have been used to generate synthetic public datasets for the heterogeneous federated learning (HFL) problem where each client uses a unique model architecture. However, the vulnerabilities of integrating FMs, especially against backdoor attacks, are not well-explored in the HFL contexts. In this paper, we introduce a novel backdoor attack mechanism for HFL that circumvents the need for client compromise or ongoing participation in the FL process. This method plants and transfers the backdoor through a generated synthetic public dataset, which could help evade existing backdoor defenses in FL by presenting normal client behaviors. Empirical experiments across different HFL configurations and benchmark datasets demonstrate the effectiveness of our attack compared to traditional client-based attacks. Our findings reveal significant security risks in developing robust FM-assisted HFL systems. This research contributes to enhancing the safety and integrity of FL systems, highlighting the need for advanced security measures in the era of FMs. The source codes can be found in the link ( https://github.com/lixi1994/backdoor_FM_hete_FL ).

Xi Li, Chen Wu, Jiaqi Wang
Combating Quality Distortion in Federated Learning with Collaborative Data Selection

Federated Learning (FL), a paradigm facilitating collaborative model training across distributed devices, has attracted substantial attention due to its potential to address privacy concerns and data localization requirements. However, the inherent inaccessibility of data poses a critical challenge in ensuring data quality within FL systems. Consequently, FL systems grapple with a range of data-related issues, encompassing erroneous samples, imbalanced data distributions, and data skew, all of which impose a significant impact on model performance. Therefore, the judicious selection of appropriate data for training is of paramount importance as it seeks to ameliorate these challenges.This research paper tackles a crucial but often overlooked concern: the presence of low-quality data samples. In such circumstances, we introduce an innovative algorithm that strategically curates a subset of data for each training iteration, with the overarching objective of optimizing the model’s accuracy while simultaneously addressing privacy concerns and reducing communication costs. Our primary innovation lies in the global selection of data, in contrast to the conventional approach that relies on individualized, client-specific data selection.Furthermore, we introduce a novel medical dataset tailored specifically for classification tasks. This dataset intentionally incorporates various attributes associated with low-quality data to effectively replicate real-world conditions. Through rigorous empirical evaluation, we show the effectiveness of our algorithm using this dataset. The results demonstrate a notable improvement of approximately 2–3% in model performance, particularly in scenarios characterized by imbalanced data distributions.

Duc Long Nguyen, Phi Le Nguyen, Thao Nguyen Truong

Probabilistic Models and Statistical Inference

Frontmatter
Neural Marked Hawkes Process for Limit Order Book Modeling

Streams of various order types submitted to financial exchanges can be modeled with multivariate Temporal Point Processes (TPPs). The multivariate Hawkes process has been the predominant choice for this purpose. To jointly model various order types with their volumes, the framework is extended to the multivariate Marked Hawkes Process by considering order volumes as marks. Rich empirical evidence suggests that the volume distributions exhibit temporal dependencies and multimodality. However, existing literature employs simple distributions for modeling the volume distributions and assumes that they are independent of the history or only dependent on the latest observation. To address these limitations, we present the Neural Marked Hawkes Process (NMHP), of which the key idea is to condition the mark distributions on the history vector embedded with Neural Hawkes Process architecture. To ensure the flexibility of the mark distributions, we propose and evaluate two promising choices: the univariate Conditional Normalizing Flows and the Mixture Density Network. The utility of NMHP is demonstrated with large-scale real-world limit order book data of three popular futures listed on Korea Exchange. To the best of our knowledge, this is the first work to incorporate complex, history-dependent order volume distributions into the multivariate TPPs of order book dynamics.

Guhyuk Chung, Yongjae Lee, Woo Chang Kim
How Large Corpora Sizes Influence the Distribution of Low Frequency Text n-grams

The prediction of the numbers of distinct word n-grams and their frequency distributions in text corpora is important in domains like information processing and language modelling. With big data corpora, there is an increased application complexity due to the large volume of data. Traditional studies have been confined to small or moderate size corpora leading to statistical laws on word frequency distributions. However, when going to very large corpora, some of the assumptions underlying those laws need to be revised, related to the corpus vocabulary and numbers of word occurrences. So, although it becomes critical to know how the corpus size influences those distributions, there is a lack of models that characterise such influence. This paper aims at filling this gap, enabling the prediction of the impact of corpus growth upon application time and space complexities. It presents a fully principled model, which, distinctively, considers words and multiwords in very large corpora, predicting the cumulative numbers of distinct n-grams above or equal to a given frequency in a corpus, as well as the sizes of equal-frequency n-gram groups, from unigrams to hexagrams, as a function of corpus size, in a language, assuming a finite n-gram vocabulary. The model applies to low occurrence frequencies, encompassing the larger populations of n-grams. Practical assessment with real corpora shows relative errors around $$3\%$$ 3 % , stable over the considered ranges of n-gram frequencies, n-gram sizes and corpora sizes from million to billion words, for English and French.

Joaquim F. Silva, Jose C. Cunha
Meta-Reinforcement Learning Algorithm Based on Reward and Dynamic Inference

Meta-Reinforcement Learning aims to rapidly address unseen tasks that share similar structures. However, the agent heavily relies on a large amount of experience during the meta-training phase, presenting a formidable challenge in achieving high sample efficiency. Current methods typically adapt to novel tasks within the Meta-Reinforcement Learning framework through task inference. Unfortunately, these approaches still exhibit limitations when faced with high-complexity task space. In this paper, we propose a Meta-Reinforcement Learning method based on reward and dynamic inference. We introduce independent reward and dynamic inference encoders, which sample specific context information to capture the deep-level features of task goals and dynamics. By reducing task inference space, agent effectively learns the shared structures across tasks and acquires a profound understanding of the task differences. We illustrate the performance degradation caused by the high task inference complexity and demonstrate that our method outperforms previous algorithms in terms of sample efficiency.

Jinhao Chen, Chunhong Zhang, Zheng Hu

Security and Privacy

Frontmatter
SecureBoost: Large Scale and High-Performance Vertical Federated Gradient Boosting Decision Tree

Gradient boosting decision tree (GBDT) is an ensemble machine learning algorithm that is widely used in industry. Due to the problem of data isolation and the requirement of privacy, many works try to use vertical federated learning to train machine learning models collaboratively between different data owners. SecureBoost is one of the most popular vertical federated learning algorithms for GBDT. However, to achieve privacy preservation, SecureBoost involves complex training procedures and time-consuming cryptography operations. This causes SecureBoost to be slow to train and does not scale to large-scale data. In this work, we propose SecureBoost+, a large-scale and high-performance vertical federated gradient boosting decision tree framework. SecureBoost+ is secure in the semi-honest model, which is the same as SecureBoost. SecureBoost+ can be scaled up to tens of millions of data samples faster than SecureBoost. SecureBoost+ achieves high performance through several novel optimizations for SecureBoost, including ciphertext operation optimization and the introduction of new training mechanisms. The experimental results show that SecureBoost+ is 6–35x faster than SecureBoost but with the same accuracy and can be scaled up to tens of millions of data samples and thousands of feature dimensions.

Tao Fan, Weijing Chen, Guoqiang Ma, Yan Kang, Lixin Fan, Qiang Yang
Construct a Secure CNN Against Gradient Inversion Attack

Federated learning enables collaborative model training across multiple clients without sharing raw data, adhering to privacy regulations, which involves clients sending model updates (gradients) to a central server, where they are aggregated to improve a global model. Despite its benefits, federated learning faces threats from gradient inversion attacks, which can reconstruct private data from gradients. Traditional defenses, including cryptography, differential privacy, and perturbation techniques, offer protection but may suffer from a reduction in computational efficiency and model performance. Thus, in this paper, we introduce Secure Convolutional Neural Networks (SecCNN), a novel approach embedding an upsampling layer into CNNs to inherently defend against gradient inversion attacks. SecCNN leverages Rank Analysis for enhanced security without sacrificing model accuracy or incurring significant computational costs. Our results demonstrate SecCNN’s effectiveness in securing federated learning against privacy breaches, thereby building trust among participants and advancing secure collaborative learning.

Yu-Hsin Liu, Yu-Chun Shen, Hsi-Wen Chen, Ming-Syan Chen
Backdoor Attack Against One-Class Sequential Anomaly Detection Models

Deep anomaly detection on sequential data has garnered significant attention due to the wide application scenarios. However, deep learning-based models face a critical security threat - their vulnerability to backdoor attacks. In this paper, we explore compromising deep sequential anomaly detection models by proposing a novel backdoor attack strategy. The attack approach comprises two primary steps, trigger generation and backdoor injection. Trigger generation is to derive imperceptible triggers by crafting perturbed samples from the benign normal data, of which the perturbed samples are still normal. The backdoor injection is to properly inject the backdoor triggers to comprise the model only for the samples with triggers. The experimental results demonstrate the effectiveness of our proposed attack strategy by injecting backdoors on two well-established one-class anomaly detection models.

He Cheng, Shuhan Yuan

Semi-supervised and Unsupervised Learning

Frontmatter
DALLMi: Domain Adaption for LLM-Based Multi-label Classifier

Large language models (LLMs) increasingly serve as the backbone for classifying text associated with distinct domains and simultaneously several labels (classes). When encountering domain shifts, e.g., classifier of movie reviews from IMDb to Rotten Tomatoes, adapting such an LLM-based multi-label classifier is challenging due to incomplete label sets at the target domain and daunting training overhead. The existing domain adaptation methods address either image multi-label classifiers or text binary classifiers. In this paper, we design DALLMi, Domain Adaptation Large Language Model interpolator, a first-of-its-kind semi-supervised domain adaptation method for text data models based on LLMs, specifically BERT. The core of DALLMi is the novel variation loss and MixUp regularization, which jointly leverage the limited positively labeled and large quantity of unlabeled text and, importantly, their interpolation from the BERT word embeddings. DALLMi also introduces a label-balanced sampling strategy to overcome the imbalance between labeled and unlabeled data. We evaluate DALLMi against the partial-supervised and unsupervised approach on three datasets under different scenarios of label availability for the target domain. Our results show that DALLMi achieves higher mAP than unsupervised and partially-supervised approaches by 19.9% and 52.2%, respectively.

Miruna Bețianu, Abele Mălan, Marco Aldinucci, Robert Birke, Lydia Chen
Contrastive Learning for Unsupervised Sentence Embedding with False Negative Calibration

Contrastive Learning, a transformative approach to the embedding of unsupervised sentences, fundamentally works to amplify similarity within positive samples and suppress it amongst negative ones. However, an obscure issue associated with Contrastive Learning is the occurrence of False Negatives, which treat similar samples as negative samples that will hurt the semantics of the sentence embedding. To address it, we propose a framework called FNC (False Negative Calibration) to alleviate the influence of false negatives. Our approach has two strategies to amplify the effect, i.e. false negative elimination and reuse. Specifically, in the training process, our method eliminates false negatives by clustering and comparing the semantic similarity. Next, we reuse those eliminated false negatives to reconstruct new positive pairs to boost contrastive learning performance. Our experiments on seven semantic textual similarity tasks demonstrate that our approach is more effective than competitive baselines.

Chi-Min Chiu, Ying-Jia Lin, Hung-Yu Kao
Recovering Population Dynamics from a Single Point Cloud Snapshot

Discovering population dynamics from point cloud data has experienced increased popularity in various applications, including GPS behavior prediction, multi-target tracking, and single cell analysis. Existing methods require data in multiple time periods. However, to address privacy concerns and observational restrictions, our method estimates trajectories solely from a single snapshot without time series information or features other than coordinates. We propose a model that recovers vector fields by solving an optimal transport problem and introducing the smoothness of point movements as regularization terms. Experiments with point cloud data generated from typical vector fields show that our method can accurately recover the original vector fields and predict the trajectories at arbitrary coordinates from just one point cloud snapshot.

Yuki Wakai, Koh Takeuchi, Hisashi Kashima
SAWTab: Smoothed Adaptive Weighting for Tabular Data in Semi-supervised Learning

Self-supervised and Semi-supervised learning (SSL) on tabular data is an understudied topic. Despite some attempts, there are two major challenges: 1. Imbalanced nature in the tabular dataset; 2. The one-hot encoding used in these methods becomes less efficient for high-cardinality categorical features. To cope with the challenges, we propose SAWTab which uses a target encoding method, Conditional Probability Representation (CPR), for efficient representation in the input space of categorical features. We improve this representation by incorporating the unlabeled samples through pseudo-labels. Furthermore, we propose a Smooth Adaptive Weighting mechanism in the target encoding to mitigate the issue of noisy and biased pseudo-labels. Experimental results on various datasets and comparisons with existing frameworks show that SAWTab yields best test accuracy on all datasets. We find that pseudo-labels can help improve the input space representation in the SSL setting, which enhances the generalization of the learning algorithm.

Morteza Mohammady Gharasuie, Fengjiao Wang, Omar Sharif, Ravi Mukkamala

Big Data

Frontmatter
Improving Anti-money Laundering via Fourier-Based Contrastive Learning

Anti-money laundering (AML) aims to detect money laundering from daily transactions, which is the key frontier of combating financial crimes. Previous deep-learning AML methods are not robust enough. To address the problem, we propose a novel Fourier-based contrastive learning model (FCLM) to improve AML. With contrastive learning, FCLM can maintain prediction consistency and be more robust in the face of data perturbations. Experiments on both the synthetic benchmark IBM2023 and the real-world benchmark show that FCLM outperforms seven state-of-the-art baselines, demonstrating the effectiveness of the proposed Fourier-based contrastive learning model.

Meihan Tong, Shuai Wang, Xinyu Chen, Jinsong Bei
A Novel SegNet Model for Crack Image Semantic Segmentation in Bridge Inspection

Cracks on bridge surfaces represent a significant defect that demands accurate and efficient inspection methods. However, current approaches for segmenting cracks suffer from low accuracy and slow detection speed, particularly when dealing with fine and small cracks that occupy only a few pixels. In this work, we propose a novel crack image semantic segmentation method based on an enhanced SegNet. The proposed approach addresses these challenges through three key innovations. First, we reduce the network depth to improve computational efficiency while maintaining accuracy. Furthermore, we employ ConvNeXt-V2 to effectively extract and fuse crack features, thereby improving segmentation performance. To handle pixel imbalance during loss calculation, we integrate the Dice coefficient into the original cross-entropy loss function. Experimental results demonstrate that our enhanced SegNet achieves remarkable improvements in mIoU for non-steel and steel crack segmentation tasks, reaching 82.37% and 77.26%, respectively. Our approach outperforms state-of-the-art methods in both inference speed and accuracy.

Rong Pang, Hao Tan, Yan Yang, Xun Xu, Nanqing Liu, Peng Zhang
Graph-based Dynamic Preference Modeling for Personalized Recommendation

Sequential Recommendation (SR) can predict possible future behaviors by considering the user’s behavioral sequence. However, users’ preferences constantly change in practice and are difficult to track. The existing methods only consider neighbouring items and neglect the impact of non-adjacent items on user choices. Therefore, how to build an accurate recommendation model is a complex challenge. We propose a novel Graph Neural Network (GNN) based model, Graph-based Dynamic Preference Modeling for Personalized Recommendation (DPPR). In DPPR, the graph attention network (GAT) learns the features of long-term preference. The short-term graph computes items’ dependencies on link propagation between items and attributes. It adjusts node features under the user’s views. The module emphasizes skip features among entity nodes and incorporates time intervals of items to calculate the impact of non-adjacent items. Finally, we combine their representations to generate user preferences and aid decisions. The experimental results indicate that our model outperforms state-of-the-art methods on three public datasets.

Jiaqi Wu, Yidan Xu, Bowen Zhang, Zekun Xu, Bohan Li
LEAF: A Less Expert Annotation Framework with Active Learning

Many modern ML applications rely on large amounts of labeled data, which can be difficult and time-consuming to obtain. Active Learning (AL) is an advanced solution that addresses this problem. AL not only enables efficient training with limited data but also speeds up the labeling process and saves on labor costs. However, existing AL methods primarily focus on optimizing the query sampling strategy for single-task and fixed model scenarios, which is inefficient for real-world multi-task scenarios. In multi-task AL, multi-model hyperparameters optimization and multi-query strategies bring new challenges that require more labor. In this paper, we propose LEAF, a Less Expert Annotation Framework, to tackle those challenges and reduce the workload of both data experts and technical experts. In LEAF, we apply AutoML techniques to automatically optimize hyperparameters for multi-task and multi-model AL and design a heuristic adaptive query strategy for multi-query strategy in AL. Experimental results on three publicly available datasets show that our framework requires fewer iterations, less training time, and higher precision than conventional Active Learning frameworks. Additionally, we present a detailed case study that demonstrates the practical use and high quality of our proposed framework for real-world data annotation tasks.

Aishan Maoliniyazi, Chaohong Ma, Xiaofeng Meng, Yingtao Peng
MLT-Trans: Multi-level Token Transformer for Hierarchical Image Classification

This paper focuses on Multi-level Hierarchical Classification (MLHC) of images, presenting a novel architecture that exploits the “[CLS]” (classification) token within transformers – often disregarded in computer vision tasks. Our primary goal lies in utilizing the information of every [CLS] token in a hierarchical manner. Toward this aim, we introduce a Multi-level Token Transformer (MLT-Trans). This model, trained with sharpness-aware minimization and a hierarchical loss function based on knowledge distillation is capable of being adapted to various transformer-based networks, with our choice being the Swin Transformer as the backbone model. Empirical results across diverse hierarchical datasets confirm the efficacy of our approach. The findings highlight the potential of combining transformers and [CLS] tokens, by demonstrating improvements in hierarchical evaluation metrics and accuracy up to 5.7% on the last level in comparison to the base network, thereby supporting the adoption of the MLT-Trans framework in MLHC.

Tanya Boone Sifuentes, Asef Nazari, Mohamed Reda Bouadjenek, Imran Razzak
Improving Knowledge Tracing via Considering Students’ Interaction Patterns

Knowledge Tracing (KT), which aims to accurately identify students’ evolving mastery of different concepts during their learning process, is a popular task for providing intelligent tutoring in online learning systems. Recent research has leveraged various variants of single-state recurrent neural networks to model the transition of students’ knowledge states. However, students’ interaction patterns implicit in learning records are overlooked which plays an important role in reflecting students’ mental state and learning habits. Additionally, interaction patterns affect an individual’s self-efficacy and knowledge acquisition. To fill this gap, we propose the Interaction Pattern-Aware Knowledge Tracing (IPAKT) model that uses two hidden states to model knowledge state and interaction patterns separately. Specifically, we first extract the interaction patterns from two types of interaction responses: hint and time. Subsequently, these interaction patterns are employed to regulate the update of the knowledge state. Extensive experiments on three common datasets demonstrate that our method achieves state-of-the-art performance. We also present the reasonableness of IPAKT by ablation testing. Our codes are available at https://github.com/SummerGua/IPAKT .

Shilong Shu, Liting Wang, Junhua Tian
MDAN: Multi-distribution Adaptive Networks for LTV Prediction

In industry, Customer Lifetime Value (LTV) represents the entire revenue generated from a single user within an application. Accurate LTV prediction can help marketers make more informed decisions about acquiring high-quality new users and increasing revenue. However, LTV prediction is a complex and challenging task, and the LTV of most application users is prone to bias and sparsity. To address these issues, this paper proposes a Multi-Distribution Adaptive Networks (MDAN) to predict LTV. In terms of classification debiasing, we leverage multi-channel networks to simultaneously learn disparate distributions and a Channel Learning Controller (CLC) is used to advance the learning of different channels. Moreover, in the context of regression debiasing, a novel loss function called Distance Similarity Loss is introduced for the specific purpose of predicting LTV. This loss function is designed to distinguish between the feature representations associated with different LTV values, thus improving the ability to represent user characteristics within the model. The MDAN framework has been successfully deployed in multiple applications within Tencent, leading to considerable increases in revenue. Extensive experiments on three million-level datasets, QB, YYB, and WeSing, demonstrate the superiority of the proposed method compared to state-of-the-art baselines such as DNN, RankSim, ZILN and ODMN models.

Wenshuang Liu, Guoqiang Xu, Bada Ye, Xinji Luo, Yancheng He, Cunxiang Yin
Backmatter
Metadaten
Titel
Advances in Knowledge Discovery and Data Mining
herausgegeben von
De-Nian Yang
Xing Xie
Vincent S. Tseng
Jian Pei
Jen-Wei Huang
Jerry Chun-Wei Lin
Copyright-Jahr
2024
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9722-59-4
Print ISBN
978-981-9722-61-7
DOI
https://doi.org/10.1007/978-981-97-2259-4

Premium Partner