Skip to main content
Top

2024 | Book

Advances in Intelligent Data Analysis XXII

22nd International Symposium on Intelligent Data Analysis, IDA 2024, Stockholm, Sweden, April 24–26, 2024, Proceedings, Part II

insite
SEARCH

About this book

The two volume set LNCS 14641 and 14642 constitutes the proceedings of the 22nd International Symposium on Intelligent Data Analysis, IDA 2024, which was held in Stockholm, Sweden, during April 24-26, 2024.

The 40 full and 3 short papers included in the proceedings were carefully reviewed and selected from 94 submissions. IDA is an international symposium presenting advances in the intelligent analysis of data. Distinguishing characteristics of IDA are its focus on novel, inspiring ideas, its focus on research, and its relatively small scale.

Table of Contents

Frontmatter

Temporal and Sequence Data

Frontmatter
Kernel Corrector LSTM
Abstract
Forecasting methods are affected by data quality issues in two ways: 1. they are hard to predict, and 2. they may affect the model negatively when it is updated with new data. The latter issue is usually addressed by pre-processing the data to remove those issues. An alternative approach has recently been proposed, Corrector LSTM (cLSTM), which is a Read & Write Machine Learning (RW-ML) algorithm that changes the data while learning to improve its predictions. Despite promising results being reported, cLSTM is computationally expensive, as it uses a meta-learner to monitor the hidden states of the LSTM. We propose a new RW-ML algorithm, Kernel Corrector LSTM (KcLSTM), that replaces the meta-learner of cLSTM with a simpler method: Kernel Smoothing. We empirically evaluate the forecasting accuracy and the training time of the new algorithm and compare it with cLSTM and LSTM. Results indicate that it is able to decrease the training time while maintaining a competitive forecasting accuracy.
Rodrigo Tuna, Yassine Baghoussi, Carlos Soares, João Mendes-Moreira
Unsupervised Representation Learning for Smart Transportation
Abstract
In the automotive industry, sensors collect data that contain valuable driving information. The collected datasets are in multivariate time series (MTS) format, which are noisy, non-stationary, lengthy, and unlabeled, making them difficult to analyze and model. To understand the driving behavior at specific times of operation, we employ an unsupervised representation learning method. We present Temporal Neighborhood Coding for Maneuvering (TNC4maneuvering), which aims to understand maneuverability in smart transportation data via a use-case of bivariate accelerations from three operation days out of 2.5 years of driving. Our method proves capable of extracting meaningful maneuver states as representations. We evaluate them in various downstream tasks, including time-series classification, clustering, and multi-linear regression. Moreover, we propose methods for pruning the sizes of representations along with a window-size optimizing algorithm. Our results show that TNC4maneuvering has the capacity to generalize over longer temporal dependencies, although scalability and speedup present challenges.
Thabang Lebese, Cécile Mattrand, David Clair, Jean-Marc Bourinet, François Deheeger
T-DANTE: Detecting Group Behaviour in Spatio-Temporal Trajectories Using Context Information
Abstract
The present study addresses the group detection problem using spatio-temporal data. This study relies on modeling contextual information embedded in the trajectories of surrounding agents as well as temporal dynamics in the trajectories of the agent of interest to determine if two agents belong to the same group. Specifically, our proposed method, called T-DANTE, builds upon the Deep Affinity Network (DANTE) [16] for Clustering Conversational Interactants using spatio-temporal data and extends it by incorporating Recurrent Neural Networks (RNN) (i.e., Long Short-term Memory (LSTM) and Gated Recurrent Unit (GRU)) to capture the temporal dynamics inherent in the trajectories of agents. Our ablation study demonstrates that including context information, combined with temporal dynamics, yields promising results for the group detection task across five real-world pedestrian and five simulation datasets using two common evaluation metrics, namely Group Correctness and Group Mitre metrics. Moreover, in the comparative study, the proposed method outperformed three state-of-the-art baselines in terms of the group correctness metric by at least 17.97% for pedestrian datasets. Although some baselines perform better in simulation datasets, the difference is not statistically significant.
Maedeh Nasri, Thomas Maliappis, Carolien Rieffe, Mitra Baratchi

Statistical Learning

Frontmatter
Backward Inference in Probabilistic Regressor Chains with Distributional Constraints
Abstract
State-of-the-art approaches for multi-target prediction, such as Regressor Chains, can exploit interdependencies among the targets and model the outputs jointly, by flowing predictions from the first output to the last. While these models are very useful in applications where targets are highly interdependent and should be modeled jointly, they are however unable to answer queries in situations when targets are not only mutually dependent but also have joint constraints over the output. In addition, existing models are unsuitable when certain target values are fixed or manually imputed prior to inference, and as a result, the flow of predictions cannot cascade backward from an already-imputed output. Here we present a solution to the aforementioned problem as a backward inference algorithm for Regressor Chains via Metropolis-Hastings sampling. We evaluate the proposed approach via different metrics using both synthetic and real-world data. We show that our approach notably reduces errors when compared to traditional marginal inference methods that overlook joint modeling. Furthermore, we show that the proposed method can provide useful insights into a problem in conservation science in predicting the distribution of potential natural vegetation.
Ekaterina Antonenko, Michael Mechenich, Rita Beigaitė, Indrė Žliobaitė, Jesse Read
Empirical Comparison Between Cross-Validation and Mutation-Validation in Model Selection
Abstract
Mutation validation (MV) is a recently proposed approach for model selection, garnering significant interest due to its unique characteristics and potential benefits compared to the widely used cross-validation (CV) method. In this study, we empirically compared MV and k-fold CV using benchmark and real-world datasets. By employing Bayesian tests, we compared generalization estimates yielding three posterior probabilities: practical equivalence, CV superiority, and MV superiority. We also evaluated the differences in the capacity of the selected models and computational efficiency. We found that both MV and CV select models with practically equivalent generalization performance across various machine learning algorithms and the majority of benchmark datasets. MV exhibited advantages in terms of selecting simpler models and lower computational costs. However, in some cases MV selected overly simplistic models leading to underfitting and showed instability in hyperparameter selection. These limitations of MV became more evident in the evaluation of a real-world neuroscientific task of predicting sex at birth using brain functional connectivity.
Jinyang Yu, Sami Hamdan, Leonard Sasse, Abigail Morrison, Kaustubh R. Patil
Amplified Contribution Analysis for Federated Learning
Abstract
The problem of establishing the client’s marginal contribution is essential to any decentralised machine-learning process that relies on the participation of remote agents. The ability to detect harmful participants on an ongoing basis can constitute a significant challenge as one can obtain only a very limited amount of information from the external environment in order not to break the privacy assumption that underlies the federated learning paradigm. In this work, we present an Amplified Contribution Function - a set of aggregation operations performed on gradients received by the central orchestrator that allows to non-intrusively investigate the risk of accepting a certain set of gradients dispatched from a remote agent. Our proposed method is distinguished by a high degree of interpretability and interoperability as it supports the gross majority of the currently available federated techniques and algorithms. It is also characterised by a space and time complexity similar to that of the leave-one-out method - a common baseline for all deletion and sensitivity analytics tools.
Maciej Krzysztof Zuziak, Salvatore Rinzivillo

Data Mining

Frontmatter
Monitoring Concept Drift in Continuous Federated Learning Platforms
Abstract
Continuous federated learning (CFL), a recently emerging learning paradigm that facilitates collaborative, yet privacy-preserving machine learning (ML), bears the potential to shape the future of distributed ML. In spite of its great potential, it is - similar to continuous ML - prone to suffer from concept drift (a change in data properties over time). In turn, CFL can greatly benefit from employing drift detection to react adequately to emerging drifts. Although various such approaches exist, respective research lacks application of drift detection to CFL with dynamic client participation as well as detailed analysis of the advantages of different drift detection approaches such as error-based or data-based drift detection. To this end, we apply these drift detection approaches to a CFL platform that allows new clients to join even after the training has started and measure the negative impact of concept drift on model performance. Moreover, we uncover distinct differences between the error- and data-based drift detection. In particular, we find the former ones to be more suitable to detect the point in time where the joint models stops benefiting from concept drift whereas the latter allows for a more precise detection of the first occurrence of concept drift.
Christoph Düsing, Philipp Cimiano
S+t-SNE - Bringing Dimensionality Reduction to Data Streams
Abstract
We present S+t-SNE, an adaptation of the t-SNE algorithm designed to handle infinite data streams. The core idea behind S+t-SNE is to update the t-SNE embedding incrementally as new data arrives, ensuring scalability and adaptability to handle streaming scenarios. By selecting the most important points at each step, the algorithm ensures scalability while keeping informative visualisations. By employing a blind method for drift management, the algorithm adjusts the embedding space, which facilitates the visualisation of evolving data dynamics. Our experimental evaluations demonstrate the effectiveness and efficiency of S+t-SNE, whilst highlighting its ability to capture patterns in a streaming scenario. We hope our approach offers researchers and practitioners a real-time tool for understanding and interpreting high-dimensional data.
Pedro C. Vieira, João P. Montrezol, João T. Vieira, João Gama
-DBSCAN: Augmenting DBSCAN with Prior Knowledge
Abstract
State-of-the-art density based cluster algorithms offer remarkable speed and robustness. However, they do not allow the user to make local changes without affecting the global outcome. The user thus has to choose between clustering a local region well or keeping the global result.
We present a new approach, \(\lambda \)-DBSCAN, which augments the DBSCAN algorithm to include local a priori knowledge. The parameters can be specified per observation, rather than globally, which enables the user to include local knowledge about the data without modifying other regions. Furthermore, we define regions in the data that should be affected by certain parameter choices, to reduce the workload for a user.
Joel Dierkes, Daniel Stelter, Christian Braune
Putting Sense into Incomplete Heterogeneous Data with Hypergraph Clustering Analysis
Abstract
Many industrial scenarios are concerned with the exploration of high-dimensional heterogeneous data sets originating from diverse sources and often incomplete, i.e., containing a substantial amount of missing values. This paper proposes a novel unsupervised method that efficiently facilitates the exploration and analysis of such data sets. The methodology combines in an exploratory workflow multi-layer data analysis with shared nearest neighbor similarity and hypergraph clustering. It produces overlapping homogeneous clusters, i.e., assuming that the assets within each cluster exhibit comparable behavior. The latter can be used for computing relevant KPIs per cluster for the purpose of performance analysis and comparison. More concretely, such KPIs have the potential to aid domain experts in monitoring and understanding asset performance and, subsequently, enable the identification of outliers and the timely detection of performance degradation.
Vishnu Manasa Devagiri, Pierre Dagnely, Veselka Boeva, Elena Tsiporkova

Optimization

Frontmatter
Efficient Lookahead Decision Trees
Abstract
Conventionally, decision trees are learned using a greedy approach, beginning at the root and moving toward the leaves. At each internal node, the feature that yields the best data split is chosen based on a metric like information gain. This process can be regarded as evaluating the quality of the best depth-one subtree. To address the shortsightedness of this method, one can generalize it to greater depths. Lookahead trees have demonstrated strong performance in situations with high feature interaction or low signal-to-noise ratios. They constitute a good trade-off between optimal decision trees and purely greedy decision trees. Currently, there are no readily available tools for constructing these lookahead trees, and their computational cost can be significantly higher than that of purely greedy ones. In this study, we introduce an efficient implementation of lookahead decision trees, specifically LGDT, by adapting a recently introduced algorithmic concept from the MurTree approach to find optimal decision trees of depth two. Additionally, we utilize an efficient reversible sparse bitset data structure to store the filtered examples while expanding the tree nodes in a depth-first-search manner. Experiments on state-of-the-art datasets demonstrate that our implementation offers remarkable computation-time performance.
Harold Kiossou, Pierre Schaus, Siegfried Nijssen, Gaël Aglin
Learning Curve Extrapolation Methods Across Extrapolation Settings
Abstract
Learning curves are important for decision-making in supervised machine learning. They show how the performance of a machine learning model develops over a given resource. In this work, we consider learning curves that describe the performance of a machine learning model as a function of the number of data points used for training. It is often useful to extrapolate learning curves, which can be done by fitting a parametric model based on the observed values, or by using an extrapolation model trained on learning curves from similar datasets. We perform an extensive analysis comparing these two methods with different observations and prediction objectives. Depending on the setting, different extrapolation methods perform best. When a small number of initial segments of the learning curve have been observed we find that it is better to rely on learning curves from similar datasets. Once more observations have been made, a parametric model, or just the last observation, should be used. Moreover, using a parametric model is mostly useful when the exact value of the final performance itself is of interest.
Lionel Kielhöfer, Felix Mohr, Jan N. van Rijn
Efficient NAS with FaDE on Hierarchical Spaces
Abstract
Neural architecture search (NAS) is a challenging problem. Hierarchical search spaces allow for cheap evaluations of neural network sub modules to serve as surrogate for architecture evaluations. Yet, sometimes the hierarchy is too restrictive or the surrogate fails to generalize. We present FaDE which uses differentiable architecture search to obtain relative performance predictions on finite regions of a hierarchical NAS space. The relative nature of these ranks calls for a memory-less, batch-wise outer search algorithm for which we use an evolutionary algorithm with pseudo-gradient descent. FaDE is especially suited on deep hierarchical, respectively multi-cell search spaces, which it can explore by linear instead of exponential cost and therefore eliminates the need for a proxy search space.
Our experiments show that firstly, FaDE-ranks on finite regions of the search space correlate with corresponding architecture performances and secondly, the ranks can empower a pseudo-gradient evolutionary search on the complete neural architecture search space.
Simon Neumeyer, Julian Stier, Michael Granitzer
Investigating the Relation Between Problem Hardness and QUBO Properties
Abstract
Combinatorial optimization problems, integral to various scientific and industrial applications, often vary significantly in their complexity and computational difficulty. Transforming such problems into Quadratic Unconstrained Binary Optimization (Qubo) has regained considerable research attention in recent decades due to the central role of Qubo in Quantum Annealing. This work aims to shed some light on the relationship between the problems’ properties. In particular, we examine how the spectral gap of the Qubo formulation correlates with the original problem, since it has an impact on how efficiently it can be solved on quantum computers. We analyze two well-known problems from Machine Learning, namely Clustering and Support Vector Machine (SVM) training, regarding the spectral gaps of their respective Qubo counterparts. An empirical evaluation provides interesting insights, showing that the spectral gap of Clustering Qubo instances positively correlates with data separability, while for SVM Qubo the opposite is true.
Thore Gerlach, Sascha Mücke

XAI

Frontmatter
Example-Based Explanations of Random Forest Predictions
Abstract
A random forest prediction can be computed by the scalar product of the labels of the training examples and a set of weights that are determined by the leafs of the forest into which the test object falls; each prediction can hence be explained exactly by the set of training examples for which the weights are non-zero. The number of examples used in such explanations is shown to vary with the dimensionality of the training set and hyperparameters of the random forest algorithm. This means that the number of examples involved in each prediction can to some extent be controlled by varying these parameters. However, for settings that lead to a required predictive performance, the number of examples involved in each prediction may be unreasonably large, preventing the user from grasping the explanations. In order to provide more useful explanations, a modified prediction procedure is proposed, which includes only the top-weighted examples. An investigation on regression and classification tasks shows that the number of examples used in each explanation can be substantially reduced while maintaining, or even improving, predictive performance compared to the standard prediction procedure.
Henrik Boström
FLocalX - Local to Global Fuzzy Explanations for Black Box Classifiers
Abstract
The need for explanation for new, complex machine learning models has caused the rise and growth of the field of eXplainable Artificial Intelligence. Different explanation types arise, such as local explanations which focus on the classification for a particular instance, or global explanations which aim to show a global overview of the inner workings of the model. In this paper, we propose FLocalX, a framework that builds a fuzzy global explanation expressed in terms of fuzzy rules by using local explanations as a starting point and a metaheuristic optimization process to obtain the result. An initial experimentation has been carried out with a genetic algorithm as the optimization process. Across several datasets, black-box algorithms and local explanation methods, FLocalX has been tested in terms of both fidelity of the resulting global explanation, and complexity The results show that FLocalX is successfully able to generate short and understandable global explanations that accurately imitate the classifier.
Guillermo Fernandez, Riccardo Guidotti, Fosca Giannotti, Mattia Setzu, Juan A. Aledo, Jose A. Gámez, Jose M. Puerta
Interpretable Quantile Regression by Optimal Decision Trees
Abstract
The field of machine learning is subject to an increasing interest in models that are not only accurate but also interpretable and robust, thus allowing their end users to understand and trust AI systems. This paper presents a novel method for learning a set of optimal quantile regression trees. The advantages of this method are that (1) it provides predictions about the complete conditional distribution of a target variable without prior assumptions on this distribution; (2) it provides predictions that are interpretable; (3) it learns a set of optimal quantile regression trees without compromising algorithmic efficiency compared to learning a single tree.
Valentin Lemaire, Gaël Aglin, Siegfried Nijssen

Open Access

SLIPMAP: Fast and Robust Manifold Visualisation for Explainable AI
Abstract
We propose a new supervised manifold visualisation method, slipmap, that finds local explanations for complex black-box supervised learning methods and creates a two-dimensional embedding of the data items such that data items with similar local explanations are embedded nearby. This work extends and improves our earlier algorithm and addresses its shortcomings: poor scalability, inability to make predictions, and a tendency to find patterns in noise. We present our visualisation problem and provide an efficient GPU-optimised library to solve it. We experimentally verify that slipmap is fast and robust to noise, provides explanations that are on the level or better than the other local explanation methods, and are usable in practice.
Anton Björklund, Lauri Seppäläinen, Kai Puolamäki
A Frank System for Co-Evolutionary Hybrid Decision-Making
Abstract
We introduce Frank, a human-in-the-loop system for co-evolutionary hybrid decision-making aiding the user to label records from an un-labeled dataset. Frank employs incremental learning to “evolve” in parallel with the user’s decisions, by training an interpretable machine learning model on the records labeled by the user. Furthermore, advances state-of-the-art approaches by offering inconsistency controls, explanations, fairness checks, and bad-faith safeguards simultaneously. We evaluate our proposal by simulating the users’ behavior with various levels of expertise and reliance on Frank’s suggestions. The experiments show that Frank’s intervention leads to improvements in the accuracy and the fairness of the decisions.
Federico Mazzoni, Riccardo Guidotti, Alessio Malizia

Industrial Challenge

Frontmatter
Predicting the Failure of Component X in the Scania Dataset with Graph Neural Networks
Abstract
We use Graph Neural Networks on signature-augmented graphs derived from time series for Predictive Maintenance. With this technique, we propose a solution to the Intelligent Data Analysis Industrial Challenge 2024 on the newly released SCANIA Component X dataset. We describe an Exploratory Data Analysis and preprocessing of the dataset, proposing improvements for its description in the SCANIA paper.
Maurizio Parton, Andrea Fois, Michelangelo Vegliò, Carlo Metta, Marco Gregnanin
Towards Contextual, Cost-Efficient Predictive Maintenance in Heavy-Duty Trucks
Abstract
Predictive maintenance is a crucial yet challenging task in many industrial applications. This work explores a large repository of existing techniques and approaches to process historical data and predict if an asset is at risk of failure. In particular, the operational condition and specification of Scania trucks in heavy-duty applications is considered as part of the IDA 2024 Industrial Challenge.
Louis Carpentier, Arne De Temmerman, Mathias Verbeke
Implementing Deep Learning Models for Imminent Component X Failures Prediction in Heavy-Duty Scania Trucks
Abstract
This paper explores the application of predictive maintenance (PdM) in vehicle management, focusing on improving performance and reliability of critical truck components. By leveraging a newly acquired, comprehensive real-world dataset, the study aims to develop machine learning models for accurately predicting component failures. The dataset, sourced from the Symposium on Intelligent Data Analysis (IDA 2024), includes multivariate time series data from an anonymized engine component of a fleet of trucks, featuring operational data, repair records, and specifications. The research employs advanced deep learning techniques like Convolutional and Recurrent Neural Networks, including Long Short-Term Memory (LSTM) networks, to identify patterns indicative of potential failures. This initiative aims to optimize maintenance interventions, resource allocation, and fleet management by predicting the time or class of potential failures, thereby reducing downtime and maintenance costs.
Jie Zhong, Zhenkan Wang
Backmatter
Metadata
Title
Advances in Intelligent Data Analysis XXII
Editors
Ioanna Miliou
Nico Piatkowski
Panagiotis Papapetrou
Copyright Year
2024
Electronic ISBN
978-3-031-58553-1
Print ISBN
978-3-031-58555-5
DOI
https://doi.org/10.1007/978-3-031-58553-1

Premium Partner