Skip to main content

2024 | Buch

Web and Big Data

7th International Joint Conference, APWeb-WAIM 2023, Wuhan, China, October 6–8, 2023, Proceedings, Part III

herausgegeben von: Xiangyu Song, Ruyi Feng, Yunliang Chen, Jianxin Li, Geyong Min

Verlag: Springer Nature Singapore

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The 4-volume set LNCS 14331, 14332, 14333, and 14334 constitutes the refereed proceedings of the 7th International Joint Conference, APWeb-WAIM 2023, which took place in Wuhan, China, in October 2023.

The total of 138 papers included in the proceedings were carefully reviewed and selected from 434 submissions. They focus on innovative ideas, original research findings, case study results, and experienced insights in the areas of the World Wide Web and big data, covering Web technologies, database systems, information management, software engineering, knowledge graph, recommend system and big data.

Inhaltsverzeichnis

Frontmatter
Adaptive Graph Attention Hashing for Unsupervised Cross-Modal Retrieval via Multimodal Transformers

Unsupervised cross-modal hashing retrieval has been extensively studied due to its advantages in storage, retrieval efficiency, and label independence. However, there are still two obstacles to existing unsupervised methods: (1) Existing unsupervised methods suffer from inaccurate similarity as simple features do not describe fine-grained multimodal relationships. (2) Existing methods suffer from unbalanced multimodal learning due to the different coding capabilities of different modal networks. To address these obstacles, we devised an effective Adaptive Graph Attention Hashing (AGAH) for unsupervised cross-modal retrieval. Firstly, we use the multimodal transformer model CLIP to extract cross-modal fine-grained features and exploit multiple data similarities to mine similar information from different perspectives in multi-modal data and perform similarity enhancement. In addition, we present an adaptive graph attention hashing module to assist in generating hash codes, which uses an attention mechanism to learn relation-based similarity from image-text modality. It aggregates the essential neighborhood message of neighboring data nodes through the graph neural networks to generate more discriminative hash codes. Sufficient experiments on three benchmark datasets demonstrate that the proposed AGAH outperforms existing advanced unsupervised cross-modal hashing methods.

Yewen Li, Mingyuan Ge, Yucheng Ji, Mingyong Li
Answering Property Path Queries over Federated RDF Systems

Property path query can get graph nodes meeting complex conditions with more concise expressions for some SPARQL basic graph patterns, which is an important new query type in SPARQL 1.1. However, property path query is only applied in centralized RDF system, not in a federated RDF system. In this paper, a MinDFA-based property path query method over federated RDF system is proposed, called FPPQO (Federated Property Path Query Optimization). FPPQO first decomposes a federated property path query into multiple subqueries. Then, a Thompson-based MinDFA algorithm is used to construct MinDFA corresponding to the property path expression of each subquery. Finally, the query execution strategy base on B-DFS is used to search for MinDFA matching. During the matching process, the possible circular matching problem is eliminated by adopting the alternating buffer marking mechanism. Experimental results on datasets of different sizes show that the proposed scheme is effective and scalable.

Ningchao Ge, Peng Peng, Jibing Wu, Lihua Liu, Haiwen Chen, Tengyun Wang
Distributed Knowledge Graph Query Acceleration Algorithm

As the era of big data continues to evolve, the scale of knowledge data that needs to be processed in reality is enormous, and the single-machine model is incapable of handling queries on large-scale knowledge graph data. Therefore, distributed clusters are necessary to improve processing capability. The core of the existing approaches is all by splitting the large-scale graph data into multiple copies, distributing each copy to different machines for processing, and finally merging the results. However, these approaches suffer from two problems: (i) the result of knowledge graph merging is huge, far exceeding the final result itself, resulting in a lot of data transfer overhead during the distributed merging phase; (ii) the parallelism of algorithms is limited to the physical level of machine parallelism in task partitioning and lacks computational logic parallelism, such as the merging phase, which does not achieve good parallelism. To address these issues, we propose a distributed framework for offline index construction and online SPARQL query processing framework to achieve parallel accelerated processing. Our approach can more efficiently filter candidate solutions that do not match the result, reducing the size of the results to be merged and leading to a reduction in computational and communication costs. Additionally, we also introduce additional parallelism in the mutual merging phase to improve computational efficiency and system throughput.

Peifan Shi, Youhuan Li, Wenjie Li, Xinhuan Chen
Truth Discovery of Source Dependency Perception in Dynamic Scenarios

In the era of big data, obtaining large amounts of data from different sources has become increasingly easy. However, conflicts may arise among the information provided by these sources. Therefore, various truth discovery methods have been proposed to solve this problem. In practical applications, information may be generated in chronological order, such as daily or hourly updates on weather conditions in a particular location. As a result, the truth of an object and the reliability of sources may dynamically change over time. Besides, there may be dependencies among data sources and the dependencies are stable in the short term. However, existing truth discovery methods for dynamic scenarios ignore the continuity of source dependencies in the short term. To address this issue, we study the source dependency detection and the problem of data sparsity caused by removing dependent sources in dynamic scenarios, and propose an incremental model based on source dependency detection, namely SDPTD, which can dynamically update object truth values and source weights and detect source dependencies when new data arrive. Experiments on two real-world datasets and synthetic datasets demonstrate the effectiveness and efficiency of our proposed method.

Xiu Fang, Chenling Shen, Guohao Sun, Hao Chen, Yating Tang
Truth Discovery Against Disguised Attack Mechanism in Crowdsourcing

Crowdsourcing is an effective paradigm for recruiting online workers to perform intelligent tasks that are difficult for computers to complete. More and more attacks bring challenges to crowdsourcing systems. Although the truth discovery method can defend against common attacks to a certain extent, the real scene is much more complex. Malicious workers can not only improve their reliability by agreeing with normal workers on tasks that are unlikely to be overturned, but also gather together to launch more effective attacks on tasks that are easily overturned. This disguised attack is smarter and harder to defend. To solve this problem, we propose a new defense framework TD-DA (Truth Discovery against Disguised Attack) composed of truth discovery and task allocation. In the truth discovery phase, we quantify the aggressiveness and reliability of workers on the golden task based on the sigmoid function. In the task allocation phase, the Weighted Arithmetic Mean (WAM) is used to estimate the allocation probability of golden tasks to avoid the shortage of golden tasks. Extensive experiments on real-world datasets and synthetic datasets demonstrate that our method is effective against disguised attacks.

Xiu Fang, Yating Tang, Guohao Sun, Chenling Shen, Hao Chen
Continuous Group Nearest Group Search over Streaming Data

Group nearest group query(GNG for short) is an important variant of NN search. Let $$\mathcal {D}$$ D be the $$d-$$ d - multi-dimensional object set, $$GQ\langle k, Q\rangle $$ G Q ⟨ k , Q ⟩ be a GNG with Q containing a set of d-multi-dimensional query points. The target of GNG is to select k object points $$O_Q$$ O Q from $$\mathcal {D}$$ D such that the total distance between these query points and their NNs in $$O_Q$$ O Q is minimal. In this paper, we study GNG in a very dynamical data environment, i.e., continuous GNG query(CGNG for short) over sliding window, which has many applications. To the best of our knowledge, it is the first time to study the problem of CGNG over sliding window.In this paper, we propose a novel framework named KMPT(short for K-Means Partition Tree-based framework) for supporting CGNG. The key behind KMPT is to partition query points into a group of k subsets, generate a group of k virtual points based on objects in these subsets, and reduce the CGNG problem to continuous NN search over data stream. In order to efficiently support continuous NN search, we first partition objects in the window into a group of sub-windows based on their arrived order. We then form a group of quad-tree based indexes to maintain objects’ position information in each partition, form an R-tree based index to evaluate which objects have a chance to become query result objects in the near future, and finally achieve to goal of using a small number of objects to support query processing. The comprehensive experiments on both real and synthetic data sets demonstrate the superiority in both efficiency and quality.

Rui Zhu, Chunhong Li, Anzhen Zhang, Chuanyu Zong, Xiufeng Xia
Approximate Continuous Skyline Queries over Memory Limitation-Based Streaming Data

Continuous skyline query over sliding window is an important problem over streaming data. The query returns all skyline objects to the system whenever the window slides. Existing efforts include exact-based algorithms and approximate-based algorithms. Their key idea is to find objects that cannot become skyline objects before they expire from the window, delete them, and use reminders to support query processing. However, the space cost of all existing efforts is high, and cannot work under memory limitation-based streaming data, i.e., a general environment in real applications.In this paper, we define a novel query named $$\rho -$$ ρ - approximate continuous skyline query( $$\rho $$ ρ -ACSQ), which returns error-bounded answers to the system. Here, $$\rho $$ ρ is a threshold, which can bind the error ratio between approximate and exact results. In order to support $$\rho $$ ρ -ACSQ, we propose a novel framework named $$\rho -$$ ρ - SEAK(short for $$\rho -$$ ρ - Self-adaptive Error-based Approximate Skyline). It can self-adaptively adjust $$\rho $$ ρ based on the distribution of streaming data, and achieve the goal of supporting $$\rho $$ ρ -ACSQ over memory limitation-based streaming data. Theoretical analysis indicates that even in the worst case, both the running cost and space cost of $$\rho -$$ ρ - SEAK are all unrelated with data scale.

Yunzhe An, Zhu Zhen, Rui Zhu, Tao Qiu, Xiufeng Xia
Identifying Backdoor Attacks in Federated Learning via Anomaly Detection

Federated learning has seen increased adoption in recent years in response to the growing regulatory demand for data privacy. However, the opaque local training process of federated learning also sparks rising concerns about model faithfulness. For instance, studies have revealed that federated learning is vulnerable to backdoor attacks, whereby a compromised participant can stealthily modify the model’s behavior in the presence of backdoor triggers. This paper proposes an effective defense against the attack by examining shared model updates. We begin with the observation that the embedding of backdoors influences the participants’ local model weights in terms of the magnitude and orientation of their model gradients, which can manifest as distinguishable disparities. We enable a robust identification of backdoors by studying the statistical distribution of the models’ subsets of gradients. Concretely, we first segment the model gradients into fragment vectors that represent small portions of model parameters. We then employ anomaly detection to locate the distributionally skewed fragments and prune the participants with the most outliers. We embody the findings in a novel defense method, ARIBA. We demonstrate through extensive analyses that our proposed methods effectively mitigate state-of-the-art backdoor attacks with minimal impact on task utility.

Yuxi Mi, Yiheng Sun, Jihong Guan, Shuigeng Zhou
PaTraS: A Path-Preserving Trajectory Simplification Method for Low-Loss Map Matching

Massive and redundant vehicle trajectory data is being accumulated and recorded at an unprecedented speed and scale, incurring expensive cost for storage, transmission, and query processing. Trajectory simplification is a typical way to reduce the size of raw trajectory as well as maintaining its structural information. However, existing methods mainly focus on preserving the shape of the trajectory while ignoring its influence on downstream applications. Since most applications require trajectories to be map-matched into paths before further processing, in this paper, we propose PaTraS, a path-preserving trajectory simplification method that aims to minimize the accuracy loss on the map-matching results of the compressed trajectories. To achieve this objective, we build an index that materializes the road network connectivity, and propose a connectivity-based similarity function that measures the importance of a trajectory point with respect to how it contributes to the map-matching results. Extensive experiments show that, compared with state-of-the-art methods, our proposed solution can better preserve the path generated by trajectory map-matching at the cost of a slightly increased running time, and it works effectively in both online and offline modes.

Ruoyu Leng, Chunhui Feng, Chenxi Hao, Pingfu Chao, Junhua Fang
Coordinate Descent for k-Means with Differential Privacy

In recent years, Lloyd’s heuristic has become one of the most useful methods to solve k-means problem due to its simplicity. However, Lloyd’s heuristic suffers from the bad local minimum and the privacy issues which make it not proper to be used in the privacy-preserving scenarios. In this paper, we propose a differentially private framework for k-means clustering by using the coordinate descent method. Firstly, we propose an approximate version of the updating functions of the indicator matrix which claims each point’s assignment. Then we ensure differential privacy for k-means clustering by using exponential mechanism to perturb the indicator matrix. Finally, we conduct several experiments based on multiple real-world datasets. Our experimental results show that our algorithm outperforms state of the art in terms of the trade-off between utility and privacy.

Yuchen Xie, Yi-Jun Yang, Wei Zeng
DADR: A Denoising Approach for Dense Retrieval Model Training

With the development of representation learning techniques, Dense Retrieval (DR) has become a new paradigm to retrieve relevant texts for better ranking performance. Although current DR models have achieved encouraging results, their performance is highly affected by the noise level in training samples. In particular, a large number of examples that were not labeled as positives (which were used as negative samples by default) were found to actually be positive or highly relevant. As such, it is of critical importance to account for the inevitable noises in DR model training. However, little work on dense retrieval has taken the noisy nature into consideration. In this work, we intensely investigate the serious negative impacts of noisy training samples and propose a new denoising approach, i.e., A Denoising Approach based on dynamic weights for Dense Retrieval model training (DADR), which reduces the effects of noise on model performance by assigning diverse weights to the different samples during the training process. We incorporate the proposed DADR approach with three representative kinds of sampling methods and different loss functions. Experimental results on two publicly available retrieval benchmark datasets show that our approach significantly improves the performance of the DR model over normal training.

Mengxue Du, Shasha Li, Jie Yu, Jun Ma, Huijun Liu, Miaomiao Li, Bin Ji
Multi-pair Contrastive Learning Based on Same-Timestamp Data Augmentation for Sequential Recommendation

The core of sequential recommendations is to model users’ dynamic preferences from their sequential historical behaviors. Bidirectional representation models can make better sequential recommendations because each item in user’s historical behaviors fuses information from both left and right sides. Despite their effectiveness, we argue that such bidirectional models are sub-optimal due to the limitations including: a) items with the same timestamp interactions have adverse effect on user modeling; b) the random masking process often produces noises. To address these limitations, we propose Multi-pair Contrastive Learning based on same-timestamp data augmentation for Sequential Recommendation (MCL4SR). Specifically, we firstly modify the masking strategies of BERT encoder. Then we propose a multi-pair contrastive learning framework by exploring data augmentation of the same timestamp interactions. During the training and testing process, we design three types of samples so as to imitate human learning. Extensive experiments on two benchmark datasets show that our model outperforms state-of-the-art sequential models.

Shun Zheng, Shaoqing Wang, Lijie Zhang, Yao Zhang, Fuzhen Sun
Enhancing Collaborative Features with Knowledge Graph for Recommendation

Knowledge Graph (KG) is of great help in improving the performance of recommendation systems. Graph neural networks (GNNs) based model has gradually become the mainstream of knowledge-aware recommendation (KGR). However, existing GNN-based KGR models underutilize the semantic information in KG to enhance collaborative features. Therefore, we propose a Collaborative Knowledge Graph-Aware framework (CKGA). In general, we first use the knowledge graph to obtain the semantic representation of items and users, and then feed these representations into the Collaborative Filtering (CF) model to obtain better collaborative features. Specifically, (1) we design a novel CF model to learn the collaborative features of items and users, which partitions the interaction graph into different subgraphs of similar interest and performs high-order graph convolution inside subgraphs. (2) For learning important semantic information in KG, we design an attribute aggregation scheme and an inference mechanism for GNN which directly propagates further attributes and inference information to the central node. Extensive experiments conducted on three public datasets demonstrate the superior performance of CKGA over the state-of-the-arts.

Lingang Zhu, Yi Zhang, Gang Li
PageCNNs: Convolutional Neural Networks for Multi-label Chinese Webpage Classification with Multi-information Fusion

Along with the popularity and development of the Internet in China, Chinese webpage classification has become an important research topic. As the webpage text is a kind of text, webpage classification is constructed based on text classification. But due the particularity of the webpage composition, the external linked webpages can leverage helpful information to improve the webpage classification performance. The goal of this work is to design accurate multi-label Chinese webpage classification models by effectively fusing the information extracted from current webpage and external linked webpages, including the text information and label information of external linked webpages. A convolutional neural network for webpage classification (PageCNN) model and its two variants (PageCNN-CLL and PageCNN-WLL) are proposed to effectively fuse the text and label information extracted from multiple Chinese webpages. The proposed PageCNN models are compared with two modified traditional machine learning models, the modified TextCNN model, and three state-of-the-art deep learning based multi-label text classification models. The experimental results demonstrate that the PageCNN models perform better than the compared models in terms of subset accuracy, Hamming loss, macro F1, and micro F1. Moreover, the in-depth analysis of the effectiveness of the external linked webpages on current webpage classification is conducted by analyzing the error correction rate and hit rate of the proposed models and preliminary prediction variables. As demonstrated in the experiments, the multi-information fusion methods developed in the PageCNN models can effectively manipulate the input data from multiple webpages to enhance the multi-label Chinese webpage classification performance.

Jiawei Zheng, Junying Chen, Yi Cai
MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification

In fine-grained visual classification, fusing both local and global information is crucial. However, current methods based on vision transformer tend to just focus on selecting discriminative patch tokens, which ignore the variation of rich global and semantic information in classification tokens at different layers. To address this limitation, we propose a novel framework dubbed MFF-Trans that considers the mutual relationships between all tokens. Specifically, we put forward the important token election module (ITEM) which utilizes multi-headed self-attention mechanism in vision transformer to evaluate the importance of all tokens. This module will guide the model to select tokens which contain discriminative local information and global information with different semantics at each ViT layer. Meanwhile, to enhance the model’s perception of semantic connection between selected patch tokens, we further introduce the semantic connection enhancing module (SCEM) which use the graph convolutional network to mine the structural information between them in deep layers of vision transformer. Extensive experimental results on three benchmark datasets indicate that MFF-Trans achieves satisfactory performance compared with other methods. We achieve good results in CUB (92.1%), Stanford Cars (95.4%), and Stanford Dogs (92.3%).

Qi Hang, Xuefeng Yan, Lina Gong
Summarizing Doctor’s Diagnoses and Suggestions from Medical Dialogues

Nowadays, doctors can provide consultation services to patients by dialogues on the online medical platforms, and need to summarize their diagnoses and suggestions according to the regulations of the platform, which will play an important guiding role in the follow-up treatments. The essential challenges of automatic summarization lie in the high overlap between summaries and doctors’ original utterances and the adaption to the specific structure of medical dialogues, which are overlooked by the majority of existing work. In response to this problem, we propose a pointer generator network model, dubbed as PMDS, to generate accurate and concise summaries for doctors’ diagnoses and suggestions. PMDS takes the pointer generator network as the basic architecture and uses multi-level enhanced input feature representation, the latter of which helps to effectively distinguish speakers and better focus on key information. We evaluate our proposed model on a Chinese medical dialogue summarization dataset, and the experimental results exceeded several strong baselines in previous studies.

Tianbao Zhang, Yuan Cui, Zhenfei Yang, Shi Feng, Daling Wang
HSA: Hyperbolic Self-attention for Sequential Recommendation

Recently, researchers apply various deep neural networks to the task of sequential recommendation, which captures dynamics of user preference from user behavior data to make accurate recommendation. Self-attention based approaches have been proposed to effectively identify relevant items and better capture long-term dependencies, achieving competitive results in sequential recommendation domain. However, most existing methods perform in the Euclidean space, which expands only polynomially, limiting the capacity of models. Besides, methods typically do not consider and leverage latent hierarchical structures existing in real-world datasets. To this end, we propose to learn representations in hyperbolic space for sequential recommendation, bringing two advantages. First, hyperbolic space expands exponentially and thus provides higher representation ability. Second, it is able to effectively model the latent hierarchical structures, which are indicated by the power-law distributions of user behavior sequences. Specifically, we propose a novel hyperbolic self-attention model, which learns item embeddings in hyperbolic space and adopts self-attention to model user sequence representation, using hyperbolic distance to measure preference and make recommendation. Extensive experiments conducted on three real-world datasets demonstrate the superiority of our proposed hyperbolic embedding approach over various competitive baselines, including Euclidean self-attention counterpart. We apply the proposed hyperbolic embedding method to classic sequential recommendation models and observe improvement, showing it a general technique which can boost other models.

Peizhong Hou, Haiyang Wang, Tianming Li, Junchi Yan
CFGCon: A Scheme for Accurately Generating Control Flow Graphs of Smart Contracts

Smart contracts are a significant component that allows decentralized applications (DApps) to automate the exchange of digital assets without third-party surveillance. To build trust, smart contracts are designed to be immutable, resulting in design flaws that may remain unrevealed in deployed contracts. Many analysis tools are developed to identify various vulnerabilities that could be targeted by hackers after deployment and thus cause financial losses. However, these approaches based on graph classification rely much on the quality of control flow graphs (CFGs) generated from the bytecode of smart contracts. In this paper, we propose a novel generator named CFGCon to convert bytecodes of smart contracts to CFGs. After targeting the difficulties for the existing CFG generators, a program counter is designed to deal with the opcodes with loops or instructions that need to read the current counter. Experimental results show that our proposed CFGCon reached a much higher success rate than other state-of-art CFG generators on the dataset containing 579 open source contracts and 10,000 non-open source contracts from Ethereum. At the same time, the analysis speed of CFGCon is similar to that of the current mainstream tools.

Nengyu Xia, Yixin Zhang, Wei Ren, Xianyi Chen
Hypergraph-Enhanced Self-supervised Heterogeneous Graph Representation Learning

Heterogeneous graphs are widely used to model complex systems in the real world, such as social networks, biomedical networks, and citation networks. Learning heterogeneous graph embeddings (i.e., representations) provides a way to perform deep learning-driven downstream tasks, such as recommendation and prediction. However, existing heterogeneous graph neural networks mainly capture pairwise relations in heterogeneous graphs, while real-world relations are often more complex and not limited to pairs. In this paper, we propose a novel method to capture relations beyond pairwise in heterogeneous graphs, namely HHGR. First, we construct hypergraphs from heterogeneous graphs and preserve semantic information of network schema and meta paths. Second, we design a cross-view contrast module to aggregate information on different aspects. Further, to enhance the performance of HHGR, we propose a semantic positive sampling strategy, which chooses proper positive samples according to structure and attribute semantics. Extensive experiments conducted on various real-world datasets demonstrate the state-of-the-art performance of HHGR.

Yuanhao Zhang, Chengxin He, Longhai Li, Bingzhe Zhang, Lei Duan, Jie Zuo
LAF: A Local Depth Autoregressive Framework for Cardinality Estimation of Multi-attribute Queries

Cardinality estimation is significant for database query optimization, which affects the query efficiency. Most existing methods often use a uniform approach to model strongly and weakly correlated attributes and seldom make comprehensively use of data information and query information. Some methods have poor accuracy due to simple structure, while others suffer from low efficiency due to complex structure. The problem of cardinality estimation that strong and weak association coexist among attributes can not be well solved by these methods or their simple combinations. Therefore we propose LAF, a new Local deep Autoregressive Framework, which performs fine-grained modeling for attributes with strong and weak correlation. LAF utilizes mutual information to identify the strong and weak association between attributes, applying the local strategy to construct deep autoregressive models to learn the joint distribution for strongly correlated attributes and outputting corresponding local estimations, using lightweight regression model to capture the complex mapping between local estimations with weak correlation and cardinality, and LAF combines information entropy to sort attributes in descending order. Not only do we enable local deep autoregressive models to learn from data information, but also make lightweight regression model to learn from query information. Extensive experimental evaluations on real datasets show that accurate result is achieved while estimation time is significantly shortened, and model size is controlled within a reasonable range.

Qianwen Cheng, Hao Li, Dawei Wang, Yue Zhang, Zhaohui Peng
MGCN-CT: Multi-type Vehicle Fuel Consumption Prediction Based on Module-GCN and Config-Transfer

Accurate vehicle fuel consumption prediction is crucial to reduce pollutant emissions and save commercial vehicle operating costs. With the support of Internet of Vehicles data, data-driven multivariate time series forecasting methods have been adopted for fuel consumption prediction. Different types of vehicles are composed of modules with different configurations and contain different domain knowledge. However, existing methods rarely consider these differences, and cannot be adjusted according to the vehicle configuration when facing multiple types of vehicles. Moreover, the number of vehicle samples for some personalized configurations is not enough to support the training of the model. To solve the above problems, we propose the multi-type vehicle fuel consumption prediction model based on Module Graph Convolution Network and Configuration Transfer(MGCN-CT). First, in order to express the vehicle module domain knowledge and driving data uniformly, a module graph embedded with domain knowledge is proposed. Then a module graph convolutional network is proposed to model the spatio-temporal dependence of the module graph and realize fuel consumption prediction. Finally, a configuration transfer module based on a configuration classifier is proposed to realize the fuel consumption prediction of a few-sample personalized configuration vehicles. The effectiveness of the model is verified through extensive experiments on real datasets. Compared with the baseline methods, our method achieves superior accuracy for fuel consumption prediction.

Hao Li, Qianwen Cheng, Zhaohui Peng, Yashu Tan, Zengzhe Chen
Hardware and Software Co-optimization of Convolutional and Self-attention Combined Model Based on FPGA

Since Transformer was proposed, the self-attention mechanism has been widely used. Some studies have tried to apply the self-attention mechanism to the field of computer vision CV. However, since self-attention lacks some inductive biases inherent to CNNs, it cannot achieve good generalization in the case of insufficient data. To solve this problem, researchers have proposed to combine the convolution module with the self-attention mechanism module to complement the inductive bias lacking by the self-attention mechanism. Many models based on this idea have been generated with good results. However, traditional central processor architectures cannot take good advantage of the parallel nature of these models. Among various computing platforms, FPGA becomes a suitable solution for algorithm acceleration with its high parallelism. At the same time, we note that the combined modules of convolution and self-attention have not received enough attention in terms of acceleration. Therefore, customizing computational units using FPGAs to improve model parallelism is a feasible solution. In this paper, we optimize the parallelism of the combined model of convolution and self-attention, and design algorithm optimization for two of the most complex generic nonlinear functions from the perspective of hardware-software co-optimization to further reduce the hardware complexity and the latency of the whole system, and design the corresponding hardware modules. The design is coded in HDL, a hardware description language, and simulated on a Xilinx FPGA. The experimental results show that the hardware resource consumption of the ZCU216 FPGA-based design is greatly reduced compared to the conventional design, while the throughput is increased by 8.82 $$\times $$ × and 1.23 $$\times $$ × compared to the CPU and GPU, respectively.

Wei Hu, Heyuan Li, Fang Liu, Zhiyv Zhong
FBCA: FPGA-Based Balanced Convolutional Attention Module

Large-scale computation and data processing are common tasks in machine learning. While traditional central processors are capable of performing these tasks, their computational speed is often inadequate when dealing with large-scale data sets and deep neural networks. As a result, many accelerators have emerged, such as graphics processors, field-programmable gate arrays, etc. FPGA have become a widely used type of accelerator compared to other accelerators due to their high flexibility, high performance, low power consumption, and low latency. However, most of the existing FPGA accelerators only accelerate single modules of CNN, RNN, and attention modules, and few cases of joint acceleration for different types of network combinations are mentioned. Therefore, this work is based on the hardware design of a model with a combination of convolutional and attention modules, and the way they combine to process the data is a perfect fit for the core of hardware acceleration. On the hardware device, the data in this model can flow into the computation at the same time to obtain parallel processing speed. We use a cut that is more suitable for hardware parallelism to process the data coming into both modules, thus making the best use of resources and keeping the time of both modules close to each other. In the same way, for the most computationally heavy loop structure, we have adapted the array structure for faster computation. We also parallelize the design of the serial linear layer in the attention module after the efforts in this paper, the model is further streamlined and accelerated, and finally, our model achieves a speedup of 12.5 times with only a 0.25 decrease in BLEU.

Wei Hu, Zhiyv Zhong, Fang Liu, Heyuan Li
Multi-level Matching of Natural Language-Based Vehicle Retrieval

Utilizing natural language to retrieve vehicles of specific types and motion states in videos holds great significance for analyzing traffic conditions. But natural language and vehicle video contain rich semantics, including static and dynamic information about vehicles. Additionally, the flexibility of natural language allows for multiple expressions of sentences with identical semantics. To make full use of the information in it, we divide the natural language and video data into different levels and divide them into the representation of overall and local information. We propose information enhancement methods for different data levels, followed by generating embedded representations for layered data using representation learning networks. Finally, the overall cross-modal similarity is calculated by applying weighted measures. Experimental results demonstrate the method’s capability to enhance the accuracy of retrieving vehicles in specific states from videos using natural language.

Ying Liu, Zhongshuai Zhang, Xiaochun Yang
Improving the Consistency of Semantic Parsing in KBQA Through Knowledge Distillation

Knowledge base question answering (KBQA) is an important task that involves analyzing natural language questions and retrieving relevant answers from a knowledge base. To achieve this, Semantic Parsing (SP) is used to parse the question into a structured logical form, which is then executed to obtain the answer. Although different logical forms have unique advantages, existing methods only focus on a single logical form and do not consider the semantic consistency between different logical forms. In this paper, we address the issue of consistency in semantic parsing, which has not been explored before. We show that improving the semantic consistency between multiple logical forms can help increase the parsing performance. To address the consistency problem, we present a dynamic knowledge distillation framework for semantic parsing (DKD-SP). Our framework enables one logical form to learn some useful hidden knowledge from another, which improves the semantic consistency of different logical forms. Additionally, it dynamically adjusts the supervised weight of the hidden knowledge as the student model’s ability changes. We evaluate our approach on the KQA Pro dataset, and our experimental results confirm its effectiveness. Our method improves the overall accuracy of the seven types of questions by 0.57%, with notable improvements in the accuracy of Qualifier, Compare, and Count questions. Furthermore, in the compositional generalization scenario, the overall accuracy improved by 4.02%. Our codes are publicly available on https://github.com/zjtfo/SP_Consistency_By_KD .

Jun Zou, Shulin Cao, Jing Wan, Lei Hou, Jianjun Xu
DYGL: A Unified Benchmark and Library for Dynamic Graph

Difficulty in reproducing the code and inconsistent experimental methods hinder the development of the dynamic network field. We present DYGL, a unified, comprehensive, and extensible library for dynamic graph representation learning. The main goal of the library is to make dynamic graph representation learning available for researchers in a unified easy-to-use framework. To accelerate the development of new models, we design unified model interfaces based on unified data formats, which effectively encapsulate the details of the implementation. Experiments demonstrate the predictive performance of the models implemented in the library on node classification and link prediction. Our library will contribute to the standardization and reproducibility in the field of the dynamic graph. The project is released at the link: https://github.com/half-salve/DYGL-lib

Teng Ma, Bin Shi, Yiming Xu, Zihan Zhao, Siqi Liang, Bo Dong
TrieKV: Managing Values After KV Separation to Optimize Scan Performance in LSM-Tree

Persistent key-value(KV) stores are mainly designed based on the Log-Structured Merge-tree(LSM-tree) for high write performance, yet the LSM-tree suffers from the inherently high I/O amplification which influences the read and write performance when KV stores grow in size. KV separation mitigates I/O amplification by storing only keys in the LSM-tree while values are in separated storage. However, the KV separation breaks the key sequence of values, which influences their range query performance. We propose TrieKV make the most of the hard-disk drives(HDD)’s sequential read performance advantages to improve range query performance. TrieKV uses a dynamic prefix index and a collaborative KV data merging and sorting mechanism to manage values after KV separation. Compared with the typical KV separation storage system WiscKey, TrieKV achieves $$2.35\times $$ 2.35 × range query performance under HDD. Meanwhile, TrieKV also performs better than WiscKey in all six YCSB workloads.

Zekun Yao, Yang Song, Yinliang Yue, Jinzhou Liu, Zhixin Fan
Bit Splicing Frequent Itemset Mining Algorithm Based on Dynamic Grouping

Frequent itemset mining has always been one of the most classic tasks in data mining. It provides effective decision-making and judgment for many problems. A novel MPL (multi-partition list) structure is proposed in this paper combining bit combination and linear table structure. The MPL is composed of arrays where each unit stores a combination of items rather than a single item, which addresses the limitations of maintaining many pointers in the traditional tree structure. In addition, the MPL stores the least valid information required in the mining process. This paper further proposes a bit splicing frequent itemset mining algorithm based on dynamic grouping (BSFIM-DG) for the MPL. The algorithm dynamically calculates the number of grouping by using coverage according to the dataset’s characteristics. The candidate itemset is obtained by the bit-splicing method. The length of the MPL to be traversed is determined by the low-bit feature of the candidate itemset. The search space is reduced with the corresponding pruning strategy. Experiments on various open datasets demonstrate that the algorithm has excellent running speed, especially since the support is low. The proposed algorithm has a similar running speed to the BCLT-O and the FP-growth on some datasets. In terms of memory usage, the algorithm is better than the FP-growth and comparable to the BCLT-O, but there is still a particular gap with the Bit-combination algorithm. Nevertheless, as the pace of technology updates and iteration is getting faster and faster, it is very feasible to exchange space for speed.

Wenhe Xu, Jun Lu
Entity Resolution Based on Pre-trained Language Models with Two Attentions

Entity Resolution (ER) is one of the most important issues for improving data quality, which aims to identify the records from one and more datasets that refer to the same real-world entity. For the textual datasets with the attribute values of long word sequences, the traditional methods of ER may fail to capture accurately the semantic information of records, leading to poor effectiveness. To address this challenging problem, in this paper, by using pre-trained language model RoBERTa and by fine-tuning it in the training process, we propose a novel entity resolution model IGaBERT, in which interactive attention is applied to capture token-level differences between records and to break the restriction that the schema required identically, and then global attention is utilized to determine the importance of these differences. Extensive experiments without injecting domain knowledge are conducted to measure the effectiveness of the IGaBERT model over both structured datasets and textual datasets. The results indicate that IGaBERT significantly outperforms several state-of-the-art approaches over textual datasets, especially with small size of training data, and it is highly competitive with those approaches over structured datasets.

Liang Zhu, Hao Liu, Xin Song, Yonggang Wei, Yu Wang
A High-Performance Hybrid Index Framework Supporting Inserts for Static Learned Indexes

The learned index is a new index structure that uses a trained model to directly predict the position of a key and thus has high query performance. However, static learned indexes cannot handle insert operations. Although static PGM-index uses a dynamic data structure to support inserts, it faces a serious read amplification problem under read-write workloads, as the inefficient lookup process of the buffers diminishes the learned indexes. Besides, this structure also leads to periodic retraining of the internal PGM-indexes because the buffers and the learned indexes are strongly coupled, which is unacceptable for those static learned indexes that need tuning. Obviously, this structure is not an ideal general framework. In this paper, we propose a two-layer Hybrid Index Framework (HIF) to address such issues. Specifically, the dynamic layer is used as a buffer for inserts, and the static layer consisting of static learned indexes is used for lookups only. HIF effectively alleviates read amplification by searching the static layer directly. And with this hierarchical structure, HIF isolates learned indexes from insert operations. Thus HIF can completely avoid the retraining of the learned indexes by transformation strategy from the dynamic layer to the static layer. Moreover, we provide a self-tuning algorithm for the learned indexes that cannot be built in a single pass over the data, allowing them to be applied to dynamic workloads with low training overhead. We have conducted experiments using multiple datasets and workloads and the results show that on average, three HIF-based static learned indexes, HLI, PGM, and RMI, achieve up to 1.8 $$\times $$ × , 1.7 $$\times $$ × , and 1.5 $$\times $$ × higher throughput than the original dynamic PGM-index for insert ratio below 70%.

Yuquan Ding, Xujian Zhao
A Study on Historical Behaviour Enabled Insider Threat Prediction

Insider threats have been the major challenges in cybersecurity in recent years since they come from authorized individuals and usually cause significant losses once succeeded. Researchers have been trying to solve this problem by discovering the malicious activities that have already happened, which offers not much help for the prevention of those threats. In this paper, we propose a novel problem setting that focuses on predicting whether an individual would be a malicious insider in a future day based on their daily behavioral records of the previous several days, which could assist cybersecurity specialists in better allocating managerial resources. We investigate seven traditional machine learning methods and two deep learning methods, evaluating their performance on the CERT-r4.2 dataset for this specific task. Results show that the random forest algorithm tops the ranking list with f1 = 0.8447 in the best case, and deep learning models are not necessarily better than machine learning models for this specific problem setting. Further study shows that the historical records from the previous four days around can offer the most predicting power compared with other length settings. We publish our codes on GitHub: https://github.com/mybingxf/insider-threat-prediction.

Fan Xiao, Wei Hong, Jiao Yin, Hua Wang, Jinli Cao, Yanchun Zhang
PV-PATE: An Improved PATE for Deep Learning with Differential Privacy in Trusted Industrial Data Matrix

Differential privacy (DP) has been widely used in many domains of statistics and deep learning (DL), such as protecting the parameters of DL models. The framework Private Aggregation of Teacher Ensembles (PATE) is a popular solution for privacy protection that effectively avoids membership inference attacks in model training. However, in Trusted Industrial Data Matrix (TDM) where privacy budgets are constrained and information sharing between models is required, existing works using PATE have two issues. First, the data utility is reduced due to the overfitting problem resulting from insufficient knowledge transfer from teachers to students. Second, teachers cannot share information, thus creating an information silo problem. In this paper, we first proposed the Personalized Voting-based PATE framework (PV-PATE) in TDM to solve the above-mentioned issues. It includes Teacher Credibility that reduces sensitivity by changing voting weights and an Adaptive Voting mechanism based on teachers voting. In addition, we propose a Model Sharing mechanism to achieve model cloning and elimination. We conduct extensive experiments on MNIST dataset and SVHN dataset to demonstrate that our approach achieves not only outstanding learning performance but also provides strong privacy guarantees.

Hongyu Hu, Qilong Han, Zhiqiang Ma, Yukun Yan, Zuobin Xiong, Linyu Jiang, Yuemin Zhang
LayerBF: A Space Allocation Policy for Bloom Filter in LSM-Tree

LSM-Tree based key-value stores commonly suffer from the issue of read amplification, as the retrieval of a particular key typically requires examination of multiple layers of SSTables. To enhance query performance, a bloom filter is commonly employed, although it is susceptible to the problem of false positives, which leads to additional I/Os. To mitigate the issue of false positives, the bloom filter size can be increased, but this in turn results in higher memory consumption. In response, we have developed LayerBF, a space allocation strategy for layered bloom filters. By leveraging access frequency, LayerBF dynamically allocates bits-per-key of bloom filters in each layer. Hotter layers are allocated a larger space, while colder layers are allocated a smaller space. This approach reduces the average false positive rate, improves storage read performance, and simultaneously minimizes memory consumption. We have implemented LayerBF in the widely used RocksDB key-value store and evaluated its performance with and without LayerBF on both hard disk drives (HDDs) and solid-state drives (SSDs). The evaluation results demonstrate that LayerBF improves read performance by 5% to 14% and reduces the false positive rate by 8% to 10%.

Jiaoyang Li, Zhixin Fan, Yinliang Yue, Zekun Yao, Jinzhou Liu, Jiang Zhou
HTStore: A High-Performance Mixed Index Based Key-Value Store for Update-Intensive Workloads

In this paper, we propose a high-performance Mixed Index based key-value store named HTStore to improve the write and read performance in update-intensive workloads of LSM-tree based key-value stores. The key idea of HTStore is to build a global index, called Mixed Index, in the DRAM and NVM hybrid storage, which saves keys and their latest positions. HTStore judges the key’s version participating in flush or compaction is the latest by accessing the Mixed Index. If the key-value pair is expired redundant old version data, HTStore filters it away, which helps reduce the flush and compaction overhead caused by such expired redundant data, leading to improved write throughput. Additionally, the Mixed Index helps quickly locate the level of the key, avoiding layer-by-layer search. The Mixed Index comprises a HashTable in DRAM and a Trie-tree in NVM. We implemented HTStore on RocksDB and compared it with the original RocksDB. Our performance evaluation showed that write throughput increased by up to 73–105% in update-intensive workloads and read throughput by up to 80%. Compared with MatrixKV and PebblesDB, HTStore also demonstrated specific performance improvements.

Jinzhou Liu, Yinliang Yue, Jiang Zhou, Zhixin Fan, Zekun Yao
Backmatter
Metadaten
Titel
Web and Big Data
herausgegeben von
Xiangyu Song
Ruyi Feng
Yunliang Chen
Jianxin Li
Geyong Min
Copyright-Jahr
2024
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9723-87-4
Print ISBN
978-981-9723-86-7
DOI
https://doi.org/10.1007/978-981-97-2387-4

Premium Partner