Skip to main content

2023 | Buch

Data Analysis and Optimization

In Honor of Boris Mirkin's 80th Birthday

insite
SUCHEN

Über dieses Buch

This book presents the state-of-the-art in the emerging field of data science and includes models for layered security with applications in the protection of sites—such as large gathering places—through high-stake decision-making tasks. Such tasks include cancer diagnostics, self-driving cars, and others where wrong decisions can possibly have catastrophic consequences. Additionally, this book provides readers with automated methods to analyze patterns and models for various types of data, with applications ranging from scientific discovery to business intelligence and analytics.

The book primarily includes exploratory data analysis, pattern mining, clustering, and classification supported by real life case studies. The statistical section of this book explores the impact of data mining and modeling on the predictability assessment of time series. Further new notions of mean values based on ideas of multi-criteria optimization are compared with their conventional definitions, leading to new algorithmic approaches to the calculation of the suggested new means.

The style of the written chapters and the provision of a broad yet in-depth overview of data mining, integrating novel concepts from machine learning and statistics, make the book accessible to upper level undergraduate and graduate students in data mining courses. Students and professionals specializing in computer and management science, data mining for high-dimensional data, complex graphs and networks will benefit from the cutting-edge ideas and practically motivated case studies in this book.

Inhaltsverzeichnis

Frontmatter
Optimal Layered Defense For Site Protection
Abstract
We present a model for layered security with applications to the protection of sites such as stadiums or large gathering places. We formulate the problem as one of maximizing the capture of illegal contraband. The objective function is indefinite and only limited information can be gained when the problem is solved by standard convex optimization methods. In order to solve the model, we develop a dynamic programming approach, and study its convergence properties. Additionally, we formulate a version of the problem aimed at addressing intelligent adversaries who can adjust their direction of attack as they observe changes in the site security. Furthermore, we also develop a method for the solution of the latter model. Finally, we perform computational experiments to demonstrate the use of our methods.
Tsvetan Asamov, Emre Yamangil, Endre Boros, Paul B. Kantor, Fred Roberts
Developing and Running a Set of Competitive College/University Blended Courses
Abstract
A mathematical model underlying a decision support tool letting a college/university administration (a) find a financially optimal composition of teachers to design and run blended courses for its students using fragments of recorded lectures of distinguished professors from world-leading universities, and (b) be competitive in the education market of potential students by offering such designed blended courses, along with corresponding explanations and tutorials, in a new learning environment at affordable tuition fees, is proposed. This model, reflecting the college/university expenses associated with running each such blended course, lets its administration calculate (a) the maximal percentage of the students expected to succeed in studying each course from a set of these courses under a fixed budget, and (b) the minimal budget to secure desirable percentages of the students expected to succeed in studying particular blended courses from this set. The calculation can be done by formulating two Boolean programming problems on the basis of the proposed model and solving them using standard software packages.
Alexander S. Belenky
SARAH-Based Variance-Reduced Algorithm for Stochastic Finite-Sum Cocoercive Variational Inequalities
Abstract
Variational inequalities are a broad formalism that encompasses a vast number of applications. Motivated by applications in machine learning and beyond, stochastic methods are of great importance. In this paper we consider the problem of stochastic finite-sum cocoercive variational inequalities. For this class of problems, we investigate the convergence of the method based on the SARAH variance reduction technique. We show that for strongly monotone problems it is possible to achieve linear convergence to a solution using this method. Experiments confirm the importance and practical applicability of our approach.
Aleksandr Beznosikov, Alexander Gasnikov
Dimensionality Reduction Using Pseudo-Boolean Polynomials for Cluster Analysis
Abstract
We introduce usage of a reduction property of penalty-based formulation of pseudo-Boolean polynomials as a mechanism for invariant dimensionality reduction in cluster analysis processes. In our experiments, we show that multidimensional data, like 4-dimensional Iris Flower dataset can be reduced to 2-dimensional space while the 30-dimensional Wisconsin Diagnostic Breast Cancer (WDBC) dataset can be reduced to 3-dimensional space, and by searching lines or planes that lie between reduced samples we can extract clusters in a linear and unbiased manner with competitive accuracies, reproducibility and clear interpretation.
Tendai Mapungwana Chikake, Boris Goldengorin
Pseudo-Boolean Polynomials Approach to Edge Detection and Image Segmentation
Abstract
We introduce a deterministic approach to edge detection and image segmentation by formulating pseudo-Boolean polynomials on image patches. The approach works by applying a binary classification of blob and edge regions in an image based on the degrees of pseudo-Boolean polynomials calculated on patches extracted from the provided image. We test our method on simple images containing primitive shapes of constant and contrasting colour and establish the feasibility before applying it to complex instances like aerial landscape images. The proposed method is based on the exploitation of the reduction, polynomial degree, and equivalence properties of penalty-based pseudo-Boolean polynomials.
Tendai Mapungwana Chikake, Boris Goldengorin, Alexey Samosyuk
Purifying Data by Machine Learning with Certainty Levels
Abstract
For autonomic computing, self-managing systems, and decision-making under uncertainty and faults, in many cases we are using machine learning models and combine them to solve any problem. This models uses a data-set, or a set of data-items, and data-item is a vector of feature values and a classification. in many cases these data sets includes outlier and/or misleading data items that were created by input device malfunctions, or were maliciously inserted to lead the machine learning to wrong conclusions. A reliable machine learning model must be able to handle a corrupted data-set, otherwise, a malfunctioning input device that corrupts a portion of the data-set, or malicious adversary may lead to inaccurate classifications. Therefore, the challenge is to find an effective methods to evaluate and increase the certainty level of the learning process as much as possible. This work introduces the use of a certainty level measure to obtain better classification capability in the presence of corrupted or malicious data items. Assuming we know the data distribution, e.g., is a normal distribution (which is a reasonable assumption in a large amount of data items) and/or a known upper bound on the given number of corrupted data items, our techniques define a certainty level for classifications. Another approach that will be presented in this work suggests enhancing the random forest techniques (the original model was developed by Leo Breiman) to cope with corrupted data items by augmenting the certainty level for the classification obtained in each leaf in the forest. This method is of independent interest, that of significantly improving the classification of the random forest machine learning technique in less severe settings.
Shlomi Dolev, Guy Leshem
On Impact of Data Models on Predictability Assessment of Time Series
Abstract
The main methods for predicting traffic in telecommunications networks are the methods and techniques of Machine Learning (ML).
Within the framework of the ML approach, the network traffic predictor is considered as a tool that uses one way or another accumulated statistics over time to draw conclusions about the future behavior of network traffic.
However, as the analysis of the literature shows, many modern MO tools do not work efficiently enough due to the pronounced non-linearity of traffic changes and non-stationarity.
This paper is an attempt at some ordering and categorization in a huge stream of publications on modern methods, techniques and models for predicting data of various nature.
For this, the concept of a conceptual model (CM) of algorithms for predicting the state of systems in the subject area is introduced. This concept is used widely used in modern Big Date and Software Engineering, is adapted here to represent specific probabilistic models of random sequences and processes of various nature (symbolic, binary, real). Under such a conceptual model, the paper refers to a formalized description of the relationships between elements, objects (both real and model) and goals of mathematical models and prediction algorithms, as well as the relationships between them, expressed in probabilistic terms.
This construction allows representing classes of models according to the types of predicted data, used probability measures, loss functions when making decisions about the acceptability of the received forecast.
Among the tasks of forecasting, the task of predicting signs of increments (direction of change) of the time series process is singled out separately. The paper proposes to use some results of the theory of random processes for a quick assessment of the predictability of signs of increments with acceptable ac-curacy. The proposed procedure is a simple heuristic rule for predicting the increment of two neighboring values of a random sequence. The connection with this approach for time series with known approaches for predicting binary sequences is shown.
Sergey Frenkel
A Three-Step Method for Audience Extension in Internet Advertising Using an Industrial Taxonomy
Abstract
The paper addresses a very common problem in targeted digital advertising, insufficient audience size. Many approaches to audience extension frequently lead to much diminishing quality metrics, such as audience quality or conversion rates. This is the case, for example, for so-called look-alike techniques. We present a novel method for the efficient extension of target audiences. Our base is a popular taxonomy of user interests, the IAB Contents taxonomy, combined with the representation of browsing behavior of millions of users by fuzzy sets of visited IAB taxonomy segments, that are leaves of the taxonomy tree. We use this idea in our method. The method consists of three steps: (1) computing membership values for the interest segments for a user by a classifier; (2) performing generalization of those sets and obtaining high-ranked segments, which is a core part of the method; (3) obtaining a set of advertising campaigns for a user. Our method involves an algorithm for optimally lifting individual fuzzy leaf sets into a higher rank taxonomy node, a so-called “head subject”. The head subject must cover the input fuzzy leaf set in such a way that the number of errors is minimized. This algorithm was proposed as an intelligent information retrieval tool. It can be applied, however, to a very different task of targeted advertisement. To extend the audiences of a targeted advertisement, we find their head subjects off-line. Given a set of taxonomy segments corresponding to targeted audiences, we include a user as a target if her head subject covers any of those segments. This lifting-based step does increase the number of successful matches between user segments and campaign segments two- or three-fold without losing in the targeting quality, the click-through rate, CTR. This is in stark contrast to the conventional look-alike methods for increasing the audience numbers by reducing the admissibility thresholds, which leads to a large decrease in CTR, which was experimentally proven.
Dmitry Frolov, Zina Taran
From Prebase in Automata Theory to Data Analysis: Boris Mirkin’s Way
Abstract
At the Google Scholar, links to Boris’s books alone exceed 5000. In addition to Boris Mirkin’s starting Ph.D. results in abstract automata, Mirkin’s innovative contributions in various fields of decision-making theory and practice have marked the fourth quarter of the twentieth century and beyond. Boris has done pioneering work in group choice, mathematical psychology, clustering, datamining and knowledge discovery. Here is the list of notions and terms that have been introduced or heavily explored by B. Mirkin: Anomalous cluster, Bi-cluster, Categorical factor analysis, Chain order partition, Complementary partition criterion, Core-shell cluster, Data recovery approach, Distance between partitions, Federation consensus rule, Generalization in taxonomy, Interval order, Linear stratification, Mapping between evolutionary trees, Minkowski weighted feature clustering, Parsimonious gene history reconstruction, Single cluster clustering, Structured partition, Prebase (in automata theory), Taxonomic rank of research results, Tri-clustering.
Boris Goldengorin
Manipulability of Aggregation Procedures for the Case of Large Numbers of Voters
Abstract
Manipulation is a situation when an agent misrepresents her preferences to obtain a better result of an aggregation procedure. It was proven in literature that there is no non-dictatorial aggregation procedure which is non-manipulable. A number of papers studying the degree of manipulability of aggregation procedures have been published since then. Such papers either look for theoretical formulae for particular cases or use computer modelling for empirical estimations. In case of computer modelling, only small numbers of voters are considered in the literature, usually up to 100 agents. We use computer modelling to obtain the values of manipulability indices for the case of large numbers of voters, for example, for the cases of 100–10,000 voters. The obtained results allow to make empirical approximations of the manipulability of aggregation procedures and to find out the least manipulable ones.
Alexander Ivanov
Preferences over Mixed Manna
Abstract
A mixed manna contains alternatives that are the best for some agents, and alternatives that are the never best for all agents. We define a new class of Condorcet domains that are called GF-domains. GF-domains are unique Condorcet domains that are weakly minimally rich, semi-connected and contain a pair of mutually reverse preference orders. GF-domains are single-peaked on a circle that leads to a clear interpretation.
Alexander Karpov
About Some Clustering Algorithms in Evidence Theory
Abstract
The Dempster–Shafer theory of evidence considers data that have a frequency-set nature (the so-called body of evidence). In recent years, there has been interest in clustering such objects to approximate them with simpler bodies of evidence, to analyze the inconsistency of information, reducing the computational complexity of processing algorithms, revealing the structure of the set of focal elements, etc. The article discusses some existing algorithms for clustering evidence bodies and suggests some new algorithms and approaches in such clustering.
Alexander Lepskiy
Inferring Multiple Consensus Trees and Supertrees Using Clustering: A Review
Abstract
Phylogenetic trees (i.e. evolutionary trees, additive trees or X-trees) play a key role in the processes of modeling and representing species evolution. Genome evolution of a given group of species is usually modeled by a species phylogenetic tree that represents the main patterns of vertical descent. However, the evolution of each gene is unique. It can be represented by its own gene tree which can differ substantially from a general species tree representation. Consensus trees and supertrees have been widely used in evolutionary studies to combine phylogenetic information contained in individual gene trees. Nevertheless, if the available gene trees are quite different, then the resulting consensus tree or supertree can either include many unresolved subtrees corresponding to internal nodes of high degree or can simply be a star tree. This may happen if the available gene trees have been affected by different reticulate evolutionary events, such as horizontal gene transfer, hybridization or genetic recombination. In this case, the problem of inferring multiple alternative consensus trees or supertrees, using clustering, becomes relevant since it allows one to regroup in different cluster gene trees having similar evolutionary patterns (e.g. gene trees representing genes that have undergone the same horizontal gene transfer or recombination events). We critically review recent advances and methods in the field of phylogenetic tree clustering, discuss the methods’ mathematical properties, and describe the advantages and limitations of multiple consensus tree and supertree approaches. In the application section, we show how the discussed supertree clustering approach can be used to cluster aaRS evolutionary trees according to their evolutionary patterns.
Vladimir Makarenkov, Gayane S. Barseghyan, Nadia Tahiri
Anomaly Detection with Neural Network Using a Generator
Abstract
This paper concerns with the problem of detecting anomalies on X-ray images taken by Full Body Scanners (FBS). Our previous work describes the sequence of image preprocessing methods used to convert the original images, which are produced with FBS, to an images with visually distinguishable anomalies. In this paper we focus on development of the proposed methods, including the addition of preprocessing methods and the creation of generator which can produce synthetic anomalies. Examples of processed images are given. The results of using a neural network for anomaly detection are shown.
Alexander S. Markov, Evgeny Yu. Kotlyarov, Natalia P. Anosova, Vladimir A. Popov, Yakov M. Karandashev, Darya E. Apushkinskaya
Controllability of Triangular Systems with Phase Space Change
Abstract
In the present paper the controllability of a composite system of the following structure is investigated: two phase spaces and two consecutive time intervals are given, in each space on the corresponding time interval the motion of the object is described by a nonlinear system. The phase spaces are changed with the help of some given mapping, and the docking of trajectories is also connected with it. The conditions of controllability of this system from the initial set of one space to the finite set of another space are obtained in the paper. An approach to finding trajectories for this motion is proposed.
Irina Sergeevna Maximova
A Parallel Linear Active Set Method
Abstract
Given a linear-inequality-constrained convex minimization problem in a Hilbert space, we develop a novel binary test that examines sets of constraints and passes only active-constraint sets. The test employs a black-box, linear-equality-constrained convex minimization method but can often fast fail, without calling the black-box method, by considering information from previous applications of the test on subsets of the current constraint set. This fast fail, as a function of the number of dimensions, has quadratic complexity, and can be completely multi-threaded down to near-constant complexity. Only when the test is unable to fast fail, does it use the black-box method. In both cases the test generates the optimal point over the subject inequalities. Iterative and largely parallel applications of the test over growing subsets of inequality constraints yields a minimization algorithm. We also include an adaptation of the algorithm for a non-convex polyhedron in Euclidean space. Outside of calling the black-box method, complexity is not a function of accuracy. The algorithm does not require the feasible space to have a non-empty interior, or even be nonempty. With ample threads, the multi-threaded complexity of the algorithm is constant as a function of the number of inequalities.
E. Dov Neimand, Şerban Sabău
Mean Values: A Multicriterial Analysis
Abstract
We introduce new notions of mean values based on ideas of multicriteria optimization. The distances between the current point to all points in the sample are regarded as elements of a vector estimate. We introduce preference relations on the set of all such vectors, based on the information about the preferences of the decision maker who could be a statistician, analyst or researcher. Such preference relations reflect the distances between points, including the case in which all distances are equally important. We define the mean values as the points whose corresponding vector estimates are nondominated with respect to the defined preference relation, and investigate their properties. Such mean values turn out to be multi-valued. We further explore the relationship between the new notions of mean values with their conventional definitions and suggest computational approaches to the calculation of the suggested new means.
Vladislav V. Podinovski, Andrey P. Nelyubin
Data and Text Interpretation in Social Media: Urban Planning Conflicts
Abstract
The relevance of this study is determined by the need to develop technologies for effective urban systems management and resolution of urban planning conflicts. The paper presents an algorithm for analyzing urban planning conflicts on the example of data and text interpretation in social media. The material for the study was data from social networks, microblogging, blogs, instant messaging, forums, reviews, video hosting services, thematic portals, online media, print media and TV related to the construction of the Big circle metro line (Southern section) in Moscow (RF). Data collection: 1 October 2020–10 June 2021. Number of tokens: 62 657 289. To analyze the content of social media, a multi-modal approach was used. The paper presents the results of research on the development of methods and approaches for constructing mathematical and neural network models for analyzing the social media users’ perceptions based on the user generated content and on digital footprints of users. Artificial neural networks, differential equations, and mathematical statistics were involved in building the models. Differential equations of dynamic systems were based on observations enabled by machine learning. In combination with mathematical and neural network model the developed approaches, made it possible to draw a conclusion about the tense situation, identify complaints of residents to constructors and city authorities, and propose recommendations to resolve and prevent conflicts.
Maria Pilgun, Nailia Gabdrakhmanova
Visual Explainable Machine Learning for High-Stakes Decision-Making with Worst Case Estimates
Abstract
A major motivation for explaining and rigorous evaluating Machine Learning (ML) models is coming from high-stake decision-making tasks like cancer diagnostics, self-driving cars, and others with possible catastrophic consequences of wrong decisions. This chapter shows that visual knowledge discovery (VKD) methods, based on the General Line Coordinates (GLC) recently developed, can significantly contribute to solving this problem. The concept of hyperblocks (n-D rectangles) as interpretable dataset units and GLC are combined to create visual self-service machine learning models. Two variants of Dynamic Scaffold Coordinates (DSC) are proposed. It allows losslessly mapping high-dimensional datasets to a single two-dimensional Cartesian plane and building interactively an ML predictive model in this 2-D visualization space. Major benefits of DSC1 and DSC2 is their highly interpretable nature. They allow domain experts to control or establish new machine learning models through visual pattern discovery. It opens a visually appealing opportunity for domain experts, who are not ML experts, to build ML models as a self-service bringing the domain expertise to the model discovery, which increases model explainability and trust for the end user. DSC were used to find, visualize, and estimate the worst-case validation splits in several benchmark datasets, which is important for high-risk application. For large datasets DSC is combined with dimensionality reduction techniques such as principal component analysis, singular value decomposition, and t-distributed stochastic neighbor embedding. A software package referred to as Dynamic Scaffold Coordinates Visualization System (DSCViz) was created to showcase the DSC1 and DSC2 systems.
Charles Recaido, Boris Kovalerchuk
Algorithm of Trading on the Stock Market, Providing Satisfactory Results
Abstract
The paper proposes a new trading algorithm for S&P − 500 stock market, which provides positive results over a sufficiently long period of time. No assumptions are made about the behave-or of this market (including probabilistic ones) and no predictions are made and used. The daily real stock price data are considered and the gain (or loss) that would be obtained if the proposed stock choice algorithm for the next day was applied, is calculated. The algorithm uses only the closing price data for the preceding days and includes a special stopping rule based on the income, accumulated since an initial day of a considered period till a current day.
The suggested algorithm substantially uses the previously developed approach to construction a family of graph decompositions (see publications Rubchinsky A, Family of graph decompositions and its applications to data analysis: working paper WP7/2016/09 – Moscow: Higher School of Economics Publ. House, 2016. – (Series WP7 “Mathematical methods for decision making in economics, business and politics”). p 60, 2016; Rubchinsky A, A new approach to network decomposition problems. In: Kalyagin V., Nikolaev A., Pardalos P., Prokopyev O. (eds) Models, algorithms, and technologies for network analysis. NET 2016. Springer Proceedings in Mathematics & Statistics, vol 197. Springer, Cham, 2017; Rubchinsky A, Graph dichotomy algorithm and its applications to analysis of stocks market. In: Kalyagin V., Pardalos P., Prokopyev O., Utkina I. (eds) Computational aspects and applications in large-scale networks. NET 2017. Springer Proceedings in Mathematics & Statistics, vol 247. Springer, Cham, 2018).
Alexander Rubchinsky, Kristina Baikova
Classification Using Marginalized Maximum Likelihood Estimation and Black-Box Variational Inference
Abstract
Based upon variational inference (VI) a new set of classification algorithms has recently emerged. This set of algorithms aims (A) to increase generalization power, (B) to decrease computational complexity. However, the complex math and implementation considerations have led to the emergence of black-box variational inference methods (BBVI). Relying on these principles, we assume the existence of a set of latent variables during the generation of data points. We subsequently marginalize the conventional maximum likelihood objective function w.r.t this set of latent variables and then apply black-box variational inference to estimate the model’s parameters. We evaluate the performance of the proposed method by comparing the results obtained from the application of our method to real-world and synthetic data sets with those obtained using basic and state-of-art classification algorithms. We proceed and scrutinize the impact of: (1) the existence of non-informative features at various dimensionalities, (2) the imbalanced data representation, (3) non-linear data sets, and (4) different data set size on the performance of algorithms under consideration. The results obtained prove to be encouraging and effective.
Soroosh Shalileh
Generating Genomic Maps of Z-DNA with the Transformer Algorithm
Abstract
Z-DNA and Z-RNA were shown to play an important role in various processes of genome functioning acting as flipons that launch or suppress genetic programs. Genome-wide experimental detection of Z-DNA remains a challenge due to dynamic nature of its formation. Recently we developed a deep learning approach DeepZ, based on CNN and RNN architectures, that predicts Z-DNA regions using additional information from omics data collected from different cell types. Here we took advantage of the transformer algorithm that trains attention maps to improve classifier performance. We started with pretrained DNABERT models and fine-tuned their performance by training with experimental Z-DNA regions from mouse and human genome wide studies. The resulting DNABERT-Z outperformed DeepZ. We demonstrated that DNABERT-Z finetuned on human data sets also generalizes to predict Z-DNA sites in mouse genome.
Dmitry Umerenkov, Vladimir Kokh, Alan Herbert, Maria Poptsova
Manipulation by Coalitions in Voting with Incomplete Information
Abstract
We consider the problem of coalitional manipulation in collective decision making and a probabilistic approach for solving it. We assume that voters have some information about other voters’ preferences from opinion polls held before voting. There are 5 different types of poll information functions. Coalition members are assumed to have identical preferences. We consider the probability that in a randomly chosen preference profile there exists a coalition which has an incentive to manipulate under a given type of poll information. We answer the following questions. How does coalitional manipulability differ from individual? How do different types of poll information affect coalitional manipulability? We answer these questions via both theoretical investigation and computational experiments.
Yuliya A. Veselova
Rethinking Probabilistic Topic Modeling from the Point of View of Classical Non-Bayesian Regularization
Abstract
Probabilistic Topic Modeling with hundreds of its models and applications has been an efficient text analysis technique for almost 20 years. This research area has evolved mostly within the frame of the Bayesian learning theory. For a long time, the possibility of learning topic models with a simpler conventional (non-Bayesian) regularization remained underestimated and rarely used. The framework of Additive Regularization for Topic Modeling (ARTM) fills this gap. It dramatically simplifies the model inference and opens up new possibilities for combining topic models by just adding their regularizers. This makes the ARTM a tool for synthesizing models with desired properties and gives rise to developing the fast online algorithms in the BigARTM open-source environment equipped with a modular extensible library of regularizers. In this paper, a general iterative process is proposed that maximizes a smooth function on unit simplices. This process can be used as inference mechanism for a wide variety of topic models. This approach is believed to be useful not only for rethinking probabilistic topic modeling, but also for building the neural topic models increasingly popular in recent years.
Konstantin Vorontsov
Metadaten
Titel
Data Analysis and Optimization
herausgegeben von
Boris Goldengorin
Sergei Kuznetsov
Copyright-Jahr
2023
Electronic ISBN
978-3-031-31654-8
Print ISBN
978-3-031-31653-1
DOI
https://doi.org/10.1007/978-3-031-31654-8

Premium Partner