Skip to main content

2020 | Buch

Linking Sensitive Data

Methods and Techniques for Practical Privacy-Preserving Information Sharing

verfasst von: Prof. Peter Christen, Dr. Thilina Ranbaduge, Prof. Dr. Rainer Schnell

Verlag: Springer International Publishing

insite
SUCHEN

Über dieses Buch

This book provides modern technical answers to the legal requirements of pseudonymisation as recommended by privacy legislation. It covers topics such as modern regulatory frameworks for sharing and linking sensitive information, concepts and algorithms for privacy-preserving record linkage and their computational aspects, practical considerations such as dealing with dirty and missing data, as well as privacy, risk, and performance assessment measures. Existing techniques for privacy-preserving record linkage are evaluated empirically and real-world application examples that scale to population sizes are described. The book also includes pointers to freely available software tools, benchmark data sets, and tools to generate synthetic data that can be used to test and evaluate linkage techniques.

This book consists of fourteen chapters grouped into four parts, and two appendices. The first part introduces the reader to the topic of linking sensitive data, the second part covers methods and techniques to link such data, the third part discusses aspects of practical importance, and the fourth part provides an outlook of future challenges and open research problems relevant to linking sensitive databases. The appendices provide pointers and describe freely available, open-source software systems that allow the linkage of sensitive data, and provide further details about the evaluations presented. A companion Web site at https://dmm.anu.edu.au/lsdbook2020 provides additional material and Python programs used in the book.

This book is mainly written for applied scientists, researchers, and advanced practitioners in governments, industry, and universities who are concerned with developing, implementing, and deploying systems and tools to share sensitive information in administrative, commercial, or medical databases.

The Book describes how linkage methods work and how to evaluate their performance. It covers all the major concepts and methods and also discusses practical matters such as computational efficiency, which are critical if the methods are to be used in practice - and it does all this in a highly accessible way!David J. Hand, Imperial College, London

Inhaltsverzeichnis

Frontmatter

Introduction

Frontmatter
Chapter 1. Introduction
Abstract
In this chapter, we show that linking individual records from different databases is indispensable for many research purposes and data usage in practical applications. Almost all analyses of Big data sources require linking several databases containing information about the same or similar populations. We discuss examples of applications from medicine, economics, and official statistics. Since the GDPR and other legal restrictions usually require pseudonymisation, the use of error tolerant pseudonymisation methods becomes necessary. Based on the increasing number of research published in diverse areas we show that the need for the techniques presented in this book is becoming more important.
Peter Christen, Thilina Ranbaduge, Rainer Schnell
Chapter 2. Regulatory Frameworks
Abstract
In this chapter, we review ethical and legal regulatory frameworks as relevant to the topic of linking sensitive databases that contain personal information. With reference to the declarations of Helsinki and Taipei, the resulting importance of research ethics committees is explained. Thereafter, we describe formal regulations for linking databases in selected countries. The European Data Protection Regulation (GDPR) and its implementation in different European countries are outlined first (Austria, Germany, UK). We discuss the Caldicott principles, important in the UK but not well known in Europe. We then describe the basic principles of the Common Rule and the Health Insurance Portability and Accountability Act (HIPAA) in the US. For comparison, the legal regulations in Australia and Switzerland are outlined. We then introduce best practice approaches, such as separating microdata and identifiers, using technical and organisational measures to restrict data access, and implementing organisational structures and procedures such as the Five Safes. Finally, we highlight the importance of the embedding of research involving sensitive databases within organisational and societal settings, both for the evaluation of privacy as well as preconditions for research.
Peter Christen, Thilina Ranbaduge, Rainer Schnell
Chapter 3. Linking Sensitive Data Background
Abstract
This chapter covers background topics relevant to linking databases, with a focus on linking sensitive data. We first provide a short history of linking data, and then give an overview of the record linkage process with a ow chart and descriptions of the major steps involved when linking databases. We discuss aspects of data quality that can in uence the outcomes of a linking process, and we highlight the importance of data preprocessing in any linkage application. Next, we present measures that can be used to evaluate the linkage process with regard to linkage quality and completeness, as well as complexity and scalability when linking large databases. Linking sensitive databases can involve a variety of challenges, which we discuss in this chapter. We then introduce the topic of privacy-preserving record linkage (PPRL), and provide a formal definition. We also contrast the general record linkage process with the PPRL process and describe the additional requirements of PPRL over traditional record linkage.
Peter Christen, Thilina Ranbaduge, Rainer Schnell

Methods and Techniques

Frontmatter
Chapter 4. Private Information Sharing Protocols
Abstract
In this chapter we describe different aspects of information sharing protocols aimed at linking sensitive databases. We first discuss the different roles of the parties that can participate in these protocols. Next we describe the general principle of separating identifying data from sensitive microdata in a linkage process, and provide the details of the basic two-party, threeparty, and multiparty protocols that can be used for different linkage scenarios. We then describe the different adversarial models that are being used to model the privacy risks and protection provided by these protocols, such as the honest-but-curious and malicious models. Finally, we cover additional aspects that need to be considered when private information sharing protocols are employed in practice, such as frameworks for key exchange as well as access control.
Peter Christen, Thilina Ranbaduge, Rainer Schnell
Chapter 5. Assessing Privacy and Risks
Abstract
An important aspect of any applications that facilitate the linking of sensitive data is their evaluation with regard to the privacy protection they provide, as well as the risks of a potential successful reidentification of sensitive information in any encoded or encrypted database used by such an application. In this chapter we discuss how to measure privacy and risks in the context of conducting privacy-preserving linking of sensitive databases, and we present the different types of attacks that potentially can be applied on encoded or encrypted databases where the aim of an adversary is to learn about the sensitive information contained in such databases. We also discuss the related topic of statistical disclosure control methods, which have been used by many national statistical institutes in the context of publishing sensitive microdata while at the same time ensuring that the release of such data protects the identity of all subjects in the released data.
Peter Christen, Thilina Ranbaduge, Rainer Schnell
Chapter 6. Building Blocks for Linking Sensitive Data
Abstract
This chapter covers in detail the various techniques that are employed as building blocks in protocols used to link sensitive databases. These include techniques to generate random values, as well as various hashing, anonymisation, and encryption techniques. We also cover the basic building blocks used in secure multiparty computation approaches such as secure summation and secure set intersection, and discuss differential privacy. We then describe methods for phonetic encoding, statistical linkage keys, and measures that can be used to calculate similarities between two values. We conclude the chapter with a discussion about the applicability of such building blocks when they are used in the context of linking sensitive databases.
Peter Christen, Thilina Ranbaduge, Rainer Schnell
Chapter 7. Encoding and Comparing Sensitive Values
Abstract
Over the past nearly two decades researchers from different domains, including statistics, the social and health sciences, and computer science, have developed a variety of techniques to link sensitive databases in a privacy-preserving way. Many of these techniques have so far not been used in practical applications for a variety of reasons that range from security weaknesses or limitations in linkage capabilities to prohibitive computational requirements. We begin this chapter with a taxonomy that has been developed to categorise techniques for encoding and comparing sensitive databases based on different dimensions ranging from privacy and technical to practical aspects, and we provide a general discussion of the different generations of techniques that have been developed. We then give brief overviews of specific techniques, including those based on phonetic encoding, hashing, public reference values, embedding into multidimensional spaces, and secure multiparty computation. We end the chapter with a discussion on the suitability of these types of techniques for different linkage scenarios.
Peter Christen, Thilina Ranbaduge, Rainer Schnell
Chapter 8. Bloom Filter based Encoding Methods
Abstract
Bloom filter encoding is currently the most popular privacy technique employed in different practical applications to link sensitive databases. In this chapter we describe in detail the main techniques used to encode databases into Bloom filters, and how to calculate similarities between encoded Bloom filters. We cover existing hashing and encoding techniques for Bloom filters, and how they can be used to encode textual and numerical values, as well as hierarchical classification codes. We also discuss how to choose suitable Bloom filter encoding techniques and how to appropriately set the parameters of these techniques for their use in real-world linkage applications.
Peter Christen, Thilina Ranbaduge, Rainer Schnell
Chapter 9. Attacking and Hardening Bloom Filter Encoding
Abstract
While the Bloom filter hashing and encoding techniques described in the previous chapter have shown to enable the efficient and accurate linkage of large sensitive databases, including the approximate matching of textual and numerical values, as well as hierarchical classification codes, recent research has shown that these techniques can be vulnerable to certain attack methods that are aimed at reidentifying the sensitive information encoded in Bloom filters. In this chapter we discuss the principal ideas of these attacks and describe in more detail several successful attack methods. We then describe a series of hardening techniques that aim to make Bloom filter encoding less vulnerable to such attacks.
Peter Christen, Thilina Ranbaduge, Rainer Schnell
Chapter 10. Computational Efficiency
Abstract
As the sensitive databases held by many organisations are getting larger, linking them can become computationally more challenging. Techniques known as blocking, indexing, and filtering have been developed and are being used to make linkage techniques more scalable. In this chapter we describe a variety of such techniques, as well as methods for linking large sensitive databases on modern parallel computing platforms and distributed environments. We also discuss scalability aspects when multiple (more than two) databases are to be linked, and the challenges involved when many, possibly hundreds or even thousands, of databases need to be linked.
Peter Christen, Thilina Ranbaduge, Rainer Schnell

Practical Aspects, Evaluation, and Applications

Frontmatter
Chapter 11. Practical Considerations
Abstract
When planning to link real-world sensitive databases, an organisation is likely faced with a variety of practical data related issues that can include, but are not limited to, noisy and dirty data, missing values in the attributes used for linkage, data recorded in different formats and structures, as well as data collected at different points in time. In this chapter we discuss how these issues can affect the linkage of sensitive databases, and how to consider these issues by employing appropriate data cleaning, preprocessing, and standardisation approaches, as well as linkage strategies that can deal with missing data. In the later sections of this chapter we then cover technical as well as institutional aspects that need to be considered when (sensitive) databases are being linked, such as the availability or lack of software, the difficulty of implementation of certain techniques, the setting and tuning of parameters required by different techniques, the requirement and management of computational resources, and the skills and expertise required to link sensitive databases. We conclude the chapter with a discussion of guidelines that have been developed to help practitioners improve their record linkage processes. Our aim with this chapter is to provide the reader with the breadth of issues they need to consider when linking (sensitive) databases.
Peter Christen, Thilina Ranbaduge, Rainer Schnell
Chapter 12. Empirical Evaluation
Abstract
In this chapter we describe an empirical evaluation of selected Bloom filter based encoding and hardening techniques we have described in Chapters 8 and 9 with regard to their linkage quality, scalability, and privacy. We describe the data sets and software used for this evaluation, the experimental platform employed, as well as the evaluation measures we used. The aim of this chapter is to provide the reader with an example of how sensitive databases can be linked using privacy-preserving linkage techniques, and how such a linkage exercise can be evaluated. We further describe how to use the software employed for this evaluation in Appendix B.
Peter Christen, Thilina Ranbaduge, Rainer Schnell
Chapter 13. Real-world Applications
Abstract
In this chapter we describe several existing real-world applications where sensitive databases are being linked in a practical setting using privacy-preserving techniques. These examples come from different countries, where different privacy frameworks and legislation exist that either make the use of such approaches necessary, or where organisations are using privacypreserving linkage approaches to make their linkages more secure.
Peter Christen, Thilina Ranbaduge, Rainer Schnell

Outlook

Frontmatter
Chapter 14. Future Research Challenges and Directions
Abstract
The linking of sensitive databases across organisations is an active area of research in several domains. In this chapter we discuss some of the major open research questions that require further investigations. These include the development of frameworks that allow comparative empirical evaluations, the preparation of benchmark data collections, how to link sensitive databases in a cloud data service, how to properly assess the quality and completeness of linked databases in those situations when only encoded or encrypted records are available, improved theoretically grounded privacy measures, how to best deal with missing values, and novel adversary models. We also discuss how the linking of sensitive databases is challenged in the era of Big data, and the challenges and opportunities that novel types of data, such as biometric and genetic data, can provide.
Peter Christen, Thilina Ranbaduge, Rainer Schnell
Backmatter
Metadaten
Titel
Linking Sensitive Data
verfasst von
Prof. Peter Christen
Dr. Thilina Ranbaduge
Prof. Dr. Rainer Schnell
Copyright-Jahr
2020
Electronic ISBN
978-3-030-59706-1
Print ISBN
978-3-030-59705-4
DOI
https://doi.org/10.1007/978-3-030-59706-1

Premium Partner