FREE ELECTRONIC LIBRARY - Books, dissertations, abstract

Pages:   || 2 | 3 | 4 |

«Abstract Despite its benefit in a wide range of applications, data mining techniques also have raised a number of ethical issues. Some such issues ...»

-- [ Page 1 ] --

Privacy Preserving Clustering By Data Transformation

Stanley R. M. Oliveira1,2 Osmar R. Za¨ane2


Embrapa Inform´ tica Agropecu´ ria

a a Department of Computing Science

Andr´ Tosello, 209 - Bar˜ o Geraldo

e a University of Alberta

13083-886 - Campinas, SP, Brasil Edmonton, AB, Canada, T6G 1K7

oliveira@cs.ualberta.ca zaiane@cs.ualberta.ca Abstract Despite its benefit in a wide range of applications, data mining techniques also have raised a number of ethical issues.

Some such issues include those of privacy, data security, intellectual property rights, and many others. In this paper, we address the privacy problem against unauthorized secondary use of information. To do so, we introduce a family of geometric data transformation methods (GDTMs) which ensure that the mining process will not violate privacy up to a certain degree of security. We focus primarily on privacy preserving data clustering, notably on partition-based and hierarchical methods. Our proposed methods distort only confidential numerical attributes to meet privacy requirements, while preserving general features for clustering analysis. Our experiments demonstrate that our methods are effective and provide acceptable values in practice for balancing privacy and accuracy. We report the main results of our performance evaluation and discuss some open research issues.

1. Introduction Huge volumes of detailed personal data are regularly collected and analyzed by applications using data mining. Such data include shopping habits, criminal records, medical history, credit records, among others [4]. On the one hand, such data is an important asset to business organizations and governments both to decision making processes and to provide social benefits, such as medical research, crime reduction, national security, etc. [12]. On the other hand, analyzing such data opens new threats to privacy and autonomy of the individual if not done properly.

The threat to privacy becomes real since data mining techniques are able to derive highly sensitive knowledge from unclassified data that is not even known to database holders. Worse is the privacy invasion occasioned by secondary usage of data when individuals are unaware of “behind the scenes” use of data mining techniques [13]. As an example in point, Culnan [6] made a particular study of secondary information use which she defined as “the use of personal information for other purposes subsequent to the original transaction between an individual and an organization when the information was collected.” The key finding of this study was that concern over secondary use was correlated with the level of control the individual has over the secondary use. As a result, individuals are increasingly feeling that they are losing control over their own personal information that may reside on thousands of file servers largely beyond the control of existing privacy laws. This scenario has led to privacy invasion on a scale never before possible.

The challenging problem that we address in this paper is: how can we protect against the abuse of the knowledge discovered from secondary usage of data and meet the needs of organizations and governments to support decision making or even to promote social benefits? We claim that a solution for such a problem requires two vital techniques: anonymity [15, 17] to remove identifiers (e.g. names, social insurance numbers, addresses, etc.) in the first phase of privacy protection, and data transformation to protect some sensitive attributes (e.g. salary, age, etc.) since the released data, after removing identifiers, may contain other information that can be linked with other datasets to re-identify individuals or entities [19]. In this paper, we focus on the latter technique. Specifically, we consider the case in which confidential numerical attributes are distorted in order to meet privacy protection in clustering analysis, notably on partition-based and hierarchical methods.

The intuition behind such methods is to partition a dataset into new classes (clusters) of similar objects. The goal is to group objects to achieve high similarity between objects within individual clusters (interclass similarity) and low similarity between objects that belong to different clusters (intraclass similarity) [11]. Clustering is widely used in many applications such as customer behaviour analysis, targeted marketing, and many others.

A motivating example for the privacy problem in data clustering could be found in business collaboration. Two or more companies have a very large dataset of records of their customers’ buying activities. These companies decide to cooperatively conduct data clustering on their datasets for their mutual benefit since this collaboration brings them an advantage over other competitors. The goal is to subdivide a market into distinct subsets of customers where any subset may be selected as a market to be reached with a distinct marketing mix. However, these companies would like to transform their data in such a way that the privacy of their customers cannot be violated. Is it possible for these companies to benefit from such collaboration by sharing their data while preserving the private information of their customers?

To address privacy concerns in clustering analysis, we need to design specific data transformation methods that enforce privacy without loosing the benefit of mining. The proposed data perturbation methods in the literature pertain to the context of statistical databases [1, 7, 5, 16]. They do not apply to data clustering as they have limitations when the perturbed attributes are considered as a vector in the Euclidean space. For instance, let us suppose that some confidential attributes (e.g. salary and age) are represented by points in a 2D discrete space for clustering analysis. If we distort these attributes using any perturbed methods proposed in the literature, the clusters obtained after perturbing the data would be very different from those mined from the original database. The main problem is that many points would move from one cluster to another jeopardizing the notion of similarity between data points in the global space. Consequently, this introduces the problem of misclassification. Therefore, the perturbation has to be uniformly applied to all attributes to guarantee safeguarding the global distances between data points, or even to slightly modify the distance between some points.

In this paper, we introduce a family of geometric data transformation methods (GDTMs) that distort confidential numerical attributes in order to meet privacy protection in clustering analysis. We benefit from the work on image processing [10]. Of particular interest is work on geometric transformation of digital images, notably the idea behind translation, scaling, and rotation. We also benefit from the work on statistical databases, particularly the intuition behind data distortion. We show that our transformation data methods are simple, independent of clustering algorithms, preserve the general features of the clusters, and have a sound mathematical foundation. Although our approach does not provide a comprehensive solution to the problem of privacy preservation in data mining, we argue that our approach is a simple building block toward privacy preserving data clustering. To date, such schemata have not been explored in detail.

This paper is organized as follows. Related work is reviewed in Section 2. In Section 3, we provide the basic concepts that are necessary to understand the scope and the issues addressed in this paper. We introduce our family of geometric data transformation methods in Section 4. In Section 5, we present the experimental results and discussion. Finally, Section 6 presents our conclusions and a discussion of future work.

2. Related Work

Some effort has been made to address the problem of privacy preservation in data mining. This effort has been restricted basically to classification and association rules. The class of solutions for this problem rely on data partition, data sanitization, randomization and data distortion. In this work, we focus on the last two categories.

Estivill-Castro and Brankovic [8] introduced a method for ensuring partial disclosure while allowing a miner to explore detailed data. In this approach, one first builds a local decision tree over true data, and then swaps values amongst records in a leaf node of the tree to generate randomized training data. The swapping is performed over the confidential attribute only, where the confidential attribute is the class label. This approach deals with a trade-off: statistical precision against security level, i.e., the closer to the root, the higher the security but lower the precision.

Agrawal and Srikant [3] considered the case of building a decision-tree classifier from training data in which the values of individual records have been perturbed, by adding random values from a probability distribution. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, they proposed a novel reconstruction procedure to accurately estimate the distribution of original data values. The distribution reconstruction process naturally leads to some loss of information, but the authors argue that this is acceptable in many practical situations.

In [2], the authors proposed a new algorithm for distribution reconstruction which is more effective than that proposed in [3], in terms of the level of information loss. This algorithm, based on Expectation Maximization (EM) algorithm, converges to the maximum likelihood estimate of the original distribution based on the perturbed data, even when a large amount of data is available. They also pointed out that the EM algorithm was in fact identical to the Bayesian reconstruction proposed in [3], except for the approximation partitioning values into intervals.

Evfimievski et al. [9] proposed a framework for mining association rules from transactions consisting of categorical items in which the data has been randomized to preserve privacy of individual transactions. The idea behind this approach is that some items in each transaction are replaced by new items not originally present in this transaction. In doing so, some true information is taken away and some false information is introduced, which seems to have obtained a reasonable privacy protection. In general, this strategy is feasible to recover association rules, less frequent than originally, and preserve privacy using a straightforward uniform randomization. Although privacy is preserved on average, confidential information leaks through uniform randomization for some fraction of transactions.

More recently, the data distortion approach has been applied to boolean association rules [18].

Again, the idea is to modify data values such that reconstruction of the values for any individual transaction is difficult, but the rules learned on the distorted data are still valid. One interesting feature of this work is a flexibility definition of privacy. For instance, the ability to correctly guess a value of ‘1’ from the distorted data can be considered a greater threat to privacy than correctly learning a ‘0’.

This scheme is based on probabilistic distortion of user data, which is composed of a privacy metric and an analytical formula. Although this framework provides a high degree of privacy to the user and retains a high level of accuracy in the mining results, mining the distorted database can be, apart from being error-prone, significantly more expensive in terms of both time and space as compared to mining the original database.

The work presented here differs from the related work in some aspects, as follows: First, we aim to address the problem of privacy preservation in clustering analysis. To our best knowledge, this problem has not been considered so far. Our proposed solution and those ones in the related work are complementary. Second, we study the impact of our data transformation schemes in the original database by quantifying how much information is preserved after transforming a database.

So, our focus is not only on protecting individual data records, but also on providing accurate data for clustering analysis.

Pages:   || 2 | 3 | 4 |

Similar works:

«TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl Entwicklungsgenetik The effect of mitochondrial dysfunction on astrocytes and radial glia like stem cells in the adult hippocampus Birgit Ebert Vollständiger Abdruck der von der Fakultät Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung und Umwelt der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. E. Grill Prüfer der...»

«Informing Science InSITE “Where Parallels Intersect” June 2003 Investigating the Status of Data Mining in Practice Teh Ying Wah and Zaitun Abu Bakar University of Malaya, Kuala Lumpur, Malaysia tehyw@um.edu.my zab@um.edu.my Abstract This paper is based on a survey carried out in the Malaysian environment. The paper starts with a definition of data warehouses, data mining and this is followed by a description of its current status in the Malaysian data mining environment. This is followed by...»

«Variational Assimilation (§5.5) We now turn from Optimal Interpolation to another approach to objective analysis, the variational assimilation technique. Variational Assimilation (§5.5) We now turn from Optimal Interpolation to another approach to objective analysis, the variational assimilation technique. This method is of growing popularity and is now in use in several major NWP centres. Variational Assimilation (§5.5) We now turn from Optimal Interpolation to another approach to objective...»

«® UNITEST Bedienungsanleitung Instruction Manual Mode d’emploi CHB 15 Best.-Nr.: 93468 Order No.: 93468 Réf.: 93468 CHB 35 Best.-Nr.: 93469 Order No.: 93469 Réf.: 93469 CHB 10 Best.-Nr.: 93471 Order No.: 93471 Réf.: 93471 Inhaltsverzeichnis Seite 1.0 Einleitung/Lieferumfang 2 2.0 Sicherheitsmaßnahmen 4 3.0 Bedienelemente und Anschlüsse 5 4.0 Durchführung von Messungen 6 4.1 Vorbereitung und Sicherheitsmaßnahmen 6 4.2 Durchführen von Strommessungen 6 4.2.1 Gleichstrommessung 7 4.2.2...»

«TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Brauund Getränketechnologie Fundamental studies on sodium reduction in processing cereal-based foods Margit Beck Vollständiger Abdruck der von der Fakultät Wissenschaftszentrum Weihenstephan für die Ernährung, Landnutzung und Umwelt der Technischen Universität München zur Erlangung des akademischen Grades eines Doktor der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Univ.Prof. Dr. H.-Chr. Langowski Prüfer der Dissertation:...»

«ISSN (Print): 2279-0063 International Association of Scientific Innovation and Research (IASIR) ISSN (Online): 2279-0071 (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Software and Web Sciences (IJSWS) www.iasir.net Data Mining Techniques for Software Defect Prediction Ms. Puneet Jai Kaur1, Ms. Pallavi2 Assistant Professor, UIET, Panjab University, Chandigarh, India Student, UIET, Panjab University, Chandigarh, India AbstractSoftware defect...»

«Czech Technical University in Prague Faculty of Electrical Engineering Diploma thesis Distributed topology control in MANETs Tom´ˇ Meiser as Thesis supervisor: Ing. Milan Rollo, Ph.D. Study programme: Open Informatics Specialization: Artificial Intelligence July 2012 Acknowledgements I would like to thank all the people who helped me in any way with this master thesis. Especially, I would like to thank to my supervisor Ing. Milan Rollo Ph.D. for his time and patience. ii Abstrakt C´ılem...»

«Please quote as: Schweizer, K.; Leimeister, J. M. & Krcmar, H. (2006): The role of virtual communities for the social network of cancer patients. In: Proceedings of the Americas Conference on Information Systems (AMCIS 2006), Acapulco. Schweizer et al. Virtual communities & the social network of cancer patients The role of virtual communities for the social network of cancer patients Karin Janina Schweizer Jan Marco Leimeister Microsoft Deutschland GmbH Technische Universität München...»

«Seen in a different light. Icons in Byzantine museums and churches Gianna Stavroulaki and John Peponis National Technical University of Athens, Greece, Georgia institute of technology, USA gstav 1@otenet.gr, john.peponis@arch.gatech.edu Abstract We analyze the principles that govern the visibility of icons in churches and museums. We develop methods for enriching visibility analysis by taking into consideration not only the arrangement but also the illumination of space. Finally, we suggest...»

«58th ILMENAU SCIENTIFIC COLLOQUIUM URN (Paper): urn:nbn:de:gbv:ilm1-2014iwk-070:2 Technische Universität Ilmenau, 08 – 12 September 2014 URN: urn:nbn:de:gbv:ilm1-2014iwk:3 Determination of the Near-Field-Acoustics of Primary Vehicle Sound Sources in Relation to Indoor Pass-by Noise Testing for the Verification of a Virtual Acoustic Vehicle Model Albert Albers1, David Landes1, Matthias Behrendt1, Christian Weber2, Antje Siegel2, Stephan Husung2 IPEK Institute of Product Engineering at...»

<<  HOME   |    CONTACTS
2016 www.book.dislib.info - Free e-library - Books, dissertations, abstract

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.