«Abstract Despite its beneﬁt in a wide range of applications, data mining techniques also have raised a number of ethical issues. Some such issues ...»
Privacy Preserving Clustering By Data Transformation
Stanley R. M. Oliveira1,2 Osmar R. Za¨ane2
Embrapa Inform´ tica Agropecu´ ria
a a Department of Computing Science
Andr´ Tosello, 209 - Bar˜ o Geraldo
e a University of Alberta
13083-886 - Campinas, SP, Brasil Edmonton, AB, Canada, T6G 1K7
email@example.com firstname.lastname@example.org Abstract Despite its beneﬁt in a wide range of applications, data mining techniques also have raised a number of ethical issues.
Some such issues include those of privacy, data security, intellectual property rights, and many others. In this paper, we address the privacy problem against unauthorized secondary use of information. To do so, we introduce a family of geometric data transformation methods (GDTMs) which ensure that the mining process will not violate privacy up to a certain degree of security. We focus primarily on privacy preserving data clustering, notably on partition-based and hierarchical methods. Our proposed methods distort only conﬁdential numerical attributes to meet privacy requirements, while preserving general features for clustering analysis. Our experiments demonstrate that our methods are effective and provide acceptable values in practice for balancing privacy and accuracy. We report the main results of our performance evaluation and discuss some open research issues.
1. Introduction Huge volumes of detailed personal data are regularly collected and analyzed by applications using data mining. Such data include shopping habits, criminal records, medical history, credit records, among others . On the one hand, such data is an important asset to business organizations and governments both to decision making processes and to provide social beneﬁts, such as medical research, crime reduction, national security, etc. . On the other hand, analyzing such data opens new threats to privacy and autonomy of the individual if not done properly.
The threat to privacy becomes real since data mining techniques are able to derive highly sensitive knowledge from unclassiﬁed data that is not even known to database holders. Worse is the privacy invasion occasioned by secondary usage of data when individuals are unaware of “behind the scenes” use of data mining techniques . As an example in point, Culnan  made a particular study of secondary information use which she deﬁned as “the use of personal information for other purposes subsequent to the original transaction between an individual and an organization when the information was collected.” The key ﬁnding of this study was that concern over secondary use was correlated with the level of control the individual has over the secondary use. As a result, individuals are increasingly feeling that they are losing control over their own personal information that may reside on thousands of ﬁle servers largely beyond the control of existing privacy laws. This scenario has led to privacy invasion on a scale never before possible.
The challenging problem that we address in this paper is: how can we protect against the abuse of the knowledge discovered from secondary usage of data and meet the needs of organizations and governments to support decision making or even to promote social beneﬁts? We claim that a solution for such a problem requires two vital techniques: anonymity [15, 17] to remove identiﬁers (e.g. names, social insurance numbers, addresses, etc.) in the ﬁrst phase of privacy protection, and data transformation to protect some sensitive attributes (e.g. salary, age, etc.) since the released data, after removing identiﬁers, may contain other information that can be linked with other datasets to re-identify individuals or entities . In this paper, we focus on the latter technique. Speciﬁcally, we consider the case in which conﬁdential numerical attributes are distorted in order to meet privacy protection in clustering analysis, notably on partition-based and hierarchical methods.
The intuition behind such methods is to partition a dataset into new classes (clusters) of similar objects. The goal is to group objects to achieve high similarity between objects within individual clusters (interclass similarity) and low similarity between objects that belong to different clusters (intraclass similarity) . Clustering is widely used in many applications such as customer behaviour analysis, targeted marketing, and many others.
A motivating example for the privacy problem in data clustering could be found in business collaboration. Two or more companies have a very large dataset of records of their customers’ buying activities. These companies decide to cooperatively conduct data clustering on their datasets for their mutual beneﬁt since this collaboration brings them an advantage over other competitors. The goal is to subdivide a market into distinct subsets of customers where any subset may be selected as a market to be reached with a distinct marketing mix. However, these companies would like to transform their data in such a way that the privacy of their customers cannot be violated. Is it possible for these companies to beneﬁt from such collaboration by sharing their data while preserving the private information of their customers?
To address privacy concerns in clustering analysis, we need to design speciﬁc data transformation methods that enforce privacy without loosing the beneﬁt of mining. The proposed data perturbation methods in the literature pertain to the context of statistical databases [1, 7, 5, 16]. They do not apply to data clustering as they have limitations when the perturbed attributes are considered as a vector in the Euclidean space. For instance, let us suppose that some conﬁdential attributes (e.g. salary and age) are represented by points in a 2D discrete space for clustering analysis. If we distort these attributes using any perturbed methods proposed in the literature, the clusters obtained after perturbing the data would be very different from those mined from the original database. The main problem is that many points would move from one cluster to another jeopardizing the notion of similarity between data points in the global space. Consequently, this introduces the problem of misclassiﬁcation. Therefore, the perturbation has to be uniformly applied to all attributes to guarantee safeguarding the global distances between data points, or even to slightly modify the distance between some points.
In this paper, we introduce a family of geometric data transformation methods (GDTMs) that distort conﬁdential numerical attributes in order to meet privacy protection in clustering analysis. We beneﬁt from the work on image processing . Of particular interest is work on geometric transformation of digital images, notably the idea behind translation, scaling, and rotation. We also beneﬁt from the work on statistical databases, particularly the intuition behind data distortion. We show that our transformation data methods are simple, independent of clustering algorithms, preserve the general features of the clusters, and have a sound mathematical foundation. Although our approach does not provide a comprehensive solution to the problem of privacy preservation in data mining, we argue that our approach is a simple building block toward privacy preserving data clustering. To date, such schemata have not been explored in detail.
This paper is organized as follows. Related work is reviewed in Section 2. In Section 3, we provide the basic concepts that are necessary to understand the scope and the issues addressed in this paper. We introduce our family of geometric data transformation methods in Section 4. In Section 5, we present the experimental results and discussion. Finally, Section 6 presents our conclusions and a discussion of future work.
2. Related Work
Some effort has been made to address the problem of privacy preservation in data mining. This effort has been restricted basically to classiﬁcation and association rules. The class of solutions for this problem rely on data partition, data sanitization, randomization and data distortion. In this work, we focus on the last two categories.
Estivill-Castro and Brankovic  introduced a method for ensuring partial disclosure while allowing a miner to explore detailed data. In this approach, one ﬁrst builds a local decision tree over true data, and then swaps values amongst records in a leaf node of the tree to generate randomized training data. The swapping is performed over the conﬁdential attribute only, where the conﬁdential attribute is the class label. This approach deals with a trade-off: statistical precision against security level, i.e., the closer to the root, the higher the security but lower the precision.
Agrawal and Srikant  considered the case of building a decision-tree classiﬁer from training data in which the values of individual records have been perturbed, by adding random values from a probability distribution. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, they proposed a novel reconstruction procedure to accurately estimate the distribution of original data values. The distribution reconstruction process naturally leads to some loss of information, but the authors argue that this is acceptable in many practical situations.
In , the authors proposed a new algorithm for distribution reconstruction which is more effective than that proposed in , in terms of the level of information loss. This algorithm, based on Expectation Maximization (EM) algorithm, converges to the maximum likelihood estimate of the original distribution based on the perturbed data, even when a large amount of data is available. They also pointed out that the EM algorithm was in fact identical to the Bayesian reconstruction proposed in , except for the approximation partitioning values into intervals.
Evﬁmievski et al.  proposed a framework for mining association rules from transactions consisting of categorical items in which the data has been randomized to preserve privacy of individual transactions. The idea behind this approach is that some items in each transaction are replaced by new items not originally present in this transaction. In doing so, some true information is taken away and some false information is introduced, which seems to have obtained a reasonable privacy protection. In general, this strategy is feasible to recover association rules, less frequent than originally, and preserve privacy using a straightforward uniform randomization. Although privacy is preserved on average, conﬁdential information leaks through uniform randomization for some fraction of transactions.
More recently, the data distortion approach has been applied to boolean association rules .
Again, the idea is to modify data values such that reconstruction of the values for any individual transaction is difﬁcult, but the rules learned on the distorted data are still valid. One interesting feature of this work is a ﬂexibility deﬁnition of privacy. For instance, the ability to correctly guess a value of ‘1’ from the distorted data can be considered a greater threat to privacy than correctly learning a ‘0’.
This scheme is based on probabilistic distortion of user data, which is composed of a privacy metric and an analytical formula. Although this framework provides a high degree of privacy to the user and retains a high level of accuracy in the mining results, mining the distorted database can be, apart from being error-prone, signiﬁcantly more expensive in terms of both time and space as compared to mining the original database.
The work presented here differs from the related work in some aspects, as follows: First, we aim to address the problem of privacy preservation in clustering analysis. To our best knowledge, this problem has not been considered so far. Our proposed solution and those ones in the related work are complementary. Second, we study the impact of our data transformation schemes in the original database by quantifying how much information is preserved after transforming a database.
So, our focus is not only on protecting individual data records, but also on providing accurate data for clustering analysis.