FREE ELECTRONIC LIBRARY - Books, dissertations, abstract

Pages:     | 1 || 3 |

«QDex : A database profiler for generic bio-data exploration and quality aware integration F. Moussouni1, L. Berti-Équille 2, G. Rozé1, O. Loréal, ...»

-- [ Page 2 ] --

GEDAW supports several functions that consist of analyses on demand made on a group of genes of interest upon a database selection query with one or more criteria. These analyses correspond to APIs that use OQL (Object Query Language) and java to retrieve multiple information items about the genes. Some external analyses that correspond to external bioinformatics tools have been applied on subsets of integrated data on genes, as clustering for example. These two kinds of analyses have been combined to connect successive steps, thus forming a workflow.

One of them (Fig 2) has been designed according to the hypothesis that genes sharing an expression pattern should be associated. The strategy consists in selecting a group of genes that are associated with a same disease and a typical expression pattern (steps 1 and 2 in Fig 2), and then extrapolate this group to more genes involved in the disease (step 5) by searching for expression pattern similarity (step 4). The genes are then characterized by studying the biological processes and the cellular components using integrated GO annotations (step 6).

–  –  –

Fig 2: Combining Biomedical Information within an Expert Guided Workflow This example, which is expert guided, has been used in order to extract new knowledge consisting of new gene associations to hepatic disorders [5]. The found genes are now biologically investigated by the expert for a better understanding of their involvement in the disease.

Requirement analysis from biologists and their associated workflows have been since rapidly evolving with a non-stop emergence on the Web of new complex data types like protein structures, gene interactions or metabolic pathways, urging to continuous evolution of the warehouse schema, contents and applications.

3.2 Bio-entity identification

By using GAV mapping approach for integrating one data source at a time in GEDAW (e.g.

Fig 1 with GENBANK), we have minimized as much as possible the problem of identification of equivalent attributes. The problem of equivalent instances identification is still complex to address. This is due to general redundancy of bio-entities in life science even within a single source. Biological databanks may also have inconsistent values in equivalent attributes of records referring to the same real-world object. For example, there are more than 10 ID's records for the same DNA segment associated to human HFE gene in GENBANK! Obviously the same segment could be a clone, a marker or a genomic sequence.

Anyone is indeed able to submit biological information to public databanks with more or less formalized submission protocols that usually do not include names standardization or data quality controls. Erroneous data may be easily entered and cross-referenced. Even if some tools propose clusters of records (like EntryGene for GENBANK) which identify the same biological concept across different biological databanks for being semantically related, biologists still must validate the correctness of these clusters and resolve the differences of interpretation among the records.

This is a typical problem of entity resolution and record linkage that is augmented and made more complex due to the high-level of expertise and knowledge it requires (i.e., difficult to formalize and related to many different sub-disciplines of biology, chemistry, pharmacology, and medical sciences). After the step of bio-entity resolution, data are scrubbed and transformed to fit the global DW schema with the appropriate standardized format for values, so that the data meets all the validation rules that have been decided upon by the warehouse designer.

Problems that can arise during this step include null or missing data; violations of data type;

non-uniform value formats; invalid data. The process of data cleansing and scrubbing is rulebased. Then, data are migrated, physically integrated and imported into the data warehouse.

During and after data cleansing and migration, quality metadata are computed or updated in the data warehouse metadata repository by pre- and post- data validation programs.



4.1 Database Profiling Database Profiling is the process of analyzing a database to determine its structure and internal relationships. It consists mainly of identifying: i) the objects used and their attributes (contents and number), ii) relationships between objects with their different kinds of associations including aggregation and inheritance, and iii) the objects behaviour and their relative functions. Database profiling is then useful when managing data conversion and data cleanup projects.

–  –  –

Fig 3: The XMI document generated from GEDAW schema The XMI (XML Metadata Interchange) document (see Fig 3) that collects metadata information on the objects of the database is generated on-demand for profiling GEDAW. It has been quite useful to face the syntactic heterogeneity of the evolving schemas of data sources and the warehousing system during its life cycle. A generic data exploration has been made possible by the development of QDex tools (Quality based Database Exploration) that parse the XMI document detailing the database structure (in terms of class, attributes, relationships, etc.) and generate a model-based interface to explore the multiple attributes on genes description stored in the warehouse.

4.2 Data Quality Profiling

Data quality profiling is the process of analyzing a database to identify and prioritize data quality problems. The results include simple summaries (counts, averages, percentages, etc.) describing for instance: completeness of datasets and the number of missing data records, the data freshness, and various data problems in existing records (e.g., outliers, duplicates, redundancies). During the process of data profiling, available data in the existing database are examined and statistics are being computed and gathered to track different summaries describing aspects of data quality. As a result, by providing QDex data profiling tools, one also provides data quality profiling tools.

A considerable amount of data quality research involves investigating and describing various categories of desirable attributes (or dimensions) of data quality. These lists commonly include accuracy, consistency, completeness, unicity (i.e., no duplicates), and freshness. Nearly 200 such terms have been identified in [15,16], regarding nature, definitions and measures of attributes.

Contradictory or ambiguous data is also a crucial problem as well, especially in bioinformatics where data are continuously speculative. Centralizing data in a warehouse is one of the initiatives one can take to ensure data validity.

Taking advantage of the stored XMI metadata information obtained by database profiling using the XMI document, QDex provides generic tools for bio-database exploration and data quality profiling. In developing QDex, we believe that profiling databases (both considering the structure of data sources and data warehouse) could be very useful for the integration process.

Moreover, our work examines the extent of biological database profiling and proposes a way for flexibly building query workflows that follow the reasoning of biologists and assist them in the elaboration of their pioneer queries, including queries for data quality track.


5.1 Generic bio-data exploration A global overview of QDex interface is given in Fig 4. Parts of the workflow that has been used to combine biological and medical knowledge to extract new knowledge on liver genes, has been flexibly reformulated using QDex GUI. The screen-shot below shows the central database of GEDAW as profiled using the XMI metadata document which gives an insight on the current warehouse schema. An overview of the extracted database profiling (classes and attributes) is browsed on the Database Schema Viewer frame. This includes the Gene, mRNA, ExpressionLevels, GOAnnotation and UMLSAnnotation classes. Based on these classes, the user built by himself, scenarios of queries on the objects, using his criteria, in the Query Maker frame.

Fig 4: Global Overview of QDex interface

By having an immediate glance on his intermediate or final results browsed on the Result Viewer, the user may modify and re-execute his queries when needed. He is also able to save a workflow for ulterior reuse on different data, or export effective resulting data for a future use on external tools (clustering, spreadsheet, etc,). This interface makes QDex quite flexible and attractive for the biologist.

To construct the Liver Disease Associated Genes Group, the Genes of the array that are annotated by “liver disease” concept and its descendants in UMLS are selected using UMLSAnnotation Class (See Fig igure4). Corresponding mRNA or Gene names are browsed on the Result Viewer sub-frame. Using the query maker, the selected objects are refined using successive queries on the group by adding boxes on demand, to look for information on their sequences, their expression levels, and their annotations in Gene Ontology making a more exhaustive workflow.

5.2 Preliminary tools for bio-data quality track

The completeness dimension of the result of a query workflow is computed by counting the number of missing values of queried objects (see Fig 5). Actually, by using QDex, much more possibilities are offered to the user to compose various workflows on integrated objects in GEDAW.

Fig 5: Preliminary tools for tracking completeness of biomedical data The user can have indicators associated to the datasets or query results by specifying various useful metrics to describe the aspects of database or query result quality. QDex project being still in progress, more tools will be provided to the user for evaluating the quality of the data that are being explored including redundancy, freshness, and inconsistency (by checking userdefined or statistical constraints).


In this paper, we have presented a database profiling approach for designing a generic biomedical database exploration tool devoted to quality aware data integration and exploration.

QDex has been applied to GEDAW: an object oriented data warehouse devoted to the study of high throughput gene expression levels in the domain of hepatology. Metadata extracted from the XMI document of GEDAW have been used to provide a generic interface that supports tools for convivial building of query workflows using multiple profiled attributes on the genes and preliminary tools for data quality track. By developing QDex, data are supposed already being integrated. Using QDex, the user has the ability to make a clearer view of the database content and quality. As we have mentioned, QDex is under ongoing development and our perspectives are to keep on taking advantage of the extracted metadata information, and to provide more tools (such as a quality metric library) to be gradually integrated to the interface in order to evaluate the quality of the data that are being explored. Our main objective is to cover the main data quality dimensions by providing predefined analytical functions whose results (as computed indicators) will describe various aspects of consistency, accuracy, unicity, and freshness of data. Another important aspect of our future work is linked to data quality problems detection and concerns the design of pragmatic tools to help the expert to cleanse erroneous (or low quality) data within the QDex interface.

Finally, the original advantage of QDex resides in the fact that it can be generalized to any database schema outside bioinformatics. More specifically, we intend to apply QDex to the expected version of GEDAW which is being upgraded. This is for storing more actual bioinformatics data, like graph structures for gene pathways and system biology studies of genes expression profiles on the scale of a pangenomic DNA-Chip.


1. Anathakrishna, R., Chaudhuri, S., Ganti, V., Eliminating Fuzzy Duplicates in Data warehouses, Proc. of Intl. Conf. VLDB, 2002.

2. Batini C., Catarci T. and Scannapiceco M., A Survey of Data Quality Issues in Cooperative Information Systems, Tutorial presented at the Intl. Conf. on Conceptual Modeling (ER), 2004.

Pages:     | 1 || 3 |

Similar works:

«Innovationshemmnisse in der ärztlichen Praxis am Beispiel der Positronen-EmissionsTomographie (PET) Elke Conrad Positron-Emission-Tomographie (PET) is the most important advance in biomedical science since the invention of the microscope (Wagner 2002) 1 Einführung Die Wurzeln der medizinischen Nutzung der Radioaktivität reichen weit in das letzte Jahrhundert zurück. Die Anwendung von Strahlen in der medizinischen Diagnostik begann 1895 mit der Entdeckung der Röntgenstrahlen durch Wilhelm...»

«KLINIK UND POLIKLINIK FÜR PSYCHIATRIE UND PSYCHOTHERAPIE DES KLINIKUMS RECHTS DER ISAR DER TECHNISCHEN UNIVERSITÄT MÜNCHEN (DIREKTOR: UNIV.-PROF. DR. J. FÖRSTL) Emotionswahrnehmung bei frontotemporalen Demenzpatienten Carolin Ruprecht Vollständiger Abdruck der von der Fakultät für Medizin der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Medizin genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. D. Neumeier Prüfer der Dissertation: 1....»

«Aus der Klinik und Poliklinik für Psychiatrie und Psychotherapie der Ludwig-Maximilians-Universität München Direktor: Prof. Dr. H.-J. Möller Deliktbezogene Rezidivraten von Straftätern im internationalen Vergleich Dissertation zum Erwerb des Doktorgrades der Medizin an der Medizinischen Fakultät der Ludwig-Maximilians-Universität zu München vorgelegt von Gregor Groß München Mit Genehmigung der Medizinischen Fakultät der Universität München Berichterstatter Prof. Dr. N. Nedopil...»

«Zurück zur Übersicht Akademischer Bericht 2009 Institut für Parasitologie Leitung in der Berichtsperiode: Prof. Peter Deplazes Winterthurerstr. 266a 8057 Zürich E-Mail parasito@vetparas.uzh.ch Zusammenfassung (Management Summary) The Institute of Parasitology (IPZ) is affiliated to both the Vetsuisseand the Medical Faculty of the University of Zurich. During 2009, the IPZ continued its research in a wide spectrum of scientific programmes ranging from applied research to basic biology,...»

«Aus der 4. Medizinischen Abteilung Städtisches Klinikum München GmbH – Klinikum Harlaching Akademisches Lehrkrankenhaus der LUDWIG-MAXIMILIANS-UNIVERSITÄT MÜNCHEN Chefarzt: Prof. Dr. R. Hartenstein Eine retrospektive, multizentrische Kohortenstudie zum HIV-assoziierten Morbus Hodgkin Dissertation zum Erwerb des Doktorgrades der Medizin an der Medizinischen Fakultät der Ludwig-Maximilians-Universität München vorgelegt von Pamela Bofinger aus Wertheim Mit Genehmigung der medizinischen...»

«Eastern Europe Soviet Union AHERE WERE no changes in the Soviet leadership, although Leonid I. Brezhnev's periodic absences from Moscow evoked repeated conjectures regarding his political standing and the state of his health. While there may have been serious differences within the Politburo on foreign policy and internal administration, and a generational conflict between the older leaders and the relatively younger members, at the end of 1974 the old leadership was in control of Party and...»

«1 Aus dem Medizinischen Zentrum für Zahn-, Mundund Kieferheilkunde (Geschäftsführender Direktor Prof. Dr. Roland Frankenberger) Abteilung für Zahnerhaltungskunde (Direktor: Prof. Dr. Roland Frankenberger) des Fachbereichs Medizin der Philipps-Universität Marburg Dichtigkeit unterschiedlicher Verschlussarten endodontischer Zugangskavitäten Inaugural-Dissertation zur Erlangung des Doktorgrades der Zahnmedizin dem Fachbereich Medizin der Philipps-Universität Marburg vorgelegt von Jörn...»

«Aus dem Institut für Vegetative Physiologie der Medizinischen Fakultät Charité – Universitätsmedizin Berlin DISSERTATION Investigation of the therapeutic effect of External Pneumatic Counterpulsation on the myocardial and cerebrovascular arterial circulation zur Erlangung des akademischen Grades Doctor medicinae (Dr. med.) vorgelegt der Medizinischen Fakultät Charité – Universitätsmedizin Berlin von Eva Buschmann aus Graz, Österreich Datum der Promotion: 11. Dezember 2015 Contents...»

«Aus der nuklearmedizinischen Klinik der Friedrich-Alexander-Universität Erlangen-Nürnberg Direktor: Prof. Dr. med. Torsten Kuwert Unterschiede in der Sichtbarkeit von Läsionen zwischen der TrueXund der OSEM-Rekonstruktion in der Positronen-Emissions-Tomographie (PET) Inaugural-Dissertation zur Erlangung der Doktorwürde der Medizinischen Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg vorgelegt von Florian Zahnleiter aus Erlangen Gedruckt mit Erlaubnis der Medizinischen...»

«Agronomy Research 8 (Special Issue III), 625–632, 2010 Use of genetic resources from Jerusalem artichoke collection of N. Vavilov Institute in breeding for bioenergy and health security S. Kiru1 and I. Nasenko2 N. Vavilov Institute of Plant Industry (VIR), B. Morskaya St. 44, 190000 St. Petersburg, Russian Federation; e-mail: s.kiru@vir.nw.ru Majkop research station of N. Vavilov Institute of Plant Industry, Shuntuk, Majkop distr. Krasnodar reg., Russian Federation Abstract. The VIR...»

<<  HOME   |    CONTACTS
2016 www.book.dislib.info - Free e-library - Books, dissertations, abstract

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.