FREE ELECTRONIC LIBRARY - Books, dissertations, abstract

Pages:   || 2 | 3 |

«QDex : A database profiler for generic bio-data exploration and quality aware integration F. Moussouni1, L. Berti-Équille 2, G. Rozé1, O. Loréal, ...»

-- [ Page 1 ] --

QDex :

A database profiler for generic bio-data exploration

and quality aware integration

F. Moussouni1, L. Berti-Équille 2, G. Rozé1, O. Loréal, E. Guérin1

INSERM U522 CHU Pontchaillou, 35033 Rennes, France

IRISA, Campus Universitaire de Beaulieu, 35042 Rennes, France

Corresponding author : fouzia.moussouni@univ-rennes.fr

Abstract: In human health and life sciences, researchers extensively collaborate with each other, sharing genomic,

biomedical and experimental results. This necessitates dynamically integrating different databases into a single repository or a warehouse. The data integrated in these warehouses are extracted from various heterogeneous sources, having different degrees of quality and trust. Most of the time, they are neither rigorously chosen nor carefully controlled for data quality. Data preparation and data quality metadata are recommended but still insufficiently exploited for ensuring quality and validating the results of information retrieval or data mining techniques.

In a previous work, we built a data warehouse called GEDAW (Gene Expression Data Warehouse) that stores various information: data on genes expressed in the liver during iron overload and liver diseases, relevant information from public databanks (mostly in XML), DNA-chips home experiments and also medical records.

Based on our past experience, this paper reports briefly on the lessons learned from biomedical data integration and data quality issues, and the solutions we propose to the numerous problems of schema evolution of both data sources and warehousing system. In this context, we present QDex, a Quality driven bio-Data Exploration tool, which provides a functional and modular architecture for database profiling and exploration, enabling users to set up query workflows and take advantage of data quality profiling metadata before the complex processes of data integration in the warehouse. An illustration with QDex Tool is shown afterwards.

Keywords: warehousing, metadata, bio-data integration, database profiling, bioinformatics, data quality

1. INTRODUCTION In the context of modern life science, integrating resources is very challenging, mainly because biological objects are complex and spread in highly autonomous and evolving web resources.

Biomedical web resources are extremely heterogeneous as they contain different kinds of data, have different structure and use different vocabularies to name same biological entities. Their information and knowledge contents are also partial and erroneous, morphing and in perpetual progress.

In spite of these barriers, we assist in bioinformatics to an explosion of data integration approaches to help biomedical researchers to interpret their results, test and generate new hypothesis. In high throughput biotechnologies data warehouse solutions encountered a great success in the last decades, due to constant needs to store locally, confront and enrich in-house data with web information for multiple possibilities of analyses.

A tremendous amount of data warehouse projects devoted to bioinformatics studies exists now in literature. These warehouses integrate data from various heterogeneous sources, having different degrees of quality and trust. Most of the time, the data are neither rigorously chosen nor carefully controlled for data quality. Data preparation and data quality metadata are recommended but still insufficiently exploited for ensuring quality and validating the results of information retrieval or data mining techniques [17]. Moreover, data are physically imported, transformed to match the warehouse schema which tends to change rapidly with user requirements, typically in Bioinformatics. In the case of materialised integration, data model modifications for adding new concepts in response to rapid evolving needs of biologists, lead to considerable updates of the warehouse schemas and their applications, complicating the warehouse maintainability.

Lessons learned from the problems of biomedical data sources integration and warehouse schema evolution are presented in this paper. The main data quality issues in this context with current solutions for warehousing and exploring biomedical data are shown [1,2]. An illustration is given using QDex, a Quality driven bio-Data Exploration tool that: i) provides a generic functional and modular architecture for database quality profiling and exploration, ii) takes advantage of data quality profiling metadata during the process of biomedical data integration in the warehouse and, iii) enables users to set up query workflows, store intermediate results or quality profiles, and refine their queries.

This paper is structured as follows: in Section 2, requirement analyses in bioinformatics and the limits of current data warehousing techniques with regards to data quality profiling are presented in the perspective of related work. In Section 3, an illustration with our experience in building a gene expression data warehousing system: system design, data curation, cleansing, analyses, and new insight on schema evolution, In Section 4, QDex architecture and functionalities to remediate to some of these limits are presented to provide database quality profiling and extraction of quality metadata, and Section 6 concludes the paper.


2.1 Data integration issues at the structural level High throughput biotechnologies, like transcriptome, generate thousands of expression levels on genes, measured in different physiopathological situations. Beyond the process of management, normalization and clustering, biologists need to give a biological, molecular and medical sense to these raw data. Expression levels need to be enriched with the multitude of data available publicly on expressed genes: nucleic sequences, chromosomal and cellular locations, biological processes, molecular function, associated pathologies, and associated pathways. Relevant information on genes must be integrated from public databanks and warehoused locally for multiple possibilities of analyses and data mining solutions.

In the context of biological data warehouses, a survey of representative data integration systems is given in [8]. Current solutions are mostly based on data warehouse architecture (e.g., GIMS1, DataFoundry2) or a federation approach with physical or virtual integration of data sources (e.g., TAMBIS3, P/FDM4, DiscoveryLink5) that are based on the union of the local schemas which have to be transformed to a uniform schema. In [3], Do and Rahm proposed a system called GenMapper for integrating biological and molecular annotations based on the semantic knowledge represented in cross-references. Finally, BioMart [18], which is a queryoriented data integration system that can be applied to a single or multiple databases, is a heavily used data warehouse system in bioinformatics since it supports large scale querying of individual databases as well as query-chaining between them.

Major problems in the context of biomedical data integration come from heterogeneity, strong autonomy and rapid evolution of the data sources on the Web. A data warehouse is relevant as long as it adapts its structure, schemas and applications to the constantly growing knowledge on the bio-Web.

GIMS, http://www.cs.man.ac.uk/img/gims/ DataFoundry, http://www.llnl.gov/CASC/datafoundry/ TAMBIS, http://imgproj.cs.man.ac.uk/tambis/ P/FDM, http://www.csd.abdn.ac.uk/~gjlk/mediator/ DiscoveryLink, http://www.research.ibm.com/journal/sj/402/haas.html

2.2 Bio-data quality Issues at the instance level Recent advancement in biotechnology has produced massive amount of raw biological data which are accumulating at an exponential rate. Errors, redundancy and discrepancies are prevalent in the raw data, and there is a serious need for systematic approaches towards biological data cleaning. Biological databanks providers will not directly support data quality evaluations to the same degree since there is no equal motivation for them to and there are currently no standards for evaluating and comparing biomedical data quality. Little work has been done on biological data cleaning and it is usually carried out in proprietary or ad-hoc manner, sometimes even manual. Systematic processes are lacking. From among the few examples, Thanaraj uses in [14] stringent selection criteria to select 310 complete and unique records of Homo sapiens splice sites from the 4300 raw records in EMBL database.

Moreover, bio-entity identification is a complex problem in the biomedical domain, since the meaning of “entity” cannot be defined properly. In most applications, identical sequences of two genes in different organisms or even in different organs of the same organism are not treated as a single object since they can have different behaviours. In GENBANK data source for example, each sequence is treated as an entity in its own, since it was derived using a particular technique, has particular annotation, and could have individual errors.

Müller et al. [11] examined the production process of genome data and identified common types of data errors. Mining for patterns in contradictory biomedical data has been proposed in [10], but data quality evaluation techniques are needed for structured, semi-structured or textual data before any biomedical mining applications. Although rigorous elimination of data is effective in removing redundancy, it may result in loss of critical information. In another example, a sequence structure parser is used to find missing or inconsistent features in records using the constraints of gene structure [12]. The method is only limited to detecting violations of the gene structure.

More specific to data quality scoring in the biomedical context, [9] propose to extend the semistructured model with useful quality measures that are biologically-relevant, objective (i.e., with no ambiguous interpretation when assessing the value of the quality measure), and easy to compute. Six criteria such as stability (i.e., magnitude of changes applied to a record), density (i.e., number of attributes and values describing a data item), time since last update, redundancy (i.e., fraction of redundant information contained in a data item and its sub-items), correctness (i.e., degree of confidence that the data represents true information), and usefulness (i.e., utility of a data item defined as a function combining density, correctness, and redundancy) are defined and stored as quality metadata for each record (XML file) of the genomic databank of RefSeq. The authors also propose algorithms for updating the scores of quality measures when navigating, inserting or updating/deleting a node in the semi-structured record.


3.1 Database design, data integration, and application-driven workflow The Gene Expression Data warehouse GEDAW [5] has been developed by the National Institute of Health Care and Medical Research (INSERM U522) to warehouse data on genes expressed in the liver during iron overload and liver pathologies. For interpreting gene expression measurements in different physiopathological situations in the liver, relevant information from public databanks (mostly in XML format), micro-array data, DNA chips home experiments and medical records are integrated, stored and managed into GEDAW.

GEDAW aims at studying in-silico liver pathologies by enriching expression levels of genes with data extracted from the variety of scientific data sources, ontologies and standards in life science and medicine including GO ontology [6] and UMLS [7].

Designing a single global data warehouse schema (Fig 1) that integrates syntactically and semantically the whole heterogeneous life science data sources is still challenging. In GEDAW context, we integrate structured and semi-structured data sources and use a Global As View (GAV) schema mapping approach and a rule-based transformation process from a source schema to the global schema of the data warehouse (see [4] for details).

–  –  –

Fig 1: UML Class diagram representing the conceptual schema of GEDAW and some correspondences with the GENBANK DTD (e.g., Seqdes_title and Molinfo values will be extracted and migrated to the name and other description attributes of the class Gene in GEDAW schema).

With the overall integrated knowledge, the warehouse has provided an excellent analysis framework where enriched experimental data can be mined through various workflows combining successive analysis steps.

Pages:   || 2 | 3 |

Similar works:

«Advocates for Harvard ROTC HARVARD CONFEDERATES Total served killed in action Died by disease & accidents in service Harvard College: 78 16 5 (all by disease) Harvard Law School 154 33 7 (all by accidents) Harvard Scientific 23 3 0 Harvard Medical 2 0 0 Total 257 52 12 The above total of Harvard alumni serving in the Confederate military included five major generals and eight brigadier generals, three of which were killed in battle. It surprises some that 22% of all Harvard alumni who served in...»

«Aus der Klinik für Neurologie der Medizinischen Fakultät Charité – Universitätsmedizin Berlin DISSERTATION Effekte visueller und prämotorischer Arbeitsgedächtnisrepräsentationen auf visuelle Diskrimination beim Menschen Zur Erlangung des akademischen Grades Doctor medicinae (Dr. med.) vorgelegt der Medizinischen Fakultät Charité – Universitätsmedizin Berlin von Toni Fischer aus Aachen Gutachter: 1. Prof. Dr. med. C.J. Ploner 2. Prof. Dr. G. Curio 3. Prof. Dr. rer. nat. N. Kathmann...»

«Die subjektive Wahrnehmung der Prodromalsymptomatik bei akutem Myokardinfarkt Eine klinische Untersuchung zum Zusammenhang von somatischen und psychologischen Parametern Inaugural-Dissertation zur Erlangung des Grades eines Doktors der Medizin des Fachbereichs Humanmedizin der Justus-Liebig-Universität Giessen vorgelegt von Nicola Blum, geb. Voß aus Wermelskirchen Giessen 2001 Aus dem Medizinischen Zentrum für Psychosomatische Medizin Klinik für Psychosomatik und Psychotherapie Direktor:...»

«Histone deacetylase inhibitors for the epigenetic therapy of proximal spinal muscular atrophy Inaugural-Dissertation zur Erlangung des Doktorgrades der Mathematisch-Naturwissenschaftlichen Fakultät der Universität zu Köln vorgelegt von Lutz Garbes aus Köln Köln The Doctoral Thesis Histone deacetylase inhibitors for the epigenetic therapy of proximal spinal muscular atrophy“ was performed at the Institute of Human Genetics, Institute of Genetics and Centre for Molecular Medicine Cologne...»

«Active@ File Recovery User Guide 1 Active@ File Recovery Guide Copyright © 1999-2016, LSOFT TECHNOLOGIES INC. All rights reserved. No part of this documentation may be reproduced in any form or by any means or used to make any derivative work (such as translation, transformation, or adaptation) without written permission from LSOFT TECHNOLOGIES INC. LSOFT TECHNOLOGIES INC. reserves the right to revise this documentation and to make changes in content from time to time without obligation on the...»

«Volume 3: 2010-2011 ISSN: 2041-6776 School of English Studies Excreta, Ejaculate and the Emetic: the Role of the Abject in Ulysses Rory Byrne It is [.] not lack of cleanliness or health that causes abjection but what disturbs identity, system, order. What does not respect borders, positions, rules. The in-between, the ambiguous, the composite. —Julia Kristeva, Powers of Horror.1 The abject, as defined by Julia Kristeva, is that which is located at the borders of two positions; it is ‘beyond...»

«1 Concession Number: Concession Document (Wild Animal Recovery Operation Permit) North Island Schedule: Deer, pig and goat carcass recovery and live capture of deer, pig and goat THIS CONCESSION is made this day of PARTIES: Minister of Conservation 1. (the Grantor) 2. (the Concessionaire) BACKGROUND A. The Grantor administers and manages the public conservation lands described in Schedule 1 (the “Land”). B. Section 22 of the Wild Animal Control Act 1977 authorises the Grantor to grant, in...»

«Development of children's assent documents using a child-centred approach KAREN FORD, RN, RM, CertPaedN, MN PhD Candidate, School of Nursing and Midwifery, University of Tasmania, Tasmania, Australia JUDY SANKEY, RN, RM, PhD Deputy Head, School of Nursing and Midwifery, University of Tasmania, Tasmania, Australia JACKIE CRISP, RN, PhD David Coe Professor of Child and Adolescent Nursing, Practice Development Unit, Sydney Children’s Hospital and the University of Technology, Sydney, Australia;...»

«CHERG-WHO il 2000-2013 Department of Health Statistics and Information Systems (WHO, Geneva) and WHO-UNICEF Child Health Epidemiology Reference Group (CHERG) September 2014 Global Health Estimates Technical Paper WHO/HIS/HSI/GHE/2014.6.2 Ak wl g This Technical Paper is an update to Global Health Estimates Technical Paper WHO/HIS/HSI/ GHE/2014.6, which was written by Daniel Hogan, Li Liu and Colin Mathers with inputs and assistance from Wahyu Retno Mahanani, Jessica Ho and Doris Ma Fat. This...»

«Ascertaining HIV Underreporting in Low Prevalence Countries using the Approximate Ratio of Underreporting Ying-Hen Hsieh1, Hui-Ching Wang1, Hector de Arazoza2, Rachid Lounes3, Shiing-Jer Twu4, and Hsu-Mei Hsu5 Department of Applied Mathematics, National Chung Hsing University, Taichung, Taiwan. Dept. Ecuaciones Diferenciales, Facultad Matematica y Computacion, University of Havana, Havana, Cuba. Laboratoire de Statistique M´dicale, Universit´ Ren´ Descartes, e e e Paris, France. 4...»

<<  HOME   |    CONTACTS
2016 www.book.dislib.info - Free e-library - Books, dissertations, abstract

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.