Structured metadata using controlled vocabularies play an important role in modern medicine, enabling efficient literature searches and aiding in the distribution and exchange of medical knowledge.

Until recently, many bioinformatics databases had proprietary access protocols, data formats, and naming conventions for objects and their relationships. Users had to use many different tools for a number of sites, which always gave them partial information about the topic of interest. Today, RDFcompatible LSID (Life Sciences Identifiers) with SOAP-enabled servers running on LSID protocol have become the standard in most public bioinformatics databases — they provide RDF data either directly or through proxy LSID authorities. Some providers of such data include GenBank, Gene Ontology or PDB [Cowan02]. This allows seamless integration of sources in projects like BioHaystack [Karger04].

One example of LSID identifiers is urn:lsid:ncbi.nlm.nih.gov.lsid.i3c.org:genbank:nm_001240

Gene Ontology Consortium (http://www.geneontology.org/), for example, provides its own controlled vocabularies to describe specific aspects of gene products. Collaborating databases annotate their gene products with terms in RDF/XML. Genes are annotated by molecular function, biological process, and cellular component. These annotations are directed acyclic graphs of inheritance and aggregation relationships with semantics similar to RDF Schema. Every statement in the database also needs to be backed up by evidence information, such as traceable author statement, direct result of a publication, or automatically inferred fact. All of this information is then connected to LSIDs and published.

An excerpt from a Gene Ontology file describes how two gene products are associated:

B.2 SESAME Sesame [MKvH02] is an architecture that allows the storage and querying of large amounts of RDF and RDF Schema data. It was developed within the joint European On-to-knowledge project and the whole project is available for downloading by noncommercial users.

Storage layer of data in Sesame is based on using multiple stackable Repository Abstraction Layers with custom API. In newer versions of Sesame, these were extended to SAILs — Storage and Inference Layers. These can interface the Sesame core with database management systems such as PostgreSQL, RDF files, specialized RDF stores, or RDF network services.

Administration module takes care of incrementally adding RDF/S information by statements.

As triples are added, some basic RDF Schema entailments are made based on types, subclassing information and property domains/ranges, and checked against the consistency of the data store.

Export module simply exports schema or data in RDF/XML.

Query module interfaces with the user through the SeRQL query language. For the PostrgreSQL database, queries are optimized as they are translated to SQL. Further performance improvements come from the storage structure that organizes the relational tables by RDF Schema information.

Interface layer makes the modules accessible through HTTP, SOAP and RMI protocols.

B.3 HAYSTACK Haystack [Karger04] is a tool intended for casual users, developed by MIT in cooperation with IBM. It addresses the fact that people know a lot that they are willing to share, but too lazy to publish. The environment based on IBM’s open-source Eclipse IDE gathers that knowledge in RDF form for exchange and analysis without interfering with the user. This leads to support for intelligent searching, mail filtering, automatic categorization and context-sensitive support for user tasks.

Activities supported by Haystack include text processing, e-mail management, calendar and planner, Internet browsing, searching and bookmarking, content annotation and categorization. The whole environment heavily relies on context and tries to present the user only with information that is relevant to the task at hand. All information is decoupled from physical location and graphical representation and is displayed when needed through persistent views that can be customized and even exchanged among users.

Apart from personal management interfaces, another existing Haystack application is BioHaystack that integrates LSID data from various bioinformatics databases and lets the user annotate, discuss and analyze their content.

