Integrating ontologies is a mess

Using ontologies, controlled vocabularies, and thesauri to aid integration of data, and in fact also services, has come a long way over the last several years. Vocabularies help to make sure we use the exact same terms to say the same thing, and ontologies provide for computable semantics over a conceptualization of reality. The semantic web crucially relies on these so that software agents can actually interpret the data and the services they encounter. With integration at the core of the mission of the ontology effort, wouldn’t one assume that a common infrastructure for integrating vocabularies and ontologies would be in place by now that is straight-forward to use yet powerful and flexible, well documented, widely adopted, and solidly a part of the Programmable Web? Well, it turns out, no, not by a long shot.

So why is integrating ontologies a problem? Ontologies, controlled vocabularies, and thesauri are being developed by multiple organizations, in different fields, and for different applications, such as biological, medical, clinical, or in the library and information sciences. Almost all ontologies hosted by the OBO Foundry, such as the often cited Gene Ontology (GO) are maintained in OBO format, but other ontologies, even in the life sciences, are in OWL(Web Ontology Language) format, the W3C standard for semantic web ontologies. In addition there is yet another variety of terminology, vocabulary, and thesaurus formats, which the recent W3C-recommended standard SKOS (Simple Knowledge Organization System) tries to unify. All of these are served by different providers, through different user and query interfaces. As the LexGrid project puts it succinctly:

Just about every terminology has its own format, its own set of tools, and its own update mechanisms. The only thing that most of these pieces have in common with each other is their incompatibility.

As a result, by far the most common approach by applications that want to leverage ontologies and controlled vocabularies is parsing and loading local copies of vocabularies, which limits the choices to the one or few formats supported, creates numerous copies of vocabularies outdated at various degrees, lacks discoverability, and in the end simply doesn’t scale. What if instead vocabulary and ontology services were seamlessly embedded in the programmable web?

You may remember the times when gene annotations were incompatible in format and query interface (and other ways) between gene annotation providers, making it painful or impossible to compare and mash up gene annotations from different sources, such as different databases, or those resulting from new gene or function prediction methods. The Distributed Annotation Service (DAS) changed this dramatically. Meanwhile most genome database browsers allow mashing up the information from other, user-specified gene annotation resources on the fly as a standard feature. What if there was DOVS – a Distributed Ontology and Vocabulary Service?

Having recently started to research the situation, it seems that what we have instead is rather a mess of obscure, complex, and incompatible APIs that are largely incompatible with the semantic web. Leveraging and integrating multiple ontologies is centrally important for at least two of the projects I’m involved in at NESCent, for automatically extracting rich metadata from manuscript texts and data objects (see DRIADE), and for converting evolutionary character descriptions into formal, interoperable phenotype assertions with computable semantics (see PhenoScape). One is in the domain of digital library and information science, the other one is in biology, but there isn’t a programmable web infrastructure for ontologies or vocabularies in either.

So let’s look at some of the things I found. Note that this isn’t meant to be an exhaustive list by any means; rather, they are just a few points in a rugged landscape that seems to have a long way to go to become a smooth one.

We’ll start with the library and information science side.

  • Noticeably, just two weeks ago the W3C has finally released SKOS, a unifying standard format for thesauri and vocabularies. Initially emerged in 2004, this could become the common format for all applications that don’t require strict semantics, such as, I would claim, for a user choosing a term accurately describing something of relevance, one of the most prominent use-cases.
  • The Zthes profile comprises of an abstract model, an XML schema, and a query specification layered on top of Z39.50 (which has its own protocol) or SRU (which has a SOAP and a REST-like binding) to “facilitate interoperability for applications that deal with thesauri.” The approach looks promising, but seems dormant, and there seem to be no implementations of significance other than by the creators themselves. Though SKOS isn’t mentioned, it could conceivably be used instead of the Zthes XML format
  • The OCLC has an on-going project on Terminology Services, with some prototype development using MARC21 over SRU, though SKOS appears being envisioned as the core encoding scheme. A concrete API description seems elusive.
  • The SWAD-Europe project devised a SOAP-based SKOS API. The corresponding Java API has concept-related as well as relation and synonym-related queries, though I can’t identify how to obtain a list of supported thesauri as URIs. The only non-toy implementation of this that I can find is by the NBII for its Biocomplexity Thesaurus Web Services.

This leads to the biology and life science side.

  • The LexGrid project has the declared aim to “bridge terminologies and ontologies with a common set of tools, formats and update mechanisms.” It doesn’t, however, seem to support SKOS. It published its own standard of a SOAP-based Common Terminology Service (CTS), which is tailored to the healthcare community and compliant with HL7 software requirements. The corresponding Java API is complex and bears little resemblance to other ontology data models, such as the one of OBO-Edit or go-perl, or NCBO for that matter (see below). That assessment may simply reflect the fact that I am not a member of the healthcare community, but also surely doesn’t speed adoption outside of that community. 
  • The National Center for Biomedical Ontologies‘ (NCBO) BioPortal at present integrates 86 ontologies from the biomedical domain in OBO and OWL format. SKOS is not supported. BioPortal publishes a SOAP-based web-service API (though that fact is somewhat obscure). The Java class that implements the web-service endpoint (NCBOWebserviceEndpoint.java, the on-line link to the JavaDoc is broken) reveals queries for finding ontologies, terms, and relations, as well as queries for traversing the hierarchy. A JUnit test class has some examples for engaging the service as a client. Although the BioPortal uses LexGrid (as “LexBIO“) underneath the hood if I am not mistaken, the API is very different from the LexGrid CTS (and it is not clear whether a CTS interface is exposed too). There don’t seem to be any documented consumers of the NCBO web-services API.
  • The NCI’s Enterprise Vocabulary Services (EVS) is a part of their Cancer Biomedical Informatics Grid (caBIG) infrastructure platform. EVS v4.0 uses LexGrid (as “LexBIG“) as its underlying vocabulary engine. It exposes a web-service API (see pp.46-55 of the caCORE Technical Guide), in both SOAP and REST-binding. The API is yet different from LexGrid CTS or the NCBO API, and — even though the technical guide is detailed — resists rapid understanding of how to move beyond the examples given; for example I remain stumped as to how to obtain the name of a concept found through a search (note: replace @id in the guide with @code or otherwise the example returns an error). Also, engaging the API requires understanding the EVS object model, which reflects Apelon‘s proprietary Ontylog data model used by NCI, and is unlike the other models (see above). 
  • The NCI has launched a collaborative effort with NCBO (and LexGrid?), called the Open Terminology Portal (TOP), with the mission to “develop a unified version of BioPortal and a variety of services to meet the needs of terminology and ontology users.” After 7 months, there is no code yet.
  • A team at the EBI created the Ontology Lookup Service (OLS), which currently integrates 58 ontologies from the biomedical domain in OBO format hosted by the OBO Foundry. It exposes a SOAP-based web-service API with term, relationship, and ontology-based discovery queries. Most results seem to be returned as flat names, or key-value pairs, not as structured objects.
  • The BioMoby project provides an extensible web-service infrastructure with machine-interpretable semantics by using ontologies to describe input and output objects as well as the actual operation of a service. Ironically, rather than bootstrapping Moby-compliant query and update services for its own ontology service needs, querying the Moby Registry ontologies is only supported through an idiosyncratic Moby client API. In fact, querying Moby Central for services that produce ontology terms returns no results, and searching through more than 1,000 registered services reveals none that would allow traversing hierarchies, or searching for matching terms.
  • The Stanford Microarray Database (SMD) implements its own REST-like web-service interface (scroll to Appendix 3 in the document) for use with its Ontology Widget. The interface is straightforward and easy-to-use, but lacks provisions for obtaining metadata for ontologies, the hierarchical navigation is simplistic, and the response XML seems to be their own invention.
  • The DAS/2 specification includes a REST-like API for ontology retrieval. The capabilities are relatively minimal, comprising of a method to list all supported ontologies, obtain all terms within an ontology, and query for matching terms by name or identifier. Hierarchical navigation queries are not supported. The default response format is OBO-XML (called das2xml here).

So where do we go from here? I do understand that not all applications are equally well served by the same API. However, don’t we have more interesting things to do than writing code to support 10 different ontology and vocabulary APIs?

Finally, for full integration into the semantic web terms (or concepts) need unique URIs anyway, which will ideally resolve (directly or indirectly) to a metadata document in RDF. Hence, we need at least a minimalist REST-like API anyway in order to serve our ontologies to the semantic web. Wouldn’t it be nice to have an agreed upon Distributed Terminology Service that accomplishes that as a side effect?

3 Responses to “Integrating ontologies is a mess”

  1. Hi Hilmar,

    The Moby ontologies are published as OWL-RDF documents, as is the entire Biomoby registry:

    http://biomoby.org/RESOURCES/MOBY-S/Objects
    http://biomoby.org/RESOURCES/MOBY-S/Namespaces
    http://biomoby.org/RESOURCES/MOBY-S/Services
    http://biomoby.org/RESOURCES/MOBY-S/ServiceInstances

    these addresses are available through an API call to ensure that they are always discoverable if they change.

    The ontologies are *also* available for manipulation through code (in particular for the purpose of adding/removing nodes) but for simple browsing of the ontology almost everyone uses the dynamically-generated RDF files at the addresses above.

    Cheers!

    Mark

  2. Hi Mark -

    that’s great (have I missed where this is documented at BioMoby.org or should it be more prominent?). However, am I missing some detail, or is this equivalent to posting the ontologies for download in RDF/OWL? I.e., it’s not an API for querying and accessing individual terms, and programmatically navigating their neighborhood, right?

    Of course I could suck them up using a OWL/RDF-capable library, but isn’t that a bit akin to saying we don’t need an ontology web-service API, just a SPARQL endpoint?

  3. Great to see commentary on this issue. I believe we are in exciting times. Although we may call the current situation “a mess” — we actually have an information infrastructure, and a host of new enabling technologies, that permit integration, mapping, and linking different vocabularies, along the ontology/vocabulary continuum, in a way that was never before possible.

    By our sharing different developments, and hearing from different communities…I think we may be able to improve the situation dramatically, and your [Hilmar] generating this discussion is a positive step in this direction.

    It seems to me that the development of registries is a big component of our cleaning up (or improving) the mess…

    Thanks for the prompting the discussion!! jane

Leave a Reply