Remote tree reconstruction is (almost) here

This week the CIPRES project released a REST-based API for running optimized versions of some of the most powerful programs for phylogenetic tree reconstruction on the CIPRES compute cluster at the San Diego Supercomputer Center (SDSC). This lays the foundation for developing the same infrastructure for seamless remote execution of phylogenetic tree inference as is being widely used, for example, for BLAST-based sequence similarity searches run at the NCBI.

Running compute-intensive tools on a remote server rather than locally has a number of advantages. You can run them from the convenience of your laptop or generally any machine that is connected to the network without needing to install the tools (and databases in the case of sequence database searches) locally, and the resource limitations of your computer don’t matter. Usually the job runs detached from the user interface and hence keeps running while you are off-line if it is a long running analysis.

Up to here an interactive (web-based) user interface (such as the BLAST search interface) will accomplish both of these benefits, and is indeed what an end-user trying to reconstruct a tree will most likely be interested in. In fact, CIPRES has released a web-based user interface to its portal almost a year ago. The portal currently allows running ClustalW for computing multiple sequence alignments, and GARLI, RAxML, PAUP, and as of recently also MrBayes (which is not yet supported through the REST-based API) for reconstruction of phylogenetic trees. The jobs run on the CIPRES compute cluster at SDSC, and some of the tools have been optimized, for example by embedding them in a Rec-I-DCM3 boosting framework. This makes the service particularly useful for analyzing large datasets.

So why is the availability of the REST-based access to this service so noteworthy? Basically, because unlike the end-user oriented interactive version it is programmable, in any of your favorite languages that allow you to interact with an HTTP server. If you are trying to execute tree reconstruction as one step of an iteration or as one of many more steps in a workflow (see the Phyloinformatics Hackathon use-cases for examples), it is machines that need to access the service, not humans. Programmatic screen-scraping is possible but has proven notoriously fragile and is an eternal pain to maintain – because user interfaces need to change, improve, become more elaborate, etc. Now consider screen-scraping not 1 but 5 pages, which is the number of steps that job submission on the CIPRES portal user interface is currently divided into. It’s possible, but the thought of having to maintain that puts up a fairly high barrier.

Fortunately, when Mark Holder, who leads the software development at CIPRES, and I instigated integrating the CIPRES portal into the programmable web, we met receptive ears. Lucie Chan at SDSC designed and implemented the API; overall, my role in this consisted merely in occasional cheerleading :-)

So how does the API work? The resource URLs one must address, the expected input data, and the responses returned are documented at the portal (prefix all URLs with http://8ball.sdsc.edu:8888/cipres-web, which is the base URL of the service). The API is based on the REST protocol, which has the nice feature that instead of needing special client software any of the widely available HTTP-client implementations will allow us to interact with the server interface.

For example, curl nicely supports HTTP GET as well as POSTing multipart/form-data encoded input data including file upload, which is all we need for a start. To submit a job, we call the /restapi/job resource:

$ curl -F "datafile=@dna.nex -F email="your@email.org" \
       -F "analysis=ML GTR+G" -F tool=RAxML -F datatype=DNA_DATATYPE \
       -D h.txt http://8ball.sdsc.edu:8888/cipres-web/restapi/job

This is a Create operation in terms of CRUD, and since the parameters needed to create the job include the input data file of (aligned) character data, we POST the input in multipart/form-data encoding. The @ prefix to the filename in the first -F flag tells curl to attach the file as a file upload (if you don’t have an example file, you can use one of the BioPerl test data files, for example Primate_mtDNA.nex). The other values provided to the -F flags give the email address to send a completion acknowledgment to and stipulate that the analysis we wish to run is a maximum likelihood (ML) method, using the General Time Reversible substitution model with rate heterogeneity (GTR+G), the tool to run is RAxML, and the uploaded file contains DNA data (i.e., a multiple alignment of DNA sequences).

The success of the operation is returned as the HTTP status code which is the first line in the response header. The above call yields

HTTP/1.1 200 OK

and the following XML-formatted response body:

<?xml version="1.0" encoding="UTF-8"?>
<JobSubmission status="OK">
  <StatusURL>

http://8ball.sdsc.edu:8888/cipres-web/restapi/job/SID

  </StatusURL>
</JobSubmission>

Note that SID will in reality be a rather longish session identifier. If there had been an error in the submission, the HTTP status code would have been

HTTP/1.1 400 Bad Request

and the status attribute in the response would have read ERROR. In that case, the response body shows the error code and message. For example, if we leave out the email in the above call, we get

<?xml version="1.0" encoding="UTF-8"?>
<JobSubmission status="ERROR">
  <error code="4" message="Email is missing or invalid" />
</JobSubmission>

In case submission is successful, we can now use the StatusURL to inquire about the status of the submitted job using a simple GET request (in CRUD terms, this is a Retrieve operation):

$ curl http://8ball.sdsc.edu:8888/cipres-web/restapi/job/SID

which yields the following (also XML-formatted) response body:

<?xml version="1.0" encoding="UTF-8"?>
<JobStatus status="COMPLETE">
 <ResultURL>

http://8ball.sdsc.edu:8888/cipres-web/restapi/result/job/SID/file/RAxML_dna.nex

 </ResultURL>
</JobStatus>

Had it been a longer running job, the status attribute would have read IN_PROGRESS. Once the job is complete we can retrieve the result at the returned ResultURL, which is a also a simple Retrieve (GET) operation on the file as a resource:

$ curl -o result.nex \

http://8ball.sdsc.edu:8888/cipres-web/restapi/result/job/SID/file/RAxML_dna.nex
Primate mtDNA tree, computed with RAxML

This will save the result as result.nex. We can now view the reconstructed tree, for example using FigTree.

Note that this reconstruction ran with default parameters, which you can look up from the API documentation. The API also allows attaching a configuration file to the job submission.

With this API in place, there are many possibilities for where we can go from here. You could simply wrap the above example as a shell script that can simultaneously masquerade as GARLI, RAxML, and PAUP (and any other tool CIPRES will support in the future). We can now also write modules for any of the programming toolkits that have proven so powerful for scripting analysis workflows, such as BioPerl, Biojava, and the other Bio* toolkits, or for visual workflow construction tools such as Taverna and Kepler.

We could even try to MOBY-fy the service, which would make it immediately visible to all BioMOBY service consumers. Moreover, as BioMOBY adds strong semantics to input, output, and the operation itself, the service would become discoverable, by humans and software agents alike, by how it fits logically into a desired analysis, making it ultimately part of the semantic web. In fact, I’ve been speaking with Mark Wilkinson and we might look into this at the upcoming BioHackathon 2008 that we will both be attending.

Update 2008-01-29: MrBayes is supported through the REST API, too. The documentation on available tools has been updated accordingly. Also, there is an interactive tool for generating suitable config files.

One Response to “Remote tree reconstruction is (almost) here”

  1. Feifei Xu Says:

    Hi,

    I like the idea of this project. And I was trying out the service, but didn’t get it started probably. Wonder if I can get any help? The following try gives me an error “Analysis name is missing or invalid”. I wonder where can i get the help over the analysis name? And could you point me to some good documentation of how to use this service? Thanks a lot.

    curl -F “datafile=@Contig8.fas” -F email=”feifei.xu@ebc.uu.se” -F “analysis=PROTCATJTT” -F tool=RAxML -F datatype=AA_DATATYPE -D h.txt http://8ball.sdsc.edu:8888/cipres-web/restapi/job

Leave a Reply