Using Ncbi Efetch and XMLSAX

Now that you understand the basics of XML::SAX , you can apply this knowledge to dynamically retrieve and parse sequence data from NCBI. Fortunately for us, NCBI provides a web service, called EFetch that simplifies the process of retrieving sequence records. EFetch is actually an example of a REST-based web service (for details on REST-based web services, refer to Chapter 9). In a nutshell, client applications connect to EFetch via HTTP and specify search criteria with a set of URL parameters. Based on the search criteria, the EFetch service will connect to the NCBI Entrez back-end database system, find a matching record, and return the requested record in the format of your choosing. EFetch currently provides access to several NCBI Entrez databases, including sequence, literature, and taxonomy databases; and can return data in several data formats, including text, HTML, ASN.1 and XML. If you are eager to try out a few sample EFetch requests, refer to Table 5.2.

As of this writing, the base URL for connecting to EFetch is:

To retrieve a specific nucleotide sequence record, you must append a database parameter and an ID parameter, which uniquely identifies the record. For example, the following URL retrieves the complete genome record for the SARS coronavirus, formatted in the GenBank flat file format:

Table5.2 Example EFetch queries How to Retrieve a Nucleotide Record:

Example #1: Retrieves information regarding the BRCA2 gene in Human, and formats the results in TinySeq XML: nucleotide&id=U43746&rettype=fasta&retmode=xml Example #2: Retrieves information regarding the BRCA2 gene in Human, and formats the results in GenBank XML: How to Retrieve a Protein Record:

Example: Retrieves information regarding the BRCA2 protein in Human, and formats the results inGenPept XML: How to Retrieve a Literature Record:

Example: Retrieves citation and abstract information regarding PMID: 14597658, and formats the results in XML: How to Retrieve a Taxonomy Record:

Example: Retrieves the species name for NCBI Taxonomy ID: 7227. In this case, E-Fetch returns a single string: "Drosophila melanogaster''. text&id=30271926

In the URL above, the db parameter specifies the NCBI nucleotide database, rettype specifies the GenBank flat file format, retmode specifies text content, and id specifies the NCBI GI number for the SARS virus. Conveniently, the id parameter accepts both NCBI GI numbers and NCBI accession numbers.

For XML content, set the retmode parameter to "xml." For example, to retrieve data in the NCBI TinySeq XML format, set rettype=fasta and retmode=xml. To retrieve data in the more comprehensive NCBI GBSeq XML, set rettype=gb and retmode=xml. For example, the following URL retrieves the same SARS virus record, but this time it is formatted in GBSeq XML: xml&id=30271926

Complete details regarding NCBI EFetch are available online at:

Our goal is to write a Perl program capable of automatically retrieving sequence data from EFetch and extracting a small subset of the XML content for display to the console. The program expects a single command line argument, indicating an NCBI GI number or accession number. A sample run of the application is shown below:

> NC-004718

Downloading XML from NCBI E_Fetch

Using URL:

fcgi?db=nucleotide&rettype=gb&retmode=xml&id=3 027192 6

Definition: SARS coronavirus, complete genome

Accession: NC_004718

Locus: NC_0 04 718

Organism: SARS coronavirus


Source code for the Perl fetcher is shown in Listings 5.4 and 5.5. Examine the code now and we will describe its main components below.

As in our first SAX application, the fetch application consists of two parts: a main application, which initiates parsing (Listing 5.4), and a SAX event handler (Listing 5.5). The main application uses the World Wide Web library for Perl (LWP) [60] to connect to NCBI EFetch and retrieve the specified sequence record. It also obtains an XML parser via the SAX factory, and initiates parsing via the parse_string() method. The parse_string() method returns an associative array, which we then print to the console.

The module listens for specific SAX events, and selectively stores specific GBSeq elements in an internal associative array. There are a few important items to note. First, the characters() method uses a character buffer. This is important because SAX parsers are free to perform character "chunking"—for example, one SAX parser may report a line of text via a single call to characters() , whereas a second SAX parser may break the line into two "chunks" and report it via two calls to characters() . Since there is no way to know ahead of time which chunking method the parser will use, it is always safest to assume multiple calls to characters() and to append to a character buffer each time. Second, the end_element() method is used to

Chapter 5 • Parsing NCBI XML in Perl

Listing 5.4 Parsing NCBI EFetch data via the SAX API

# Fetches NCBI XML from the NCBI E_Fetch Utility.

# Author: Ethan Cerami use XML::SAX;

use LWP::Simple; use NcbiHandler; use strict;

print "Usage: ncbi_identifier (NCBI GI or Accession Number)\n" ;

die "Example: 30271926\n"; }

# Download File from NCBI e_Fetch; uses LWP Module my $ncbi_url = get_ncbi_url($ARGV[0]);

print "Downloading XML from NCBI E_Fetch\n" ; print "Using URL: $ncbi_url\n" ; my $xml_doc = LWP::Simple::get($ncbi_url);

# Parse XML Document my $handler = NcbiHandler_>new;

my $parser = XML::SAX::ParserFactory_>parser(Handler =>$handler); my %data = $parser_>parse_string($xml_doc);

# Output Results of Parsing my $sequence = $data{"GBSeq_sequence"};

print "Definition: ", $data{"GBSeq_definition"};

print "\nAccession: ", $data{"GBSeq_primary_accession"};

print "\nLocus: ", $data{"GBSeq_locus"};

print "\nOrganism: ", $data{"GBSeq_organism"};

print "\nSequence (0..20): $sequence... \n";

# Gets NCBI Identifier from user, and returns an absolute URL

# to the NCBI E_Fetch Utility. sub get_ncbi_url {

# Set Base URL for NCBI E_Fetch my $baseurl = "" ."efetch.fcgi?db=nucleotide&rettype=gb&retmode=xml&id=" ;

Listing 5.5

# Parses NCBI GBSeq XML Documents, and extracts only

# selected elements. package NcbiHandler; use strict;

# Extend XML::SAX::Base use base qw (XML::SAX::Base);

# Report Start Element Events.

# Each time we get a start element event,

# reset the character buffer. sub start-element {

my ($self, $element) = @_; $current_text = "";

# Selectively store element information, sub end-element {

my ($self, $element) = @-; my $name = $element->{"Name"}; if ($name eq "GBSeq-locus"

| | $name eq "GBSeq-primary-accession" | $name eq "GBSeq-definition" | $name eq "GBSeq-organism" | $name eq "GBSeq-sequence" ) { $data{$name} = $current-text;

# Keep Character Buffer, sub characters {

my ($self, $characters) = @-; $current-text .= $characters->{"Data"};

# Return Associative Array to main application. sub end-document {

selectively filter for specific GBSeq elements. For those specific elements of interest, we store the current character buffer into an associative array and use the element name as a hash key. We subsequently return the associative array to the main calling application by returning it from the end_document() method.

Chapter 5 • Parsing NCBI XML in Perl

Was this article helpful?

0 0
Swine Influenza

Swine Influenza

SWINE INFLUENZA frightening you? CONCERNED about the health implications? Coughs and Sneezes Spread Diseases! Stop The Swine Flu from Spreading. Follow the advice to keep your family and friends safe from this virus and not become another victim. These simple cost free guidelines will help you to protect yourself from the swine flu.

Get My Free Ebook

Post a comment