Representing Sequence Features

In addition to raw sequence data, BSML can also represent sequence features. A sequence feature is any piece of annotation that provides additional details regarding a specific location or range of sequence data. When we get to Chapter 6, we will spend more time formally defining sequence annotation, and discuss in detail the Distributed Annotation System (DAS). For now, it is simplest to think of sequence annotation as any piece of data that provides additional details regarding a raw sequence record. For example, we can take a raw sequence record, and identify important parts, such as promoter regions, protein-coding regions, and 5' and 3' untranslated regions. We can also annotate sequence records with important references to scientific articles. Sequence features

Listing 2.2 The SARS virus, Take 2. This example is identical to Listing 2.1, except that we have now added additional attributes.

<!DOCTYPE Bsml PUBLIC "-//Labbook, Inc. BSML DTD//EN"

"http://www.rescentris.com/dtd/bsml3-1.dtd">

<Definitions> <Sequences>

<Sequence id="AY278741" title="AY278741" molecule="rna" length="29727" db-source="GenBank" ic-acckey="AY278741" topology="linear" strand="ss" representation="raw"> <Attribute name="definition" content="SARS coronavirus

Urbani, complete genome."/> <Attribute name="submission-date" content="21-APR-2003"/> <Attribute name="version" content="AY278741.1 GI:30027617"/> <Attribute name="source" content="SARS coronavirus Urbani"/> <Seq-data>

atattaggtttttacctacccaggaaaagccaaccaacctcgatctcttgtagatctgtt ctctaaacgaactttaaaatctgtgtagctgtcgctcggctgcatgcctagtgcacctac gcagtataaacaataataaattttactgtcgttgacaagaaacgagtaactcgtccctct tctgcagactgcttacggtttcgtccgtgttgcagtcgatcatcagcatacctaggtttc gtccgggtgtgaccgaaaggtaagatggagagccttgttcttggtgtcaacgagaaaaca cacgtccaactcagtttgcctgtcc [For brevity, sequence is truncated.] </Seq-data> </Sequence> </Sequences> </Definitions> </Bsml>

are an important element in other file formats as well. For example, the GenBank Flat File Format includes extensive support for sequence features and includes a recommended list of feature types.

In BSML, each sequence can contain any number of features. Features are formally nested within a Feature-tables element and individual features are defined within a Feature element. Two types of features are supported: positional and nonpositional. Positional features are tied to specific sequence locations and can be used to represent a host of sequence annotations, including protein-coding regions, locations of predicted genes, single nucleotide polymorphisms (SNPs), etc. Nonpositional features are not tied to any specific region of sequence, but are instead associated with the sequence record as a whole. For example, you can attach literature references that are associated with the entire sequence record.

Nonpositional features are slightly less complex than positional features. Let's take a look at an example, shown in Listing 2.3. This new example adds a single nonpositional feature detailing the direct submission to GenBank. More specifically, it lists the primary contributors of the work and their affiliation with the Centers for Disease Control and Prevention. As you can see, the Reference element contains a list of authors, a title, and the complete journal reference. For references to published material, you can include cross-reference identifiers to MEDLINE and PubMed.

Chapter 2 • Fundamentals of XML and BSML

Listing 2.3 The SARS virus, Take 3. The record now includes a single nonpositional feature, describing the direct submission to GenBank.

<!DOCTYPE Bsml PUBLIC "-//Labbook, Inc . BSML DTD//EN"

"http://www.rescentris.com/dtd/bsml3_1.dtd">

<Definitions> <Sequences>

<Sequence id="AY278741" title="AY278741" molecule="rna" length="29727" db-source="GenBank" ic-acckey="AY278741" topology="linear" strand="ss" representation="raw"> <Attribute name="definition" content="SARS coronavirus

Urbani, complete genome."/> <Attribute name="submission-date" content="21-APR-2003"/> <Attribute name="version" content="AY278741.1 GI:30027617"/> <Attribute name="source" content="SARS coronavirus Urbani"/> <Feature-tables id="AY278741.FTS1">

<Feature-table id="AY278741.FTS1.FTB1" title="Genbank References" class="GB_REFERENCES"> <Reference id="REF1" title="Direct Submission"> <RefAuthors>Bellini,W.J., Campagnoli,R.P.,

Icenogle,J.P., Monroe,S.S., Nix,W.A., Oberste,M.S., Pallansch,M.A. and Rota,P.A. </RefAuthors>

<RefTitle>Direct Submission</RefTitle> <RefJournal>Submitted (17-APR-2003) Division of Viral and Rickettsial Diseases, Centers for Disease Control and Prevention, 1600 Clifton RD, NE, Atlanta, GA 3 0333, USA</RefJournal> </Reference> </Feature-table> </Feature-tables> <Seq-data>

atattaggtttttacctacccaggaaaagccaaccaacctcgatctcttgtagatctgttctc taaacgaactttaaaatctgtgtagctgtcgctcggctgcatgcctagtgcacctacgcagta taaacaataataaattttactgtcgttgacaagaaacgagtaactcgtccctcttctgcagac tgcttacggtttcgtccgtgttgcagtcgatcatcagcatacctaggtttcgtccgggtgtga ccgaaaggtaagatggagagccttgttcttggtgtcaacgagaaaacacacgtccaactcagt ttgcctgtcc

[For brevity, sequence is truncated.] </Seq-data> </Sequence> </Sequences> </Definitions> </Bsml>

Positional features are just slightly more complicated. Each feature can contain any number of Qualifier and location elements. A Qualifier element describes a name/value attribute that describes the feature. A location element describes the location of the feature. Two types of locations can be specified: Site-loc and Interval-loc. A Site-loc identifies a single point within a raw sequence; an

Listing 2.4 SARS virus, Take 4. The record now includes a single positional feature.

<!DOCTYPE Bsml PUBLIC "-//Labbook, Inc. BSML DTD//EN"

"http://www.rescentris.com/dtd/bsml3-1.dtd">

<Definitions> <Sequences>

<Sequence id="AY278741" title="AY278741" molecule="rna" length="29727" db-source="GenBank" ic-acckey="AY278741" topology="linear" strand="ss" representation="raw"> <Attribute name="definition" content="SARS coronavirus

Urbani, complete genome."/> <Attribute name="submission-date" content="21-APR-2003"/> <Attribute name="version" content="AY278741.1 GI:30027617"/> <Attribute name="source" content="SARS coronavirus Urbani"/> <Feature-tables id="AY278741.FTS1">

<Feature-table id="AY278741.FTS1.FTB2" title="Genbank

Features" class="GB-FEATURES"> <Feature id="AY278741.FTS1.FTB2.FTR9" title="envelope protein" class="CDS" comment="envelope protein" display-auto="1"> <Interval-loc startpos="26117" endpos="26347"/> <Qualifier value-type="note" value="envelope protein"/> <Qualifier value-type="codon-start" value="1"/> <Qualifier value-type="product" value="E protein"/> <Qualifier value-type="protein-id" value="AAP13443.1"/> <Qualifier value-type="db-xref" value="GI:30027622"/> <Qualifier value-type="translation" value="MYSFVSEETGTLIVNSVL LFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPTVYVYSRVKNLNSSEGV PDLLV"/> </Feature> </Feature-table> </Feature-tables> <Seq-data>

atattaggtttttacctacccaggaaaagccaaccaacctcgatctcttgtagatctgttctcta aacgaactttaaaatctgtgtagctgtcgctcggctgcatgcctagtgcacctacgcagtataaa caataataaattttactgtcgttgacaagaaacgagtaactcgtccctcttctgcagactgctta cggtttcgtccgtgttgcagtcgatcatcagcatacctaggtttcgtccgggtgtgaccgaaagg taagatggagagccttgttcttggtgtcaacgagaaaacacacgtccaactcagtttgcctgtcc [For brevity, sequence is truncated.] </Seq-data> </Sequence> </Sequences> </Definitions> </Bsml>

Interval-loc identifies a specific interval or range of raw sequence data. Again, a specific example should clarify the most important points. Take a look at Listing 2.4.

If you download the full SARS virus genome record from GenBank, you will see that it includes dozens of features. However, to keep the example more manageable, we have chosen to just include one positional feature in Listing 2.4. As you can see, this feature identifies a single coding sequence

Chapter 2 • Fundamentals of XML and BSML

Figure 2.11 Sample screenshot of the Rescentris Genomic Workspace™ application. We have just loaded the SARS example from Listing 2.4. Note that our envelope protein is now included in the main sequence window (it is denoted with a single line between the markers 23,782 and 29,727).

region, identifying the SARS virus envelope protein. The coding region spans a specific interval of sequence data and we therefore use the Interval-loc element:

<Interval-loc startpos="26117" endpos="26347"/>

As stated above, each feature can include any number of Qualifier elements. In this case, we use Qualifier elements to denote important attributes. For example, we identify the protein ID, a cross-reference to the protein GI number in GenBank, and the amino acid sequence of the translated region.

The Rescentris Genomic Workspace™ application will automatically draw all sequence features for you. For example, Figure 2.11 shows a screenshot of our revised SARS example. As you can see, our single feature is overlaid onto the main sequence widget in the center of the screen. Of course, this is one of the simplest possible feature examples. If you import a fully annotated sequence with multiple features, Genomic Workspace™ will draw all these features for you as well. You can then interactively select specific features and drill down to an increased level of detail. For example, Figure 2.12 shows a screenshot of one of the sample BSML files that comes bundled with the viewer. All features are displayed around the perimeter of the main circular sequence widget. If you select one of the features in the main window, detailed feature information is immediately displayed in the "Details" panel on the left.

Figure 2.12 Sample screenshot of the Rescentris Genomic Workspace™ application. A fully annotated BSML sequence is shown.

You may have noticed that the Feature element contains a display-auto attribute. When set to "1," this provides a hint to the BSML rendering software that you want to automatically display the feature with a separate graphical widget. For example, Genomic Workspace™ uses this information to automatically render and visualize all BSML files.

In BSML 3.1, you can explicitly denote that some features span multiple regions. For example, you can specify all the exons for a protein-coding sequence. To do so, you must specify a join attribute, and set it to "1." Following this, the first Intervai-ioc specifies the complete range of the sequence, and each subsequent Intervai-ioc element specifies a specific subrange of data. For example, the following excerpt describes a protein-coding sequence with three exons:

<Feature-table>

<Feature id="sample-protein" class="CDS" display-auto="1" join="1">

<Interval-loc startpos="100" endpos="400"/>

<Interval-loc startpos="120" endpos="150"/>

<Interval-loc startpos="190" endpos="210"/>

<Interval-loc startpos="300" endpos="400"/> </Feature> </Feature-table>

Was this article helpful?

0 0
Swine Influenza

Swine Influenza

SWINE INFLUENZA frightening you? CONCERNED about the health implications? Coughs and Sneezes Spread Diseases! Stop The Swine Flu from Spreading. Follow the advice to keep your family and friends safe from this virus and not become another victim. These simple cost free guidelines will help you to protect yourself from the swine flu.

Get My Free Ebook


Post a comment