Bioinformatics101
Analyzing a Sequence Ontology (SO) database
By SAJEEWA PEMASINGHE
(NOTE: This is not the initial tutorial. To know about the needed computer set-up for the analyses, read this.)
Terms, their definitions, relationships between those terms, that are used in describing biological sequences(e.g. DNA sequences, amino acid sequences) is collectively called Sequence Ontology (SO).
http://www.sequenceontology.org is such an SO database which is a collaborative project for the definition of sequence features used in biological sequence annotation.
The SO data used in this website can be downloaded in the following way.
URL=https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/so-simple.obo
wget $URL
In contrast to the normal tabular, column-based structure of database files, the obo file downloaded above does not have a column-based structure.
The terms, their definitions etc. are stored as a bunch of “multi-line records”.
An example of one such multi-line record is given below.
[Term]
id: SO:0000436
name: ARS
def: "A sequence that can autonomously replicate, as a plasmid, when transformed into a bacterial host." [SO:ma]
subset: SOFA
synonym: "autonomously replicating sequence" EXACT []
is_a: SO:0000296 ! origin_of_replication
As shown above, there is a term id , term name, term definition and an “is_a” relationship to its direct parent term.
The command in Linux to extract the above information is
cat so-simple.obo | grep "name: ARS$" -B 2 -A 5
Description of the command
The cat (stands for ‘concatenate’) command followed by a filename usually outputs the contents of that file to the standard output which is the screen. But in this case using the ‘pipe’ | character we redirect the contents of the so-simple.obo file as input to the grep (stands for ‘global regular expression print’) command. The grep command is used to output lines from a file that match a given pattern. In this case the pattern we are checking for is name: ARS$. The $ sign at the end makes sure that we are looking for matches that end with the word ARS. The -B 2 -A 5 tells the grep command to not only output the line that matches the pattern but also two lines before (-B 2) that line and five lines after (-A 5) that line (in this case the 5th line after the matching line is blank).
Sequence ontology is all about finding definitions of terms related to biological sequences, relationships between those terms etc.
If we want to look for the definition of the term ‘gene’ we can use the following command.
cat so-simple.obo | grep "name: gene$" -B 2 -A 7
This will give us the following output.
[Term]
id: SO:0000704
name: gene
def: "A region (or regions) that includes all of the sequence elements necessary to encode a functional transcript. A gene may include regulatory regions, transcribed regions and/or other functional sequence regions." [SO:immuno_workshop]
comment: This term is mapped to MGED. Do not obsolete without consulting MGED ontology. A gene may be considered as a unit of inheritance.
subset: SOFA
synonym: "INSDC_feature:gene" EXACT []
xref: http://en.wikipedia.org/wiki/Gene "wiki"
is_a: SO:0001411 ! biological_region
relationship: member_of SO:0005855 ! gene_group
Since each term description starts with the line “[Term]”, if we count the number of instances of “[Term]”, we can count the number of terms in the database. We can implement this idea with the following command.
cat so-simple.obo | grep 'Term' | wc -l
At the moment this gives the output 2454. This number would increase in the future.
Description of the command
The wc command stands for ‘word count’. So the number of lines that matches the pattern Term is given as input to wc. By default the wc command outputs the number of lines, number of words and the number of characters. But when we give the -l option to wc (as in wc -l), it outputs the number of lines only.