SampleCodes

This document is based on the original 'BioJava in Anger' by Mark Schreiber et al. "BioJava in Anger, A Tutorial and Recipe Book for Those in a Hurry".

Introduction
BioRuby can be both big and intimidating. For those of us who are in a hurry there really is a whole lot there to get your head around. This document is designed to help you develop BioRuby programs that do 99% of common tasks without needing to read and understand 99% of the BioRuby API.

The page was inspired by various programming cookbooks and follows a "How do I...?" type approach. Each How do I is linked to some example code that does what you want and sometimes more. Basically if you find the code you want and copy and paste it into your program you should be up and running quickly. I have endeavoured to over document the code to make it more obvious what I am doing so some of the code might look a bit bloated.

'BioRuby in Anger' is maintained by Toshiaki Katayama and Pjotr Prins. If you have any suggestions, questions or comments contact the BioRuby mailing list.

These demos are tested with BioRuby 0.5.3 and Ruby 1.6.8 (Partly 1.8.1).

Alphabets and Symbols
In BioRuby, Sequence class inherits String so you can treat Sequence object as a String with various powerful methods implemented in Ruby's String class. You can easily generate DNA and/or Amino Acid sequences to edit, extract subsequence, regexp pattern match on it with usual methods for String object. Sequence class also has methods for splicing, translation, calculate stastical values, window search etc.

There are nothing equivarent to BioJava's Alphabet and/or Symbols in BioRuby, however, BioRuby provides lists of nucleic acids, amino acids and codon tables and use it transparently in appropreate methods?as needed.

How can I make an ambiguous Symbol like Y or R?
The IBU defines standard codes for symbols that are ambiguous such as Y to indicate C or T and R to indicate G or C or N to indicate any nucleotide. BioRuby represents these symbols as the same Bio::Sequence::NA object which can be easily converted to Regular expression that matches components of the ambiguous symbols. In turn, Bio::Sequence::NA object can contain symbols matching one or more component symbols that are valid members of the same alphabet as the Bio::Sequence::NA and are therefore capable of being ambiguous.

Generally an ambiguity symbol is converted to a Regexp object by calling the to_re method from the Bio::Sequence::NA that contains the symbol itself. You don't need to make symbol 'Y' by yourself because it is already built in the Bio::NucleicAcid class as a hash named Bio::NucleicAcid::Names.

How do I make a Sequence from a String or make a Sequence Object back into a String?
A lot of the time we see sequence represented as a String of characters eg "atgccgtggcatcgaggcatatagc". It's a convenient method for viewing and succinctly representing a more complex biological polymer. BioRuby makes use of a Ruby's String class to represent these biological polymers as Objects. Unlike BioJava's SymbolList, BioRuby's Bio::Sequence inherits String and provide extra methods for the sequence manipulation. We don't have a container class like a BioJava's Sequence class, to store things like the name of the sequence and any features it might have, you can think of to use other container classes such as a Bio::FastaFormat, Bio::GFF, Bio::Features etc. for now (We have a plan to prepare a general container class for this to be compatible with a Sequence class in other Open Bio* projects).

Bio::Sequence class has same capability as a Ruby's String class, it is simple easy to use. You can represent a DNA sequence within the Bio::Sequence::NA class and a protein sequence within the Bio::Sequence::AA class. You can translate DNA sequence into protein sequence with a single method call and can concatenate them with the same method '+' as a String class's.

String to Bio::Sequence object
Simply pass the sequence string to the constructor.

String to Sequence with comments
Yes, we should prepare a better container class for this. Temporally, you can do this as:

Bio::Sequence to String
You don't need to call any method to convert a Bio::Sequence object to use as a String object because it can behave as a String, although you can call a to_s method to stringify explicitly.

How do I get a subsection of a Sequence?
Given a Sequence object we might only be interested in examining the first 10 bases or we might want to get a region between two points. You might also want to print a subsequence to a file or to STDOUT how could you do this?

BioRuby partly uses a biological coordinate system for identifying bases. You can use Bio::Sequence#subseq method to extract subsequence as the first base is numbered 1 and the last base index is equal to the length of the sequence. Other methods that are inherited from a String class use a normal String indexing which starts at 0 and proceedes to length -1. If you attempt to access a region outside of 1..length with a subseq method nil will be returned. Other methods in a String class will behave as a same.

Iteration
You can iterate on every subsequences easily with BioRuby.

How do I transcribe a DNA Sequence to a RNA Sequence?
In BioRuby, DNA and RNA sequences are stored in the same Bio::Sequence::NA class just using different Alphabets, you can convert from DNA to RNA or RNA to DNA using the rna or dna methods, respectively.

How do I reverse complement a DNA or RNA Sequence?
To reverse complement a DNA sequence simply use the complement method.

Sequences are immutable so how can I change it's name?
Sequences are not immutable in BioRuby - just use the freeze method to make sequence unchangable.

How can I edit a Sequence or SymbolList?
Sometimes you will want to modify the order of Symbols in a sequence. For example you may wish to delete some bases, insert some bases or overwrite some bases in a DNA sequence. BioRuby's Bio::Sequence object can be edited by any methods inherited from Ruby's String class.

How can I make a sequence motif into a regular expression?
In BioRuby, to make a Sequence into a Ruby's regular expression pattern is done by calling to_re method for the Bio::Sequence::NA object. The generated pattern can even be from an ambiguous sequence such as "acgytnwacrs". For the protein sequences, you can do this by just wrapping a Bio::Sequence::AA object with // as usual for the String to make it Regexp (or use Regexp.new). You can then use this pattern to search Strings for the existence of that pattern.

How do I translate a DNA or RNA Sequence or SymbolList to Protein?
All you need is to call a translate method for a Bio::Sequence::NA object. In BioRuby, you don't need to convert DNA to RNA before its translation.

How do I translate a single codon to a single amino acid?
The general translation example shows how to use the translate method of Bio::Sequence::NA object but most of what goes on is hidden behind the convenience method. If you only want to translate a single codon into a single amino acid you get exposed to a bit more of the gory detail but you also get a chance to figure out more of what is going on under the hood.

Here's the other way

How do I use a non standard translation table?
The convenient translate method in Bio::Sequence::NA, used in the general translation example, is not limited to use the "Universal" translation table. The translate method also accepts a translation starting frame and a codon table number as its arguments.

The following translation tables are available:


 * 1 - Standard (Eukaryote)
 * 2 - Vertebrate Mitochondrial
 * 3 - Yeast mitochondorial
 * 4 - Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma
 * 5 - Invertebrate Mitochondrial
 * 6 - Ciliate Macronuclear and Dasycladacean
 * 9 - Echinoderm Mitochondrial
 * 10 - Euplotid Nuclear
 * 11 - Bacteria
 * 12 - Alternative Yeast Nuclear
 * 13 - Ascidian Mitochondrial
 * 14 - Flatworm Mitochondrial
 * 15 - Blepharisma Macronuclear
 * 16 - Chlorophycean Mitochondrial
 * 21 - Trematode Mitochondrial
 * 22 - Scenedesmus obliquus mitochondrial
 * 23 - Thraustochytrium Mitochondrial Code

The following program shows the use of the Euplotid Nuclear translation table (where UGA = Cys).

How do I translate a nucleotide sequence in all six frames?
To be done...

How do I write Sequences in Fasta format?
FASTA format is a fairly standard bioinformatics output that is convenient and easy to read. BioRuby's Sequence class has a to_fasta method for formatting sequence in FASTA format.

Printing any Bio::Sequence sequence object in FASTA format.

How do I read in a Fasta file?
This program opens FASTA format file for reading and iterates on each sequence in the file.

This program automatically detects and reads FASTA format files given as its arguments.

Similar but specify FASTA format explicitly.

How do I read a GenBank, EMBL or SwissProt file?
The Bio::FlatFile class contains methods for reading GenBank, SwissProt and EMBL files. Because any of these files can contain more than one sequence entry Bio::FlatFile can be used with a block to iterate through the individual sequences. One of the attractive features of this model is that the Sequences are only parsed and created as needed so very large collections of sequences can be handled with moderate resources.

Information in the file is stored in the Sequence (for now, this is one of the Bio::GenBank, Bio::EMBL or Bio::SwissProt classes, please wait for our future implementation of the generic Sequence class for this) as Bio::Features or where there is location information as Bio::Locations. Reading GenBank, EMBL or SwissProt

Save the following script as the readflat.rb and execute it with files in GenBank, EMBL or SwissProt format as arguments like:

The file format is auto detected.

How do I extract GenBank, EMBL or Swissprot sequences and write them as Fasta?
To perform this task we are going to extend the general reader from the previous demo and include in it the ability to write sequence data in fasta format.

How do I turn an ABI sequence trace into a BioJava Sequence?
BioRuby doesn't support ABI trace at the moment...

How do I list the Annotations in a Sequence?
To be done...

How do I filter a Sequences based on their species (or another Annotation property)?
To be done...

How do I specify a PointLocation?
To be done...

How do I specify a RangeLocation?
To be done...

How do CircularLocations work?
To be done...

How can I make a Feature?
To be done...

How can I filter Features by type?
To be done...

How can I remove features?
To be done...

How do I set up a BLAST parser?
A frequent task in bioinformatics is the generation of BLAST search results.

Save the following script as a blastparse.rb and execute it with a file in BLAST XML output format (-m 7) like:

This script will print BLAST hits and the scores having E-value smaller than 0.001.

We recommend you to install Ruby's xmlparser and expat library to make parsing much faster.

You can also parse blast output with one of the following two approaches:


 * Example (to use File object):


 * Example (to use String containing file content):

How do I set up a FASTA parser? (or GenBank, Kegg etc.)
Search fasta file tags using a regular expression (regex) and print to stdout

Usage: fastasearch [-q query] filename(s)

Example:

Source:

But in fact the BioRuby parser is a lot more powerful than that and is not limitied to Fasta. You can run

Using the following

How do I extract information from parsed results?
To be done...

How do I calculate the frequency of a Symbol in a Sequence?
To be done...

How can I turn a Count into a Distribution?
Using a function.

Output "atgcatgcaaaa" {"a"=>6, "c"=>2, "g"=>2, "t"=>2} {"a"=>0.5, "c"=>0.166666666666667, "g"=>0.166666666666667, "t"=>0.166666666666667}

Using a class method.

Output "atgcatgcaaaa" {"a"=>6, "c"=>2, "g"=>2, "t"=>2} {"a"=>0.5, "c"=>0.166666666666667, "g"=>0.166666666666667, "t"=>0.166666666666667}

How can I generate a random sequence from a Distribution?
To be done...

How can I find the amount of information or entropy in a Distribution?
To be done...

What is an easy way to tell if two Distributions have equal weights?
To be done...

How can I make an OrderNDistribution over a custom Alphabet?
To be done...

How can I write a Distribution as XML?
To be done...

How do I use a WeightMatrix to find a motif?
To be done...

How do I make a HMMER like profile HMM?
To be done...

How do I set up a custom HMM?
To be done...

How can I visualize Annotations and Features as a tree?
To be done...

How can I display a Sequence in a GUI?
To be done...

How do I display Sequence coordinates?
To be done...

How can I display features?
To be done...

How do I set up BioSQL?
To be done...

How do I read a structure file?
The current implementation only reads PDB format files, there will hopefully be code for parsing mmCIF and PDB XML files in the future. Adding these new formats will probably change the API for reading and writing structure files.

Given the above caveat, the current way to read a PDB file is to slurp the file into a string which is fed to the constructor of the Bio::PDB class:

The new Bio::PDB object now contains all the annotations and coordinate data.

How do I write a structure file?
As with reading, the current implementation can only write in PDB format.

All the coordinate classes in the Bio::PDB heirarchy have .to_s methods which return the object in PDB format. This makes output as simple as:

How do I access annotations?
The annotations from the PDB file are stored in Bio::PDB::Record objects. You can retrieve a specific annotation by calling the '.record' method of a Bio::PDB object with the name of the annotation as the argument:

In fact '.record' returns an array of all Bio::PDB::Records of the given type, so you'll need to call '.first' to get to the actual Bio::PDB::Record. Bio::PDB::Record provides methods to get to the individual data in each record. E.g. The 'HEADER' record provides classification, depDate and idCode methods. You can look at the PDB format to see what fields are available in each record, or just look in the pdb.rb file which has a hash of all the definitions.

So, to get the title of a PDB file you would do it like this:

Some records have their own special methods:

And some methods are included for Bio::DB compatability:

How do I access the coordinate data?
Coordinate data is stroed in a heirarchy of classes with Bio::PDB at the top. A Bio::PDB object contains one or more Bio::PDB::Model objects which contain one or more Bio::PDB::Chain objects which contain one or more Bio::PDB::Residue objects which contain Bio::PDB::Atom objects.

There are two ways to access the coordinate data in a Bio::PDB object: keyed access and iterators.

Keyed access provides the simplest way to get to data if you know which residues or atoms you're interested in. For instance if you wanted to get hold of chain 'A' you can do it like this:

The nil refers to the model, which will be nil in crystal structures but may have a number in NMR structures. Every level in the heirarchy can be accessed this way. E.g. to get hold of the CA atom of residue 100 in chain 'A':

or if you still have your chain object:

The residue identifier is not just the residue number. PDB residues can have insertion codes and even negative numbers.

could be valid.

Iterators are also provided so you can easily go through all objects in the heirarchy. E.g:

Goes through every atom in the structure and outputs it. All the classes in the heirarchy implement the standard Enumerable and Comparable mixins as well.

Each class has a number of accessors to allow access to the data, most is in the Bio::PDB::Atom class:


 * Bio::PDB::Model has a 'model_serial' accessor only.
 * Bio::PDB::Chain has an 'id' accessor for getting and setting the chain id.
 * Bio::PDB::Residue has accessors for resName, resSeq and iCode.
 * Bio::PDB::Atom has accessors for serial, element, alt_loc, x, y, z, occ, and bfac

To find the x coordinate of the CA atom of residue 100 is then:

How do I make a deep copy of a structure object?
You can't! There seem to be problems implementing a deep copy function in BioRuby. This is to be fixed in future versions.

Note to developers: have we tried the time-old hack of Marshal.load(Marshal.dump(obj)) to do the trick?

How do I measure distances and angles?
Methods for measuring distance and dihedral angles are provided in the Utils module. To measure distance, use the Bio::PDB::Utils.distance method:

Bio::PDB::Utils.distance requires two Vectors as its arguments. Bio::PDB::Coordinate objects, which are produced by the .xyz call to the Bio::PDB::Atom object, are Vectors. You can also provide a Vector directly:

And use the .centreOfGravity and .geometricCentre methods, both of which provide a Bio::PDB::Coordinate:

All objects in the Bio::PDB heirarchy implement the centreOfGravity and geometricCentre methods.

Dihedral angles are calculated from four Vectors / Bio::PDB::Coordinates:

How do I find specific atoms, residues, chains or models?
If the iterators and keyed access aren't powerful enough for you, the finder method is also provided. All the objects in the Bio::PDB hierarchy implement finder. It requires the class of object you wish to find as the first argument, and then a block which is evaluated for each object. If the block returns true then that object is added to an array of 'found' things.

For example, to find all the atoms within a box:

Or, to find all the HIS residues, in chain 'A' can be written like this:

or like this:

Since the results are returned in arrays, it is very simple to do set operations, such as finding the overlap or difference between two selections:

Selection 3 now contains all the HIS atoms within the box and selection 1 contains all the HIS atoms outside the box.

How do I work with ligand data?
Because some PDB files reuse residue numbers that they already used in the ATOM records in the HETATM records, we have to add an extra flag when we want to access HETATM data. The extra flag is simply adding 'LIGAND' to the residue id. E.g Given the PDB file

ATOM     1  N   GLY A   2      -5.308  22.647  33.657  1.00 59.93 ATOM     2  CA  GLY A   2      -5.090  23.965  33.005  1.00 53.36 ATOM     3  C   GLY A   2      -6.209  24.197  32.021  1.00 52.56 ATOM     4  O   GLY A   2      -7.000  25.134  32.176  1.00 55.02 .... HETATM 2097 O8  PHT A   2     -18.122  40.603  18.146  1.00 16.14 HETATM 2098 O9  PHT A   2     -17.348  39.940  20.109  1.00 16.06 HETATM 2099 C10 PHT A   2     -15.622  41.753  16.662  1.00 13.34 HETATM 2100 O11 PHT A   2     -16.077  42.457  15.721  1.00 16.27

Asking for is ambiguous, so if you want the ligand data you must use

There should also be an .each_ligand method in the Bio::PDB::Chain class for iterating through the ligands, but this is not implemented in the current cvs version.

How do I work with solvent?
Solvent is defined in the current parser as any HETATM record with HOH as the residue type. This may not be ideal for some files, but deals with 95% of cases which is good enough for me at the moment!

Solvent residues are stored in a separate Bio::PDB::Chain attached to each Bio::PDB::Model. Currently the only methods available are 'addSolvent' which adds a residue to the solvent chain and 'removeSolvent', which sets the whole solvent chain to nil.

An each_solvent method for Bio::PDB::Model has been implemented in my copy but is not in cvs at the moment.

How do I get the sequence of a chain?
There are two types of sequence in a PDB file: the real sequence of whatever protein has been crystalised, and the sequence observable in the structure.

The first type of sequence is available through the SEQRES record (if this is provided in the PDB file). You can access this through the top-level Bio::PDB object:

provides the sequence given in the SEQRES record for chain A. The observed sequence is obtained from the Bio::PDB::Chain object. E.g.

Both methods return a Bio::Sequence::AA object.

Disclaimer
This code is generously donated by people who probably have better things to do. Where possible we test it but errors may have crept in. As such, all code and advice here in has no warranty or guarantee of any sort. You didn't pay for it and if you use it we are not responsible for anything that goes wrong. Be a good programmer and test it yourself before unleashing it on your corporate database.

Copyright
The documentation on this site is the property of the people who contributed it. If you wish to use it in a publication please make a request through the BioRuby mailing list.The original version was based on the 'BioJava in Anger' document by Mark Schreider et al.

The code is open-source. A good definition of open-source can be found here. If you agree with that definition then you can use it.