HOWTO:Trees

This page is under construction.

This HOWTO is translated from BioPerl HOWTO:Trees. The original document is copyright Jason Stajich. It can be copied and distributed under the terms of the Perl Artistic License.

Author
Jason Stajich, Dept. Molecular Genetics and Microbiology, Institute for Genome Sciences and Policy, Duke University. [mailto:jason-at-bioperl.org jason-at-bioperl.org]

Copyright
This document is copyright Jason Stajich. It can be copied and distributed under the terms of the Perl Artistic License.

Abstract
This HOWTO intends to show how to use the BioRuby Tree objects to manipulate phylogenetic trees. It shows how to read and write trees, query them for information about specific nodes or overall statistics, and create pictures of trees. Advanced topics include discussion of generating random trees and extensions of the basic structure for integration with other modules in BioRuby.

Introduction
Generating and manipulating phylogenetic trees is an important part of modern systematics and molecular evolution research. The construction of trees is the subject of a rich literature and active research. This HOWTO and the modules described within are focused on querying and manipulating trees once they have been created.

The data we intend to capture with these objects concerns the notion of Trees and their Nodes. A Tree is made up of Nodes and the relationships which connect these nodes. The basic representation of parent and child nodes is intended to represent the directionality of evolution. This is to capture the idea that some ancestral species gave rise, through speciation events, to a number of child species. The data in the trees need not be a strictly bifurcating tree (or binary trees to the computer science types), and a parent node can give rise to 1 or many child nodes.

In practice there are just a few main objects, or modules, you need to know about. There is the main Tree object Bio::Tree which is the main entry point to the data represented by a tree. A Node is represented generically by Bio::Tree::Node, however there would be subclasses of this object to handle particular cases where we need a richer object. The connections between Nodes are described by using Bio::Tree::Edge. Unlike BioPerl, Nodes do not have any pointers or references. An Edge object has just two pointers to Nodes to be connected. The two Nodes are equally treated, and no parents-children relationships are recorded in the Edge objects. Unlike BioPerl, data specific to nodes, like bootstrap values and labels, are stored in the Node objects, and data specific to edges, like distances, are stored in the Edge objects. The Bio::Tree object is just a container for some summary information about the tree, nodes and edges in the tree, and a description of the tree's root node.

Reading and Writing Trees
Trees are used to represent the ancestry of a collection of taxa, sequences, or populations.

Using Bio::FlatFile one can read trees from files or datastreams and create Bio::Tree objects. This is analogous to how we read sequences from sequence files with Bio::FlatFile to create BioRuby sequence objects which can be queried and manipulated. Similarly we can write Bio::Tree objects out to string representations like the Newick or New Hampshire tree formats which can be printed to a file, a datastream, stored in database, etc.

The main module for reading and writing trees is the Bio::FlatFile factory class which calls several driver classes. These drivers include Bio::Newick for New Hampshire or Newick tree format, and for the New Hampshire extended tree format from Sean Eddy and Christian Zmasek as part of their RIO, Forrester and ATV system RIO,ATV,SDI. The parser Bio::Nexus supports parsing tree data from PAUP's Nexus format. However this driver currently only supports parsing, not writing, of Nexus tree format tree files. There are also modules for lintree tree format and Pagel tree format for writing these formats out. The phyloxml tree format will be supported in the future.

By default, Bio::Newick automatically determines whether the internal nodes id encode bootstrap values instead of IDs or not. If you do not like the default behavior, giving appropriate :bootstrap_style option to Bio::Newick.new. This is only valid for the Nexus and Newick tree formats.

Example Code
Here is some code which will read in a Tree from a file called "tree.tre" and produce a Bio::Tree object which is stored in the variable tree.

Like most modules which do input/output you can also specify an IO object instead of the filename.

Once you have a Tree object you can do a number of things with it. These are all methods required in Bio::Tree.

For example try these two difference example scripts that read in a tree data and prints out the the node ids and bootstrap values. The first example assumes that internal node ids are Ids and not bootstrap values.

The second is just the default behavior that Bio::Newick parser automatically moves the bootstrap values over from the internal node Ids.

One can also explictly invoke this by calling just calling the move_id_to_bootstrap method on a tree.

Bio::Tree methods 1
Request the taxa (leaves of the tree).

Get the root node.

Get the total length of the tree (sum of all the branch lengths), which is only useful if the edges (connections between nodes) actually have the branch length stored, of course.

Bio::Tree methods 2
Bio::Tree has many functions which are useful for manipulating a Tree.

Find a particular node, either by name or by some other field that is stored in a Node.

If you would like to do more sophisticated searches, like "find all the nodes with bootstrap values better than 70", you can easily implement this yourself.

Remove a Node from the Tree and update the graph (children/ancestor links) where the Node is an intervening one.

Get the lowest common ancestor for a set of Nodes. This method is used to find an internal Node of the Tree which can be traced, through its children, to the requested set of Nodes. It is used in the calculations of monophyly and paraphyly and in determining the distance between two nodes.

The above works for 2 or more nodes. For just 2 nodes,

Get the distance between two nodes by adding up the branch lengths of all the connecting edges between two nodes.

Perform a test of Wp:monophyly for a set of nodes and a given outgroup node. This means the common ancestor for the members of the internal_nodes group is more recent than the common ancestor that any of them share with the outgroup node.

(coming soon)

Perform a test of Wp:paraphyly for a set of nodes and a given outgroup node. This means that a common ancestor 'A' for the members of the ingroup is more recent than a common ancestor 'B' that they share with the outgroup node and that there are no other nodes in the tree which have 'A' as a common ancestor before 'B'.

(coming soon)

Re-root a tree, specifying a different node as the root (and a different node as the outgroup).

Operations on Nodes
(below is still under translation)