Beginning Python for Bioinformatics
Pages: 1, 2, 3, 4, 5
Python Classes
To create your own custom objects, you must define a sort of template, or
cookie cutter, called a class. You do so in Python using the class statement,
followed by the name of the class and a colon. Following this, the body of the
class definition contains the properties and methods that will be available for
all object instances that are based on this class.
Let's take all the functions that we've created so far and recast them as methods of a DNA class. Then we'll see how to create DNA objects based on our DNA class. While we could do all this from the Python shell, instead we will place this code into a bio.py file and show how we can use this file from the Python shell. The contents of our bio.py file, which Python calls a module, look like this.
class DNA:
"""Class representing DNA as a string sequence."""
basecomplement = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C'}
def __init__(self, s):
"""Create DNA instance initialized to string s."""
self.seq = s
def transcribe(self):
"""Return as rna string."""
return self.seq.replace('T', 'U')
def reverse(self):
"""Return dna string in reverse order."""
letters = list(self.seq)
letters.reverse()
return ''.join(letters)
def complement(self):
"""Return the complementary dna string."""
letters = list(self.seq)
letters = [self.basecomplement[base] for base in letters]
return ''.join(letters)
def reversecomplement(self):
"""Return the reverse complement of the dna string."""
letters = list(self.seq)
letters.reverse()
letters = [self.basecomplement[base] for base in letters]
return ''.join(letters)
def gc(self):
"""Return the percentage of dna composed of G+C."""
s = self.seq
gc = s.count('G') + s.count('C')
return gc * 100.0 / len(s)
def codons(self):
"""Return list of codons for the dna string."""
s = self.seq
end = len(s) - (len(s) % 3) - 1
codons = [s[i:i+3] for i in range(0, end, 3)]
return codons
Much of this should look familiar based on our existing functions. Class definitions do add a few new elements that we need to cover. Let's look at how to use this new class before exploring the extra details.
We create object instances by calling the class, much like we would call a function. The first thing we need to do is make the Python shell aware of this class definition. We do that by importing the DNA class definition from our bio.py module. Then we create an instance of the DNA class, passing in the initial string value. From that point on the object keeps track of its own sequence value, and we simply call the methods that are defined for that object.
>>> from bio import DNA
>>> dna1 = DNA('CGACAAGGATTAGTAGTTTAC')
>>> dna1.transcribe()
'CGACAAGGAUUAGUAGUUUAC'
>>> dna1.reverse()
'CATTTGATGATTAGGAACAGC'
>>> dna1.complement()
'GCTGTTCCTAATCATCAAATG'
>>> dna1.reversecomplement()
'GTAAACTACTAATCCTTGTCG'
>>> dna1.gc()
38.095238095238095
>>> dna1.codons()
['CGA', 'CAA', 'GGA', 'TTA', 'GTA', 'GTT', 'TAC']
Since a class acts as a kind of template that's used to create multiple
object instances, we need the ability, inside a class method, to refer to the
specific object instance on which the method is called. To accommodate this
need, Python automatically passes the object instance as the first argument to
each method. The convention in the Python community is to name that first
argument "self." That's why you see "self" as the first argument in all the
method definitions of our DNA class.
The other thing to note is that the __init__() method. Python calls this
specially named method when creating instances of the class. In our example,
DNA.__init__ expects to receive a string argument, which we then store as a
property of the object instance, self.seq.
We made one other change when we moved our functions into class methods. We
moved the basecomplement dictionary definition out of the complement() method
and into the class definition. As part of the class definition, the dictionary
is only created once, rather than each time the method is called. The
dictionary is shared by all instances of the class, and it can be used by more
than one method. This is in contrast to the seq property, for which each object
instance will have its own unique value.
As you can see, classes provide a effective way to group related data and functionality. Let's finish our shell session by creating a few more DNA instances.
>>> dna2 = DNA('ACGGGAGGACGGGAAAATTACTAGCACCCGCATAGACTT')
>>> dna2.codons()
['ACG', 'GGA', 'GGA', 'CGG', 'GAA', 'AAT', 'TAC', 'TAG',
'CAC', 'CCG', 'CAT', 'AGA', 'CTT']
>>> dna3 = DNA(dna1.seq + dna2.seq)
>>> dna3.reversecomplement()
'AAGTCTATGCGGGTGCTAGTAATTTTCCCGTCCTCCCGTGTAAACTACTAATCCTTGTCG'
>>> dna4 = DNA(dna3.reversecomplement())
>>> dna4.codons()
['AAG', 'TCT', 'ATG', 'CGG', 'GTG', 'CTA', 'GTA', 'ATT',
'TTC', 'CCG', 'TCC', 'TCC', 'CGT', 'GTA', 'AAC', 'TAC',
'TAA', 'TCC', 'TTG', 'TCG']
Even with this rudimentary class definition, manipulated from the Python shell, we can start to see Python's potential for analyzing biological data in a clear, coherent fashion, with a minimum of syntactic overhead.
Conclusion
Python is a popular, open source programming language with much to offer the bioinformatics community. At the same time, Python came late to the bioinformatics party and may never rise to level of popularity of Perl. Choice is always a good thing, though, and Python offers a viable, reliable option for biologists and professional programmers alike. We hope this article gives you a reason to take a closer look at Python.
Additional Resources
If you like what you've seen of Python, here are some additional resources to explore.
Patrick O'Brien is an independent software developer and trainer, specializing in the Python programming language. He is the creator of PyCrust, a developer on the PythonCard project, and leader of the PyPerSyst project. He may be reached at pobrien@orbtech.com.
Return to the Python DevCenter.
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 7 of 7.
-
Hats off for a great introduction!
2003-01-16 20:25:44 anonymous2 [Reply | View]
I've been looking around to various tutorials of python (including the very good job of the French Pasteur institute), but this one beats them all. The discussion is the point, well-organized, accurate, perfect for someone who knows programming, is vaguely familiar with bioinformatics and wants to get productive quickly. In a couple of hours I installed pycrust on my MacOSX (including the supporting machopython s/w), and run through the tutorial. Highly recommended.
Peter Takis
DB Connectivity
dbconn2000@yahoo.com
-
Nice introduction - but here's a link to the real thing!
2002-10-31 14:00:47 anonymous2 [Reply | View]
I recently came across a link to the most comprehensive resource on Python and Bioinformatics
I have ever seen:
http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html
I have no connection with the authors or the
Pasteur Institute.
-
Re: Re: Gene expression data analysis with Python
2002-10-25 01:51:58 anonymous2 [Reply | View]
> > This then gives a very flexible environment for
> > data analysis. As far as I know, Perl is not
> > very suitable for numerics.
> Not quite true. Check
> http://pdl.perl.org
Interesting ... I wasn't aware of that. Thanks for the pointer.
--Michiel.
-
Re: Gene expression data analysis with Python
2002-10-23 15:01:56 anonymous2 [Reply | View]
> This then gives a very flexible environment for
> data analysis. As far as I know, Perl is not
> very suitable for numerics.
Not quite true. Check
http://pdl.perl.org
The bioperl people have not used it much though.
Anyway, not intending to take anything away from the nice Python story, the best IMHO would be easy interoperability between bioperl and biopython (dataformats etc).
Christian Soeller
--
Dept. of Physiology
University of Auckland Auckland, New Zealand
-
Clear introduction to python
2002-10-22 08:51:24 anonymous2 [Reply | View]
This is a very straightforward and hence very useful intro to manipulating bio data with python. It will help anyone with moderate programming skills understand what they can achieve with very little learning. The only thing lacking was a demonstration of how easy it is to read in data from a file - which you definitely will need to get at the raw data.
-
Gene expression data analysis with Python
2002-10-18 22:46:28 anonymous2 [Reply | View]
This article may make you wonder why to use Python at all -- as it says in the conclusion, Python may never rise to the level of popularity of Perl. I would like to point out that Python particularly shines in the area of numerical analysis. Python is heavily used for numerical work in physics and engineering, and since recently also in bioinformatics, particularly for gene expression data analysis. The Numerical Python package (numpy) has become the de facto standard for numerical analysis with Python, and provides many mathematical functions that would have to be written from scratch otherwise. In addition, as mentioned in the article, the tight integration between Python and C makes it possible to implement numerically intensive routines in C, and call them from Python. This then gives a very flexible environment for data analysis. As far as I know, Perl is not very suitable for numerics.
I would further like to point out that using PyCrust might make things more difficult for beginning users, especially because of the installation process. Instead, you can use IDLE, the standard Python shell which is part of the generic Python distribution. It doesn't require any additional installation steps, and should run on any platform. PyCrust may provide some features that IDLE doesn't provide. I have used IDLE for several years though, and have been quite happy with it.
--
Michiel de Hoon
University of Tokyo, Human Genome Center



One note (also from experience); many DNA strings contain lowercase characters, eg to distinguish G from C or to denote handedited bases. The example class will fail miserably when it encounters such a character whem making a complement.
Either convert the input to all upper or lowercase or change the functions so that it can handle both variants.
If the program outputs DNA strings, which are intended to be used further, many users prefer a system that keeps the case.