Python for bioinformatics: Getting started with sequence analysis in Python
A Biopython tutorial about DNA, RNA and other sequence analysis

In this post, I am going to discuss how Python is being used in the field of bioinformatics and how you can use it to analyze sequences of DNA, RNA, and proteins. Yeah, Python is being used by biologists as well.

Before I get into coding, I’d like to give a brief background of bioinformatics and related things.

What is bioinformatics?

According to Wikipedia:

Bioinformatics  is an interdisciplinary field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.

Bioinformatics is the field which is a combination of two major fields: Biological data (sequences and structures of proteins, DNA, RNAs, and others ) and Informatics (computer science, statistics, maths, and engineering). This is why it is called an interdisciplinary field. The common use of bioinformatics includes the identification of genes and SNPs.

The goal of bioinformatics are:

  • Sequence Alignment.
  • Gene finding.
  • Drug discovery and drug designing.
  • Protein structure alignment and prediction. etc.

As you have figured out, bioinformatics is a huge field that contains different areas and relevant operations. In this introductory post, we are discussing in brief about sequence analysis.

What is Sequence Analysis?

According to Wikipedia:

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution.

Sequencing is the process of finding the primary structure whether it is DNA, RNA.

DNA Sequencing is all about determining the nucleic acid sequence. It helps to find out the order of the four bases: adenine (A), guanine (G), cytosine (C) and thymine (T), in a strand of DNA.

Sequencing is done with the help of sequence machines. Once sequenced, the data is then available for further processing. A textual example of a DNA sequence looks like below:

CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA
AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA

DNA sequencing is the method of determining the order of nucleotide in a DNA.

RNA sequencing is the method to find the quantity of RNA in a biological sample.

Protein sequencing is the method of determining the amino acid sequence of all or part of a protein or peptide.

OK, enough theory, it’s time to get into code.

Biopython Development Setup

We are going to use Biopython, the library for dealing with biological computation and analysis of sequences. To install it you can either use pip or conda , in case you are using Conda based distribution.

pip install biopython

You can check the version as well:

>>> import Bio 
>>> Bio.__version__ '1.74' 
>>>

 

1.74 as of writing this post.

Reading Sequence in FASTA format

OK, so we are going to read a DNA sequence that is available in FASTA format. FASTA is a text-based format to represent different sequences which are represented in single-letter codes. A typical FASTA format looks like below:

;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA
AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA
ATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGAT
AAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCA
GGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCC
AGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGT
TTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTT
GTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGAT
GTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

Comments are added after semicolons. Each sequence is separated by a > symbol. The example given above has 3 sequences given in a single FASTA file. This is just a sample otherwise it is not necessary.

We are going to read a sample FASTA file, given here, which is the sequence of lady slipper orchids.

from Bio import SeqIO


if __name__ == '__main__':
    idx = 0
    for sequence in SeqIO.parse('ls_orchid.fasta','fasta'):
        print(sequence.id)
        print(sequence.seq)
        print('No of nucleotides: {}'.format(len(sequence)))
        idx+=1

    print('Total Sequence found {}'.format(idx))

When it runs, it shows output like below:

gi|2765658|emb|Z78533.1|CIZ78533
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGGCCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC
No of nucleotides: 740
gi|2765657|emb|Z78532.1|CCZ78532
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTGAATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGGGCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTACGGTGCAGTTTGCGCCAAGTCATATAAAGCATCACTGATGAATGACATTATTGTCAGAAAAAATCAGAGGGGCAGTATGCTACTGAGCATGCCAGTGAATTTTTATGACTCTCGCAACGGATATCTTGGCTCTAACATCGATGAAGAACGCAGCTAAATGCGATAAGTGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCTCGAGGCCATCAGGCTAAGGGCACGCCTGCCTGGGCGTCGTGTGTTGCGTCTCTCCTACCAATGCTTGCTTGGCATATCGCTAAGCTGGCATTATACGGATGTGAATGATTGGCCCCTTGTGCCTAGGTGCGGTGGGTCTAAGGATTGTTGCTTTGATGGGTAGGAATGTGGCACGAGGTGGAGAATGCTAACAGTCATAAGGCTGCTATTTGAATCCCCCATGTTGTTGTATTTTTTCGAACCTACACAAGAACCTAATTGAACCCCAATGGAGCTAAAATAACCATTGGGCAGTTGATTTCCATTCAGATGCGACCCCAGGTCAGGCGGGGCCACCCGCTGAGTTGAGGC
No of nucleotides: 753

The very first line of the sequence appears in the form of id. The text after each > then appears as a sequence.

FASTA is not the only format available. There is another format available, called, GeneBank, having a file extension .gbk.

from Bio import SeqIO


if __name__ == '__main__':
    idx = 0
    for sequence in SeqIO.parse('ls_orchid.gbk','genbank'):
        print(sequence.seq)
        print('No of nucleotides: {}'.format(len(sequence)))
        idx+=1

    print('Total Sequence found {}'.format(idx))

No change in code as such other than file and file type. The rest of the API is the same across the file format. If I run the above program it prints like below:

Z78533.1
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGGCCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC
No of nucleotides: 740
Z78532.1
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTGAATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGGGCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTACGGTGCAGTTTGCGCCAAGTCATATAAAGCATCACTGATGAATGACATTATTGTCAGAAAAAATCAGAGGGGCAGTATGCTACTGAGCATGCCAGTGAATTTTTATGACTCTCGCAACGGATATCTTGGCTCTAACATCGATGAAGAACGCAGCTAAATGCGATAAGTGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCTCGAGGCCATCAGGCTAAGGGCACGCCTGCCTGGGCGTCGTGTGTTGCGTCTCTCCTACCAATGCTTGCTTGGCATATCGCTAAGCTGGCATTATACGGATGTGAATGATTGGCCCCTTGTGCCTAGGTGCGGTGGGTCTAAGGATTGTTGCTTTGATGGGTAGGAATGTGGCACGAGGTGGAGAATGCTAACAGTCATAAGGCTGCTATTTGAATCCCCCATGTTGTTGTATTTTTTCGAACCTACACAAGAACCTAATTGAACCCCAATGGAGCTAAAATAACCATTGGGCAGTTGATTTCCATTCAGATGCGACCCCAGGTCAGGCGGGGCCACCCGCTGAGTTGAGGC
No of nucleotides: 753

You can also mention sequence type by using Alphabet. By default, it will be SingleLetterAlphabet()

from Bio.Alphabet import generic_dna

if __name__ == '__main__':
    idx = 0
    for sequence in SeqIO.parse('ls_orchid.fasta','fasta',generic_dna):
        print(sequence)

I imported generic_dna type and used it in parse() method. When it runs, it prints the following output:

Seq('CGTAACAAGGTTTCCGTAGGTGGACCTCCGGGAGGATCATTGTTGAGATCACAT...GCA', DNAAlphabet())

As you can read DNAAlphabet() is printed now.

Calculating GC Content

Biopython provides an easy way to calculate GC Content(guanine-cytosine content).

from Bio import SeqIO
from Bio.Alphabet import generic_dna
from Bio.SeqUtils import GC

if __name__ == '__main__':
    idx = 0
    for sequence in SeqIO.parse('ls_orchid.fasta','fasta',generic_dna):
        print(GC(sequence.seq))

You imported the GC method from the utility classes and passed the actual sequence object in it. On execution, it prints the numeric percentage.

Complement and Reverse Complement

The two antiparallel strands (+ve or 5’-> 3’ &-ve or 3’-> 5’) are complementary to each other in a DNA molecule.

A nucleotide sequence can be reverse complemented to get a new sequence. Similarly, the complemented sequence can be reverse complemented to get the original sequence. Biopython provides two methods to do this functionality. In Biopython it is very easy to get both of a sequence.

for sequence in SeqIO.parse('ls_orchid.fasta', 'fasta', generic_dna):
        print('Original Sequence:  {} '.format(sequence.seq))
        print('Complement: {}: '.format(sequence.seq.complement()))
        print('Reverse Complement: {}: '.format(sequence.seq.reverse_complement()))
        print('============================================================')

When it runs it prints the following output:

Original Sequence:  CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGGCCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC 
Complement: GCATTGTTCCAAAGGCATCCACTTGGACGCCTTCCTAGTAACTACTCTGGCACCTTATTTGCTAGCTCACTTAGGCCTCCTGGCCACATGAGTCGAGTGGCCCCCGTAACGAGGGCACCACTGGGACTAAACAACAACCCGGCGGAGCCCTCGCAGGTACCGCCCAAACTTGGAGATCGGGCCGCGTCAAACCCGCGGTTCGGTATACTTTCGTAGTGGCCGCTTACCGTAACAGAAGGGGTTTTGGGCCTCGCCGCCGCACGACAGCGCACGGGTTACTTAAAACTACTGAGAGCGTTTGCCCTTAGAACCGAGAAACGTAGCCTACCTTCCTGCGTCGCTTTACGCTATTCACCACACTTAACGTTCTAGGGCACTTGGTAGCTCAGAAAACTTGCGTTCAACGCGGGCTCCGGTAGTCCGATTCCCGTGCGGACGAACCCGCAGCGCGAAGCAGAGAGAGGACGGTTACGAACGGGCCGTATGTCGGTCCGGCCGCACCACGCCTACACTTTCTAACCGGGGAACACGGATCCACGCCGCCCAGGTTCTCGACCACAAAACTACCGGGCCTTGGGCCGTTCTCCACCTGCCTACGACCGTCGTCGACGGCACGCTTAGGGGGTACAACAGCACGAACAGCCTGTCCGTCCTCTTGGGAAGGCTTGGGGTTACCTCCCGCCAACTGGCGGTAAGCCTACACTGGGGTCCAGTCCGCCCCCGTGGGCGACTCAAATGCG: 
Reverse Complement: GCGTAAACTCAGCGGGTGCCCCCGCCTGACCTGGGGTCACATCCGAATGGCGGTCAACCGCCCTCCATTGGGGTTCGGAAGGGTTCTCCTGCCTGTCCGACAAGCACGACAACATGGGGGATTCGCACGGCAGCTGCTGCCAGCATCCGTCCACCTCTTGCCGGGTTCCGGGCCATCAAAACACCAGCTCTTGGACCCGCCGCACCTAGGCACAAGGGGCCAATCTTTCACATCCGCACCACGCCGGCCTGGCTGTATGCCGGGCAAGCATTGGCAGGAGAGAGACGAAGCGCGACGCCCAAGCAGGCGTGCCCTTAGCCTGATGGCCTCGGGCGCAACTTGCGTTCAAAAGACTCGATGGTTCACGGGATCTTGCAATTCACACCACTTATCGCATTTCGCTGCGTCCTTCCATCCGATGCAAAGAGCCAAGATTCCCGTTTGCGAGAGTCATCAAAATTCATTGGGCACGCGACAGCACGCCGCCGCTCCGGGTTTTGGGGAAGACAATGCCATTCGCCGGTGATGCTTTCATATGGCTTGGCGCCCAAACTGCGCCGGGCTAGAGGTTCAAACCCGCCATGGACGCTCCCGAGGCGGCCCAACAACAAATCAGGGTCACCACGGGAGCAATGCCCCCGGTGAGCTGAGTACACCGGTCCTCCGGATTCACTCGATCGTTTATTCCACGGTCTCATCAATGATCCTTCCGCAGGTTCACCTACGGAAACCTTGTTACG: 
============================================================

Conclusion

So this was a brief intro to bioinformatics and Biopython. It is way more than that. You can also apply machine learning methods to find a new drug for saving lives. Hopefully, I will be writing more on this subject, provided if it’s liked by you 🙂

 

If you like this post then you should subscribe to my blog for future updates.

* indicates required