In this post, I am going to discuss how Python is being used in the field of bioinformatics and how you can use it to analyze sequences of DNA, RNA, and proteins. Yeah, Python is being used by biologists as well.
Before I get into coding, I’d like to give a brief background of bioinformatics and related things.
What is bioinformatics?
According to Wikipedia:
Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.
Bioinformatics is the field which is a combination of two major fields: Biological data (sequences and structures of proteins, DNA, RNAs, and others ) and Informatics (computer science, statistics, maths, and engineering). This is why it is called an interdisciplinary field. The common use of bioinformatics includes the identification of genes and SNPs.
The goal of bioinformatics are:
- Sequence Alignment.
- Gene finding.
- Drug discovery and drug designing.
- Protein structure alignment and prediction. etc.
As you have figured out, bioinformatics is a huge field that contains different areas and relevant operations. In this introductory post, we are discussing in brief about sequence analysis.
What is Sequence Analysis?
According to Wikipedia:
In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution.
Sequencing is the process of finding the primary structure whether it is DNA, RNA.
DNA Sequencing is all about determining the nucleic acid sequence. It helps to find out the order of the four bases: adenine (A), guanine (G), cytosine (C) and thymine (T), in a strand of DNA.
Sequencing is done with the help of sequence machines. Once sequenced, the data is then available for further processing. A textual example of a DNA sequence looks like below:
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA
DNA sequencing is the method of determining the order of nucleotide in a DNA.
RNA sequencing is the method to find the quantity of RNA in a biological sample.
Protein sequencing is the method of determining the amino acid sequence of all or part of a protein or peptide.
OK, enough theory, it’s time to get into code.
Biopython Development Setup
We are going to use Biopython, the library for dealing with biological computation and analysis of sequences. To install it you can either use pip
or conda
, in case you are using Conda based distribution.
pip install biopython
You can check the version as well:
>>> import Bio >>> Bio.__version__ '1.74' >>>
1.74 as of writing this post.
Reading Sequence in FASTA format
OK, so we are going to read a DNA sequence that is available in FASTA format. FASTA is a text-based format to represent different sequences which are represented in single-letter codes. A typical FASTA format looks like below:
;LCBO - Prolactin precursor - Bovine ; a sample sequence in FASTA format >gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA ATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGAT AAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCA GGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCC AGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGT TTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTT GTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGAT GTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC >MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA DIDGDGQVNYEEFVQMMTAK* >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY
Comments are added after semicolons. Each sequence is separated by a >
symbol. The example given above has 3 sequences given in a single FASTA file. This is just a sample otherwise it is not necessary.
We are going to read a sample FASTA file, given here, which is the sequence of lady slipper orchids.
from Bio import SeqIO if __name__ == '__main__': idx = 0 for sequence in SeqIO.parse('ls_orchid.fasta','fasta'): print(sequence.id) print(sequence.seq) print('No of nucleotides: {}'.format(len(sequence))) idx+=1 print('Total Sequence found {}'.format(idx))
When it runs, it shows output like below:
gi|2765658|emb|Z78533.1|CIZ78533 CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGGCCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC No of nucleotides: 740 gi|2765657|emb|Z78532.1|CCZ78532 CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTGAATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGGGCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTACGGTGCAGTTTGCGCCAAGTCATATAAAGCATCACTGATGAATGACATTATTGTCAGAAAAAATCAGAGGGGCAGTATGCTACTGAGCATGCCAGTGAATTTTTATGACTCTCGCAACGGATATCTTGGCTCTAACATCGATGAAGAACGCAGCTAAATGCGATAAGTGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCTCGAGGCCATCAGGCTAAGGGCACGCCTGCCTGGGCGTCGTGTGTTGCGTCTCTCCTACCAATGCTTGCTTGGCATATCGCTAAGCTGGCATTATACGGATGTGAATGATTGGCCCCTTGTGCCTAGGTGCGGTGGGTCTAAGGATTGTTGCTTTGATGGGTAGGAATGTGGCACGAGGTGGAGAATGCTAACAGTCATAAGGCTGCTATTTGAATCCCCCATGTTGTTGTATTTTTTCGAACCTACACAAGAACCTAATTGAACCCCAATGGAGCTAAAATAACCATTGGGCAGTTGATTTCCATTCAGATGCGACCCCAGGTCAGGCGGGGCCACCCGCTGAGTTGAGGC No of nucleotides: 753
The very first line of the sequence appears in the form of id. The text after each > then appears as a sequence.
FASTA is not the only format available. There is another format available, called, GeneBank, having a file extension .gbk
.
from Bio import SeqIO if __name__ == '__main__': idx = 0 for sequence in SeqIO.parse('ls_orchid.gbk','genbank'): print(sequence.seq) print('No of nucleotides: {}'.format(len(sequence))) idx+=1 print('Total Sequence found {}'.format(idx))
No change in code as such other than file and file type. The rest of the API is the same across the file format. If I run the above program it prints like below:
Z78533.1 CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGGCCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC No of nucleotides: 740 Z78532.1 CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTGAATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGGGCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTACGGTGCAGTTTGCGCCAAGTCATATAAAGCATCACTGATGAATGACATTATTGTCAGAAAAAATCAGAGGGGCAGTATGCTACTGAGCATGCCAGTGAATTTTTATGACTCTCGCAACGGATATCTTGGCTCTAACATCGATGAAGAACGCAGCTAAATGCGATAAGTGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCTCGAGGCCATCAGGCTAAGGGCACGCCTGCCTGGGCGTCGTGTGTTGCGTCTCTCCTACCAATGCTTGCTTGGCATATCGCTAAGCTGGCATTATACGGATGTGAATGATTGGCCCCTTGTGCCTAGGTGCGGTGGGTCTAAGGATTGTTGCTTTGATGGGTAGGAATGTGGCACGAGGTGGAGAATGCTAACAGTCATAAGGCTGCTATTTGAATCCCCCATGTTGTTGTATTTTTTCGAACCTACACAAGAACCTAATTGAACCCCAATGGAGCTAAAATAACCATTGGGCAGTTGATTTCCATTCAGATGCGACCCCAGGTCAGGCGGGGCCACCCGCTGAGTTGAGGC No of nucleotides: 753
You can also mention sequence type by using Alphabet. By default, it will be SingleLetterAlphabet()
from Bio.Alphabet import generic_dna if __name__ == '__main__': idx = 0 for sequence in SeqIO.parse('ls_orchid.fasta','fasta',generic_dna): print(sequence)
I imported generic_dna
type and used it in parse()
method. When it runs, it prints the following output:
Seq('CGTAACAAGGTTTCCGTAGGTGGACCTCCGGGAGGATCATTGTTGAGATCACAT...GCA', DNAAlphabet())
As you can read DNAAlphabet()
is printed now.
Calculating GC Content
Biopython provides an easy way to calculate GC Content(guanine-cytosine content).
from Bio import SeqIO from Bio.Alphabet import generic_dna from Bio.SeqUtils import GC if __name__ == '__main__': idx = 0 for sequence in SeqIO.parse('ls_orchid.fasta','fasta',generic_dna): print(GC(sequence.seq))
You imported the GC
method from the utility classes and passed the actual sequence object in it. On execution, it prints the numeric percentage.
Complement and Reverse Complement
The two antiparallel strands (+ve or 5’-> 3’ &-ve or 3’-> 5’) are complementary to each other in a DNA molecule.
A nucleotide sequence can be reverse complemented to get a new sequence. Similarly, the complemented sequence can be reverse complemented to get the original sequence. Biopython provides two methods to do this functionality. In Biopython it is very easy to get both of a sequence.
for sequence in SeqIO.parse('ls_orchid.fasta', 'fasta', generic_dna): print('Original Sequence: {} '.format(sequence.seq)) print('Complement: {}: '.format(sequence.seq.complement())) print('Reverse Complement: {}: '.format(sequence.seq.reverse_complement())) print('============================================================')
When it runs it prints the following output:
Original Sequence: CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGGCCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC Complement: GCATTGTTCCAAAGGCATCCACTTGGACGCCTTCCTAGTAACTACTCTGGCACCTTATTTGCTAGCTCACTTAGGCCTCCTGGCCACATGAGTCGAGTGGCCCCCGTAACGAGGGCACCACTGGGACTAAACAACAACCCGGCGGAGCCCTCGCAGGTACCGCCCAAACTTGGAGATCGGGCCGCGTCAAACCCGCGGTTCGGTATACTTTCGTAGTGGCCGCTTACCGTAACAGAAGGGGTTTTGGGCCTCGCCGCCGCACGACAGCGCACGGGTTACTTAAAACTACTGAGAGCGTTTGCCCTTAGAACCGAGAAACGTAGCCTACCTTCCTGCGTCGCTTTACGCTATTCACCACACTTAACGTTCTAGGGCACTTGGTAGCTCAGAAAACTTGCGTTCAACGCGGGCTCCGGTAGTCCGATTCCCGTGCGGACGAACCCGCAGCGCGAAGCAGAGAGAGGACGGTTACGAACGGGCCGTATGTCGGTCCGGCCGCACCACGCCTACACTTTCTAACCGGGGAACACGGATCCACGCCGCCCAGGTTCTCGACCACAAAACTACCGGGCCTTGGGCCGTTCTCCACCTGCCTACGACCGTCGTCGACGGCACGCTTAGGGGGTACAACAGCACGAACAGCCTGTCCGTCCTCTTGGGAAGGCTTGGGGTTACCTCCCGCCAACTGGCGGTAAGCCTACACTGGGGTCCAGTCCGCCCCCGTGGGCGACTCAAATGCG: Reverse Complement: GCGTAAACTCAGCGGGTGCCCCCGCCTGACCTGGGGTCACATCCGAATGGCGGTCAACCGCCCTCCATTGGGGTTCGGAAGGGTTCTCCTGCCTGTCCGACAAGCACGACAACATGGGGGATTCGCACGGCAGCTGCTGCCAGCATCCGTCCACCTCTTGCCGGGTTCCGGGCCATCAAAACACCAGCTCTTGGACCCGCCGCACCTAGGCACAAGGGGCCAATCTTTCACATCCGCACCACGCCGGCCTGGCTGTATGCCGGGCAAGCATTGGCAGGAGAGAGACGAAGCGCGACGCCCAAGCAGGCGTGCCCTTAGCCTGATGGCCTCGGGCGCAACTTGCGTTCAAAAGACTCGATGGTTCACGGGATCTTGCAATTCACACCACTTATCGCATTTCGCTGCGTCCTTCCATCCGATGCAAAGAGCCAAGATTCCCGTTTGCGAGAGTCATCAAAATTCATTGGGCACGCGACAGCACGCCGCCGCTCCGGGTTTTGGGGAAGACAATGCCATTCGCCGGTGATGCTTTCATATGGCTTGGCGCCCAAACTGCGCCGGGCTAGAGGTTCAAACCCGCCATGGACGCTCCCGAGGCGGCCCAACAACAAATCAGGGTCACCACGGGAGCAATGCCCCCGGTGAGCTGAGTACACCGGTCCTCCGGATTCACTCGATCGTTTATTCCACGGTCTCATCAATGATCCTTCCGCAGGTTCACCTACGGAAACCTTGTTACG: ============================================================
Conclusion
So this was a brief intro to bioinformatics and Biopython. It is way more than that. You can also apply machine learning methods to find a new drug for saving lives. Hopefully, I will be writing more on this subject, provided if it’s liked by you 🙂