Project FAQ Project Staff Publications Sponsors Gallery 2006-07
Biotechnology Bioinformatics Teaching and Learning Evaluation Lesson Plans
Telecommunications STEM Careers Tutorials Publishing Poster Showcase

 



Introductory Bioinformatics Lab

for High School AP Biology

Teacher Version (for student version, click HERE )

 
 

Bioinformatics is emerging as a hugely important field affecting all areas of biology.  While bioinformatics is formally the application of computer technologies to biological sciences - ranging from automated analysis of microarrays containing thousands of individual experiments to the development of browser tools for looking at whole genomes - students in all areas of biology need to be familiar with software tools developed by bioinformaticians to accomplish routine tasks in biology.

Skills developed in this lab:
  • Use of National Center for Biotechnology Information (NCBI) databases
  • Retrieval of sequences from NCBI
  • Alignment of homologous protein sequences using ClustalX
  • Using ClustalX output to prepare phylogenetic networks (trees)
  • Testing evolutionary hypotheses
  • Identification and visualization of evolutionarily conserved structures in proteins

 

 

PART ONE

 

 

 

1. First you will do the guided activity asking the question, "Is Euglena a plant or an animal?" using Cytochrome C as a demonstration exercise. 

2. You will then have the tools to answer the question: "Are whales and dolphins a sister group to Artiodactyls (ungulates)?  Or should they be placed within the Artiodactyls as a sister group to Hippopotami?" You will answer that question during part two of the lab (see Reading).

 

"Is Euglena a plant or an animal?"

 


Photo courtesy of The Euglenoid Project at Rutgers
University. More on the Protist
genus Euglena including
fantastic images!

Algae are Protists with chloroplasts. However, Euglena is a protist genus where some species have chloroplasts and others don't. There are many heterotrophic and highly motile Euglena, behavior more animal-like than plant-like. So, are they more closely related to animals or plants?!! 

To answer that question we will use the resources of the:

National Center for Biotechnology Information (NCBI)

 

 


It is impossible to provide a reasonable guide to even a small section of this tremendous resource... You will have to explore it yourself...

 

As an example, look up "Mirounga"

 

As you can see, there is a vast amount of information cataloged even for this monachine phocid...

 

try clicking on "PubMed Central: free, full text journal articles."


Here, for example, you will find an important article that should be read by all biologists.

"Sequential megafaunal collapse in the North Pacific Ocean:
An ongoing legacy of industrial whaling?"


For these next steps use a pencil or pen and put a check next to the steps as you complete them.

To see what is available for Euglena let's enter that search term instead of Mirounga. Go ahead and refine the search a bit by clicking "Protein" and adding the search modifier for "organism"  like this:

Euglena [orgn]

 

That should reduce the number of hits a bit. Adding "cytochrome c" with quotes like this should help a lot:

Euglena [orgn] "cytochrome c"

 

Finally, if you add the search modifier for "protein" like this:

Euglena [orgn] "cytochrome c" [prot]

 

 ...the list should be reduced to a very few hits that include the Cytochrome C sequences for Euglena viridis and Euglena gracilis.

At this point it is worth it to begin taking some notes... If you examine the sequence record you will see that there is an accession number. Writing down "Euglena viridis Cytochrome C P22342" in your labbook can make your life much easier later. Using the accession number P22342 whenever you search or communicate your results will ensure that the exact sequence you used is known. Write down "Euglena gracilis Cytochrome C P00076" as well.

  • Create a folder called "Sequences" somewhere on your hard drive where you can find it again (Perhaps write down the pathname).
  • Save the Euglena viridis sequence P22342 to that folder as a web page called "Cyt_c_Eug_vir.html"
    Then save the Euglena gracilis sequence P00076 as a web page called "Cyt_c_Eug_gra.html"
Note: Those files contain all sorts of important information associated with the seqence... however, we will use a much simpler file type called "FASTA FORMAT" for the actual computer manipulations.

 

  • Now erase your previous search terms and try typing in "Cytochrome C" in quotes... what results do you get when you search?
    • Click on "Protein" if you aren't already in the Protein database.

You should see "Page 1" of at least "1,500 pages" of results!!!  A bit more than Mirounga... "Cytochrome C" is clearly an important protein.

 

  • To refine the search try adding [prot] after the "Cytochrome C" - that should get it down to only 25 pages of results (!).

 

  • Finally try adding "mammalia" to the search terms as in the example below:
 

 

What do you see?  You should see that the results have been narrowed to about 45 items (2006) on 3 pages.

One of the first ones in the list should be Q7YR71, the Silvered Leaf Monkey version of Cytochrome C.
Click on it!

Once again, you see a lot of "meta data" - additional information associated with the actual sequence data. We can save this file again as HTML, just as we did for the two Euglena sequences. This makes it easier to cite the authors of that data when (and if) we are writing the reseach paper. But this time we will also save the sequences in FASTA format.

Sequences are available in a variety of formats which are selected via the "Display" button. The sequences can also be sent to "text" for printing or saved in a file. Copying and pasting into Notepad also works. There is also information associated with structure, taxonomy, other genes and publications, etc.

A FASTA format file looks like this:

 Line 1: >The name and information goes here. Always begins with a ">"
 Line 2: HERE IS THE SEQUENCE ONLY, NOTHING ELSE, ESPECIALLY NO EMPTY LINES
		  

Try clicking on the DISPLAY button and selecting FASTA for the monkey sequence Q7YR71.

 

  • Next Open Notepad (Start --> Programs --> Accessories --> Notepad). We use Notepad because it is a plain text ascii editor, not a word processor. In other word, it is very simple and doesn't embed all kinds of formatting commands and font information into the file. Most bioinformatic computer programs can't work with WORD files or anything with embedded formatting.
  • Copy and paste the FASTA version of the monkey sequence into Notepad.

  • Save the Notepad file as Cyt_c_monkey_fasta.txt

 

In order to save time I have downloaded an additional five sequences in FASTA format and saved them into a single master file for us to use in this exercise.  Follow the numbered steps below the sequences.


The Cytochrome C sequences we will use:

>Arabidopsis gi|4539007 Cytochrome c [Arabidopsis thaliana]
MASFDEAPPGNPKAGEKIFRTKCAQCHTVEKGAGHKQGPNLNGLFGRQSGTTPGYSYSAA
NKSMAVNWEEKTLYDYLLNPKKYIPGTKMVFPGLKKPQDRADLIAYLKEGTA

>E_viridis P22342 Cytochrome c [Euglena viridis]
GDAERGKKLFESRAGQCHSSQKGVNSTGPALYGVYGRTSGTVPGYAYSNANKNAAIVWED
ESLNKFLENPKKYVPGTKMAFAGIKAKKDRLDIIAYMKTLKD

>Monkey (Silvered Leaf) Q7YR71 Cytochrome c [Trachypithecus cristatus]
MGDVEKGKKILIMKCSQCHTVEKGGKHKTGPNHHGLFGRKTGQAPGYSYTAANKNKGITWGEDTLMEYLE
NPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE

>E_gracilis P00076 Cytochrome c [Euglena gracilis]
GDAERGKKLFESRAAQCHSAQKGVNSTGPSLWGVYGRTSGSVPGYAYSNANKNAAIVWEE
ETLHKFLENPKKYVPGTKMAFAGIKAKKDRQDIIAYMKTLKD

>Mosquito gi|31202411|ref|XP_310154.1| [Anopheles gambiae]
MGVPAGDVEKGKKLFVQRCAQCHTVEAGGKHKVGPNLHGLFGRKTGQAAGFSYTDANKAK
GITWNEDTLFEYLENPKKYIPGTKMVFAGLKKPQERGDLIAYLKSATK


>Rice gi|218249 Cytochrome C [Oryza sativa (japonica cultivar-group)]
MASFSEAPPGNPKAGEKIFKTKCAQCHTVDKGAGHKQGPNLNGLFGRQSGTTPGYSYSTA
NKNMAVIWEENTLYDYLLNPKKYIPGTKMVFPGLKKPQERADLISYLKEATS


Preparing sequences for comparison by aligning them using ClustalX
  1. If you haven't already done so, create a folder called "Sequences" somewhere on your hard drive where you can find it again (Perhaps write down the pathname).

  2. Copy & Paste all five sequences above into NotePad.  IT IS IMPORTANT THAT YOUR EDITOR BE ABLE TO SAVE THE FILE AS TEXT. ClustalX will only use text files. It will help a lot later if you insert a reasonable name in the space behind the ">" as illustrated below. ClustalX reads names up to the first space.


  3. Save the file as "all_six.txt" or something similar, maybe "AraEugMonEugMosRice.txt" in your sequences folder, but leave it open.

  4. Now open the website at: http://align.genome.jp/ (Kyoto University Genome Net)

  5. Copy and paste all six sequences from the Notepad window into the Clustal window at the Kyoto University Genome Net website.

  6. Before we ask the program to prepare the alignment for us, we should do some cleaning up of the title lines for the sequences... that will make the trees that are produced much more aesthetically pleasing. Edit each line so there is just the ">" and the name of the organism.

  7. Click "Execute Multiple Alignment." Although there can be some machine errors... Clustal does a fairly good job of aligning the sequences. Notice that the program has rearranged the order of the sequences and grouped the two plants together, the two animals together, and the two Euglena together. Not surprisingly the software has been able to identify the organisms that are most closely related on the basis of their Cytochrome C amino acid sequences. Notice that there is a region "NPKKYIPGTKM" that is nearly identical in all six organisms! That can be interpreted as a region that was already present in the common ancestor to plants and animals and which cannot be changed without affecting the survival of the organism. Often these sites are critical to the function of the protein and good drug targets.

Preparing phylogenetic trees based on the sequence comparison


    The program also compares the aligned sequences and measures how different they are from each other.  The more differences, the less related they should be, and the more distant they should appear on a phylogenetic tree.  The program first finds the two most related sequences then adds the next most related "neighbor" sequence.  It calculates a difference score and outputs a little file of brackets and numbers that show the relationships and degree of relationship in the form of "branch lengths."


  1. Scroll down to the bottom of the "results" web page. You should see a drop down menu like this one:

     

  2. Select the Unrooted Neighbor Joining Tree and click "Exec" - and then try the others (Don't click the "Generate Profile HMM" as that will suck up a lot of time and cpu effort). What do you think about the results?

The distance from Euglena to Monkey is very slightly less than the distance to the plants... and it's the same as the distance to the Mosquito...  so, one could argue that based on a comparison of Cytochrome C sequences, Euglena is an animal.

Are you convinced?!!

 


 

 

PART TWO

 

 

Now for the more interesting question which you will answer on your own!


"Are whales and dolphins a sister group to Artiodactyls (ungulates)...
or should they be placed within the Artiodactyls as a sister group to Hippopotami?"


You may wish to prepare for this project by reading the short scientific paper on the Ancestry of Whales by Kenneth Rose.

For a project like this, Cytochrome C would probably be useless. It changes so little during evolution that it is essentially the same for all mammals. On the other hand, Pancreatic Ribonuclease is an enzyme that exhibits just the right amount of variability in mammals.

Example Pancreatic Ribonuclease sequences for this project:


You may, choose a different project for your lab... some of you may choose to work with fish, insects or plants... Or, perhaps the most challenging and interesting of all, comparing whales, seals, bears, weasels... However, be aware that it will take you a lot of extra time since you will have to find your own sequences to compare and confirm that they are what you think they are... not always easy for beginners.
 


Your write-up should consist of:

  • A hypothesis, for example, "Whales and dolphins form a sister group with Hippopotami within the Artiodactyls."
  • A background paragraph explaining the controversy or question you are attempting to resolve.
  • The protein (or RNA) you chose to use for the analysis and why. For example, what is Pancreatic Ribonuclease?  Has it been used before for phylogenetic analysis? Why choose it for this taxonomic group? Hint: why wouldn't you use Pancreatic Ribonuclease to answer the Euglena question above? If possible, include a 3D image of your protein (see below).
  • The species you chose and why. Please prepare a figure showing the FASTA format collection of sequences using 8 pt. Courier font. Also include a photo of your organism.
  • A figure showing the alignment you produced. The caption should describe how the alignment was prepared. In the discussion, you should point out the highly conserved regions... can you speculate on why they are conserved?
  • A figure showing the resulting tree.  The caption should describe how the figure was prepared.
  • In the discussion conclude whether your hypothesis was supported or not, and provide suggestions for additional comparisons along with your reasoning.
  • Literature cited. 

 

 

 

How about another interesting dataset to try? Remember, you can pick and choose from among this set... no need to run them all (See the paper from the O'Brien Lab on the Origins of Placentals [2001]). Important! Notice that these sequences are cDNA! You can try to run alignments using the DNA and check your results, or you can click on the accession number, and then on the protein ID link (it's down in the CDS section)... finally convert them to FASTA.

>Rhinocerus (white) ATP7A [Ceratotherium simum]
IVYQPHLITVQEIKKQIEAAGFPAFIKKQPKFLKLGAIDIERLKNTPVKSSERPQQRSPSYTSDSTVTFI
VDGMHCKSCVSNIESALSTLQYISSIVVSLENRSAIVKYNASLVTPETLRKAIEAVSPGQYRVNITSEVE
STSNSPSSSSLQKIPLNIVSQPLTQETVINIDGMTCNSCVQSIEGVISKKAGVKSIRVSLSNGNGTVEYD
PLLTSPETLRKAIED

>Horse ATP7A [Equus caballus]
IVYQPHLITVEEIKKQIEAAGFPAFIKKQPKFLKLGAIDIERLKNTPVKSSERPQQRSPSCTNDSAVTFI
VDGMHCKSCVSNIESALSTLQYVSSVVVSLENRSAIVKYNASLVTPETLRKAIEAISPGQYRVSFPSEVE
STSNSPSGSSLHKIPLNIVSQPLTQETVINIDGMTCNSCVQSIEGVISKKAGVKSIRVSLANGNGTVEYD
PLLTSPETLRKAIED

>Hippopotamus ATP7A [Hippopotamus amphibius]
IVYQPHLITAEEIKKQIEAVGFPAFIRKQPKYLKLGAIDIERLKNTPVKSSEGSQQRSPSYTNNSTVVFI
IDGMHCKSCVSNIESALSTLQYVSSVVVSLENRSAVVKYNASLVTPETLRKAIETMSPGQYKVSSTSEIE
STSNSPSSSSLQKSPLNIVSQPLTQETVINIDGMTCNSCVQSIEGVISKKAGVKSIRVSLANSKGTVEYD
PLLTSPETLREAIED

>Elephant (African) ATP7A [Loxodonta africana]
IIYQPHLITAEEIKKQIEAVGFSAFIKKQPKYLTLGAIDVERLKNTPVRYSEGSEQRSPSYTNDSTATFI
INGMHCKSCVSNIESALSTLQYVSSIAISLENRSATVKYNASLVTPETLRKAIEAVSPGQYSVSITSDVE
STPSSPFSSYHQQIPLNIVSQPLTQETVINIGGMTCNSCVQSIEGVISEKAGVKSIRVSLANSSGVIEYD
PLLNSPETLREAIEN

>Whale (Humpback) ATP7A [Megaptera novaeangliae]
VVYQPHLITAEEIKKQIEAVGFPAFIKKQPKYLRLGAIDIERLKNTPVKSSEGSQQRSPSYTNNSTVIFI
IDGMHCKSCVSNIESALSTLQYVSSVVVSLENRSATVKYNASLVTPETLRKAIEAISPGQYRVSSTSEIE
STSNSPSSSSLQKSPLNIVSQPLTQETVINIDGMTCNSCVQSIEGVISKKAGVKSIRVSLANGKGTVEYD
PLLTSPETLREAIED

>Okapia (Giraffe family) ATP7A [Okapia johnstoni] Photo
VVYQPHLITAEEIKKQIEAVGFTAFIKKQPKYLKLGAIDIERLKNTPVKSSEGSQQRSPSSTSNSTVIFT
IDGMHCKSCVSNIESALSTFQHISSVVVSLENKSAIVKYNANLVTPEALRKAIEAISQGQYRVSTASDVG
STSNSPSSSSLQKSPLNVVSQPLTQETVINIDGMTCNSCVQSIEGVLSKKAGVKSVQVSLANGKGTVEYD
PLLTSPETLREAIED

>Pig (note "X" at 2nd to last residue) ATP7A [Sus scrofa]
YQPHLITVEEIKKQIEAVGFPVFIKKQPKYLKLGAIDIERLKNTPVKSLEGSPQRSTSYTNNSTVIFIID
GMHCKSCVSNIESALSTLQYVSSIVVSLENRTAIVKYNASLVTPETLRKAIEDISPGQYRVTSTSDIECT
SNSPSSSSLQKSPLNIVSQPLTQEAVINIDGMTCNSCVQSIEGVISKKPGVKYIRISLANGKGTVEYDPL
LTSPETLREXI

>Manatee (Caribbean) ATP7A [Trichechus manatus]
IVYQPHLITVEEIKKQIEAVGFSVFIKKQPKYLTLGAIDIERLKNTPVRSSEGSEQRSPSYTNDSTATFI
INGMHCKSCVSNIESALSTLQYVSSIAISLENRSANVKYNASLVTPETLRKTIEAISPGQYSVSITSDAE
STPSSPSSSYHQKIPLNIVSQPLTQETVINIGGMTCNSCVQSIEGVISKKAGVKSIQVSLVNSSGIIEYD
PLLNSPETLREAIEN

>Dolphin (Bottle-nosed) ATP7A [Tursiops truncatus]
VVYQPHLITAEEIKKQIEAVGFPAFIKKQPKYLKLGAIDIERLKNTPVKSSEGSQQRSPSYTNNSTVIFI
IDGMHCKSCVSNIENALSTLQYVSSVVVSLENRTATVXXKASLVTPETLRKAIEAISPGQYRVSSTNEIE
STSNSPSSSSLQKSPLSIVSQPLTQETGINIDGMTCNSCVQSIEGVILKKAGVKSIRVSLANGKGIVEYD
PLLTCPETLREAIED



 

 

Preparing a nicer-looking figure of your tree using Adobe Acrobat.

When you are done with alignment and tree-building, you may be interested in producing a nicer image of your tree than a screen shot will provide.   If your computer has the software "Adobe Acrobat," or other software that can read postscript files, follow the steps below:


  1. When you ask the Kyoto Website to generate a tree for you, it also generates a link at the top of the new browser window called "PostScript file" (See right).




  2. Right-click on "PostScript file " and then choose "Save Target As" then save as some filename you will remember in a place you will remember :-)




  3. Right-click on the file you saved and this time select "Open With" and then choose "Adobe Acrobat."


  4. It will take a while to convert the postscript, however the image comes out much cleaner.

  5. If you like what you see, choose "File" ---> "Save As" and select "PNG Files (*.png)"


  6. When you open Microsoft WORD, you will be able to "Insert" ---> "Picture" ---> "From File" and move your tree around in your write-up document.

 

 

 

   

 

 

PART THREE

 


Viewing three dimensional structures of proteins and their sequences.

Some proteins have had their structures determined by X-ray crystallography or Nuclear Magnetic Resonance.  This is an arduous but rewarding endeavor and especially important for understanding enzyme mechanisms or for drug discovery.

 


Cytochrome c - fully oxidized Cytochrome c - fully reduced

  1. Return to the NCBI and this time select the "Structure" database with "Cytochrome C" Equus as your query.
 

 

  1. Scroll down until you see 1HRC. It may be on the second or third page.  If you don't find it, click here.  It should open as a Cn3D rotatable image.  If for some reason it doesn't, you may need to install the Cn3D browser plugin. This is a link to version 4.1 for the Windows operating system.  As updated versions are developed, they will be available at the NCBI Cn3D website. Macintosh and Unix versions are available there also.



  2. Under "Style" ---> "Options" ---> "Settings" the display can be radically altered. Here is the space-filling version of Cytochrome C from Horse.
 



© Henrik Kibak 2004

NSF Logo