Project FAQ Project Staff Publications Sponsors Gallery 2006-07
Biotechnology Bioinformatics Teaching and Learning Evaluation Lesson Plans
Telecommunications STEM Careers Tutorials Publishing Poster Showcase

 



Introductory Bioinformatics Lab

for High School AP Biology


 
 
Skills developed in this lab:
  • Use of National Center for Biotechnology Information (NCBI) databases
  • Retrieval of sequences from NCBI
  • Alignment of homologous protein sequences using ClustalX
  • Using ClustalX output to prepare phylogenetic networks (trees)
  • Testing evolutionary hypotheses
  • Identification and visualization of evolutionarily conserved structures in proteins

 

PART ONE

 

 

 

 

"Is Euglena a plant or an animal?"

 


Photo courtesy of The Euglenoid Project at Rutgers
University. More on the Protist
genus Euglena including
fantastic images!

Algae are Protists with chloroplasts. However, Euglena is a protist genus where some species have chloroplasts and others don't. There are many heterotrophic and highly motile Euglena, behavior more animal-like than plant-like.

So, are they more closely related to animals or plants?!! 

Scientists have been debating this question for over 100 years.

We are going to use the amino acid sequence in a protein, Cytochrome C, to try to answer this question.

 


Getting Euglena, animal, and plant sequences to compare...


We are going to use the National Center for Biotechnology Information (NCBI) to obtain the data we need to try to answer this question. Enter this location in your browser: http://www.ncbi.nlm.nih.gov/ or RIGHT CLICK on the link to open it in a new browser window or tab.

To see what is available for Euglena let's enter that search term into the NCBI search box:

 

 

Also refine the search a bit by clicking "Protein" on the drop-down menu and adding the search modifier for "organism"  like this: Euglena [orgn]

 

That should reduce the number of hits a bit. Adding "cytochrome c" with quotes like this should help a lot:

Euglena [orgn] "cytochrome c"

 

Finally, if you add the search modifier for "protein" like this:

Euglena [orgn] "cytochrome c" [prot]

 ...the list should be reduced to a very few hits that include the Cytochrome C sequences for Euglena viridis and Euglena gracilis (see below).

At this point it is worth it to begin taking some notes... If you examine the sequence record you will see that there is an accession number. Writing down "Euglena viridis Cytochrome C P22342" in your labbook can make your life much easier later. Using the accession number P22342 whenever you search or communicate your results will ensure that the exact sequence you used is known. Write down "Euglena gracilis Cytochrome C P00076" as well.

  • Click on P22342... You should see a page with a lot of scientific information we call "meta data," or information about the amino acid sequence. The actual sequence data is at the bottom of the page.
Sequences are available in a variety of formats which are selected via the "Display" button. Microsoft WORD and other word processing programs insert all kinds of special hidden characters into documents that cause data processing software to become confused. Therefor we will use very simple ASCII (pronounced "ASK-EE") text editors such as Notepad, to work with our sequence data.

 

  • Open Notepad: START --> PROGRAMS --> ACCESSORIES --> Notepad

 

  • Then go back to the browser open to the NCBI Cytochrome C sequence from Euglena viridis.

 

  • Click on the Display drop-down menu and select "FASTA."

 

 

  • Copy and paste the FASTA format sequence into Notepad.
  • Simplify the header by editing it down to Euglena_viridis like this:

 

  • Then save the untitled.txt as Cyt_C_Eug_vir.txt in a folder called Sequences you have created somewhere on your hard drive where you can find it again (Perhaps write down the pathname).

As you can see, a FASTA format file looks like this:

 Line 1: >The name and information goes here. Always begins with a ">"
 Line 2: HERE IS THE SEQUENCE ONLY, NOTHING ELSE, ESPECIALLY NO EMPTY LINES
		  

It is important to respect this format because in our next step we will put the sequences we want to compare all into the same Notepad text file. The format helps the computer programs distinguish one sequence from another.

In order to save time we will copy and past the five sequences below into the Euglena viridis file and save them into a single master file for us to use in this exercise.  Follow the numbered steps below the sequences.


The Cytochrome C sequences we will use:

>Arabidopsis gi|4539007 Cytochrome c [Arabidopsis thaliana]
MASFDEAPPGNPKAGEKIFRTKCAQCHTVEKGAGHKQGPNLNGLFGRQSGTTPGYSYSAA
NKSMAVNWEEKTLYDYLLNPKKYIPGTKMVFPGLKKPQDRADLIAYLKEGTA

>Monkey (Silvered Leaf) Q7YR71 Cytochrome c [Trachypithecus cristatus] 
MGDVEKGKKILIMKCSQCHTVEKGGKHKTGPNHHGLFGRKTGQAPGYSYTAANKNKGITWGEDTLMEYLE
NPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE

>E_gracilis P00076 Cytochrome c [Euglena gracilis]
GDAERGKKLFESRAAQCHSAQKGVNSTGPSLWGVYGRTSGSVPGYAYSNANKNAAIVWEE
ETLHKFLENPKKYVPGTKMAFAGIKAKKDRQDIIAYMKTLKD

>Mosquito gi|31202411|ref|XP_310154.1| [Anopheles gambiae]
MGVPAGDVEKGKKLFVQRCAQCHTVEAGGKHKVGPNLHGLFGRKTGQAAGFSYTDANKAK
GITWNEDTLFEYLENPKKYIPGTKMVFAGLKKPQERGDLIAYLKSATK

>Rice gi|218249 Cytochrome C [Oryza sativa (japonica cultivar-group)]
MASFSEAPPGNPKAGEKIFKTKCAQCHTVDKGAGHKQGPNLNGLFGRQSGTTPGYSYSTA
NKNMAVIWEENTLYDYLLNPKKYIPGTKMVFPGLKKPQERADLISYLKEATS

 


Preparing the comparison by aligning the Cytochrome C sequences from two plants, two animals, and two Euglenas


  1. Copy & Paste all five sequences above into NotePad below the Euglena viridis sequence.  It will help a lot later if you insert a reasonable name in the space behind the ">" such as "mosquito" or "monkey."
  2. Save the file as "all_six.txt" or something similar, maybe "AraEugMonEugMosRice.txt" in your sequences folder, but leave it open.

  1. Now open the website at: http://align.genome.jp/ (Kyoto University Genome Net)
  2. Copy and paste all six sequences from the Notepad window into the Clustal window at the Kyoto University Genome Net website.

  3. Before we ask the program to prepare the alignment for us, we should do some cleaning up of the title lines for the sequences... that will make the trees that are produced much more aesthetically pleasing. Edit each line so there is just the ">" and the name of the organism.

  4. Click "Execute Multiple Alignment."

Although there can be some machine errors... Clustal does a fairly good job of aligning the sequences. Notice that the program has rearranged the order of the sequences and grouped the two plants together, the two animals together, and the two Euglena together. Not surprisingly the software has been able to identify the organisms that are most closely related on the basis of their Cytochrome C amino acid sequences.

Notice that there is a region "NPKKYIPGTKM" that is nearly identical in all six organisms! That can be interpreted as a region that was already present in the common ancestor to plants and animals and which cannot be changed without affecting the survival of the organism. Often these sites are critical to the function of the protein and good drug targets.

Preparing phylogenetic trees based on the sequence comparison


ClustalW also compares the aligned sequences and measures how different they are from each other.  The more differences, the less related they should be, and the more distant they should appear on a phylogenetic tree.  The program first finds the two most related sequences then adds the next most related "neighbor" sequence.  It calculates a difference score and outputs a little file of brackets and numbers that show the relationships and degree of relationship in the form of "branch lengths."


  1. Scroll down to the bottom of the "results" web page. You should see a drop down menu like this one:

     

  2. Select the Unrooted Neighbor Joining Tree and click "Exec" - and then try the others (Don't click the "Generate Profile HMM" as that will suck up a lot of time and cpu effort). What do you think about the results?

The distance from Euglena to Monkey is very slightly less than the distance to the plants... and it's the same as the distance to the Mosquito...  so, one could argue that based on a comparison of Cytochrome C sequences, Euglena is an animal.

Are you convinced?!!

 


 

 

PART TWO

 

 

Now for the more interesting question which you will answer on your own!


"Are Hippopotami the closest relatives of Whales and Dolphins?"


You may wish to prepare for this project by reading the short scientific paper on the Ancestry of Whales by Kenneth Rose.

For a project like this, Cytochrome C would probably be useless. It changes so little during evolution that it is essentially the same for all mammals. On the other hand, Pancreatic Ribonuclease is an enzyme that exhibits just the right amount of variability in mammals. It is found only in organisms with a pancreas... which rules out plants and mosquitoes, and pretty much leaves chordates.


Let's Review your "Plan of Attack!"

1. Collect Sequences for Comparison

2. Align (=compare) Sequences Using ClustalW

3. Ask ClustalW to draw the tree supported by the alignment.

4. Prepare the writeup describing your answer to the question and the data you base your conclusion upon.

 

If we are short on time you may click on these sequences to speed your completion of this project.


If there is time, you are encouraged to choose a different project for your lab... some of you may choose to work with fish, insects or plants... Or, perhaps the most challenging and interesting of all, comparing whales, seals, bears, weasels... However, be aware that it will take you a lot of extra time since you will have to find your own sequences to compare and confirm that they are what you think they are... not always easy for beginners.
 


Your write-up should consist of:

  • A hypothesis, for example, "Whales and dolphins form a sister group with Hippopotami within the Artiodactyls."
  • A background paragraph explaining the controversy or question you are attempting to resolve.
  • The protein (or RNA) you chose to use for the analysis and why. For example, what is Pancreatic Ribonuclease?  Has it been used before for phylogenetic analysis? Why choose it for this taxonomic group? Hint: why wouldn't you use Pancreatic Ribonuclease to answer the Euglena question above? If possible, include a 3D image of your protein (see below).
  • The species you chose and why. Please prepare a figure showing the FASTA format collection of sequences using 8 pt. Courier font. Also include a photo of your organism.
  • A figure showing the alignment you produced. The caption should describe how the alignment was prepared. In the discussion, you should point out the highly conserved regions... can you speculate on why they are conserved?
  • A figure showing the resulting tree.  The caption should describe how the figure was prepared.
  • In the discussion conclude whether your hypothesis was supported or not, and provide suggestions for additional comparisons along with your reasoning.
  • Literature cited. 

 

 

 

How about another interesting dataset to try? Remember, you can pick and choose from among this set... no need to run them all (See the paper from the O'Brien Lab on the Origins of Placentals [2001]). Important! Notice that these sequences are cDNA! You can try to run alignments using the DNA and check your results, or you can click on the accession number, and then on the protein ID link (it's down in the CDS section)... finally convert them to FASTA.

>Rhinocerus (white) ATP7A [Ceratotherium simum]
IVYQPHLITVQEIKKQIEAAGFPAFIKKQPKFLKLGAIDIERLKNTPVKSSERPQQRSPSYTSDSTVTFI
VDGMHCKSCVSNIESALSTLQYISSIVVSLENRSAIVKYNASLVTPETLRKAIEAVSPGQYRVNITSEVE
STSNSPSSSSLQKIPLNIVSQPLTQETVINIDGMTCNSCVQSIEGVISKKAGVKSIRVSLSNGNGTVEYD
PLLTSPETLRKAIED

>Horse ATP7A [Equus caballus]
IVYQPHLITVEEIKKQIEAAGFPAFIKKQPKFLKLGAIDIERLKNTPVKSSERPQQRSPSCTNDSAVTFI
VDGMHCKSCVSNIESALSTLQYVSSVVVSLENRSAIVKYNASLVTPETLRKAIEAISPGQYRVSFPSEVE
STSNSPSGSSLHKIPLNIVSQPLTQETVINIDGMTCNSCVQSIEGVISKKAGVKSIRVSLANGNGTVEYD
PLLTSPETLRKAIED

>Hippopotamus ATP7A [Hippopotamus amphibius]
IVYQPHLITAEEIKKQIEAVGFPAFIRKQPKYLKLGAIDIERLKNTPVKSSEGSQQRSPSYTNNSTVVFI
IDGMHCKSCVSNIESALSTLQYVSSVVVSLENRSAVVKYNASLVTPETLRKAIETMSPGQYKVSSTSEIE
STSNSPSSSSLQKSPLNIVSQPLTQETVINIDGMTCNSCVQSIEGVISKKAGVKSIRVSLANSKGTVEYD
PLLTSPETLREAIED

>Elephant (African) ATP7A [Loxodonta africana]
IIYQPHLITAEEIKKQIEAVGFSAFIKKQPKYLTLGAIDVERLKNTPVRYSEGSEQRSPSYTNDSTATFI
INGMHCKSCVSNIESALSTLQYVSSIAISLENRSATVKYNASLVTPETLRKAIEAVSPGQYSVSITSDVE
STPSSPFSSYHQQIPLNIVSQPLTQETVINIGGMTCNSCVQSIEGVISEKAGVKSIRVSLANSSGVIEYD
PLLNSPETLREAIEN

>Whale (Humpback) ATP7A [Megaptera novaeangliae]
VVYQPHLITAEEIKKQIEAVGFPAFIKKQPKYLRLGAIDIERLKNTPVKSSEGSQQRSPSYTNNSTVIFI
IDGMHCKSCVSNIESALSTLQYVSSVVVSLENRSATVKYNASLVTPETLRKAIEAISPGQYRVSSTSEIE
STSNSPSSSSLQKSPLNIVSQPLTQETVINIDGMTCNSCVQSIEGVISKKAGVKSIRVSLANGKGTVEYD
PLLTSPETLREAIED

>Okapia (Giraffe family) ATP7A [Okapia johnstoni] Photo
VVYQPHLITAEEIKKQIEAVGFTAFIKKQPKYLKLGAIDIERLKNTPVKSSEGSQQRSPSSTSNSTVIFT
IDGMHCKSCVSNIESALSTFQHISSVVVSLENKSAIVKYNANLVTPEALRKAIEAISQGQYRVSTASDVG
STSNSPSSSSLQKSPLNVVSQPLTQETVINIDGMTCNSCVQSIEGVLSKKAGVKSVQVSLANGKGTVEYD
PLLTSPETLREAIED

>Pig (note "X" at 2nd to last residue) ATP7A [Sus scrofa]
YQPHLITVEEIKKQIEAVGFPVFIKKQPKYLKLGAIDIERLKNTPVKSLEGSPQRSTSYTNNSTVIFIID
GMHCKSCVSNIESALSTLQYVSSIVVSLENRTAIVKYNASLVTPETLRKAIEDISPGQYRVTSTSDIECT
SNSPSSSSLQKSPLNIVSQPLTQEAVINIDGMTCNSCVQSIEGVISKKPGVKYIRISLANGKGTVEYDPL
LTSPETLREXI

>Manatee (Caribbean) ATP7A [Trichechus manatus]
IVYQPHLITVEEIKKQIEAVGFSVFIKKQPKYLTLGAIDIERLKNTPVRSSEGSEQRSPSYTNDSTATFI
INGMHCKSCVSNIESALSTLQYVSSIAISLENRSANVKYNASLVTPETLRKTIEAISPGQYSVSITSDAE
STPSSPSSSYHQKIPLNIVSQPLTQETVINIGGMTCNSCVQSIEGVISKKAGVKSIQVSLVNSSGIIEYD
PLLNSPETLREAIEN

>Dolphin (Bottle-nosed) ATP7A [Tursiops truncatus]
VVYQPHLITAEEIKKQIEAVGFPAFIKKQPKYLKLGAIDIERLKNTPVKSSEGSQQRSPSYTNNSTVIFI
IDGMHCKSCVSNIENALSTLQYVSSVVVSLENRTATVXXKASLVTPETLRKAIEAISPGQYRVSSTNEIE
STSNSPSSSSLQKSPLSIVSQPLTQETGINIDGMTCNSCVQSIEGVILKKAGVKSIRVSLANGKGIVEYD
PLLTCPETLREAIED



 

 

Preparing a nicer-looking figure of your tree using Adobe Acrobat.

When you are done with alignment and tree-building, you may be interested in producing a nicer image of your tree than a screen shot will provide.   If your computer has the software "Adobe Acrobat," or other software that can read postscript files, follow the steps below:


  1. When you ask the Kyoto Website to generate a tree for you, it also generates a link at the top of the new browser window called "PostScript file" (See right).




  2. Right-click on "PostScript file " and then choose "Save Target As" then save as some filename you will remember in a place you will remember :-)




  3. Right-click on the file you saved and this time select "Open With" and then choose "Adobe Acrobat."


  4. It will take a while to convert the postscript, however the image comes out much cleaner.

  5. If you like what you see, choose "File" ---> "Save As" and select "PNG Files (*.png)"


  6. When you open Microsoft WORD, you will be able to "Insert" ---> "Picture" ---> "From File" and move your tree around in your write-up document.

 

 

 

   

 



© Henrik Kibak 2004

NSF Logo