3. Sequence comparison

 

Top

Central dogma: DNA – RNA – Protein

DNA: ATGGGAGTTCTG...
RNA: AUGGGAGUUCUG...
RNA: AUGGGAGUUCUG...
PRO:  M  G  V  L  ...
  • Three bases (= codon) corresponds to one amino acid
  • Amino acid sequence = protein
  • Protein is a major substance constituting living body.
  • Many of the proteins are “enzymes”, which act as catalysts to convert substances within a cell.
  • Enzymes are responsible for “metabolism”.
  • The sequence of the protein is folded into a certain form, depending on the nature of the amino acid.
  • There are 20 types of amino acids used in vivo.
A	Ala	Alanine
C	Cys	Cystein
D	Asp	Aspartate
E	Glu	Glutamate
F	Phe	Phenylalanine
G	Gly	Glycine
H	His	Histidine
I	Ile	Isoleucine
K	Lys	Lysine
L	Leu	Leucine
M	Met	Methyonine
N	Asn	Asparagine
P	Pro	Proline
Q	Gln	Glutamine
R	Arg	Arginine
S	Ser	Serine
T	Thr	Threonine
V	Val	Valine
W	Typ	Tryptophan
Y	Tyr	Tyrosine

Genome annotation

  • Structural annotation
    Annotation describing the structure of the gene
  • Functional annotation
    Annotation describing gene function

Similarity search

Conventional method to predict the structure of genes

  • Sequence regions analogous to known genes are genes (probably)
    • Ortholog – are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologs retain the same function in the course of evolution. Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes.
    • Paralog – are genes related by duplication within a genome. Orthologs retain the same function in the course of evolution, whereas paralogs evolve new functions, even if these are related to the original one.

Basic Local Alignment Search Tool (BLAST)

  • Why BLAST is so popular? – Good balance of sensitivity and speed.
  • See: [Movie] Webinar: A Practical Guide to NCBI BLAST by NCBI
  • program option for BLAST
    program Query DB summary
    BLASTN nucleotide nucleotide No conversion is done on the query or database
    BLASTP protein protein No conversion is done on the query or database
    BLASTX nucleotide protein All six reading frames are translated on the query and used to search the database
    TBLASTN protein nucleotide All six frames are translated in the database and searched with the protein sequence
    TBLASTX nucleotide nucleotide All six frames are translated on the query and on the database

Training

NCBI BLAST

  1. Open https://blast.ncbi.nlm.nih.gov/Blast.cgi.
  2. Select “protein BLAST”.
  3. Copy and Paste the following sequence to the window (cmd-C then cmd-V).
    >opsin Rh2(Drosophila melanogaster)
    MERSHLPETPFDLAHSGPRFQAQSSGNGSVLDNVLPDMAHLVNPYWSRFAPMDPMMSKIL
    GLFTLAIMIISCCGNGVVVYIFGGTKSLRTPANLLVLNLAFSDFCMMASQSPVMIINFYY
    ETWVLGPLWCDIYAGCGSLFGCVSIWSMCMIAFDRYNVIVKGINGTPMTIKTSIMKILFI
    WMMAVFWTVMPLIGWSAYVPEGNLTACSIDYMTRMWNPRSYLITYSLFVYYTPLFLICYS
    YWFIIAAVAAHEKAMREQAKKMNVKSLRSSEDCDKSAEGKLAKVALTTISLWFMAWTPYL
    VICYFGLFKIDGLTPLTTIWGATFAKTSAVYNPIVYGISHPKYRIVLKEKCPMCVFGNTD
    EPKPDAPASDTETTSEADSKA
  4. Choose Search Set as “UniProtKB/Swiss-Prot(swissprot)”.
  5. Click the “BLAST” button in the lower left to execute.
  6. First, “Conserved domains” is shown (even after BLAST results are returned, they can be viewed from “Show Conserved Domains” of “Graphic Summary”).
  7. Click “7tmA_photoreceptros_insect” area in “Conserved domains” image then:
    • 7tmA_photoreceptors_insect cd15079 insect photoreceptors R1-R6 and similar proteins
    • 7tm_1 pfam00001 7 transmembrane receptor (rhodopsin family)
    • PHA03087 PHA03087 G protein-coupled chemokine receptor-like protein

    were found as Conserved Domain (seven transmembrane receptors)

  8. When the result is shown, look at “Graphic Summary” & “Descriptions.”
  9. If “Related Information Gene-associated gene details” link was present in the “Alignment” panel, it will give you the information of the gene in the integration database by NCBI gene.
  10. From “Edit and Resubmit” link at the top of the result page, You can narrow down the results by species’ name, keywords etc.
  11. Practice: Put “cat family (taxid: 9681)” into “Organism” of “Choose Search Set”, Search again for similar genes of “cat family”.

DDBJ BLAST

Compared with NCBI’s BLAST output is simpler, but usually DDBJ’s BLAST is faster.

      1. Open http://blast.ddbj.nig.ac.jp/blastp?lang=en (for BLASTP).
      2. Copy and Paste the following sequence to the window (cmd-C then cmd-V).
        >opsin Rh2(Drosophila melanogaster)
        MERSHLPETPFDLAHSGPRFQAQSSGNGSVLDNVLPDMAHLVNPYWSRFAPMDPMMSKIL
        GLFTLAIMIISCCGNGVVVYIFGGTKSLRTPANLLVLNLAFSDFCMMASQSPVMIINFYY
        ETWVLGPLWCDIYAGCGSLFGCVSIWSMCMIAFDRYNVIVKGINGTPMTIKTSIMKILFI
        WMMAVFWTVMPLIGWSAYVPEGNLTACSIDYMTRMWNPRSYLITYSLFVYYTPLFLICYS
        YWFIIAAVAAHEKAMREQAKKMNVKSLRSSEDCDKSAEGKLAKVALTTISLWFMAWTPYL
        VICYFGLFKIDGLTPLTTIWGATFAKTSAVYNPIVYGISHPKYRIVLKEKCPMCVFGNTD
        EPKPDAPASDTETTSEADSKA
      3. Choose “UniProt (Swiss-Prot)” from Data Sets.
      4. Click the “Send to BLAST” button to execute.

* Also see:

GGRNA

Ultra-fast Google-like full text search engine for genes and transcripts. The web server accepts arbitrary words and phrases, such as gene names, IDs, gene descriptions, annotations of gene and even nucleotide/amino acid sequences through one simple search box, and quickly returns relevant RefSeq transcripts.


Top