4. Functional annotation


To know function of genes

There are not so many genes whose functions are known in the “real” manner.

See: Review – Why Are There Still Over 1000 Uncharacterized Yeast Genes? by Lourdes Peña-Castillo and Timothy R. Hughes (2007) GENETICS, vol. 176 no. 1 7-14 https://doi.org/10.1534/genetics.107.074468 – little old, but good review.

Still over 600 S. cerevisiae ORF’s are “Putative protein of unknown function” in SGD. (Search result by YeastMine)

How to infer gene functions from the sequence?

Again, similarity of sequences is useful, but…

    • Similar of similarities may not be similar in function.
    • Practice: Compare the meanings of the words: Homology / Similarity in biological sequence.
    • The partially matched part may not be related to the function.

How to avoid “false positives” to infer gene functions

1. Use small but highly reliable library: such as UniProt-SwissProt – http://www.uniprot.org/

A well-defined manual curation process is essential to ensure that all manually annotated entries are handled in a consistent manner. This process consists of 6 major mandatory steps: (1) sequence curation, (2) sequence analysis, (3) literature curation, (4) family-based curation, (5) evidence attribution, (6) quality assurance and integration of completed entries. Curation is performed by expert biologists using a range of tools that have been iteratively developed in close collaboration with curators. (from How do we manually annotate a UniProtKB entry?)

See: [Movie] UniProt videos by UniProt

Use libraries for similarity search should be selected from the trusted one, in order.

All SwissProt entries are manually curated and handled in a consistent manner, it avoid unreliable annotations.

2. Use subsequence search involved in function of proteins.

Partial patterns of proteins involved in function, are called Motif or Domain

InterPro is a resource that provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites. To classify proteins in this way, InterPro uses predictive models, known as signatures, provided by several different databases (referred to as member databases) that make up the InterPro consortium. InterProScan is the software package that allows sequences to be scanned against InterPro’s signatures (from About InterPro)

3. Use annotations with an explicit “evidence” of annotations.

See: http://www.geneontology.org/ -> Documentation -> GO: Evidence Code Guide

See: [Movie] Gene Ontology (GO) playlist by Saccharomyces Genome Database

For example:

    • IDA (Inferred from Direct Assay)
    • TAS (traceable author statement)
    • IEA (Inferred from Electronic Annotation)
    • ISS (Inferred from Sequence or Structural similarity) etc.



  1. Open InterPro – Search By Sequence
  2. Copy the following sequence to the frame of “Analyse your protein sequence”
    >opsin Rh2(Drosophila melanogaster)
  3. Click “Submit” button and wait. (It takes some time.)
  4. Investigate reported motifs, profiles and GO annotations of the protein for each panel: Protein family membership, Homologous superfamilies, Domains and repeats, Detailed signature matches, Residue annotation, GO term prediction.
  • Advanced: What kind of programs / databases are used in the InterProScan? (hint: About InterPro)

Gene Ontology (GO)

Practice: Search for genes related to the circadian clock.

    1. Open Gene Ontology Consortium.
    2. Search GO data with “circadian clock associated”
    3. Click on “Genes and gene products” button to get information of genes and gene products associated with GO terms.
    4. Click on “CCA1” to see the details of the Gene Product Information and Gene Product Associations.
    5. See “Evidence” column and filter the list with “Evidence” panel (e.g. + “experimental evidence”)