Shoshana Brown, Ph.D.
Research Summary
A Gold Standard Set of Enzyme Superfamilies
Functional characterization of the proteins encoded in the many completely sequenced genomes remains a major challenge. Superfamily analysis is a powerful method for such analysis, but must be automated for efficient use on large datasets. To facilitate the development of automated superfamily analysis methods, we are creating a validation data set using a subset of enzyme superfamilies that are functionally related by common aspects of chemistry. Superfamilies within our validation set are chosen such that each consists of a group of evolutionarily related proteins that have a set of conserved residues known in characterized members to be involved in catalysis of a common chemical step. Superfamilies are further divided into families, within which each enzyme catalyzes the same overall reaction. One of the major issues that must be addressed in the creation of this gold standard set is the accurate classification of sequences into superfamilies and families. This classification may require the use of information from several sources, including sequence similarity measures, sequence length, the presence/absence of catalytic residues, and the presence/absence of homologs for additional enzymes involved in the same biochemical pathway as a given family. This gold standard set will form the core data set for the Structure-Function Linkage Database (SFLD), developed by our lab in conjunction with the UCSF Computer Graphics Laboratory. The SFLD has been designed to explicitly link enzyme sequence, structure and function in order to facilitate sophisticated computational analysis, such as the prediction of function for uncharacterized proteins.
