Hierarchical Motif Vectors for Protein Alignment and Functional Classification
The two principal tasks of computational analysis of proteins based on their amino acid sequences are the determination of proteins related to one another either by function or by evolutionary development, and the identification of specific amino acid combinations, or motifs, that determine the protein function. Both of these two tasks have largely been addressed by matching amino acid sequences across different proteins. Sequence alignment algorithms compute similarity scores between the amino acid sequences of different proteins that are then used to identify protein subgroups with high within-group similarity. Amino acid combinations observed consistently among proteins of a specific functional subgroup constitute the sequence motifs of the related subgroup, the presence of which indicates membership of the protein in that functional group. The primary challenge in the computational analysis of amino acid sequences is the combinatorial complexity inherent in representing amino acid sequences as words composed of letters from an alphabet of twenty, with each letter corresponding to a different amino acid.
Faced with the daunting prospect of evaluating potentially millions of possible amino acid combinations for functional specificity, we introduce a numerical alternative that characterizes protein structure from amino acid sequences via numerical means using techniques from multi-scale signal decomposition and statistical learning. The proposed framework is based on the introduced notion of hierarchical motif vectors that capture the numerical variation of the local physico-chemical composition along a protein’s amino acid sequence. This allows using an extensive library of vector space data processing methods for rigorously computing the similarity of corresponding amino acid sequence motifs, both in the alignment of amino acid sequences as well as the identification of motifs specific to functional protein groups.
This project starts with developing global and local alignment methods for sequences of motif vectors to establish correspondence between the corresponding amino acid sequences. Next, it identifies hierarchical motif vectors that possess functional or structural specificity in protein groups via quasi-supervised statistical learning. Finally, it formulates a protein classification strategy based on group-specific hierarchical motif vectors.
The experimental results on local as well as global motif vector alignment indicate that the motif vectors characterize the physico-chemical composition along amino acid sequences and allow associating segments sharing similar amino acid configurations at short, mid and long range neighborhoods along their respective sequences. This allows establishing associations between amino acid sequence segments that share similar functions due to similar configurations generated by amino acids that are similar in their physico-chemical properties.
In addition, results on prediction of N-glycosylation at consensus sequence sites also confirm that the hierarchical motif vectors provide adequate characterization of the physico-chemical configurations at and around amino acid sites for functional evaluation. Furthermore, the quasi-supervised learning strategy can sort through the prospective sites of activity and identify the ones with real functional potential based on their respective motif vectors. The quasi-supervised learning strategy is especially fitting to biomedical information processing tasks where a relatively small collection of experimentally verified property is available against the backdrop of a very large number of unknown prospects. The quasi-supervised learning algorithm successfully separates the probable prospects from the unlikely ones automatically with no user intervention.
Financial support for this project is provided by the 7th Framework Programme, Marie Curie Actions - International Re-integration Grants (PIRG03-GA-2008-230903).
The software used in this project has been developed in the Matlab mathematical analysis environment that provides the Wavelet Toolbox for taking the wavelet transform of numeric sequences among others. The quasi-supervised learning algorithm developed earlier is being modified towards the goals of this project.