Date of Award

Fall 11-17-2018

Document Type


Degree Name

Doctor of Philosophy (PhD)

First Advisor

Sumeet Dua


Protein sequence data has been produced at an astounding speed. This creates an opportunity to characterize these proteins for the treatment of illness. A crucial characterization of proteins is their post translational modifications (PTM). There are 20 amino acids coded by DNA after coding (translation) nearly every protein is modified at an amino acid level. We focus on three specific PTMs. First is the bonding formed between two cysteine amino acids, thus introducing a loop to the straight chain of a protein. Second, we predict which cysteines can generally be modified (oxidized). Finally, we predict which lysine amino acids are modified by the active form of Vitamin B6 (PLP/pyridoxal-5-phosphate.) Our work aims to predict the PTM's from protein sequencing data. When available, we integrate other data sources to improve prediction.

Data mining finds patterns in data and uses these patterns to give a confidence score to unknown PTMs. There are many steps to data mining; however, our focus is on the feature engineering step i.e. the transforming of raw data into an intelligible form for a prediction algorithm. Our primary innovation is as follows: First, we created the Local Similarity Matrix (LSM), a description of the evolutionarily relatedness of a cysteine and its neighboring amino acids. This feature is taken two at a time and template matched to other cysteine pairs. If they are similar, then we give a high probability of it sharing the same bonding state. LSM is a three step algorithm, 1) a matrix of amino acid probabilities is created for each cysteine and its neighbors from an alignment. 2) We multiply the iv square of the BLOSUM62 matrix diagonal to each of the corresponding amino acids. 3) We z-score normalize the matrix by row.

Next, we innovated the Residue Adjacency Matrix (RAM) for sequential and 3-D space (integration of protein coordinate data). This matrix describes cysteine's neighbors but at much greater distances than most algorithms. It is particularly effective at finding conserved residues that are further away while still remaining a compact description. More data than necessary incurs the curse of dimensionality. RAM runs in O(n) time, making it very useful for large datasets.

Finally, we produced the Windowed Alignment Scoring algorithm (WAS). This is a vector of protein window alignment bit scores. The alignments are one to all. Then we apply dimensionality reduction for gains in speed and performance. WAS uses the BLAST algorithm to align sequences within a window surrounding potential PTMs, in this case PLP attached to Lysine. In the case of WAS, we tried many alignment algorithms and used the approximation that BLAST provides to reduce computational time from months to days. The performances of different alignment algorithms did not vary significantly.

The applications of this work are many. It has been shown that cysteine bonding configurations play a critical role in the folding of proteins. Solving the protein folding problem will help us to find the solution to Alzheimer's disease that is due to a misfolding of the amyloid-beta protein. Cysteine oxidation has been shown to play a role in oxidative stress, a situation when free radicals become too abundant in the body. Oxidative stress leads to chronic illness such as diabetes, cancer, heart disease and Parkinson's. Lysine in concert with PLP catalyzes the aminotransferase reaction. Research suggests that anti-cancer drugs will potentially selectively inhibit this reaction. Others have targeted this reaction for the treatment of epilepsy and addictions.