Date of Award

Winter 2009

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computational Analysis and Modeling

First Advisor

Vir Phoha

Abstract

In this dissertation, we present two methods for identifying computer users using keystroke patterns. In the first method "Competition between naïve Bayes models for user identification," a naïve Bayes model is created for each user. In the training phase of this method, the model of a user is trained using maximum likelihood estimation on the key press latency values extracted from the texts typed by the user. In the user identification phase of this method, for each user we determine the probabilistic likelihood that the typed text belongs to a user. Finally, the typed text is assigned to the user with the highest likelihood value. In the second method "Similarity based user identification," each user is represented by a distinct model. In the training phase of this method, the model parameters of a user are estimated using the extracted key press latency values from the texts typed by the user. In the user identification phase of this method, we assign a similarity score to each user given a typed text. The similarity score of a user is determined by finding the ratio between (1) the number of key press latency values extracted from the typed text similar to the estimated model parameters of the user and (2) the total number of key press latency values extracted from the typed text. Finally, the typed text is assigned to the user with the highest similarity score.

We also present a novel application of distance based outlier detection method for discarding outliers in the extracted key press latency values from a users' typed text. Outliers are detected using the following three-step procedure: (1) for each extracted latency value xi, a neighborhood region using a distance threshold is created, (2) a latency value xj is considered as a neighbor of xi if xj falls in the neighborhood region of xi and (3) the latency value xi is considered as an outlying value if the number of neighbors determined for xi are less than a pre-set threshold.

To empirically evaluate the performance of our proposed work, a keystroke data set was collected from ten users, where each user provided 15 typing samples. From the provided typing samples, six distinct datasets were created in which the number of user identification attempts varied from 150 to 54600. Results on the datasets indicate that the identification accuracy of the "Competition between naïve Bayes models for user identification method" ranges from 89.62% to 99.65% and the identification accuracy of the "Similarity based user identification method" ranges from 96.33% to 100%. Further, the performance of our proposed two user identification methods is compared with the performance of two user identification methods reported in the recent literature.

To further improve the performance of the user identification methods, we theoretically analyze Majority Voting Rule (MVR) based fusion of two or more user identification methods. We formulate a procedure for theoretically estimating the identification accuracy of the MVR based fusion of user identification methods. Our proposed procedure, unlike the procedure presented in the literature of MVR based fusion, does not assume that the methods to be fused have the identical identification accuracy. The theoretically estimated identification accuracy of the MVR based fusion of user identification methods is analyzed in the light of empirical results.

Share

COinS