CS572, Winter 2000
Secure methods for authenticating users have been a topic of research since the introduction of multi-user computing systems, but the principles behind such methods have been with society much longer. Traditional finance and banking industries were the first to apply wide-spread user authentication systems, using picture IDs and handwriting signatures as a means of accurately authenticating the users who authorize transactions. Such systems are successful because they are based on the use of physiological and biological data that is largely unique to individuals: the features of a person's face and the features of an individual's handwriting. The machine-measurable features of such physiological and biological data have become to be known collectively as biometrics. Some popular biometrics include voice, iris patterns, and fingerprints.
In most current computer systems that require user authentication, whether they be web-based or traditional computer access terminals, users are authenticated with a username/password pair. This method of authentication relies on the secrecy of the password and, in some cases, the secrecy of the username. The disadvantage of this type of authentication in relation to biometrics is that all of the authentication tokens are easily transferable from one user to another, whether transfered inadvertently or not.
With the increasing need for reliable Internet commerce and the advent of ``smart cards,'' it is the finance industry that is once again providing the impetus for systems that can accurately classify biometric data for use in authentication of users. Most of the systems currently available are expensive to deploy, because special hardware is needed to reliably collect the most common biometrics.
User typing patterns can be used as a ``cheap'' biometric. Proposed as early as 1980 , this biometric requires no additional hardware. And since users are already accustomed to typing in their username/password pair or an account/PIN pair to authenticate themselves to computer systems, adding keystroke pattern authentication comes at no cost to the end user.
A good summary of this research area was published in , in which Joyce and Gupta were able to show an imposter pass rate of 0.25 percent (2 out of 810 unauthorized attempts gained access) and a false alarm rate of 16.36 percent (27 out of every 165 valid attempts were denied). More recently, some additions to the system described by Joyce and Gupta were proposed, such as continuous monitoring and updating of the user keystroke model .
A chancery survey of the 30 or so papers that have been published on this topic shows that the majority use techniques that wouldn't traditionally fall under the machine learning umbrella. The notable exceptions include , which used a Markov model to classify users, and , in which Brown and Rogers experimented with using neural networks to learn to identify users by their typing patterns. Their results were better than previous attempts, but their methods had some shortcomings that are difficult to overcome in a real world application. Some of these shortcomings and a comparison to their results will be presented in Section 6.
Universal to virtually all the techniques discussed in various methods of measuring user typing patterns is the concept of a key digraph, which is defined as two adjacent keystrokes. When a user types the sequence ionlyread, for example, the digraphs which can be obtained are io, on, nl, ly, yr, re, ea, and ad. The most commonly used metric of a digraph is the time between the pressing of the first key in the digraph and the pressing of the second key. There are a total of six metrics that can be used, each of which is illustrated by Figure 1. Note that these six metrics can be collected regardless of the order that the press and release events occur in. Extending the digraph model to use all sequences of three consecutive keys or even to use all the keys typed for a single word have also been proposed.
Methods for collecting a user's keystroke ``signature'' include requiring the user to type long texts, collecting traces transparently as the user works normally, and collecting traces only for a static set of words (usually the login typing pattern).
Keystroke identification can be applied to several proposed domains. Static methods apply the authentication once, usually during user login or after the user has locked the terminal. Dynamic methods apply authentication continuously, constantly monitoring the user's typing input and locking the terminal once the classifier determines that the pattern doesn't reflect that of the user.
Models for keystroke pattern representation vary from lookup tables with entries for means and variances of every possible digraph to storage of unmodified keystroke traces.
There are also several issues relating to the technology present for collecting keystroke traces. In many of the earlier studies, only key-press events were available. Additionally, many operating systems report key events with a best possible resolution of 10 milliseconds or greater.
The application for collecting keystrokes for this study is shown in Figure 2. As can be seen from the screen shot, the traces consist of the user's username, a password, and the static psuedo-word ``ionlyread.'' The last entry is used primarily as a control, but leads to some interesting experiments and results as discussed in Sections 4 and 5. Participants submit a number of traces through this application so that a valid sample of their keystroke patterns can be obtained.
An example trace is shown in Table 1. The first column displays the event, which is either a PRES(s) or a RELE(ase). The second and third columns show the character typed and the key pressed, respectively. The last column contains a time stamp, in standard Unix epoch time format.1
Once the traces are collected, the key events can be paired into press-release events, and then into digraphs as previously discussed. The digraphs are arranged to represent the tokens typed by the user, consisting of their name, password, and ``ionlyread''. The six digraph statistics are combined by addition to form a single value for each digraph, so that the representation for one user's 20 preliminary ``ionlyread'' samples can be shown in Figure 3.
For comparison, two more user digraph patterns for the ``ionlyread'' are shown in Figure 4. As can be seen, there are some similarities between all three examples. The natural pause between y and r (the phrase ``i only read'' from which the ``ionlyread'' psuedo-word is derived is broken by a word boundary here, and the user switches from right to left hands at this same point) has a higher average latency for all users, as expected. There are also some unique characteristics for each user, such as the consistently fast ``ad'' digraph for Michael in comparison to Rachel and Bruce.
The presence of samples with high variance from the user's average is evident in the training data of all three users. Some of this variance can be eliminated simply by throwing out the earliest samples, more by using statistical means or by discarding traces which are closer to the samples of other users than they are to the samples of the correct user. No samples are discarded during the training or classifying stages of the algorithm presented here.
A simple weighted k-nearest neighbor algorithm is used to classify examples. Nearest neighbor algorithms are resilient to noise, and can also be very accurate when applied to features that are easily distinguished. Additionally, there is no immediate cost for the addition of new samples to the training set, though the cost is felt during classification.
Every new example is compared with its k-nearest neighbors. Nearness is determined by adding the summed squared error between the example's digraphs and the summed digraphs of any candidate samples. The k-nearest neighbors are weighted according to their distance from the example, and the sample with the highest score is chosen.
Each of the three tokens in the username/password/''ionlyread'' triple are classified separately, then modified k-nearest neighbor is performed on these classifications. The modification simply adds a threshold with which to accept or reject examples. For the experiments described in the next section, the threshold was 2/3, meaning that at least two of the three tokens must match the chosen classification.
Participation in the experiment was open. Participants could access the keystroke trace collector from any Java enabled browser. Eleven users chose to participate. All users were asked to submit data for a username/password/''ionlyread'' triple 20 times. After throwing out 3 samples because of errors, this left us with 217 examples.
The first experiment attempted to determine whether a user could be uniquely identified given only the user's typing of the static ``ionlyread'' token. The first 15 traces from each user were used as training data for the learning algorithm, the last 5 as validation examples.
Some of the earlier methods for authentication of users via their keystroke digraphs involved computing a mean and variance for each digraph over several samples, then accepting the user if their sample fell within 0.5 standard deviations of the mean. Using a similar method and tuning the weights of the six digraph statistics by hand, 74% of the validation examples were correctly classified.
Using the k-nearest neighbor implementation described in Section 3, that percentage jumped to 92%. Furthermore, reducing the training set from 15 examples to 5 and increasing the validation set from 5 to 15 examples only reduced the accuracy to 86%, suggesting that the algorithm works fairly accurately even with limited training data.
Since none of the original data included traces of users attempting to log in as impostors with someone else's username/password/''ionlyread'' triple, the users were invited back to make attempts at successfully logging in as impostors. Each user who wished to participate was given access to another user's tokens, then allowed 5 attempts at successfully logging in with those tokens. The imposter experiment could be repeated by the user as many times as desired, allowing the user unlimited imposter attempts, but the user for which they could attempt logging in was chosen randomly for each set of 5 attempts. 95 imposter logins were attempted. Of those, 4 were successful, resulting in a 4.2% imposter pass rate. All four successful impostors came from one single imposter/valid_user pair.
Classification of the 5 submitted samples typically took only 1-2 seconds of computation time.
It at first seems like a more accurate result could be obtained by computing the k-nearest neighbor to each digraph separately, then weighting these according to distance, and finally performing a k-nearest neighbor on the nearest neighbors to all the digraphs. This, however, results in a lower overall accuracy, likely because the value for single digraphs overlaps significantly in the training data, and the likelihood of the k-nearest neighbors being incorrect is higher. Error is effectively multiplied by performing this k-nearest-k-nearest neighbor method. For an intuitive feel for why this is the case, see Figure 5. The lines connecting individual values have been removed not only to make the graph readable, but to emphasize that this method would ``forget'' which trace a sample digraph belongs to.
The most interesting results in each experiment come from examining the false alarms (users who are rejected when they shouldn't be) and the successful impostors. In both instances, the same users make up the bulk of the problem. The user Bruce was often confused for other users. In the first experiment, Bruce was classified as Todd three times. These misclassifications resulted in 3 of the 6 false alarms reported in the first experiment. Removing Bruce from the experiment resulted in boosting the accuracy to 95%.
The graphs which describe Bruce and Todd are shown in Figure 6. As can be seen, the two graphs are very similar. The only real notable difference is the quicker average time for Todd with the digraph ``ad''. For the rest of the token, there is a lot of overlap. This is a problem that is not easy to overcome, especially since some of it can be attributed to the coarse grained resolution of the timing events from which the traces were collected. Finer-grained traces may differentiate these users more effectively.
Interestingly, Todd was never confused for Bruce. These examples need further study, but for now it seems like some additional tuning may be sufficient for differentiating these users and other similar examples.
In the case of the successful impostors, Bruce was the sole user able to log in as another user, this time as Rachel. Removing Bruce from the experiment lowers the imposter pass rate to 0%. The traces for these two users are shown in Figure 4.
Some of the other misclassifications resulted from examples that were only distantly related to any of the training examples. Setting a rather generous minimum distance threshold can immediately reject 2 more of the false positives. While this doesn't improve the false alarm rate, it may further reduce the imposter pass rate.
Brown and Rogers  reported imposter pass rates of 0% and false alarm rates of at best 11.5%. While the results of this work are slightly better for false alarm rates, the 0% imposter pass rate is the more important of the two performance measurements, since unauthorized access is a graver problem than occasionally rejecting a valid user. They achieved the low imposter pass rate by requiring their test subjects to submit a number of imposter attempts about equal to the number of valid logins. These traces were used to set minimum thresholds for similarity based on the impostors that passed classification; if an imposter was classified correctly as the target user, the distance to that user's valid samples in the future must be less than the distance to the imposter. This was carried out as a pre-processing step.
A similar method could be applied to the k-nearest neighbor algorithm described here. It's effect would be to eliminate the user Bruce, which would boost the results as described in Section 5. But there is a problem with this method - it requires that all users submit imposter attempts that cover the other users uniformly, and it requires that this occur statically at the beginning of the process. Such a technique would not be practical for most real-world systems, in which the userbase is in constant flux. Asking a certain percentage of current users to make imposter attempts every time a new user is added to the system is impractical at best, and insufficient at worst.
Additionally, there are some problems with using a neural network for a fluctuating instance base. Neural networks are very efficient computationally once trained, but training can be computationally expensive. Retraining a neural network when new users are added to the system or when old users are removed can be costly. While incremental neural network training methods do exist, Brown and Rogers did not address this issue, nor did they report the cost of training on their examples. An instance-based approach such as the weighted k-nearest neighbor implementation described in this paper has the advantage that adding new instances or removing old ones is trivial. The disadvantage is that all the training is done at classification time. The performance results of k-nearest neighbor as reported at the end of Section 4 are promising even with the knowledge of the computational costs, and indexing methods such as kd-tree  promise to decrease the computation time even further.
Nevertheless, the technique presented here is inferior to that presented in  in several ways. First, weighted k-nearest neighbor is a lazy learner, and all the work must be done at classification time. Much of the work involves iterating through all the samples in the training set to find the nearest neighbors. This task is proportional to the number of users in the system. Although there are ways to offset some of this work, none of them have yet been applied. Second, the work presented here used only 11 test subjects. Brown and Rogers used 46 test subjects. There is little doubt that increasing the number of test subjects increases computation time and makes the classification task more difficult.
The most serious limitation of this work is the small number of individuals involved in the experiment. The experiments will soon be carried out on a larger pool of test subjects.
Additionally, it may be useful to add some dynamic shuffling of the training examples, replacing older, unused or erroneous samples with newer correctly classified samples. Such an addition could be added by simply maintaining a count for each user's samples, and occasionally replacing the sample least used in correct classification with a fresh sample obtained while the user logs in.
Work should be carried out in determining appropriate thresholds for immediately rejecting examples. Currently, an example that is 1000 times more distant than any sample yet seen would still be classified according to its nearest neighbors. An examination of some of the current samples reveals that even a generous threshold will have no other effect than immediately removing examples that would be misclassified anyway. The effect of such an improvement should be measured according to how it affects both the false alarm rate and the imposter pass rate.
Methods to eliminate overlap of classification, such as that described for the users Todd and Bruce (Section 5) seems like a promising avenue of study. Clustering and segmenting algorithms may be of some help here.
Lastly, it appears that some examination of additional features might be helpful. One such attribute includes the slope of the line connecting adjacent digraphs. Further study of classification of individual digraphs and digraph statistics is also needed.
This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 1 -no_navigation KeystrokePaper.tex.
The translation was initiated by Alen Peacock on 2000-04-12