Web user anomaly detection through user profiling with Naive Bayes classifier / by Gregg Victor D. Gabison

By: Gabison, Gregg Victor D [author]
Language: English Description: viii, 52 [11] leaves ; color illustrations; 28 cmContent type: text Media type: unmediated Carrier type: volumeSubject(s): Data mining | Machine learning -- Statistical methods | Web databasesGenre/Form: Academic theses.DDC classification: 006.312 Dissertation note: Thesis (DIT) -- Cebu Institute of Technology - University, College of Computer Studies, 2017 Abstract: Most organization today, deliver the Internet as one of its basic services to its employees and customer. However, such service has to managed, monitored and secured in order to maintain its optimal delivery. Unfortunately, users inadvertently share or expose their user credentials, which results in user account compromise, this allowing other individuals take over their accounts. The objective of this paper is to develop a web application service that will determin whether a user account has been compromised through user profiling or pattern discovery of their respective web utilization with the use of the Naïve Bayes Classifier. The Naïve Bayes provides a simple approach with clear semantics that returns impressive results in terms of classification activity as compared with other mining algorithms. To validate, we tested the NB classifier with another mining algorithm using WEKA, which resulted in being more accurate in the tow (2) experimentation methods, namely: a) the percentage split and b) the cross-validation (10 folds). In the preprocessing activity, the two repositories namely: web activity logs and user log-ins are integrated as one data source set which will serve as the corpus of this application. For the training process, we construct and analyze the visited sites in terms of n-grams, resulting to an idiom agnostic and trigram form. Using the Naïve Bayes Classifier, we adopt the multinomial model where it captures the frequency of words, not just their presence as compared to the Bernoulli model. Comparing the generated user profile/ class, a likelihood score is computed determining its similarity vis-à-vis the new user web activity. In qualifying the generated likelihood score, we made use of Pearson’s correlation coefficient (r), which is measure of the strength of the association between the two variables. The testing process made use of the remaining dataset which represent ¼ or 25%, serving as the test set. Four (4) testing methods were undertaken. The first method was the use of the confusion matrix given a computed sample size. The accuracy score was high 93%, with very low False Positive results. However, it was observed that scores of lower than 50%, pointed to web activities not related to the user. The second test is made use of 3 users, with added synthetic website visits both at the training and test set, with the purpose of doing a controlled testing process of the accuracy of the NB algorithm used in profile creation. The third test is used to determine if there are significant difference between a long trigram vs generating a profile based on the combine N-grams (trigram,4gram and 5gram). The other testing method is the random selection of thirty (30) users and made a comparison between their computed likelihood scores and the actual activities both in the training and the testing set. This was done in reference to the correlation matrix used. Of the users tested, one come out with a result where there was an overwhelming number of sites with scores of “rather weak” and medium (50% and below). Further analysis made, that these sites that come out in the testing were not present in the training set, which means that either these are new visited sites or a possible case where the account is compromised. Overall, the common observation is that the computed likelihood score tends to overestimate the probability or yields very high scores. A further observation made in this paper, suggest that the smoothing value of one (1) used in the NB Classifier algorithm, affected the computation, explaining this overestimation phenomenon. Nevertheless, the intention of this paper is not to accurately predict the actual probabilities but rather determine the relation with that of the compared sets, in this case, the training versus the testing set, in determining whether such an account is compromised or not. Keywords: Web data mining, data mining, web log analysis, pattern detection, Naïve Bayes, machine learning, profiling, classification, anomaly detection  
Tags from this library: No tags from this library for this title. Log in to add tags.
    Average rating: 0.0 (0 votes)

Thesis (DIT) -- Cebu Institute of Technology - University, College of Computer Studies, 2017

Includes bibliographical references.

Most organization today, deliver the Internet as one of its basic services to its employees and customer. However, such service has to managed, monitored and secured in order to maintain its optimal delivery. Unfortunately, users inadvertently share or expose their user credentials, which results in user account compromise, this allowing other individuals take over their accounts.
The objective of this paper is to develop a web application service that will determin whether a user account has been compromised through user profiling or pattern discovery of their respective web utilization with the use of the Naïve Bayes Classifier. The Naïve Bayes provides a simple approach with clear semantics that returns impressive results in terms of classification activity as compared with other mining algorithms. To validate, we tested the NB classifier with another mining algorithm using WEKA, which resulted in being more accurate in the tow (2) experimentation methods, namely: a) the percentage split and b) the cross-validation (10 folds). In the preprocessing activity, the two repositories namely: web activity logs and user log-ins are integrated as one data source set which will serve as the corpus of this application. For the training process, we construct and analyze the visited sites in terms of n-grams, resulting to an idiom agnostic and trigram form. Using the Naïve Bayes Classifier, we adopt the multinomial model where it captures the frequency of words, not just their presence as compared to the Bernoulli model. Comparing the generated user profile/ class, a likelihood score is computed determining its similarity vis-à-vis the new user web activity. In qualifying the generated likelihood score, we made use of Pearson’s correlation coefficient (r), which is measure of the strength of the association between the two variables.
The testing process made use of the remaining dataset which represent ¼ or 25%, serving as the test set. Four (4) testing methods were undertaken. The first method was the use of the confusion matrix given a computed sample size. The accuracy score was high 93%, with very low False Positive results. However, it was observed that scores of lower than 50%, pointed to web activities not related to the user. The second test is made use of 3 users, with added synthetic website visits both at the training and test set, with the purpose of doing a controlled testing process of the accuracy of the NB algorithm used in profile creation. The third test is used to determine if there are significant difference between a long trigram vs generating a profile based on the combine N-grams (trigram,4gram and 5gram). The other testing method is the random selection of thirty (30) users and made a comparison between their computed likelihood scores and the actual activities both in the training and the testing set. This was done in reference to the correlation matrix used. Of the users tested, one come out with a result where there was an overwhelming number of sites with scores of “rather weak” and medium (50% and below). Further analysis made, that these sites that come out in the testing were not present in the training set, which means that either these are new visited sites or a possible case where the account is compromised.
Overall, the common observation is that the computed likelihood score tends to overestimate the probability or yields very high scores. A further observation made in this paper, suggest that the smoothing value of one (1) used in the NB Classifier algorithm, affected the computation, explaining this overestimation phenomenon. Nevertheless, the intention of this paper is not to accurately predict the actual probabilities but rather determine the relation with that of the compared sets, in this case, the training versus the testing set, in determining whether such an account is compromised or not.

Keywords: Web data mining, data mining, web log analysis, pattern detection, Naïve Bayes, machine learning, profiling, classification, anomaly detection


There are no comments for this item.

to post a comment.