Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso.
In: Proceedings of the 7th International Symposium on Wikis and Open Collaboration, series WikiSym '11, pages 82-90.
ACM, New York, NY, USA, 2011.
Sara Javanmardi, David W. McDonald and Cristina V. Lopes.
[doi]
[abstract]
[BibTeX]
User generated content (UGC) constitutes a significant fraction of the Web. However, some wiiki-based sites, such as Wikipedia, are so popular that they have become a favorite target of spammers and other vandals. In such popular sites, human vigilance is not enough to combat vandalism, and tools that detect possible vandalism and poor-quality contributions become a necessity. The application of machine learning techniques holds promise for developing efficient online algorithms for better tools to assist users in vandalism detection. We describe an efficient and accurate classifier that performs vandalism detection in UGC sites. We show the results of our classifier in the PAN Wikipedia dataset. We explore the effectiveness of a combination of 66 individual features that produce an AUC of 0.9553 on a test dataset -- the best result to our knowledge. Using Lasso optimization we then reduce our feature--rich model to a much smaller and more efficient model of 28 features that performs almost as well -- the drop in AUC being only 0.005. We describe how this approach can be generalized to other user generated content systems and describe several applications of this classifier to help users identify potential vandalism.
A user-oriented splog filtering based on a machine learning.
In: Proceedings of the 2008/2009 international conference on Social software: recent trends and developments in social software, series BlogTalk'08/09, pages 88-99.
Springer-Verlag, Berlin, Heidelberg, 2010.
Takayuki Yoshinaka, Soichi Ishii, Tomohiro Fukuhara, Hidetaka Masuda and Hiroshi Nakagawa.
[doi]
[abstract]
[BibTeX]
A method for filtering spam blogs (splogs) based on a machine learning technique, and its evaluation results are described. Today, spam blogs (splogs) became one of major issues on theWeb. The problem of splogs is that values of blog sites are different by people. We propose a novel user-oriented splog filtering method that can adapt each user's preference for valuable blogs. We use the SVM(Support Vector Machine) for creating a personalized splog filter for each user. We had two experiments: (1) an experiment of individual splog judgement, and (2) an experiment for user oriented splog filtering. From the former experiment, we found existence of 'gray' blogs that are needed to treat by persons. From the latter experiment, we found that we can provide appropriate personalized filters by choosing the best feature set for each user. An overview of proposed method, and evaluation results are described.
Using dynamic markov compression to detect vandalism in the wikipedia.
In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, series SIGIR '09, pages 822-823.
ACM, New York, NY, USA, 2009.
Kelly Y. Itakura and Charles L. A. Clarke.
[doi]
[abstract]
[BibTeX]
We apply the Dynamic Markov Compression model to detect spam edits in the Wikipedia. The method appears to outperform previous efforts based on compression models, providing performance comparable to methods based on manually constructed rules.
Data mining : concepts and techniques.
2005.
Jiawei Han and Micheline Kamber.
[doi]
[BibTeX]
95/46/EC of the European Parliament and of the Council of 24 October
1995 on the Protection of Individuals with Regard to the Processing
of Personal Data and on the Free Movement of such Data.
Official Journal of the EC, 23, 1995.
E. U. Directive.
[BibTeX]
|
|