Human beings are unpredictable, and for those of us in security this poses a problem. Even the most resilient systems with multiple layers of security are subject to failure because of human error (usually incompetence). However, where individuals often by themselves make stupid errors that can compromise themselves or your infrastructure as a whole, groups of individuals together often reach intelligent consensuses. This phenomena is known as group intelligence, and can be a powerful tool for identifying evil in data-driven communities such as: content-hosting sites, public forums, image-boards, and torrent-sites.
The current solution to such problems involves hiring moderators to crawl comments/discussions, and remove any spam or harmful download links from the site. Although this solution works it is time-consuming, and depending on the size of your site it can be expensive as well. In the context of forums and image-boards moderator bots are also used, but these bots usually only fire on a few key words or combination of words, and don’t really persist or analyze this potentially useful data later. To make this data useful you need a way to persist, correlate, and quantify this data. You also need a feed back loop that essentially learns based on new content.
For the sake of avoiding confusion I will use a torrent downloading site as an easy-to-understand illustration. Most torrent sites host multiple torrents allowing their users to rate and comment each. The mechanism we will use to rate this content is called sentiment analysis which will give us a negative or positive integer rating based on the “feeling” of each individual comment. These comment “feelings” are calculated based on a content-specific criteria of good and bad words and signatures. The overall rating of the content can then be calculated by adding up ratings of individual comments.
Here is a very simplified wordlist containing a few key words and their values, negative, positive, or neutral.
Now let’s use this small amount of context to analyze the following comment.
“This is a good copy does not contain malware”
Bolded words are those that our sentiment analysis algorithm understands
This is a good copy does not contain malware”
(good + copy) + (not * malware) = (3 + 0) + (-1 * -5) = + 8 rating
Obviously, the flexible syntax of natural language can pose problems to even the most advance natural language processing algorithms. However, as long as you are able to correctly interpret the majority of your content you do not need to worry so much about these outliers, after all we are trying to find a consensus.
Once you have a consensus you can use this data to advance your knowledge-base through learning algorithms. For example suppose you have one-hundred comments for a particular torrent. Eighty of these comments received positive scores, twenty negative. Think about what this tells us: with a population of this size we can reasonably say the content is both safe and good quality. We can now use association algorithms to look for commonly reoccurring unknown words in our newly identified positive comments. From there we can say with some certainty these newly identified words are positive, and add them to our existing positive wordlist. This information is persisted and used in the next analysis cycle. The same concept can be applied with negative words as well, and more specific “feelings” can be assigned to word groups by custom signatures.
The ultimate goal is to make your data somewhat context aware, where each cycle of analysis builds on the previous cycle. This way as the community content on your site gross so does the overall “intelligence” of the algorithm. In the next few months I will be adding write-ups to this blog on my own “context-aware” security project, and share what I have learned from the data I have gleaned.