Security, Sentiment Analysis, and Machine Learning

Human beings are unpredictable, and for those of us in security this poses a problem. Even the most resilient systems with multiple layers of security are subject to failure because of human error (usually incompetence). However, where individuals often by themselves make stupid errors that can compromise themselves or your infrastructure as a whole, groups of individuals together often reach intelligent consensuses. This phenomena is known as group intelligence, and can be a powerful tool for identifying evil in data-driven communities such as: content-hosting sites, public forums, image-boards, and torrent-sites.

The current solution to such problems involves hiring moderators to crawl comments/discussions, and remove any spam or harmful download links from the site. Although this solution works it is time-consuming, and depending on the size of your site it can be expensive as well. In the context of forums and image-boards moderator bots are also used, but these bots usually only fire on a few key words or combination of words, and don’t really persist or analyze this potentially useful data later. To make this data useful you need a way to persist, correlate, and quantify this data. You also need a feed back loop that essentially learns based on new content.

For the sake of avoiding confusion I will use a torrent downloading site as an easy-to-understand illustration. Most torrent sites host multiple torrents allowing their users to rate and comment each. The mechanism we will use to rate this content is called sentiment analysis which will give us a negative or positive integer rating based on the “feeling” of each individual comment. These comment “feelings” are calculated based on a content-specific criteria of good and bad words and signatures. The overall rating of the content can then be calculated by adding up ratings of individual comments.


Here is a very simplified wordlist containing a few key words and their values, negative, positive, or neutral.


no -1
not -1
never -1

amazing +5
good +3
excellent +5

awful -5
bad -3
horrible -5
malware -5

copy 0
video 0
software 0
Now let’s use this small amount of context to analyze the following comment.

“This is a good copy does not contain malware”

Bolded words are those that our sentiment analysis algorithm understands

This is a good copy does not contain malware


(good + copy) + (not * malware) = (3 + 0) + (-1 * -5) = + 8 rating


Obviously, the flexible syntax of natural language can pose problems to even the most advance natural language processing algorithms. However, as long as you are able to correctly interpret the majority of your content you do not need to worry so much about these outliers, after all we are trying to find a consensus.

Once you have a consensus you can use this data to advance your knowledge-base through learning algorithms. For example suppose you have one-hundred comments for a particular torrent. Eighty of these comments received positive scores, twenty negative. Think about what this tells us: with a population of this size we can reasonably say the content is both safe and good quality. We can now use association algorithms to look for commonly reoccurring unknown words in our newly identified positive comments. From there we can say with some certainty these newly identified words are positive, and add them to our existing positive wordlist. This information is persisted and used in the next analysis cycle. The same concept can be applied with negative words as well, and more specific “feelings” can be assigned to word groups by custom signatures.

The ultimate goal is to make your data somewhat context aware, where each cycle of analysis builds on the previous cycle. This way as the community content on your site gross so does the overall “intelligence” of the algorithm. In the next few months I will be adding write-ups to this blog on my own “context-aware” security project, and share what I have learned from the data I have gleaned.

Screen Shot 2014-06-16 at 3.13.02 PM

~Jamin Becker

Linksys & Netgear Backdoor by the Numbers

If you’d like to just skip to the data, feel free to scroll on down. Research is not endorsed or attributable to $DayJob 🙂

After reading Rick Lawshae’s post on Hunting Botnets with ZMAP, I started wondering what types of cool things ZMAP can be used for. It wasn’t but a day or two later that something fell into my hands. On January first, Eloi Vanderbeken posted his findings on a backdoor that listens on TCP port 32764. The backdoor appears to affect older versions of Netgear and Linksys routers but some users are also reporting that other brands are also affected by the backdoor. Eloi was also able to write a python script that had the ability to check for the vulnerability among other functions. To get more info on the backdoor and how Eloi discovered it, you can check it out here:

Once I had wrapped up my reading on his work, I got excited. I realized that I finally have a way of answering a question we usually go without knowing. Almost every couple months you hear someone say, “There’s another backdoor in XYZ product!” and that’s about when media blows up, PR statements are released, Snort sigs are written, and we all wait for the first exploits to start rolling out.

I know that I don’t speak for all but I feel like the general mindset is that when a major backdoor or ZeroDay starts to make headlines, we think that hundreds of thousands, maybe millions of users, are affected by the vulnerability. With this in mind I set out to answer the question, “How bad is it?”

Step one was to figure out how to use Zmap so I installed it on my kali VM and gave it a shot. I followed the extremely simple instructions on their webpage an in one line I had my scan configured “$ zmap –p 32764 –o OpenPorts.csv”.

I then went to my VPS provider of choice and purchased a VPS that had a gigabit connection to the intertubes . I loaded up a vanilla install of Ubuntu server 12.04 and installed Zmap. Before I launched the scan, I made sure to read the Scanning Best Practices section of the Zmap documentation which lists things such as “Conduct scans no longer than time need for research” and “Should someone ask you to cease scanning their environment, add their range to your blacklist”.

The scan took roughly 22 hours to complete. The Zmap documentation and advertising states that you can get it done in less than an hour but I think they used a cluster format plus 22 hours isn’t bad by any means. 22 hours and 13 abuse complaints later (all complaints were acknowledged and scanning was ceased), I had my list of roughly 1.5 million IP addresses that were currently had TCP port 32764 open.  1.5 million…I thought to my self “That’s a pretty big number.”

I knew that this probably wasn’t statistically accurate though because there had been no validation that the backdoor service was the service listening on those open ports. To help validate how many of our 1.5 million users were actually vulnerable, I pulled in my friend Farel (Jamin) Becker.

Using Eloi’s findings Jamin was able to write some scripts using bash and python that allowed for us to quickly check the 1.5 million hosts for the vulnerability. It did this by simply reaching out to the port and seeing if there were indicators that the service was running. No exploitation or malicious actions were taken against the vulnerable routers.  Our checking was comparable to when you try and connect to a web page.

To effectively check for the vulnerable service, Jamin’s scripts functioned by splitting the list of 1.5 million IPs into roughly 2000 different list. Then the system was able to spin up 2000 independent instances of python to perform the work. To do this, we needed a pretty beefy computer so we rented the top EC2 server we could find. Needless to say it worked beautifully and only cost about $2.40 for the hour it took to complete the validation.

This is where the real data comes in. My first thought was “Oh man, here comes the part where we get to tell the world 400,000 routers are vulnerable RIGHT NOW!” The results were actually quite surprising. It turns out that only 4,998 routers were exposed and vulnerable. Safe to say that I expected more and I feel most would too. Below is some statistical data around what Jamin and I found. Geo data was gathered by querying the Maxmind Database.




-Max Rogers & Jamin Becker

Tagged , ,