Category Archives: Projects

Network Analysis #2

Building a decent PCAP analysis engine has turned out to be a lot of work. Since my last post I decided to scrap the Postgres database and RESTful api design for an Elasticsearch backend. The decision was primarily motivated by how unscalable the entire setup was. Every time I wanted to add a new log I had to create a corresponding model. Initially this seemed simple, but defining constraints became a huge issue. For example, DNS queries sometimes returned hundreds of lines of response data, which would break INSERTs 1 in 10000 times. A schema based backend forced me to define fields for every potential value, but many times only a fraction of the fields were populated, leaving tons of null data in my database. Another issue was the absolute shit, data-transport protocol I improvised. Moving the data from analysis nodes to storage nodes often took twice as long as the actual analysis.

Switching to an Elastic backend was a huge pain, but ended up being the perfect solution to the unstructured data I was storing. The result payed off as I did not have to define new log sources. Instead the processing nodes translate each extracted log into a a list of dictionaries. Each dictionary represents a row in the log. The result is wrapped in a JSON object, given a type and then stored in an analysis index on my Elastic cluster.

I cut analysis times from 2 minutes to 8 seconds by reading and not skimming the BRO IDS documentation (really important to do that). Up to this point I had assumed BRO operated only in live mode, and did not realize -r would read PCAP files in offline mode, generating logs without reading directly off the network card. Previously I had been using tcpreplay to replay the PCAP over a physical network interface at max speed. This was fairly inefficient, even with PF_RING kernel modules installed the process took almost 2 minutes. I swapped the tcpreplay method with bro -r and could almost instantaneously get results.

Another area I spent a lot of time on was the UI itself. The UI got a complete redesign. During stages of the submission and analysis process the interface rearranges, displaying only relevant information during that stage of the analysis process. When analysis of the PCAP file is completed only the analysis panel and a small tools interface are available. I also incorporated several jQuery-UI widgets to allow drag-drop and resizing of panels.drag

I took advantage of lobipanel.js’s built in full-screen mode incase a user wants to focus attention on one specific panel.screen-shot-2016-11-22-at-9-00-53-pm

Another concept I had been experimenting with is that of row-specific tools. The idea being that each row contains data which may constitute further analysis. I decided to categorize each potential cell value as a datatype using various regex patterns. When a user click “tools” on the left of any entry, that row is parsed, and fields which were assigned a datatype are extracted. I then generate a set of tools which can be used provide further information about extracted row-data.

Two row-tools I’ve built so far are a simple IP2Geo tool as well as a whois lookup.

screen-shot-2016-11-22-at-9-11-34-pmscreen-shot-2016-11-22-at-9-10-47-pm

I also plan on adding tools which make it easy to pivot between corresponding connections in various BRO logs (those sharing connection UID).

The last major improvement came with incorporating Suricata into my analysis nodes. BRO is great at extracting protocol information and giving you a good idea of the content of the PCAP file. This provides context and is necessary for any decent PCAP analysis, however out-of-the-box BRO is not very good at telling you if PCAP data contains indicators of malicious activity. Suricata on the other hand uses Emerging Threat signatures and can instantly tell whether or not malicious binaries, suspicious HTTP requests, or other IOCs exist within the capture.

screen-shot-2016-11-22-at-9-18-59-pm

The next steps of the project are around making this tool actually useful. Up to this point I have been capturing a ton of data about individual PCAPs but ultimately throwing the PCAP away once analysis is complete. I want to allow the user to download the PCAP as well as artifacts extracted from it, for this I am considering several large-scale storage options.

Hopefully, my next update will be weeks not months from now.

 

Network Analysis #1 – New Projects!

Two years ago I began working on SmartTorrent, which was a sentiment analysis based torrent search engine. The goal of the project was to rank torrent searches based on user content such as comments and determine whether or not a torrent was safe to download. At the time this seemed like a feasible goal, however as the project grew I began to realize how incredibly complex the problem actually was. I realized my approach was inherently flawed as I was assuming some level of consistency in semantic structure of the content I was analyzing. Frustrated by this and the monolithic pile of crap the tool had become  I decided to discontinue work on the project and begin work on a more feasible one; automating various aspects of network analysis for incident responders.

Actually, this has taken the form of several projects the two foremost being a web-based (I hate the word “cloud”) packet-capture (PCAP) analysis engine (think VirusTotal with PCAPs) and a NetFlow log visualizer which identifies top-talkers, potential lateral movement, and other incident response related metrics.

The web-based PCAP analysis engine allows a user to upload a PCAP file to our site, where it is offloaded to a processing node, replayed over a virtual network interface, and analyzed by several IDSs. The resulting analysis will return:

  1. Protocols found within the packet.
  2. Detailed logs of all connections
  3. Signatures fired.
  4. A list of related PCAP submissions containing similar data.

pk-1

The obvious value of this tool comes from it’s ability to group similar packet-captures into one consolidated view. This allows an analyst to search our database using indicators such as IP, host-name, URLs, etc. and receive results which could be used to extend existing blacklists.

Over the next few months I will go into greater detail about each of these projects as I add features.

The link to the Netflow log visualizer can be found here. Please feel free to fork and improve.

 

 

Security, Sentiment Analysis, and Machine Learning

Human beings are unpredictable, and for those of us in security this poses a problem. Even the most resilient systems with multiple layers of security are subject to failure because of human error (usually incompetence). However, where individuals often by themselves make stupid errors that can compromise themselves or your infrastructure as a whole, groups of individuals together often reach intelligent consensuses. This phenomena is known as group intelligence, and can be a powerful tool for identifying evil in data-driven communities such as: content-hosting sites, public forums, image-boards, and torrent-sites.

The current solution to such problems involves hiring moderators to crawl comments/discussions, and remove any spam or harmful download links from the site. Although this solution works it is time-consuming, and depending on the size of your site it can be expensive as well. In the context of forums and image-boards moderator bots are also used, but these bots usually only fire on a few key words or combination of words, and don’t really persist or analyze this potentially useful data later. To make this data useful you need a way to persist, correlate, and quantify this data. You also need a feed back loop that essentially learns based on new content.

For the sake of avoiding confusion I will use a torrent downloading site as an easy-to-understand illustration. Most torrent sites host multiple torrents allowing their users to rate and comment each. The mechanism we will use to rate this content is called sentiment analysis which will give us a negative or positive integer rating based on the “feeling” of each individual comment. These comment “feelings” are calculated based on a content-specific criteria of good and bad words and signatures. The overall rating of the content can then be calculated by adding up ratings of individual comments.


 

Here is a very simplified wordlist containing a few key words and their values, negative, positive, or neutral.

Negators:

no -1
not -1
never -1
Positive:

amazing +5
good +3
excellent +5
Negative:

awful -5
bad -3
horrible -5
malware -5
Context:

copy 0
video 0
software 0
Now let’s use this small amount of context to analyze the following comment.

“This is a good copy does not contain malware”

Bolded words are those that our sentiment analysis algorithm understands

This is a good copy does not contain malware

Evaluated:

(good + copy) + (not * malware) = (3 + 0) + (-1 * -5) = + 8 rating


 

Obviously, the flexible syntax of natural language can pose problems to even the most advance natural language processing algorithms. However, as long as you are able to correctly interpret the majority of your content you do not need to worry so much about these outliers, after all we are trying to find a consensus.

Once you have a consensus you can use this data to advance your knowledge-base through learning algorithms. For example suppose you have one-hundred comments for a particular torrent. Eighty of these comments received positive scores, twenty negative. Think about what this tells us: with a population of this size we can reasonably say the content is both safe and good quality. We can now use association algorithms to look for commonly reoccurring unknown words in our newly identified positive comments. From there we can say with some certainty these newly identified words are positive, and add them to our existing positive wordlist. This information is persisted and used in the next analysis cycle. The same concept can be applied with negative words as well, and more specific “feelings” can be assigned to word groups by custom signatures.

The ultimate goal is to make your data somewhat context aware, where each cycle of analysis builds on the previous cycle. This way as the community content on your site gross so does the overall “intelligence” of the algorithm. In the next few months I will be adding write-ups to this blog on my own “context-aware” security project, and share what I have learned from the data I have gleaned.

Screen Shot 2014-06-16 at 3.13.02 PM

~Jamin Becker

Quick & Easy Malware Discovery/Submission

In this quick project I decided that the goal would be to automate the downloading of malware and submitting samples to VirusTotal that aren’t currently in VirusTotal. I decided that to gather the malware I would use Maltrieve.

From the github, “Maltrieve originated as a fork of mwcrawler. It retrieves malware directly from the sources as listed at a number of sites, including Malc0de, Malware Black ListMalware Domain ListMalware PatrolSacour.cnVX VaultURLqery, and CleanMX” I would like to thank  for taking the time to build this out.

To upload samples to VirusTotal, I utilized a script written by @it4sec. It can be found at http://ondailybasis.com/blog/wp-content/uploads/2012/12/yaps.py_.txt. All I had to do was add my API key to the script and tell it what directory my samples were in. At that point, it handles checking if the sample is already in the VirustTotal data set and if it isn’t, it will upload it. It even keeps track of everything in a log file for future reference.

I added a cron job that runs Maltrieve at the top of every hour and another cron job that runs yaps.py 30 minutes after. This essentially allows me to pull down new samples every hour and do my part in uploading new samples to VirusTotal.

Analysis:
So far I’ve pulled down 7,424 malware samples by using Maltrieve over the last few days. Out of that 7,424 samples, ~1400 samples have never been seen by Virus Total. I’ve found different variants of malware such as Zeus, Asprox, and lot’s of malicious iframe injections on web pages. I’m actually surprised in the amount of unique samples being uploaded, I was expecting someone to be doing this exact same process and uploading samples sooner than I can.

The next step of the project is to get malware uploaded automatically to Malwr.com to generate sandboxing of the samples. I look forward to expanding this out and hopefully receiving some input on what direction this should/could go.

– Max Rogers

Tagged

Linksys & Netgear Backdoor by the Numbers

If you’d like to just skip to the data, feel free to scroll on down. Research is not endorsed or attributable to $DayJob 🙂

After reading Rick Lawshae’s post on Hunting Botnets with ZMAP, I started wondering what types of cool things ZMAP can be used for. It wasn’t but a day or two later that something fell into my hands. On January first, Eloi Vanderbeken posted his findings on a backdoor that listens on TCP port 32764. The backdoor appears to affect older versions of Netgear and Linksys routers but some users are also reporting that other brands are also affected by the backdoor. Eloi was also able to write a python script that had the ability to check for the vulnerability among other functions. To get more info on the backdoor and how Eloi discovered it, you can check it out here: https://github.com/elvanderb/TCP-32764/blob/master/backdoor_description_for_those_who_don-t_like_pptx.pdf.

Once I had wrapped up my reading on his work, I got excited. I realized that I finally have a way of answering a question we usually go without knowing. Almost every couple months you hear someone say, “There’s another backdoor in XYZ product!” and that’s about when media blows up, PR statements are released, Snort sigs are written, and we all wait for the first exploits to start rolling out.

I know that I don’t speak for all but I feel like the general mindset is that when a major backdoor or ZeroDay starts to make headlines, we think that hundreds of thousands, maybe millions of users, are affected by the vulnerability. With this in mind I set out to answer the question, “How bad is it?”

Step one was to figure out how to use Zmap so I installed it on my kali VM and gave it a shot. I followed the extremely simple instructions on their webpage an in one line I had my scan configured “$ zmap –p 32764 –o OpenPorts.csv”.

I then went to my VPS provider of choice and purchased a VPS that had a gigabit connection to the intertubes . I loaded up a vanilla install of Ubuntu server 12.04 and installed Zmap. Before I launched the scan, I made sure to read the Scanning Best Practices section of the Zmap documentation which lists things such as “Conduct scans no longer than time need for research” and “Should someone ask you to cease scanning their environment, add their range to your blacklist”.

The scan took roughly 22 hours to complete. The Zmap documentation and advertising states that you can get it done in less than an hour but I think they used a cluster format plus 22 hours isn’t bad by any means. 22 hours and 13 abuse complaints later (all complaints were acknowledged and scanning was ceased), I had my list of roughly 1.5 million IP addresses that were currently had TCP port 32764 open.  1.5 million…I thought to my self “That’s a pretty big number.”

I knew that this probably wasn’t statistically accurate though because there had been no validation that the backdoor service was the service listening on those open ports. To help validate how many of our 1.5 million users were actually vulnerable, I pulled in my friend Farel (Jamin) Becker.

Using Eloi’s findings Jamin was able to write some scripts using bash and python that allowed for us to quickly check the 1.5 million hosts for the vulnerability. It did this by simply reaching out to the port and seeing if there were indicators that the service was running. No exploitation or malicious actions were taken against the vulnerable routers.  Our checking was comparable to when you try and connect to a web page.

To effectively check for the vulnerable service, Jamin’s scripts functioned by splitting the list of 1.5 million IPs into roughly 2000 different list. Then the system was able to spin up 2000 independent instances of python to perform the work. To do this, we needed a pretty beefy computer so we rented the top EC2 server we could find. Needless to say it worked beautifully and only cost about $2.40 for the hour it took to complete the validation.

This is where the real data comes in. My first thought was “Oh man, here comes the part where we get to tell the world 400,000 routers are vulnerable RIGHT NOW!” The results were actually quite surprising. It turns out that only 4,998 routers were exposed and vulnerable. Safe to say that I expected more and I feel most would too. Below is some statistical data around what Jamin and I found. Geo data was gathered by querying the Maxmind Database.

ByCountry

byISP

ByState

-Max Rogers & Jamin Becker

Tagged , ,