Miki's Programmers Blog: 2017

Monday, December 4, 2017

How to build your own spam filter based entirely on machine learning

Several posts ago, I explained the basic logic behind how spam filters work and I explained how to approach such a project from a programmer's point of view when factors like scalability come to play.

Today I want to explain, in this article, how you can create your own spam filter that is not dependent on blocklists or external reputation services, in a way that will make sense and that's easy to understand and apply to practical application.

First, I highly recommend that you read the post I brought in the link above to explain the logic first, especially the one that is related to reputation filtering , once you've read that, or if you already did - feel free to continue.

"Machine Learning" explained

The term "Machine Learning" is usually brought up in every article that tries to explain spam filtering, with all sorts of mathematical equations, a thing that usually just adds up to confusing and complicating a concept that is rather simple.

All it means is - that we are collecting data, lots of data, and then aggregate that data to come up with conclusions from it. That's all this big term means.

Part 1 - Create an "internal" reputation table for ANY IP connecting to your server

The first part of "machine learning" that you need for your filter if you want it to be effective, is to simulate, as much as possible, what external IP reputation services are doing when they measure an IP's "reputation score".

In practical terms, what it means is that you need to count, for every IP connecting to your server:

The number of connections made from that specific IP address
The number of valid recipients vs invalid recipients - usually the higher the number of invalid recipients tried from an IP address over a given period of time - the higher the probability it is a "Spammer" trying to abuse your clients.
The number of spam marked messages - if a certain percentage of messages received from this IP were marked as "spam" by your users, chances are, it's an IP belonging to a server that sends out spam!

So in general, a "good" IP address would be one that sends email to a high percentage of valid recipients, and that a high percentage of the messages received from it aren't marked as "spam" by your recipients.

You can decide of thresholds of your own - for example, an IP would be in "good" reputation if 90% of the messages go to valid recipients, and 99% of the messages aren't marked as spam. Then you can choose to change this reputation according to behaviour - if for example, 30% of the recipients are invalid, you can change this IP's reputation to "suspicious", or if 10% of the messages from it are marked as "spam" it can also lead to "suspicious", and then you can set thresholds to "blocked" to be 50% invalid recipients OR 30% spam reports.

Part 2 - Create an "internal" reputation table for ANY domain used in received emails

Creating an internal reputation score ("good" / "suspicious" / "bad") for an IP is one part of the learning, but it can have some problems as I explained in my linked article when it comes to false-positives and also blocking legitimate email that could be sent from that IP.

So to complete it, create an internal reputation table for ANY domain used in the received emails - extracting the domains being used in messages is a rather quick and fast process.

Then, have a table made up of domain names, and for every domain you can count the number of messages received with that domain name as a link, AND how many of the messages with that domain that were marked as "spam".

And again, apply the same amount of percentage counting here as well - a domain that is in "good" reputation can be one where 99% of the messages received with this domain name are good messages and only 1% were marked as spam. A "suspicious" domain can be when the percentage of messages that were marked as spam using this domain inside are up to 10% , and a "blocked" domain can be when the percentage goes above 10%.

Applying an internal reputation for domain names can be a much better choice to help you avoiding blocking IP addresses and being more specific on domain names.

And if you want to extend it even further ...

Part 3 - Create an "internal" reputation table for ANY "From address" used in received emails

A "From address" is usually the way in which a specific sender identifies himself. It is usually a brand name, it's usually consistent, and senders usually do not change it - because alot of recipients who are interested in the sender's services will usually add him to their contact list or "whitelisted" senders list.

So again, the logic for this table would be exactly the same like the one used to domain names - except that you'll use the From address (for example: "support@microsoft.com") to do the learning.

And again, you can give a reputation score to a from address - "Good", "Suspicious" and "Blocked" using the same logic I suggested above.

What about content? I see alot of email messages that all look quite the same, but are sent from different IP's , domains and from addresses. Is there a way to identify and block those?

Yes, there's a way. As I explained in the pervious article, I don't realy recommend going too deeply into content filtering because it can be a rather heavy and inscalable solution, but if all the above methods didn't help you to get rid of the spam, then you have to add some level of content filtering and I'll try to explain it here now.

Theoretically speaking, a "bulk message" is usually the same message being sent to many people at once, and because of the word emphasized "same message" it means that it will have some kind of a constant structure that can be detected and marked as "Good" , "Suspicious" or "Block".

Some of the things to look for on bulk messages that are being marked as "spam" but come from different IPs, domains or From addresses, and that can be checked for in a scalable way:

Length of the message - Yes, as simple as this may sound, usually if the same message gets marked again and again by users as "Spam" it will usually have either a consistent length, or a length that may change at most by 2-3% of it's size.
Content Type, Number of Parts, and size of every part - If the Content-Type is the same, and the number of parts being used (text,html and attachments) is the same, it's also something that very easy to check quickly.
If Images are included - CHECK THEIR DIMENSIONS! - This is again something that can be very quick to test, especially if it's an embedded image - embedded images can usually be in the form of JPG, GIF and PNG, and these formats usually have the image's dimensions accessible very quickly.

These 3 should be enough for this type of "content filter", I wouldn't go for actual words check because spammers will usually just use random combinations of words, making actual dictionary type of lookup useless.

What you can do is create a "profile" for every message that is marked as spam using this way - extract the length of the message, the content type, number of parts, size of each part, and images. One "row" that contains all these details shouldn't take more than 50-100 bytes of storage. And then, any message that gets marked - extract all these elements and compare them to the ones already in your database, and do the same counts for them as well - number of messages received that contained this "content" profile, and number of messages that were marked as spam that contains this profile - if the number of messages marked as spam is high - you can mark this "content profile" as "Blocked", and then you can use this to block similar messages as well.

Conclusion

So, I brought to you in this article a practical idea on how to create your own spam filter that is based on "machine learning" on all parameters of the message that realy counts - the IP address, the domain used, the From Address, and a way to quickly create a "content profile" to be looked over as well.

These should be enough for you to create a good and reliable spam filter for your company.

Perfecting this type of filter realy comes down to research, experience and "out of box" thinking - usually unsolicited bulk messages (aka "spam") will either be sent from a consistent IP space, have a consistent "From" address, have a consistent domain linked inside, or have a consistent content structure that can be detected.

And you can also make your filter even more accurate by using external blocklists, as I suggested in my previous article, like Spamhaus.org which are very reliable at detecting spamming activities.

One last word though I must add is - while some filters are more accurate than the rest, no spam filter is 100% accurate. For example - a spammer sending emails from different IPs, using different From addresses, with NO linked domain, and with randomized text in every message will probably never be detected by ANY filter. Although, recently, a new authentication protocol called DMARC have started to be widely used, which is aimed at authenticating the "From:" address to make sure it's coming from legitimate senders. But even with this - still, no message is 100% undetectable, and this is why stopping spam is a global effort - not only the companies that provide spam-filters do their best to detect it - but also the ISPs that send the mail (like Hotmail, Gmail, etc..) are providing measures, on their part, to stop any spammers from being able to abuse their systems (like deploying DMARC, counting number of messages sent by recipients, and much more).

Saturday, January 28, 2017

PopUp/PopUnder is the WORST advertising strategy (+FACTS!)

Marketing is one of the key elements for a successful campaign / product launch and even sales.
You have your product ready to be sold, and you need to get people exposed to it, so some of them will be interested enough to buy it.

While I myself represent the E-mail marketing world, over the years I have had my experience with display ads (Banners), but recently I wanted to put the PopUp/PopUnder marketing strategy to a test.

I have read a few blog posts not long ago that suggested that popups have a 2% CTR, and having a blind trust in this data, One of our clients who has a viral website, put out a popup/popunder campaign to test the effectiveness of this marketing method.

The results were quite disappointing - out of 1600 popups that were shown, only 3 people clicked on his landing page. 3 clicks out of 1600 popups impressions = 0.1% CTR !!

We must put some other fact here - the landing page that he displayed on the popup was exactly the same design that was sent in his E-mail campaigns. And in his e-mail campaigns this same design brought up to 7% CTR , which is 70 times more clicks.

So we wanted to put the popup/popunder to a more accurate test, to better reflect the performance of popups/popunders.

This is what we did:

1. We created a landing page based on the same design he uses in his email campaign.
2. We have put 4 signals to measure the popup performance:

First signal is fired when the landing page is first called by the popup ad network
Second signal is fired after 2 seconds once the page was initially requested
Third signal is fired when the page is physically viewed by the client (using the Visibility API of modern browsers)
Fourth signal is fired when a user clicks on, one of the links in the landing page

3. We also installed a user-recording analytical tool (similar to ClickTale) to see what the user has physically done on the page in case he interacted with it.

The results:

We gave this landing page a few hours to run, collecting well over 500 popup/popunders that were called. The result:

All of the popups only fired the first signal (initial page request)
45% of the popups fired the second signal (JS event after 2 seconds since the page loaded)
3.4% of the popups fired the third signal (indicated the page was physically viewed)
0% of the popups fired the fourth signal (click on one of the links)

So, from this test we can conclude that before we want to calculate what is the CTR rate of the popup, first we need to get the visitor to actually SEE the landing page. This is true in any form of marketing - you can only make conclusions based on real, physical views of your advertisement.

And as we can see from this data, the results are far from being in our favor - only 3.4% of the people that have a popup or popunder showed to them come to the part where they realy see what's inside.

We can assume, then, that a little more than 96% of the people that are exposed to a popup/popunder NEVER SEE THEM. They just either close them immediately, or they close the popup window without even opening it (This is, of course, possible just like you can close any window in your desktop without having to open it - A simple right-click on the window, and an option to close it is available).

So, if around 96.6% of popups aren't even viewed.. consider the following: 100,000 popups generated, 96.6% of them aren't even seen. That means that out of 100,000 popups, only 3400 people will actually see your ad.

One more thing to consider as a disadvantage to popup/popunders marketing, is the obvious pattern interrupt and context-switching that this form of marketing presents - the visitor is apparently having his attention somewhere else, where suddenly an intrusive attempt to sell something to him pops infront of his face. And people's natural response to a pattern interrupt, is to simply ignore it and go back to do what they were doing before. Another aspect which makes popups a realy bad marketing strategy, even more so than banner ads.

I hope this information was useful to anyone who wishes to consider popup advertising.

I won't completely ditch popup advertising because there are some advantages to it as well (for instance, the relatively low cost), but I thought it would be neccessary to have this information out there for people to consider this marketing strategy for themselves.

Friday, January 6, 2017

Razor2 "blacklist" .. apparently that's what it is

In my last post about how spam filters realy work, I talked about the use of distributed "black lists" that are use to collaborate on blocking sender's IPs , domains, or content based on some "shared" knowledge.

Most of the "known" blacklists, i.e: services that actually pretty straightforward say that they are blacklisting services, are usually being fed from automated honeypots (aka "spamtraps") to collect their data.

The reason why automated honeypots are used, is because of the huge amount of spam email that is sent each day. I mean think about it. there are several billions of emails sent each day, and according to the last statistics by Cisco's SenderBase online email monitoring service, about 85% of email sent is "spam".

That is an amount that is way too much for humans to read and interpret each message, so automated honeypot for "flagging" a sender in a blacklist, and on the other hand giving the senders option to request a removal from the blacklist in case of a false positive exists as well.

And then here comes the Razor2 engine that is used by SpamAssassin, and apparently by a big majority of email filtering platforms.

If you don't know what Razor2 is, let me refer you to some email headers generated by SpamAssassin that will give you a clue what I'm talking about. If you've seen any of the following:

RAZOR2_CHECK Listed in Razor2 (http://razor.sf.net/)

RAZOR2_CF_RANGE_E4_51_100 Razor2 gives engine 4 confidence level above 50%
[cf: 100]

RAZOR2_CF_RANGE_51_100 Razor2 gives confidence level above 50% [cf: 100]

Then congratulations - You are listed in Razor2's database.. erm.. sorry.. blacklist.

It's funny that Razor2's creators insist that it is not a blacklist.. but a "distributed hash sharing system".
So.. here are some interesting facts to know about Razor2 :

Fact #1 - Razor2 is now owned by CloudMark security

You're probably familiar with CloudMark's IP reputation check tool. They are huge player in the field of email security / threat protection and of course.. spam filtering.

And it turns out, that they decided to purchase Razor2, and not only that.. but they also host Razor2's main query requesting domain:

Name: discovery.razor.cloudmark.com
Address: 208.83.139.205
Name: discovery.razor.cloudmark.com
Address: 208.83.137.118
Name: discovery.razor.cloudmark.com
Address: 208.83.137.117

Fact #2 - Razor2 is a blacklist

A sender's blacklist, is by its definition - any distributed list of blocked IPs and or sending domains that are used to send spam.. that's exactly what Razor2 is.

Fact #3 - Razor2 do not offer a removal tool

Actually in this aspect, I can understand. Razor2 tends to be a reputation based system, which means that records are "cleaned" once the offending (spam) traffic has ceased to be sent. Gmail and many other known email providers seem to work that way as well.

One Unknown fact about Razor2

There's though one aspect of the Razor2 list that remains a mystery, and that is - who feeds this list?

Razor2 have created an Outlook plugin for reporting email to them, apparently it seems to be fed from real humans who get spam messages and then they use this "report" plugin to send the reported has to the Razor2 database.

On the other hand.. a question arises.. how many reports should be sent to them in order to decide that a sender should be blocked? 1 report? 2 report? 10 reports?

And the other question is.. who can guarantee that there's no person who just installed a honeypot with a Razor2 report plugin that just reports any message that arrives to it automatically as it arrives?

Indeed interesting questions that I hope that someone from Razor2 could one day provide the answers to them.

Miki's Programmers Blog