Monday, December 4, 2017

How to build your own spam filter based entirely on machine learning

Several posts ago, I explained the basic logic behind how spam filters work and I explained how to approach such a project from a programmer's point of view when factors like scalability come to play.

Today I want to explain, in this article, how you can create your own spam filter that is not dependent on blocklists or external reputation services, in a way that will make sense and that's easy to understand and apply to practical application.

First, I highly recommend that you read the post I brought in the link above to explain the logic first, especially the one that is related to reputation filtering , once you've read that, or if you already did - feel free to continue.

"Machine Learning" explained

The term "Machine Learning" is usually brought up in every article that tries to explain spam filtering, with all sorts of mathematical equations, a thing that usually just adds up to confusing and complicating a concept that is rather simple.

All it means is - that we are collecting data, lots of data, and then aggregate that data to come up with conclusions from it.  That's all this big term means.


Part 1 - Create an "internal" reputation table for ANY IP connecting to your server

The first part of "machine learning" that you need for your filter if you want it to be effective, is to simulate, as much as possible, what external IP reputation services are doing when they measure an IP's "reputation score".

In practical terms, what it means is that you need to count, for every IP connecting to your server:
  1. The number of connections made from that specific IP address
  2. The number of valid recipients vs invalid recipients - usually the higher the number of invalid recipients tried from an IP address over a given period of time - the higher the probability it is a "Spammer" trying to abuse your clients.
  3. The number of spam marked messages - if a certain percentage of messages received from this IP were marked as "spam" by your users, chances are, it's an IP belonging to a server that sends out spam!
So in general, a "good" IP address would be one that sends email to a high percentage of valid recipients, and that a high percentage of the messages received from it aren't marked as "spam" by your recipients.

You can decide of thresholds of your own - for example, an IP would be in "good" reputation if 90% of the messages go to valid recipients, and 99% of the messages aren't marked as spam.   Then you can choose to change this reputation according to behaviour - if for example, 30% of the recipients are invalid, you can change this IP's reputation to "suspicious", or if 10% of the messages from it are marked as "spam" it can also lead to "suspicious", and then you can set thresholds to "blocked" to be 50% invalid recipients OR 30% spam reports.


Part 2 - Create an "internal" reputation table for ANY domain used in received emails

Creating an internal reputation score ("good" / "suspicious" / "bad") for an IP is one part of the learning, but it can have some problems as I explained in my linked article when it comes to false-positives and also blocking legitimate email that could be sent from that IP.

So to complete it, create an internal reputation table for ANY domain used in the received emails - extracting the domains being used in messages is a rather quick and fast process.

Then, have a table made up of domain names, and for every domain you can count the number of messages received with that domain name as a link, AND how many of the messages with that domain that were marked as "spam".

And again, apply the same amount of percentage counting here as well - a domain that is in "good" reputation can be one where 99% of the messages received with this domain name are good messages and only 1% were marked as spam.   A "suspicious" domain can be when the percentage of messages that were marked as spam using this domain inside are up to 10% , and a "blocked" domain can be when the percentage goes above 10%.

Applying an internal reputation for domain names can be a much better choice to help you avoiding blocking IP addresses and being more specific on domain names.

And if you want to extend it even further ...


Part 3 - Create an "internal" reputation table for ANY "From address" used in received emails

A "From address" is usually the way in which a specific sender identifies himself.  It is usually a brand name, it's usually consistent, and senders usually do not change it - because alot of recipients who are interested in the sender's services will usually add him to their contact list or "whitelisted" senders list.

So again, the logic for this table would be exactly the same like the one used to domain names - except that you'll use the From address (for example: "support@microsoft.com") to do the learning.

And again, you can give a reputation score to a from address - "Good", "Suspicious" and "Blocked" using the same logic I suggested above.



What about content?    I see alot of email messages that all look quite the same, but are sent from different IP's , domains and from addresses.   Is there a way to identify and block those?

Yes, there's a way.  As I explained in the pervious article, I don't realy recommend going too deeply into content filtering because it can be a rather heavy and inscalable solution, but if all the above methods didn't help you to get rid of the spam, then you have to add some level of content filtering and I'll try to explain it here now.

Theoretically speaking, a "bulk message" is usually the same message being sent to many people at once, and because of the word emphasized "same message" it means that it will have some kind of a constant structure that can be detected and marked as "Good" , "Suspicious" or "Block".

Some of the things to look for on bulk messages that are being marked as "spam" but come from different IPs, domains or From addresses, and that can be checked for in a scalable way:

  1. Length of the message - Yes, as simple as this may sound, usually if the same message gets marked again and again by users as "Spam" it will usually have either a consistent length, or a length that may change at most by 2-3% of it's size.
  2. Content Type, Number of Parts, and size of every part - If the Content-Type is the same, and the number of parts being used (text,html and attachments) is the same, it's also something that very easy to check quickly.
  3. If Images are included - CHECK THEIR DIMENSIONS!  - This is again something that can be very quick to test, especially if it's an embedded image - embedded images can usually be in the form of JPG, GIF and PNG, and these formats usually have the image's dimensions accessible very quickly.
These 3 should be enough for this type of "content filter", I wouldn't go for actual words check because spammers will usually just use random combinations of words, making actual dictionary type of lookup useless.

What you can do is create a "profile" for every message that is marked as spam using this way - extract the length of the message, the content type, number of parts, size of each part, and images.   One "row" that contains all these details shouldn't take more than 50-100 bytes of storage.   And then, any message that gets marked - extract all these elements and compare them to the ones already in your database, and do the same counts for them as well - number of messages received that contained this "content" profile, and number of messages that were marked as spam that contains this profile - if the number of messages marked as spam is high - you can mark this "content profile" as "Blocked", and then you can use this to block similar messages as well.

Conclusion

So, I brought to you in this article a practical idea on how to create your own spam filter that is based on "machine learning" on all parameters of the message that realy counts - the IP address, the domain used, the From Address, and a way to quickly create a "content profile" to be looked over as well.

These should be enough for you to create a good and reliable spam filter for your company.

Perfecting this type of filter realy comes down to research, experience and "out of box" thinking - usually unsolicited bulk messages (aka "spam") will either be sent from a consistent IP space, have a consistent "From" address, have a consistent domain linked inside, or have a consistent content structure that can be detected.

And you can also make your filter even more accurate by using external blocklists, as I suggested in my previous article, like Spamhaus.org which are very reliable at detecting spamming activities.

One last word though I must add is - while some filters are more accurate than the rest, no spam filter is 100% accurate.   For example - a spammer sending emails from different IPs, using different From addresses, with NO linked domain, and with randomized text in every message will probably never be detected by ANY filter.    Although, recently, a new authentication protocol called DMARC have started to be widely used, which is aimed at authenticating the "From:" address to make sure it's coming from legitimate senders.   But even with this - still, no message is 100% undetectable, and this is why stopping spam is a global effort - not only the companies that provide spam-filters do their best to detect it - but also the ISPs that send the mail (like Hotmail, Gmail, etc..) are providing measures, on their part, to stop any spammers from being able to abuse their systems (like deploying DMARC, counting number of messages sent by recipients, and much more).