Monday, October 21, 2019

How to get high open rates with email marketing

The success of any good email marketing campaign starts with a high click rate, obviously.
But the number of clicks an email gets is directly related to number of opens the same email gets.

If you sent your email to 10,000 recipients, and got a 2% open rate, that means that only 200 people opened your email.  Even if your click rate is 10%, that's still only 20 clicks.   However, if you had the same email generating 20% open rate.. then both of these are multiplied by 10 and you get 2000 opens, and 200 clicks.

So actually it's more correct to say that your email campaign starts at the open rate itself.
Now the question arises.. so how do you get a high open rate with your email?

Here are a few key points that attribute to this:

1. The quality of your list

If those 10,000 recipients you sent your email to are mostly unused email addresses, or dead email addresses, or addresses that people don't use anymore (because they've switched to other providers and now their old email is obsolete) then it makes sense that your open rates will be impacted from this.

So the first and most important criteria to get high open rates is the quality of your list.  A good list of 10,000 active users is far more better, and more valuable, and more stable to your campaigns, than a list of 100,000 users that consists of 90% dead or unused email addresses and only 10% active addresses, and its not just the open rates that are affected by these numbers.

You must understand that over time, dead or unused email addresses are converted into spamtraps, it is a practice many email providers use in order to filter our spam.   Your emails landing in spamtraps lead to your further emails getting filtered, so in a sense if you send to a list of 100,000 users that contains only 10% active users, 90% unused addresses - even if only 10 addresses of these 90% unused addresses are spamtraps - that will directly impact your ability to reach the 10% real users you have.

So the correct way to achieve this is to actually monitor your email opens - see which recipients open your emails constantly and keep only them in your list.   A smaller, higher quality list, leads to better open rates and far more consistent email deliveries.


2. The sender's name and subject line

When your emails first arrive at the inbox (or spam folder) of the user, the only thing they see is your sender's name and the subject line (sometimes even the subject line isn't fully shown as some email clients have a smaller screen that trims it..).

It makes sense that both the sender name and the subject line will determine alot on whether the recipient will open your email, delete it without even opening, or mark it as spam.

If you have a spam folder in your own email address, try to check it sometimes - andpay special attention to spam emails. If you do this you'll see that a classical spam email usually uses a personal name of someone you never heard of, contains subject line that appears to be a response "Re: xxxxx" or something urgent "URGENT!!" , and its content contains text that doesnt realy add any value to you.

If you do not want your mails to be immediately marked as spam, as a rule of thumb - make your best to avoid being perceived as a spammer.   DON'T use personal names as your sender's name unless they already know you or tried contacting you first.   And obviously don't use blank, misleading, or nonsense subject lines - when people can't determine what your email is about by looking first at your sender name and the subject line, they will immediately delete your message or even worse - mark it as spam, which will hurt your reputation.

Make sure that people can immediately understand what your email is about by looking at the sender name and subject line.   This step is critically important.   You will be surprised to know that when your emails does provide some added value to people - even if your email appears to be "spam", alot of people will actually de-flag you and mark your mails as not-spam and by this improve your reputation.

That's the next point ...


3. The content of the message itself

How does the content of the message affects your open rates?
Well, since the only way to realy track opens is by using downloadable images inside of the mail itself, it means that people who open your email and don't download images by default won't be counted.

There are many mailbox providers who block images by default to protect their users' privacy.
When images aren't shown, that leaves just the plain html or text to read, which is why including text inside your email and not only images is a good practice.

So when they look at your email and realize that your email is interesting, or gives them an added value, then they use the "show images" button which then downloads your images, and adds more opens to you.  In addition, it also flags to the mailbox provider that your email is "relevant" or provides value, which also in turn improves your sender's reputation.

In today's email world, relevancy and value play a higher role in email marketing than ever before.  It's not only just in email, but also in search engines.  The days where spammers could abuse a mail system or a search engine for their own benefits using blackhat strategies are gone.

BUT - if your mail provides value to the recipients, if your mail is relevant to them, and if it looks good, they will tell their mailbox providers that your mail is "relevant" and good by flagging your emails accordingly.  So be relevant, add value to your users.   That's the way to go.

Tuesday, December 25, 2018

How to bypass spam filters and get your email delivered to the inbox

This is a topic that may be of interest to many internet marketers as well as brands.
So how do you bypass spam filters and get your email delivered to the inbox?


First step: Understand how spamfilters work

One of the first realizations you'll need to get is that spam filtering these days is an entirely automatic process.

Even though sometimes it might feel like there's a person who is manually reviewing your email and deciding that it looks like spam so it should be filtered - the reality is, it doesn't really work like that.  

The amount of emails sent daily is nearly 400 Billion emails - even the smallest ISPs will still get hundreds of thousands of emails.  DAILY!   There's no way a person will sit and manually review every email.

So everything is automatic, and spam filters are just very clever, so in order to bypass them, you'll need to understand the process in which they work to detect spam.   Let's go over them by order of importance.


First criteria of detecting spam: Volume and Speed

Let's assume, for the purposes of this example, that you are a sender named someone@someplace.com.
You start sending your brand new email campaign to users of some server, let's say Hotmail.   If Hotmail doesn't "know" you yet (you are a brand new sender to them) then the system will give you a short time period of trying to "learn" your sending behaviour.

Now you might assume that's it's ok to just blast away all your emails as fast as you can, but this is exactly the type of behaviour that triggers a suspicious mark on you, as a sender.   Because you see, this is exactly what spammers do - they try to blast away their emails as fast as possible, before they get "detected" and filtered.

So the first criteria of spam filtering that you have to bypass through, in order to be considered a legitimate sender (and thus - all your emails will be delivered to the inbox), is this criteria - volume and speed.

You'll need to start slow.  Very slow, in fact.   And then gradually increase your volume and speed.   This is the kind of behaviour that is accepted and in fact it also makes sense - the growth graph of almost every business out there starts slow and then gradually increases - No business ever gets opened and immediately, within it's first hour of opening, already starts blasting tens of thousands of emails.  Think about it.



Second criteria of detecting spam: Spamtraps

Remember the "it feels like a person is manually reviewing your email.." phrase at the beginning of this article?   Well, there's no person who really sits there.   Most likely, it's either you failed in the volume and/or speed criteria (see above), or that there is an automated spamtrap address, or even a few of them, inside your lists.

Spamtraps are usually old and abandoned email addresses that haven't been used for a very long time.  VERY LONG time.   Which is exactly why they shouldn't end up in your "brand new" email list - they are dead addresses.   There's no human person who manually confirmed the inclusion of these addresses in your list.   So if you email them, it means that you are sending email to people who didn't request it, which makes you a "spammer".   

So when your email arrives at these addresses, an automated process just evaluates the person who sent this email (you) and within a very short time marks you as a "spammer" and that all your further emails should be ignored and/or filtered.   It's actually a very clever mechanism!

So in order to get your emails delivered to the inbox, you have to make sure there are no spamtraps in them - only real and valid email addresses.   Your best choice of action to evaluate this, is by sending a confirmation email first to your list to see who is interested in receiving your email, and who is not.  This is actually the suggested method by major ISPs as well.



Third criteria of detecting spam: Invalid addresses / Hard bounces

This goes back again to the common spammer behaviour.   Spammers just take an email list and starts blasting them.  We talked about this already in the first criteria (volume and speed).   In addition, most of their lists contain invalid email addresses, so they get a relatively HIGH bounce rates.

Most legitimate email senders perform list hygiene practices to remove dead addresses and or invalid or inactive users.   This goes back again to the second criteria above (spamtraps).   Because dead email addresses might turn, in time, to spamtraps, no legitimate sender will want them in his list.   So a legitimate sender might have as few as less than 0.5% of this list having invalid addresses.   This is considered a realy good value.

But, if you have 30% of your list as invalid addresses, then that's a real problem.   Not only that it might be that at least a couple of these invalid addresses are spamtraps (and thus - marking you as a spammer) but also it triggers the alarms of ISPs about you as a bad sender - which you DON'T want!

The solution to this?  Again, same solution as in the spamtraps section - send a confirmation email, first, to your list.   You might want to consider using email validation services to remove invalid addresses from your list even before you send your confirmation email.



Fourth criteria of detecting spam: Spam complaints

This just makes sense.   Maybe even more than the first 3 criterias I explained previously - if alot of users are clicking the "This is spam" button, or "complaining" about you, as a sender, then most chances are - you are a spammer.

However, you might actually be surprised to discover that this criteria is not that much important as the other 3 I mentioned earlier, you may ask.. why?

Well the first reason is, that most people who mark your email as "This is spam" don't even open your email in the first place.   They just see an unknown sender in their list, and just mark it with a checkbox and remove it.   Just the same as using the "Delete" button, but they do it with "This is spam" button.

The second reason is, and this may come as a surprise to you - is that many people will actually OPEN emails from unknown senders, mostly out of curiousity.   And when this happends, if the email actually looks legitimate and looks safe, they may either click on it, or just use the "Unsubscribe" link inside of it, instead of marking it as spam - which is exactly WHY you want to make the Unsubscribe link as easy to find as possible.

Another problem with relying on spam complaints as an indication for spam is market competition - If two companies compete against each other for clients, and both of them have the same clients, one company can give incentives to some of their clients, to mark the other company's emails as "Spam", which then makes many legitimate emails be filtered as spam without no reason.   Big problem!

So, I believe that ISPs take into consideration the spam complaints rate only if it is a VERY high rate (above 3% of overall emails sent), but they give it some percentage from the entire score, and that they have other measures which are alot more "reliable" to detecting spam, than just spam complaints (like volume and speed, spamtraps, invalid users, etc).



Fifth and final criteria of detecting spam: Content

You might think that content should be the first criteria for detecting spam, not the last.   But there's a reason why it's the final decision maker and I'll explain it.

Some of the best spammers know how spam filters work.   They know that they search for volume and speed, so they bypass them.   They know that they use spamtraps for marking the senders.   They know that they test the invalid addresses amount, and they know how relatively unimportant spam complaints are.

So what they do is they just impersonate to be a thousand different senders, using a thousand different servers, all used to send the same email.

In this case, spam filters can't realy detect the "sender", because it changes all the time.   And all the big criterias for detecting this email as spam fall in place - You can't assign volume and speed, spamtraps, and unknown users penalties to a specific sender.

But there's one thing that does remain constant, in this scenario, which is the content itself.
So, as a final line of defence, if all others fail, spam filters will look for patterns of content that repeat themselves, and mark those.

This is, of course, considered very bad behaviour, and is definitely not something that a legitimate sender will EVER do, so I don't realy think you should ever have a problem with this level.




What If I followed ALL these and still get filtered?

So you have a perfectly clean list - No spamtraps, No invalid addresses.
And you still get filtered and arrive at the spam folder, and not the inbox.
What do you do in such case?

Well, first - you can always try and contact the email provider where your recipients reside in itself and let them know of this issue.   I find that alot of ISPs are actually willing to help in such cases.

Second, you may want to check how you send your email.   Some ISPs will be more strict than others about themes like Authorization, for instance (SPF records, DKIM, DMARC), or security (using a secure channel, like STARTTLS, to send emails).

Third, if you are not sure, try to send a simple text email to one of the servers you are filetered in.   Just a regular plain email.   This SHOULD go through.    If this doesn't work, try to change the sender's name several times.   If that doesn't work, try to change your sending IP address.   And if that doesn't work, try to change your email sending infrastructure or even try sending the email from a different infrastructure.   At least ONE of these combinations SHOULD work, and it may help you in debugging the issue and/or reason for why you are filtered.

Sometimes, you get filtered because of your sender name.   This is something you don't realy have alot of control over, because some spammer can just start sending emails and impersonate to be you.   This is why authorization themes like SPF, DKIM, and DMARC, were created.

Sometimes, you get filtered because your mailing infrastructure doesn't work properly, or doesn't generate a completely valid RFC email.

Sometimes, your server IPs might belong to a neighbourhood of IPs that have been known to send spam.

And sometimes, it can actually be some problematic words inside your content that are triggering the "alarms" of spam filters.

I hope this post helped you.

Monday, December 4, 2017

How to build your own spam filter based entirely on machine learning

Several posts ago, I explained the basic logic behind how spam filters work and I explained how to approach such a project from a programmer's point of view when factors like scalability come to play.

Today I want to explain, in this article, how you can create your own spam filter that is not dependent on blocklists or external reputation services, in a way that will make sense and that's easy to understand and apply to practical application.

First, I highly recommend that you read the post I brought in the link above to explain the logic first, especially the one that is related to reputation filtering , once you've read that, or if you already did - feel free to continue.

"Machine Learning" explained

The term "Machine Learning" is usually brought up in every article that tries to explain spam filtering, with all sorts of mathematical equations, a thing that usually just adds up to confusing and complicating a concept that is rather simple.

All it means is - that we are collecting data, lots of data, and then aggregate that data to come up with conclusions from it.  That's all this big term means.


Part 1 - Create an "internal" reputation table for ANY IP connecting to your server

The first part of "machine learning" that you need for your filter if you want it to be effective, is to simulate, as much as possible, what external IP reputation services are doing when they measure an IP's "reputation score".

In practical terms, what it means is that you need to count, for every IP connecting to your server:
  1. The number of connections made from that specific IP address
  2. The number of valid recipients vs invalid recipients - usually the higher the number of invalid recipients tried from an IP address over a given period of time - the higher the probability it is a "Spammer" trying to abuse your clients.
  3. The number of spam marked messages - if a certain percentage of messages received from this IP were marked as "spam" by your users, chances are, it's an IP belonging to a server that sends out spam!
So in general, a "good" IP address would be one that sends email to a high percentage of valid recipients, and that a high percentage of the messages received from it aren't marked as "spam" by your recipients.

You can decide of thresholds of your own - for example, an IP would be in "good" reputation if 90% of the messages go to valid recipients, and 99% of the messages aren't marked as spam.   Then you can choose to change this reputation according to behaviour - if for example, 30% of the recipients are invalid, you can change this IP's reputation to "suspicious", or if 10% of the messages from it are marked as "spam" it can also lead to "suspicious", and then you can set thresholds to "blocked" to be 50% invalid recipients OR 30% spam reports.


Part 2 - Create an "internal" reputation table for ANY domain used in received emails

Creating an internal reputation score ("good" / "suspicious" / "bad") for an IP is one part of the learning, but it can have some problems as I explained in my linked article when it comes to false-positives and also blocking legitimate email that could be sent from that IP.

So to complete it, create an internal reputation table for ANY domain used in the received emails - extracting the domains being used in messages is a rather quick and fast process.

Then, have a table made up of domain names, and for every domain you can count the number of messages received with that domain name as a link, AND how many of the messages with that domain that were marked as "spam".

And again, apply the same amount of percentage counting here as well - a domain that is in "good" reputation can be one where 99% of the messages received with this domain name are good messages and only 1% were marked as spam.   A "suspicious" domain can be when the percentage of messages that were marked as spam using this domain inside are up to 10% , and a "blocked" domain can be when the percentage goes above 10%.

Applying an internal reputation for domain names can be a much better choice to help you avoiding blocking IP addresses and being more specific on domain names.

And if you want to extend it even further ...


Part 3 - Create an "internal" reputation table for ANY "From address" used in received emails

A "From address" is usually the way in which a specific sender identifies himself.  It is usually a brand name, it's usually consistent, and senders usually do not change it - because alot of recipients who are interested in the sender's services will usually add him to their contact list or "whitelisted" senders list.

So again, the logic for this table would be exactly the same like the one used to domain names - except that you'll use the From address (for example: "support@microsoft.com") to do the learning.

And again, you can give a reputation score to a from address - "Good", "Suspicious" and "Blocked" using the same logic I suggested above.



What about content?    I see alot of email messages that all look quite the same, but are sent from different IP's , domains and from addresses.   Is there a way to identify and block those?

Yes, there's a way.  As I explained in the pervious article, I don't realy recommend going too deeply into content filtering because it can be a rather heavy and inscalable solution, but if all the above methods didn't help you to get rid of the spam, then you have to add some level of content filtering and I'll try to explain it here now.

Theoretically speaking, a "bulk message" is usually the same message being sent to many people at once, and because of the word emphasized "same message" it means that it will have some kind of a constant structure that can be detected and marked as "Good" , "Suspicious" or "Block".

Some of the things to look for on bulk messages that are being marked as "spam" but come from different IPs, domains or From addresses, and that can be checked for in a scalable way:

  1. Length of the message - Yes, as simple as this may sound, usually if the same message gets marked again and again by users as "Spam" it will usually have either a consistent length, or a length that may change at most by 2-3% of it's size.
  2. Content Type, Number of Parts, and size of every part - If the Content-Type is the same, and the number of parts being used (text,html and attachments) is the same, it's also something that very easy to check quickly.
  3. If Images are included - CHECK THEIR DIMENSIONS!  - This is again something that can be very quick to test, especially if it's an embedded image - embedded images can usually be in the form of JPG, GIF and PNG, and these formats usually have the image's dimensions accessible very quickly.
These 3 should be enough for this type of "content filter", I wouldn't go for actual words check because spammers will usually just use random combinations of words, making actual dictionary type of lookup useless.

What you can do is create a "profile" for every message that is marked as spam using this way - extract the length of the message, the content type, number of parts, size of each part, and images.   One "row" that contains all these details shouldn't take more than 50-100 bytes of storage.   And then, any message that gets marked - extract all these elements and compare them to the ones already in your database, and do the same counts for them as well - number of messages received that contained this "content" profile, and number of messages that were marked as spam that contains this profile - if the number of messages marked as spam is high - you can mark this "content profile" as "Blocked", and then you can use this to block similar messages as well.

Conclusion

So, I brought to you in this article a practical idea on how to create your own spam filter that is based on "machine learning" on all parameters of the message that realy counts - the IP address, the domain used, the From Address, and a way to quickly create a "content profile" to be looked over as well.

These should be enough for you to create a good and reliable spam filter for your company.

Perfecting this type of filter realy comes down to research, experience and "out of box" thinking - usually unsolicited bulk messages (aka "spam") will either be sent from a consistent IP space, have a consistent "From" address, have a consistent domain linked inside, or have a consistent content structure that can be detected.

And you can also make your filter even more accurate by using external blocklists, as I suggested in my previous article, like Spamhaus.org which are very reliable at detecting spamming activities.

One last word though I must add is - while some filters are more accurate than the rest, no spam filter is 100% accurate.   For example - a spammer sending emails from different IPs, using different From addresses, with NO linked domain, and with randomized text in every message will probably never be detected by ANY filter.    Although, recently, a new authentication protocol called DMARC have started to be widely used, which is aimed at authenticating the "From:" address to make sure it's coming from legitimate senders.   But even with this - still, no message is 100% undetectable, and this is why stopping spam is a global effort - not only the companies that provide spam-filters do their best to detect it - but also the ISPs that send the mail (like Hotmail, Gmail, etc..) are providing measures, on their part, to stop any spammers from being able to abuse their systems (like deploying DMARC, counting number of messages sent by recipients, and much more).

Saturday, January 28, 2017

PopUp/PopUnder is the WORST advertising strategy (+FACTS!)

Marketing is one of the key elements for a successful campaign / product launch and even sales.
You have your product ready to be sold, and you need to get people exposed to it, so some of them will be interested enough to buy it.

While I myself represent the E-mail marketing world, over the years I have had my experience with display ads (Banners), but recently I wanted to put the PopUp/PopUnder marketing strategy to a test.

I have read a few blog posts not long ago that suggested that popups have a 2% CTR, and having a blind trust in this data, One of our clients who has a viral website, put out a popup/popunder campaign to test the effectiveness of this marketing method.

The results were quite disappointing - out of 1600 popups that were shown, only 3 people clicked on his landing page.  3 clicks out of 1600 popups impressions = 0.1% CTR !!

We must put some other fact here - the landing page that he displayed on the popup was exactly the same design that was sent in his E-mail campaigns.  And in his e-mail campaigns this same design brought up to 7% CTR , which is 70 times more clicks.

So we wanted to put the popup/popunder to a more accurate test, to better reflect the performance of popups/popunders.

This is what we did:

1. We created a landing page based on the same design he uses in his email campaign.
2. We have put 4 signals to measure the popup performance:

  • First signal is fired when the landing page is first called by the popup ad network
  • Second signal is fired after 2 seconds once the page was initially requested
  • Third signal is fired when the page is physically viewed by the client (using the Visibility API of modern browsers)
  • Fourth signal is fired when a user clicks on, one of the links in the landing page
3. We also installed a user-recording analytical tool (similar to ClickTale) to see what the user has physically done on the page in case he interacted with it.


The results:

We gave this landing page a few hours to run, collecting well over 500 popup/popunders that were called.  The result:

  • All of the popups only fired the first signal (initial page request)
  • 45% of the popups fired the second signal (JS event after 2 seconds since the page loaded)
  • 3.4% of the popups fired the third signal (indicated the page was physically viewed)
  • 0% of the popups fired the fourth signal (click on one of the links)

So, from this test we can conclude that before we want to calculate what is the CTR rate of the popup, first we need to get the visitor to actually SEE the landing page.  This is true in any form of marketing - you can only make conclusions based on real, physical views of your advertisement.

And as we can see from this data, the results are far from being in our favor - only 3.4% of the people that have a popup or popunder showed to them come to the part where they realy see what's inside.

We can assume, then, that a little more than 96% of the people that are exposed to a popup/popunder NEVER SEE THEM.  They just either close them immediately, or they close the popup window without even opening it (This is, of course, possible just like you can close any window in your desktop without having to open it - A simple right-click on the window, and an option to close it is available).

So, if around 96.6% of popups aren't even viewed.. consider the following: 100,000 popups generated, 96.6% of them aren't even seen.   That means that out of 100,000 popups, only 3400 people will actually see your ad.

One more thing to consider as a disadvantage to popup/popunders marketing, is the obvious pattern interrupt and context-switching that this form of marketing presents - the visitor is apparently having his attention somewhere else, where suddenly an intrusive attempt to sell something to him pops infront of his face.   And people's natural response to a pattern interrupt, is to simply ignore it and go back to do what they were doing before.   Another aspect which makes popups a realy bad marketing strategy, even more so than banner ads.

I hope this information was useful to anyone who wishes to consider popup advertising.

I won't completely ditch popup advertising because there are some advantages to it as well (for instance, the relatively low cost), but I thought it would be neccessary to have this information out there for people to consider this marketing strategy for themselves.






Friday, January 6, 2017

Razor2 "blacklist" .. apparently that's what it is

In my last post about how spam filters realy work, I talked about the use of distributed "black lists" that are use to collaborate on blocking sender's IPs , domains, or content based on some "shared" knowledge.

Most of the "known" blacklists, i.e: services that actually pretty straightforward say that they are blacklisting services, are usually being fed from automated honeypots (aka "spamtraps") to collect their data.

The reason why automated honeypots are used, is because of the huge amount of spam email that is sent each day.  I mean think about it.  there are several billions of emails sent each day, and according to the last statistics by Cisco's SenderBase online email monitoring service, about 85% of email sent is "spam".

That is an amount that is way too much for humans to read and interpret each message, so automated honeypot for "flagging" a sender in a blacklist, and on the other hand giving the senders option to request a removal from the blacklist in case of a false positive exists as well.

And then here comes the Razor2 engine that is used by SpamAssassin, and apparently by a big majority of email filtering platforms.

If you don't know what Razor2 is, let me refer you to some email headers generated by SpamAssassin that will give you a clue what I'm talking about.  If you've seen any of the following:


RAZOR2_CHECK Listed in Razor2 (http://razor.sf.net/) 

RAZOR2_CF_RANGE_E4_51_100 Razor2 gives engine 4 confidence level above 50%
                          [cf: 100] 

RAZOR2_CF_RANGE_51_100 Razor2 gives confidence level above 50% [cf: 100] 



Then congratulations - You are listed in Razor2's database.. erm.. sorry.. blacklist.

It's funny that Razor2's creators insist that it is not a blacklist.. but a "distributed hash sharing system".
So.. here are some interesting facts to know about Razor2 :


Fact #1 - Razor2 is now owned by CloudMark security

You're probably familiar with CloudMark's IP reputation check tool.  They are huge player in the field of email security / threat protection and of course.. spam filtering.

And it turns out, that they decided to purchase Razor2, and not only that.. but they also host Razor2's main query requesting domain:


Name: discovery.razor.cloudmark.com
Address: 208.83.139.205
Name: discovery.razor.cloudmark.com
Address: 208.83.137.118
Name: discovery.razor.cloudmark.com
Address: 208.83.137.117



Fact #2 - Razor2 is a blacklist

A sender's blacklist, is by its definition - any distributed list of blocked IPs and or sending domains that are used to send spam.. that's exactly what Razor2 is.


Fact #3 - Razor2 do not offer a removal tool

Actually in this aspect, I can understand.  Razor2 tends to be a reputation based system, which means that records are "cleaned" once the offending (spam) traffic has ceased to be sent.  Gmail and many other known email providers seem to work that way as well.



One Unknown fact about Razor2

There's though one aspect of the Razor2 list that remains a mystery, and that is - who feeds this list?

Razor2 have created an Outlook plugin for reporting email to them, apparently it seems to be fed from real humans who get spam messages and then they use this "report" plugin to send the reported has to the Razor2 database.

On the other hand.. a question arises.. how many reports should be sent to them in order to decide that a sender should be blocked?  1 report?   2 report?   10 reports?

And the other question is.. who can guarantee that there's no person who just installed a honeypot with a Razor2 report plugin that just reports any message that arrives to it automatically as it arrives?

Indeed interesting questions that I hope that someone from Razor2 could one day provide the answers to them.

Friday, July 1, 2016

How spam filters actually work?

In nowdays, spam email became a serious problem, hence why spam filters were invented.

There are many articles out there that line out how spam filters are "supposed" to work, some of these are at best guesses, and very few give actual technical details on how this thing work.

So in this article, I'm going to give you my perspective as a programmer, as someone who needed to write a spam filter for his own domain.


Solution #1: do spam filters block spam based on content?

We are all familiar with those online pharmacies spam emails trying to make us buy "pills" or "medications" from them, or with this fake PayPal confirmation emails that are sent trying to steal our personal login information (and infact some of them actually succeed in doing so!).

So our initial thought would be to block the emails based on content that sits in them - if an email contains words like "viagra" / "pharmacy" or "free shipping!!" it is most likely a spam message.

But this is where the problems only begin.  You see, spammers became smart and found ways to use other words and combinations for viagra .. such as v1agr4 , or even use images on those specific words.

Content-based spam filters were good in the beginning, when there was a very little dictionary of words to search for.  But as spammers became more intelligent, so the spam filters attempted to be, and the dictionary files became bigger and bigger.. to the point where the CPU consumption of such filters made them ineffective, and then in fact alot of people who installed such filters in the past had to either remove them completely because they got their servers stuck.. or to dramatically reduce their CPU usage and by doing so also reduce their capability of filtering content.

I actually found this post on serverfault.com where some guy asked for help on how to reduce his SpamAssassin's memory and CPU usage.  This was not a joke at all!

The main problem with content-based filtering is scalability.

If you are running a very small email exchange, where you receive around 100-200 emails daily, then you can use SpamAssassin or other content-based solutions, and it can be OK (although it may take up to 2-10 seconds to "scan" the email in order to filter it).

However, the "big" folks, Yahoo, Gmail and Hotmail, where their servers receive billions of messages each day.. they just can't allow any bottle-neck of CPU or memory usage on their servers, so they need a faster, and alot more scalable solution.


Solution #2: I heard about this "reputation" thing, what is it?

If content filtering is one side of the spam filering scale, then sender's reputation would be the other side of it.  A Sender's reputation is basically an internal score from 0-100 that is given and kept in the receiving domain's database (be it Hotmail, Yahoo, Gmail or others..) for a given "sender", "domain" or "website".

A "sender" simply means either the IP address that sent the message, the domain that sent the message (in the SMTP's "MAIL FROM" command), the domain that appears in the From: address, any domains that appears inside the message's body (like redirection links etc.), either each on its own, or a combination of some.

Some servers even keep a different reputation score for each one of these.  For example, a sending IP address may have a score of 100 (perfect), but the domain that sends the message may have a low score (20), or the From: address may have a low score, or a message may contain links inside that belong to a domain that has a very low score.

This solution is essentially a perfect solution for very large servers, because it is extremely scalable (requires almost no CPU or memory consumption at all), and also pretty practical in servers like Hotmail, Yahoo and Gmail where they have a very large number of users that report spam messages, and by doing so they actually teach the filters directly.

The only problem with this solution, is that it may generate false-positives and may not always be accurate, flagging what would be legitimate emails as spam mistakenly, something that would require Hotmail or Yahoo providing technical support for senders that their messages don't get delivered, and require them to "mitigate" or "clean" their reputation score once in a while.

How is this possible?   Let's examine this scenario:

A new server sends a message to Hotmail, his IP is 1.2.3.4.   When he first arrives at Hotmail, his "reputation" score is theoretically clean, so he'll have a 100 score.  Now he starts sending some messages, some of them are legitimate emails (plain-text messages, co-workers information exchange, invitations, etc), and some are spam.

People will report the spam messages, and each report will affect the IP's score, reducing it a little each time.   By the time the IP's score will reach some threshold, almost all messages from this IP will be sent to the Junk folder, and hence some legitimate email as well.

This is a real problem not only for the sender, but for Hotmail as well.   Because if you had all the messages sent by a sender that you realy are looking for his messages (for example: Bank notifications, money transfers, etc) sent to the Junk folder at best, or at worst not arriving at all, then you will leave Hotmail and go for another email-provider, causing Hotmail to lose customers.  Which is not good for them at all!

So, because of this problem, many servers must use multiple reputation scores and give each one a different impact on the overall "decision" whether to allow a message in or not, but just as this problem happend with the IP address, this same problem could also happend with any other of the elements I mentioned as well.


So on one hand we had content-filtering, which is CPU intensive but provide good results, and on the other hand we had Reputation based filtering which is very light on the CPU but provides false-positives..

What other solution can be that are both light-weight on the CPU, and at the same time can be pretty effective on determining whether a message should be filtered or not?



Solution #3: What are blocklists, and are they any good?

A Blocklist is basically a third-party provider, like SpamHaus or SURBL, that contains a list of known IP address and/or domains that appear in spam messages, that can be queried by anyone seeking either via DNS query, or a URL request.

This solution is great because it is scalable (doesn't contain any CPU usage at all to send a DNS or web request), and generally allows anyone with a simple mail server to apply without very much technical knowledge.

The question that comes to mind when using blocklists, is are they reliable?

Blocklists are basically another form of reputation-based system, and by such they can also make mistakes (which is why many of them provide a web-form to request de-listing).

In the past, blocklists used to be made of individuals inside mailing-lists who had "honeypot" email addresses with spam messages sent to them, and then a human sitting and reading those messages would make a decision on what IP or domain should be added, if at all, to their block list.

But as the internet grew, and the number of spam messages grew, in many cases it became almost impossible for a human to keep track of all the messages arriving, which is why the automated spamtrap was created - they simple replace a human with some machine based algorithm that is supposed to "read" the arriving messages and make a decision whether or not to add that sender to the block list or not.

Blocklists are almost entirely based on automated spamtraps, and as such some legitimate email may land in a spamtrap as well, which is why even they will sometimes have false-positives in this way.

So in order to understand the blocklists entirely, we must provide a section on spamtraps, which will help shed some light on blocklists as well, and then we'll be able to conclude this article.



Solution #4: The Spamtraps

Spamtraps and blocklists are generally linked together, so what is a spamtrap?

A spamtrap is basically an email address that is not supposed to be in any email list.   It is an address that have never signed up for any newsletter, never conducted any email communication, and generally speaking it is an address that was never used before.. so due to this logic, it would make sense, that any email that arrives into that mailbox is not wanted, or in practical words: unsolicited.

Blocklist operators have deliberately setup such addresses, and alot of them.   In fact, there are around 220 million spamtrap addresses around the world today according to Project Honeypot .

These addresses are usually published in "hacked" lists that spammers would usually collect, and then start mailing them.. and these spamtraps are usually just connected to an automated system that would simply mark any email arriving at them as spam immediately, flagging the IP address and the domain name of the sender immediately.

Of course, a single spamtrap hit every once and then can make sense.   Because some of these addresses are just typos (like hotmial.com or gimal.com, etc..), but if a specific sender gets into 100 or more different spamtraps a day, then he's just deemed to be blocked.

This is in general a realy smart solution that makes everyone gain from it.

Email providers can use blocklists that are fed from spamtrap's data and usually they will be very accurate in their predictions whether a sender is a spammer or not.   This makes detecting spam messages very efficient and accurate, and also makes blocking the messages on the server part very scalable with using block lists and needing almost no CPU power at all.

While this is not a 100% solution, it is very close to an ideal solution.

And then, there is one last and final part of content filtering, which is also related to spamtraps, and that is signatures.



Solution #5: Mail block signatures as a scalable solution for content-filtering

A mail block signature is basically a light-weight and scalable solution to content filtering.

Imagine a mail hitting a spamtrap.   Other than the fact the the sending IP and or domain get a reputation score (which can be sometimes inaccurate, as explained above!), a signature , which is some sort of a large mathematical number that is calculated based on gathering statistical analysis of the content inside the email, is generated for the message.

This is a good scalable solution, because we can take a single email message, produce a quick tag-cloud or words used, html tags, links, etc, and create some sort of a signature that would identify a similar content, and then send this mail signature to some third party block-list like Symantec's or Cloudmark's and have other email servers that receive similar messages generate the signature on their part, and then see if this signature exists on the block list, and then decide to block the message based on its content.

This provides a good solution to the content-filtering problem that I described in solution #1 by providing a much faster way to "analyse" the content based on statistical analysis.

This solution takes place in spam filtering as well, and is very effective as well.



Conclusion:

I have presented in this article all the different techniques and methodologies that spam filters use in order to block email.

When I had to add spam filtering to my own server, I decided to go with the block lists solution, because it is very scalable, light weight, and with the kind of spam messages I receive on my servers, I can rely on it.

There are some other techniques to prevent spam messages from arriving, like greylisting , which I didn't mention in this article, because there are other articles on the internet that talk about it, and I wanted to give a look on the technical view and most importantly on the consideration for scalability.








Saturday, January 23, 2010

How to make your website load faster

Most websites these days share a very similiar layout concept - there is a "template" that contains the page's header, a side or top menu, a bottom "Copyrights" text, and a "middle section" that is where the actual page display is shown.

While other considerations, such as database performance, may impact the loading speed, the actual real cause for loading speed are the size of the HTML pages being transmitted, and the size and number of images (GIFs/JPGs etc) that are being sent.

Therefore, if you want to make your website load faster you'll need to make sure that your images take as less space as needed, and that your HTML pages are the smallest as they can.

Google is a good example site to look on.
If you'll look at the size of the HTML of their homepage you will see it takes just 4Kb, which is nothing!

So how do you achieve those tasks ?

1. Reducing images size - Use an images optimizer. Adobe's ImageReady and PhotoShop are a great product to do just that, as it allows you to save "optimized" images where they actually allow you to see a preview of how the compressed image will look like = allowing you to adjust maximum compression for best possible view.

2. Reducing HTMLs size

Now, as far as your HTML goes, here's a little quick tip: If your pages have a common structure (such as a table) that is used to draw menus or catalog items etc, you can dramatically increase the loading speed of such pages by creating a JavaScript function that does the "drawing" of the table row or cells automatically, where you just have to pass the parameters to it and it will do the drawing it self.

This amazingly decreases the page's size and makes it load very fast, and since there's no browser today that doesn't support JavaScript, its a sure go!

Here's an example of what im talking about:

-- Original Page --



Assuming this page has 200 table rows like that, thats at least 138 characters per row, which means 27,600 bytes (or 27.6KB!!) with only the table's layout and WITHOUT any information!

So, instead of doing this, you can convert this page into a very nice and quickly loading page using a simple JavaScript function that draws the table for you and saves you tons of time and speeds your loading up by orders of magnitude.

Here's how:

-- Optimized Page: --



As you can see, we created a JavaScript function, called dr(), which is short for "draw", which does the actual drawing of the table row's HTML portion. All it does, is you pass the parameters for ID, Name and Address to it, and it does the drawing for you. By using this approach, we now have turned those 138 bytes into 24 bytes without information, which means that on a page with 200 rows like that, the page size will now be 4800 bytes, or 4.6KB , thats quite a difference !!

Unoptimized = 27.6KB
Optimized = 4.6KB

That means we reduced the page's size by at least 85% if not more !

And that was a very short demonstration.
There are webpages that I saw, where every table row takes 2KB , so imagine a page like that containing 100 rows - it takes 200KB to send !!!

And if you can reduce a page like that to even 10KB, thats loads of a difference in loadup speeds.

Ponder that for a while, and feel free to leave your comments, feedback and questions.