PURVEYOR OF FINE WORDS

January 14, 2007

January 14 2007

Concise Adblock Filter Set Explained

Adblock is the single most useful Firefox plugin available today. Just like watching sitcoms with automatic commercial-skip, adblock’s banner ad supression system elicits a smug sense of satisfaction even after browsing through your 10,000th ad-free web page. However, a huge barrier to adoption seems to be the lack of a default filter set, so when you first install adblock, nothing happens.

The main issue is that adblock does not have any intelligence as to the content that is included with a webpage; it is just a generic regex-based filter system, so it is only as effective as the filters that you provide. There are plenty of pre-made lists available but they tend to be overly-aggressive in what is supressed, resulting in occasional broken pages and/or pages that dead-end because adblock has removed the “Next” button. The most dangerous public set seems to be the EasyList, which has a 360+ item block list. Evidence that the creators know of its greedy nature is their inclusion of a 20+ item whitelist to manually compensate what was initially blocked. Even more unstable is the EasyElement list that searches through the DOM to remove suspected elements directly from the main document — a list of 570+ substrings to search for.

Intead of using such a large, reactive list of simple and site-specific string matches that tries to supress 100% of ads, I posit that you only need 2 adblock filters to eliminate 70-80% of ads, and still be confident that legitimate content isn’t being flagged as a false positive. By getting into the heads of HTML writers, we can pick out the most common patterns used to include ads and create regex patterns to suppress the ads.

  1. /(\b|_)ad(x|s?)(\b|_)/
    This regex looks for any element that contains the string ‘ad’, ‘ads’, or ‘adx’ surrounded by a word boundary, because the vast majority of web sites partition their ads into a single directory or serve them through a single script. The word boundary check is crucial to this filter because just searching for the characters ‘ad’ is ineffective. Instead, the word boundary restriction means that adblock will supress elements that contain strings like ‘ads.server.com’ or ‘www.server.com/ads/’ or ’server.com/ad_server.php’, but not ‘adobe.com’ or ’server.com/adjustment’.
  2. /ad.*\d+[xX]\d+/
    This regex exploits the common technique of ad designers to use the image dimensions in their element name, i.e., “server.com/newads.php?location=top&size=468×80″. Like the previous rule, we don’t just exclude any element that has dimensions, but qualify that by searching for the string ‘ad’ as well.

At this point, your browsing experience will be significantly improved, but you can bump up your block rate to about 80-90% with a few more simple substring matches. There are many well known ad providers that exist solely to deliver ads, so we can consildate those in composite filter rules:

  1. /a(2\.yimg|dserv|dvert|tdmt|twola)/
    This rule collects all the ad serving systems that start with ‘a’: Yahoo, Atlas, AOLTimeWarner, and generic ad serving systems.
  2. /b(anners|logads)/
    falkag.net

    These pick up anything labeled with ‘banner’, the ‘blogads’ network, or Falk AdSolutions.

Realistically, reducing the ad load by 90% should be more than sufficient for anyone. Chasing that last 10% — and whitelisting the collateral damage — will always be a losing battle. Your time is better used reading the content that is on the page you requested in the first place.



3 Comments »

  1. Funny how you measure the “dangerousness” of a list by the number of filters. Any reason why you would say that http://adblock.free.fr/adblock.txt is less dangerous? It has less filters but those filters are so complex that hardly anybody can tell what they block.

    Easylist goes by the recommendations for Adblock Plus - use specific filters and avoid regular expressions. This allows the filter list to be processed very fast. And the whitelisting entries are mostly due to the fact that some sites started to serve regular content through known advertising sites.

    Have a look at the Filterset.G whitelist (http://pierceive.com/filtersetg/whitelist-beta/) - now that’s scary…

    Comment by Wladimir Palant — January 15, 2007 @ 3:14 pm

  2. “Dangerous”? That’s a bit harsh don’t you think?

    The EasyList is fairly aggressive because users want it that way … and “yes” there will be an occasional ‘burp’ in what it blocks. But considering the amount of users the EasyList has, problems have been very minimal at best and false-positives have been addressed NOT mainly through whitelisting, but rather through a rewrite of the filtering strings.

    You wrote:
    “Evidence that the creators know of its greedy nature is their inclusion of a 20+ item whitelist to manually compensate what was initially blocked.”

    The irony of your statement is that your first proposed filter string:
    /(\b|_)ad(x|s?)(\b|_)/
    …is EXACTLY why about 80% of those whitelist strings exist. Most of the whitelistings are for video players served thru an “ad” string. The whitelists allow the player to function correctly on some very MAJOR sites without having to remove the broader generic filter strings like */ads/* or *//ads.*. And I don’t have to whitelist the ENTIRE page. The EasyList works quite well this way. Try watching news video at FoxNews, MSN, Forbes, etc with just that one filter string that you proposed …. you will have a whitelist larger than mine with just that one string.

    You wrote:
    ” … resulting in occasional broken pages and/or pages that dead-end because adblock has removed the “Next” button.”

    I don’t know what ‘next’ button you are talking about. Things like this could occasionally happen, but I currently have no reports of any big problem with things like that … and if someone did have a problem, I would hope that they would bring it to my attention. These things are usually fixed as fast as I can fix them when they occur. Trying to keep pages free of ads without interrupting a user’s surfing experience is no small task … but I love doing it and devote a lot of my time to it. :-)

    ps: Adblock Plus does NOT have a problem with large filter lists as long as they follow the simple expression ’shortcut’ rules. So using string totals is irrelevant to ABP’s operation. I increased the filter size because it does not take any noticeable performance hit.

    Sincerely:
    rick752 - ABP EasyList/EasyElement author.

    Comment by rick752 — January 15, 2007 @ 5:18 pm

  3. I’m not sure who you’re writing for. Are you targetting end-users or filter subscription maintainers? If the former, then they’re not spending time figuring out the best way to create filters. If the latter, then EasyList is the way it is because it’s optimized for the way Adblock Plus works. You can read all about it on adblockplus.org where excellent documentation is maintained and Wladimir explains in his blog which filters work best. I think you’re still used to Adblock’s filter style where regular expressions are preferred and people try to cram as many rules as they can in one expression. That is bad form for Adblock Plus because those types of filters are slower and make debugging harder.

    Comment by Stupid Head — April 20, 2007 @ 1:36 pm

RSS feed for comments on this post. TrackBack URI

Leave a comment

 

Linking

Links provided by kottke.org.

Offering

Syndicating