squid

Fun with Squid and CDNs

One neat upgrade in Debian's recent 5.0.0 release1 was Squid 2.7. In this bandwidth-starved corner of the world, a caching proxy is a nice addition to a network, as it should shave at least 10% off your monthly bandwidth usage. However, the recent rise of CDNs has made many objects that should be highly cacheable, un-cacheable.

For example, a YouTube video has a static ID. The same piece of video will always have the same ID, it'll never be replaced by anything else (except a "sorry this is no longer available" notice). But it's served from one of many delivery servers. If I watch it once, it may come from

http://v3.cache.googlevideo.com/videoplayback?id=0123456789abcdef&itag=34&ip=1.2.3.4&region=0&signature=5B1BA40D8464F2303DDDD59B2586C10A0AEFAD19.169DA15A09AB88E824DE63DF138F0D835295463B&sver=2&expire=1234714137&key=yt1&ipbits=0

But the next time it may come from v15.cache.googlevideo.com. And that's not all, the signature parameter is unique (to protect against hot-linking) as well as other not-static parameters. Basically, any proxy will probably refuse to cache it (because of all the parameters) and if it did, it'd be a waste of space because the signature would ensure that no one would ever access that cached item again.

I came across a page on the squid wiki that addresses a solution to this. Squid 2.7 introduces the concept of a storeurl_rewrite_program which gets a chance to rewrite any URL before storing / accessing an item in the cache. Thus we could rewrite our example file to

http://cdn.googlevideo.com.SQUIDINTERNAL/videoplayback?id=0123456789abcdef&itag=34

We've normalised the URL and kept the only two parameters that matter, the video id and the itag which specifies the video quality level.

The squid wiki page I mentioned includes a sample perl script to perform this rewrite. They don't include the itag, and my perl isn't good enough to fix that without making a dog's breakfast of it, so I re-wrote it in Python. You can find it at the end of this post. Each line the rewrite program reads contains a concurrency ID, the URL to be rewritten, and some parameters. We output the concurrency ID and the URL to rewrite to.

The concurrency ID is a way to use a single script to process rewrites from different squid threads in parallel. The documentation is this is almost non-existant, but if you specify a non-zero storeurl_rewrite_concurrency each request and response will be prepended with a numeric ID. The perl script concatenated this directly before the re-written URL, but I separate them with a space. Both seem to work. (Bad documentation sucks)

All that's left is to tell Squid to use this, and to override the caching rules on these URLs.

storeurl_rewrite_program /usr/local/bin/storeurl-youtube.py
storeurl_rewrite_children 1
storeurl_rewrite_concurrency 10

#  The keyword for all youtube video files are "get_video?", "videodownload?" and "videoplaybeck?id"
#  The "\.(jp(e?g|e|2)|gif|png|tiff?|bmp|ico|flv)\?" is only for pictures and other videos
acl store_rewrite_list urlpath_regex \/(get_video\?|videodownload\?|videoplayback\?id) \.(jp(e?g|e|2)|gif|png|tiff?|bmp|ico|flv)\? \/ads\?
acl store_rewrite_list_web url_regex ^http:\/\/([A-Za-z-]+[0-9]+)*\.[A-Za-z]*\.[A-Za-z]*
acl store_rewrite_list_path urlpath_regex \.(jp(e?g|e|2)|gif|png|tiff?|bmp|ico|flv)$
acl store_rewrite_list_web_CDN url_regex ^http:\/\/[a-z]+[0-9]\.google\.com doubleclick\.net

# Rewrite youtube URLs
storeurl_access allow store_rewrite_list
# this is not related to youtube video its only for CDN pictures
storeurl_access allow store_rewrite_list_web_CDN
storeurl_access allow store_rewrite_list_web store_rewrite_list_path
storeurl_access deny all

# Default refresh_patterns
refresh_pattern ^ftp:           1440    20%     10080
refresh_pattern ^gopher:        1440    0%      1440
refresh_pattern -i (/cgi-bin/|\?) 0     0%      0

# Updates (unrelated to this post, but useful settings to have):
refresh_pattern windowsupdate.com/.*\.(cab|exe)(\?|$) 518400 100% 518400 reload-into-ims
refresh_pattern update.microsoft.com/.*\.(cab|exe)(\?|$) 518400 100% 518400 reload-into-ims
refresh_pattern download.microsoft.com/.*\.(cab|exe)(\?|$) 518400 100% 518400 reload-into-ims
refresh_pattern (Release|Package(.gz)*)$        0       20%     2880
refresh_pattern \.deb$         518400   100%    518400 override-expire

# Youtube:
refresh_pattern -i (get_video\?|videodownload\?|videoplayback\?) 161280 50000% 525948 override-expire ignore-reload
# Other long-lived items
refresh_pattern -i \.(jp(e?g|e|2)|gif|png|tiff?|bmp|ico|flv)(\?|$) 161280 3000% 525948 override-expire reload-into-ims

refresh_pattern .               0       20%     4320

# All of the above can cause a redirect loop when the server
# doesn't send a "Cache-control: no-cache" header with a 302 redirect.
# This is a work-around.
minimum_object_size 512 bytes

Done. And it seems to be working relatively well. If only I'd set this up last year when I had pesky house-mates watching youtube all day ;-)

It should of course be noted that doing this instructs your Squid Proxy to break rules. Both override-expire and ignore-reload violate guarantees that the HTTP standards provide the browser and web-server about their communication with each other. They are relatively benign changes, but illegal nonetheless.

And it goes without saying that rewriting the URLs of stored objects could cause some major breakage by assuming that different objects (with different URLs) are the same. The provided regexes seem sane enough to not assume that this won't happen, but YMMV.

#!/usr/bin/env python
# vim:et:ts=4:sw=4:

import re
import sys
import urlparse

youtube_getvid_res = [
    re.compile(r"^http:\/\/([A-Za-z]*?)-(.*?)\.(.*)\.youtube\.com\/get_video\?video_id=(.*?)&(.*?)$"),
    re.compile(r"^http:\/\/(.*?)\/get_video\?video_id=(.*?)&(.*?)$"),
    re.compile(r"^http:\/\/(.*?)video_id=(.*?)&(.*?)$"),
]

youtube_playback_re = re.compile(r"^http:\/\/(.*?)\/videoplayback\?id=(.*?)&(.*?)$")

others = [
    (re.compile(r"^http:\/\/(.*?)\/(ads)\?(?:.*?)$"), "http://%s/%s"),
    (re.compile(r"^http:\/\/(?:.*?)\.yimg\.com\/(?:.*?)\.yimg\.com\/(.*?)\?(?:.*?)$"), "http://cdn.yimg.com/%s"),
    (re.compile(r"^http:\/\/(?:(?:[A-Za-z]+[0-9-.]+)*?)\.(.*?)\.(.*?)\/(.*?)\.(.*?)\?(?:.*?)$"), "http://cdn.%s.%s.SQUIDINTERNAL/%s.%s"),
    (re.compile(r"^http:\/\/(?:(?:[A-Za-z]+[0-9-.]+)*?)\.(.*?)\.(.*?)\/(.*?)\.(.{3,5})$"), "http://cdn.%s.%s.SQUIDINTERNAL/%s.%s"),
    (re.compile(r"^http:\/\/(?:(?:[A-Za-z]+[0-9-.]+)*?)\.(.*?)\.(.*?)\/(.*?)$"), "http://cdn.%s.%s.SQUIDINTERNAL/%s"),
    (re.compile(r"^http:\/\/(.*?)\/(.*?)\.(jp(?:e?g|e|2)|gif|png|tiff?|bmp|ico|flv)\?(?:.*?)$"), "http://%s/%s.%s"),
    (re.compile(r"^http:\/\/(.*?)\/(.*?)\;(?:.*?)$"), "http://%s/%s"),
]

def parse_params(url):
    "Convert a URL's set of GET parameters into a dictionary"
    params = {}
    for param in urlparse.urlsplit(url)[3].split("&"):
        if "=" in param:
            n, p = param.split("=", 1)
            params[n] = p
    return params

while True:
    line = sys.stdin.readline()
    if line == "":
        break
    try:
        channel, url, other = line.split(" ", 2)
        matched = False

        for re in youtube_getvid_res:
            if re.match(url):
                params = parse_params(url)
                if "fmt" in params:
                    print channel, "http://video-srv.youtube.com.SQUIDINTERNAL/get_video?video_id=%s&fmt=%s" % (params["video_id"], params["fmt"])
                else:
                    print channel, "http://video-srv.youtube.com.SQUIDINTERNAL/get_video?video_id=%s" % params["video_id"]
                matched = True
                break

        if not matched and youtube_playback_re.match(url):
            params = parse_params(url)
            if "itag" in params:
                print channel, "http://video-srv.youtube.com.SQUIDINTERNAL/videoplayback?id=%s&itag=%s" % (params["id"], params["itag"])
            else:
                print channel, "http://video-srv.youtube.com.SQUIDINTERNAL/videoplayback?id=%s" % params["id"]
            matched = True

        if not matched:
            for re, pattern in others:
                m = re.match(url)
                if m:
                    print channel, pattern % m.groups()
                    matched = True
                    break

        if not matched:
            print channel, url

    except Exception:
        # For Debugging only. In production we want this to never die.
        #raise
        print line

    sys.stdout.flush()

  1. Yes, Vhata, Debian released in 2009, I won the bet, you owe me a dinner now. 

Easy home transparent proxy

Everyone in South Africa wants to save a little more bandwidth, as low traffic caps are the rule of the day (esp if you are hanging off an expensive 3G connection).

While the "correct" thing to do is to use wpad autodetection, and thus politely request that users use your proxy, this isn't always an option:

  • Firefox doesn't Autodetect Proxies by default
  • Autodetection doesn't behave well for many roaming users (firefox should talk to network-manager)
  • Many programs simply don't support wpad.
  • Your upstream ISP transparently proxies anyway (the norm in ZA), so it's not like we have any end-to-endness to protect.

So, here's how you do it:

  1. Lets assume your network is 10.1.1.0/24, and the squid box is 10.1.1.1 on eth0
  2. Install squid (aptitude install squid), configure it to have a reasonably large storage pool, give it some sane ACLs, etc.
  3. Add http_port 8080 transparent to squid.conf(or http_port 10.1.1.1:8080 transparent if you are using explicit http_port options)
  4. invoke-rc.d squid reload
  5. Add the following to your iptables script:
iptables -t nat -A PREROUTING -i eth0 -s 10.1.1.0/24 -d ! 10.20.1.1 -p tcp --dport 80 -j REDIRECT --to 8080

If you run squid on your network's default gateway, then you are done. Otherwise, if you have a separate router, you need to do the following on the router:

  1. Add a new transprox table to /etc/iproute2/rt_tables, i.e. 1 transprox
  2. Pick a new netfilter MARK value, i.e. 0x04
  3. Add the following to the router's iptables script:
# Transparent proxy
iptables -t mangle -F PREROUTING
iptables -t mangle -A PREROUTING -i br-lan -s ! 10.1.1.1 -d ! 10.1.1.0/24 -p tcp --dport 80 -j MARK --set-mark 0x04
ip route del table transprox
ip route add default via 10.1.1.1 table transprox
ip rule del table transprox
ip rule add fwmark 0x04 pref 10 table transprox
  1. Done: test and tail your squid logs

The reason we use iproute rules rather than iptables DNAT is that you lose destination-IP information with a DNAT (like the envelope of an e-mail).

An alternative solution is to run tinyproxy on the router (with the transparent option, enabled in ubuntu but not debian), use the REDIRECT rule above on the router, to redirect to the tinyproxy, and have that upstream to the squid. But tinyproxy requires some RAM, and on a WRT54 or the likes, you don't have any of that to spare...

Should you need to temporarily disable this for any reason:

  • With all-in-one-router: iptables -t nat -F PREROUTING
  • With the separate router: iptables -t mangle -F PREROUTING
Syndicate content