One neat upgrade in Debian's recent 5.0.0 release1 was Squid 2.7. In this bandwidth-starved corner of the world, a caching proxy is a nice addition to a network, as it should shave at least 10% off your monthly bandwidth usage. However, the recent rise of CDNs has made many objects that should be highly cacheable, un-cacheable.
For example, a YouTube video has a static ID. The same piece of video will always have the same ID, it'll never be replaced by anything else (except a "sorry this is no longer available" notice). But it's served from one of many delivery servers. If I watch it once, it may come from
But the next time it may come from
v15.cache.googlevideo.com. And that's not all, the signature parameter is unique (to protect against hot-linking) as well as other not-static parameters.
Basically, any proxy will probably refuse to cache it (because of all the parameters) and if it did, it'd be a waste of space because the signature would ensure that no one would ever access that cached item again.
I came across a page on the squid wiki that addresses a solution to this.
Squid 2.7 introduces the concept of a
storeurl_rewrite_program which gets a chance to rewrite any URL before storing / accessing an item in the cache. Thus we could rewrite our example file to
We've normalised the URL and kept the only two parameters that matter, the video id and the itag which specifies the video quality level.
The squid wiki page I mentioned includes a sample perl script to perform this rewrite. They don't include the itag, and my perl isn't good enough to fix that without making a dog's breakfast of it, so I re-wrote it in Python. You can find it at the end of this post. Each line the rewrite program reads contains a concurrency ID, the URL to be rewritten, and some parameters. We output the concurrency ID and the URL to rewrite to.
The concurrency ID is a way to use a single script to process rewrites from different squid threads in parallel. The documentation is this is almost non-existant, but if you specify a non-zero
storeurl_rewrite_concurrency each request and response will be prepended with a numeric ID. The perl script concatenated this directly before the re-written URL, but I separate them with a space. Both seem to work. (Bad documentation sucks)
All that's left is to tell Squid to use this, and to override the caching rules on these URLs.
Done. And it seems to be working relatively well. If only I'd set this up last year when I had pesky house-mates watching youtube all day ;-)
It should of course be noted that doing this instructs your Squid Proxy to break rules.
ignore-reload violate guarantees that the HTTP standards provide the browser and web-server about their communication with each other.
They are relatively benign changes, but illegal nonetheless.
And it goes without saying that rewriting the URLs of stored objects could cause some major breakage by assuming that different objects (with different URLs) are the same. The provided regexes seem sane enough to not assume that this won't happen, but YMMV.
Everyone in South Africa wants to save a little more bandwidth, as low traffic caps are the rule of the day (esp if you are hanging off an expensive 3G connection).
While the "correct" thing to do is to use wpad autodetection, and thus politely request that users use your proxy, this isn't always an option:
So, here's how you do it:
aptitude install squid), configure it to have a reasonably large storage pool, give it some sane ACLs, etc.
http_port 8080 transparentto
http_port 10.1.1.1:8080 transparentif you are using explicit
invoke-rc.d squid reload
If you run squid on your network's default gateway, then you are done. Otherwise, if you have a separate router, you need to do the following on the router:
The reason we use
iproute rules rather than
iptables DNAT is that you lose destination-IP information with a DNAT (like the envelope of an e-mail).
An alternative solution is to run tinyproxy on the router (with the transparent option, enabled in ubuntu but not debian), use the REDIRECT rule above on the router, to redirect to the tinyproxy, and have that
upstream to the squid. But tinyproxy requires some RAM, and on a WRT54 or the likes, you don't have any of that to spare...
Should you need to temporarily disable this for any reason:
iptables -t nat -F PREROUTING
iptables -t mangle -F PREROUTING