Fun with Squid and CDNs

One neat upgrade in Debian's recent 5.0.0 release1 was Squid 2.7. In this bandwidth-starved corner of the world, a caching proxy is a nice addition to a network, as it should shave at least 10% off your monthly bandwidth usage. However, the recent rise of CDNs has made many objects that should be highly cacheable, un-cacheable.

For example, a YouTube video has a static ID. The same piece of video will always have the same ID, it'll never be replaced by anything else (except a "sorry this is no longer available" notice). But it's served from one of many delivery servers. If I watch it once, it may come from

http://v3.cache.googlevideo.com/videoplayback?id=0123456789abcdef&itag=34&ip=1.2.3.4&region=0&signature=5B1BA40D8464F2303DDDD59B2586C10A0AEFAD19.169DA15A09AB88E824DE63DF138F0D835295463B&sver=2&expire=1234714137&key=yt1&ipbits=0

But the next time it may come from v15.cache.googlevideo.com. And that's not all, the signature parameter is unique (to protect against hot-linking) as well as other not-static parameters. Basically, any proxy will probably refuse to cache it (because of all the parameters) and if it did, it'd be a waste of space because the signature would ensure that no one would ever access that cached item again.

I came across a page on the squid wiki that addresses a solution to this. Squid 2.7 introduces the concept of a storeurl_rewrite_program which gets a chance to rewrite any URL before storing / accessing an item in the cache. Thus we could rewrite our example file to

http://cdn.googlevideo.com.SQUIDINTERNAL/videoplayback?id=0123456789abcdef&itag=34

We've normalised the URL and kept the only two parameters that matter, the video id and the itag which specifies the video quality level.

The squid wiki page I mentioned includes a sample perl script to perform this rewrite. They don't include the itag, and my perl isn't good enough to fix that without making a dog's breakfast of it, so I re-wrote it in Python. You can find it at the end of this post. Each line the rewrite program reads contains a concurrency ID, the URL to be rewritten, and some parameters. We output the concurrency ID and the URL to rewrite to.

The concurrency ID is a way to use a single script to process rewrites from different squid threads in parallel. The documentation is this is almost non-existant, but if you specify a non-zero storeurl_rewrite_concurrency each request and response will be prepended with a numeric ID. The perl script concatenated this directly before the re-written URL, but I separate them with a space. Both seem to work. (Bad documentation sucks)

All that's left is to tell Squid to use this, and to override the caching rules on these URLs.

storeurl_rewrite_program /usr/local/bin/storeurl-youtube.py
storeurl_rewrite_children 1
storeurl_rewrite_concurrency 10

#  The keyword for all youtube video files are "get_video?", "videodownload?" and "videoplaybeck?id"
#  The "\.(jp(e?g|e|2)|gif|png|tiff?|bmp|ico|flv)\?" is only for pictures and other videos
acl store_rewrite_list urlpath_regex \/(get_video\?|videodownload\?|videoplayback\?id) \.(jp(e?g|e|2)|gif|png|tiff?|bmp|ico|flv)\? \/ads\?
acl store_rewrite_list_web url_regex ^http:\/\/([A-Za-z-]+[0-9]+)*\.[A-Za-z]*\.[A-Za-z]*
acl store_rewrite_list_path urlpath_regex \.(jp(e?g|e|2)|gif|png|tiff?|bmp|ico|flv)$
acl store_rewrite_list_web_CDN url_regex ^http:\/\/[a-z]+[0-9]\.google\.com doubleclick\.net

# Rewrite youtube URLs
storeurl_access allow store_rewrite_list
# this is not related to youtube video its only for CDN pictures
storeurl_access allow store_rewrite_list_web_CDN
storeurl_access allow store_rewrite_list_web store_rewrite_list_path
storeurl_access deny all

# Default refresh_patterns
refresh_pattern ^ftp:           1440    20%     10080
refresh_pattern ^gopher:        1440    0%      1440
refresh_pattern -i (/cgi-bin/|\?) 0     0%      0

# Updates (unrelated to this post, but useful settings to have):
refresh_pattern windowsupdate.com/.*\.(cab|exe)(\?|$) 518400 100% 518400 reload-into-ims
refresh_pattern update.microsoft.com/.*\.(cab|exe)(\?|$) 518400 100% 518400 reload-into-ims
refresh_pattern download.microsoft.com/.*\.(cab|exe)(\?|$) 518400 100% 518400 reload-into-ims
refresh_pattern (Release|Package(.gz)*)$        0       20%     2880
refresh_pattern \.deb$         518400   100%    518400 override-expire

# Youtube:
refresh_pattern -i (get_video\?|videodownload\?|videoplayback\?) 161280 50000% 525948 override-expire ignore-reload
# Other long-lived items
refresh_pattern -i \.(jp(e?g|e|2)|gif|png|tiff?|bmp|ico|flv)(\?|$) 161280 3000% 525948 override-expire reload-into-ims

refresh_pattern .               0       20%     4320

# All of the above can cause a redirect loop when the server
# doesn't send a "Cache-control: no-cache" header with a 302 redirect.
# This is a work-around.
minimum_object_size 512 bytes

Done. And it seems to be working relatively well. If only I'd set this up last year when I had pesky house-mates watching youtube all day ;-)

It should of course be noted that doing this instructs your Squid Proxy to break rules. Both override-expire and ignore-reload violate guarantees that the HTTP standards provide the browser and web-server about their communication with each other. They are relatively benign changes, but illegal nonetheless.

And it goes without saying that rewriting the URLs of stored objects could cause some major breakage by assuming that different objects (with different URLs) are the same. The provided regexes seem sane enough to not assume that this won't happen, but YMMV.

#!/usr/bin/env python
# vim:et:ts=4:sw=4:

import re
import sys
import urlparse

youtube_getvid_res = [
    re.compile(r"^http:\/\/([A-Za-z]*?)-(.*?)\.(.*)\.youtube\.com\/get_video\?video_id=(.*?)&(.*?)$"),
    re.compile(r"^http:\/\/(.*?)\/get_video\?video_id=(.*?)&(.*?)$"),
    re.compile(r"^http:\/\/(.*?)video_id=(.*?)&(.*?)$"),
]

youtube_playback_re = re.compile(r"^http:\/\/(.*?)\/videoplayback\?id=(.*?)&(.*?)$")

others = [
    (re.compile(r"^http:\/\/(.*?)\/(ads)\?(?:.*?)$"), "http://%s/%s"),
    (re.compile(r"^http:\/\/(?:.*?)\.yimg\.com\/(?:.*?)\.yimg\.com\/(.*?)\?(?:.*?)$"), "http://cdn.yimg.com/%s"),
    (re.compile(r"^http:\/\/(?:(?:[A-Za-z]+[0-9-.]+)*?)\.(.*?)\.(.*?)\/(.*?)\.(.*?)\?(?:.*?)$"), "http://cdn.%s.%s.SQUIDINTERNAL/%s.%s"),
    (re.compile(r"^http:\/\/(?:(?:[A-Za-z]+[0-9-.]+)*?)\.(.*?)\.(.*?)\/(.*?)\.(.{3,5})$"), "http://cdn.%s.%s.SQUIDINTERNAL/%s.%s"),
    (re.compile(r"^http:\/\/(?:(?:[A-Za-z]+[0-9-.]+)*?)\.(.*?)\.(.*?)\/(.*?)$"), "http://cdn.%s.%s.SQUIDINTERNAL/%s"),
    (re.compile(r"^http:\/\/(.*?)\/(.*?)\.(jp(?:e?g|e|2)|gif|png|tiff?|bmp|ico|flv)\?(?:.*?)$"), "http://%s/%s.%s"),
    (re.compile(r"^http:\/\/(.*?)\/(.*?)\;(?:.*?)$"), "http://%s/%s"),
]

def parse_params(url):
    "Convert a URL's set of GET parameters into a dictionary"
    params = {}
    for param in urlparse.urlsplit(url)[3].split("&"):
        if "=" in param:
            n, p = param.split("=", 1)
            params[n] = p
    return params

while True:
    line = sys.stdin.readline()
    if line == "":
        break
    try:
        channel, url, other = line.split(" ", 2)
        matched = False

        for re in youtube_getvid_res:
            if re.match(url):
                params = parse_params(url)
                if "fmt" in params:
                    print channel, "http://video-srv.youtube.com.SQUIDINTERNAL/get_video?video_id=%s&fmt=%s" % (params["video_id"], params["fmt"])
                else:
                    print channel, "http://video-srv.youtube.com.SQUIDINTERNAL/get_video?video_id=%s" % params["video_id"]
                matched = True
                break

        if not matched and youtube_playback_re.match(url):
            params = parse_params(url)
            if "itag" in params:
                print channel, "http://video-srv.youtube.com.SQUIDINTERNAL/videoplayback?id=%s&itag=%s" % (params["id"], params["itag"])
            else:
                print channel, "http://video-srv.youtube.com.SQUIDINTERNAL/videoplayback?id=%s" % params["id"]
            matched = True

        if not matched:
            for re, pattern in others:
                m = re.match(url)
                if m:
                    print channel, pattern % m.groups()
                    matched = True
                    break

        if not matched:
            print channel, url

    except Exception:
        # For Debugging only. In production we want this to never die.
        #raise
        print line

    sys.stdout.flush()

  1. Yes, Vhata, Debian released in 2009, I won the bet, you owe me a dinner now. 

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Nice work

The thing that bothers me about youtube, more than that it reloads videos on every visit (although that is stupid), is that it reloads videos every time you push the play button. This has happened to me: "Hey, come look at this cool video I just laboriously buffered then watched!" "Oh, OK, let's just wait 20 minutes for it to buffer again"

Of course, this seems like it solves both problems.

There's a lot sewa bus

There's a lot sewa bus pariwisata to be desain interior rumah minimalis said about kebaya pengantin modern the way desain rumah minimalis the perfect rumah dijual combination of a info lokasi tempat obyek wisata movie and its music model rumah minimalis can move us

I agree with you. This post

I agree with you. This post is truly inspiring. I like your post and everything you share with us is current and very informative, I want to bookmark the page so I can return here from you that you have done a fantastic job. дизайн студия интерьера
дизайн интерьера Москва

Great information you got

Great information you got here. I’ve been reading about this topic for one week now for my papers in school and thank God I found it here in your blog. High Pagerank Backlinks
Buy High Pagerank Backlinks
High Pagerank Backlinks

And looming large in my eyes

And looming large in my eyes and in those of everyone else in attendance were these huge breasts of a pyramid beast woman, recast as reproductive machine. how to be rich

And looming large in my eyes

And looming large in my eyes and in those of everyone else in attendance were these huge breasts of a pyramid beast woman, recast as reproductive machine. how to be rich
how to be rich
how to be rich
how to be rich

Really solid, awesome,

Really solid, awesome, fact-filled information here. Your posts NEVER ever disappoint, and that certainly holds true here as well. You always make for an interesting read. Can you tell I'm impressed? :) Keep up the fantastic articles.
Fling Login

Easily, the article is

Easily, the article is actually the best topic on this registry related issue. I fit in with your conclusions and will eagerly look forward to your next updates. Just saying thanks will not just be sufficient, for the fantasti c lucidity in your writing. I will instantly grab your rss feed to stay informed of any updates.
Bajar de Peso

I really appreciate that you

I really appreciate that you wrote this article and shared some really good information on this specific topic. I was in vital need to get some information on this topic and thanks to you, I've got that! Thanks, once again! home warranty of america
nations home warranty

The world is changing fast.

The world is changing fast. people are also being changed.day by day we are becoming more dependant on degital system.yoU make me think of this really.You have a nice way of sharing your thoughts.
дизайн интерьера Москва
дизайн интерьера Москва
дизайн интерьера Москва
дизайн интерьера Москва

I love listening to rock and

I love listening to rock and enjoy in very much.I just want you to know That this blog really gonna help me to broaden my knowledge.Thanks asvab for dummies 2014

Design I have recently

Design
I have recently started a blog, the info you provide on this site has helped me greatly. Thanks  for all of your time & work.  

Riverdeep Login Interesting.

Riverdeep Login
Interesting. I would agree that focusing on growth vs cost control during the early stages of a business is ideal, however, there will be a point when that shifts.

Hey there thanks for showing

Hey there thanks for showing me this. I must say that your blogpost was the most enjoying read I've seen in a long time. Greetings from personal trainer berlin

I really enjoy simply reading

I really enjoy simply reading all of your weblogs. Simply wanted to inform you that you have people like me who appreciate your work. Definitely a great post. Hats off to you! The information that you have provided is very helpful. stop loss order
stop loss order
stop loss order
stop loss order

Upgrading quality

It seems this will always ensure you get the quality you asked for. But it seems to me that you don't mind, as long as you get *at least* the quality you ask for. But this can't be done by canonicalizing URLs. Can it?

Interesting thought

No, I don't know how best to handle that, without knowing exactly what's cached or only downloading specific qualities.

I was also searching related

I was also searching related stuff from long time.You have solved my problem.Thanks for sharing this great stuff with us High Pagerank Backlinks
High PR Backlinks

Website Design

Website Design Sacramento
Interesting. I would agree that focusing on growth vs cost control during the early stages of a business is ideal, however, there will be a point when that shifts.

yotuube Video keyword

Things have changed this time videoplayback\?id -> videoplayback.*id

youtube video quality

Up to now I don't see the difference between high quality and normal quality(although both still low quality) in the URL. Maybe they call it high quality because its the original file uploaded by the user. By the way HOME is true high quality I've ever seen on youtube.

by the way whats good thing about python vs perl? I'm only concern about speed.
I only know assembly. I'm a beginner with this languages. Thats why I'm toying squid.

minimum_object_size 512 bytes
you'll lost all small content

I Like it !

Useful information. Fortunate me I found your site by accident, and I am surprised why this coincidence didn’t took place earlier! I bookmarked it.
friv 2 | Z6

mem object in hit has mis-matched

i have try your post, and i found these in /var/log/squid/cache.log

================================================
2009/07/01 02:27:02| clientCacheHit: request has store_url 'http://cdn.static.zynga.com.SQUIDINTERNAL/zbar-new/images/trans/yoville-mini-button.png'; mem object in hit has mis-matched url 'http://zbar2.static.zynga.com/zbar-new/images/trans/yoville-mini-button.png?x=10001'!
2009/07/01 02:27:02| clientCacheHit: request has store_url 'http://cdn.static.zynga.com.SQUIDINTERNAL/zbar-new/images/trans/streetracing-mini-button.png'; mem object in hit has mis-matched url 'http://zbar2.static.zynga.com/zbar-new/images/trans/streetracing-mini-button.png?x=10001'!
2009/07/01 02:27:02| clientCacheHit: request has store_url 'http://cdn.static.zynga.com.SQUIDINTERNAL/zbar-new/images/trans/pirates-mini-button.png'; mem object in hit has mis-matched url 'http://zbar2.static.zynga.com/zbar-new/images/trans/pirates-mini-button.png?x=10001'!
2009/07/01 02:27:02| clientCacheHit: request has store_url 'http://cdn.static.zynga.com.SQUIDINTERNAL/zbar-new/images/trans/vampires-mini-button.png'; mem object in hit has mis-matched url 'http://zbar2.static.zynga.com/zbar-new/images/trans/vampires-mini-button.png?x=10001'!
2009/07/01 02:27:02| clientCacheHit: request has store_url 'http://cdn.static.zynga.com.SQUIDINTERNAL/zbar-new/images/trans/fashionwars-mini-button.png'; mem object in hit has mis-matched url 'http://zbar2.static.zynga.com/zbar-new/images/trans/fashionwars-mini-button.png?x=10001'!
2009/07/01 02:27:02| clientCacheHit: request has store_url 'http://cdn.static.zynga.com.SQUIDINTERNAL/zbar-new/templates/facebook_dropdown_allgames/images/bttn_moreGames.png'; mem object in hit has mis-matched url 'http://zbar2.static.zynga.com/zbar-new/templates/facebook_dropdown_allgames/images/bttn_moreGames.png?x=10001'!
2009/07/01 03:28:33| clientCacheHit: request has store_url 'http://cdn.static.zynga.com.SQUIDINTERNAL/zbar-new/images/icons/mafiawars-mini-button.jpg'; mem object in hit has mis-matched url 'http://zbar2.static.zynga.com/zbar-new/images/icons/mafiawars-mini-button.jpg?x=10001'!
2009/07/01 03:28:35| clientCacheHit: request has store_url 'http://cdn.static.zynga.com.SQUIDINTERNAL/zbar-new/images/trans/mafiawars-mini-button.png'; mem object in hit has mis-matched url 'http://zbar2.static.zynga.com/zbar-new/images/trans/mafiawars-mini-button.png?x=10001'!
2009/07/01 04:18:30| storeClientReadHeader: URL mismatch
2009/07/01 04:18:30| {http://static.ak.fbcdn.net/images/icons/fbpage.gif} != {http://static.ak.fbcdn.net/images/icons/fbpage.gif?8:67540}
===============================================

what should i do ??

Seem like a documented but

Seem like a documented but unloved bug.
"storeurl_rewrite mismatched when object stored on memory"
http://bugs.squid-cache.org/show_bug.cgi?id=2248

Reply to comment | Tumbleweed Rants

I love it when folks come together and share views.

Great blog, stick with it!

freedebtconsolidationquotes.c

freedebtconsolidationquotes.com
I must say, I thought this was a pretty interesting read when it comes to this topic. Liked the material. . .

I am not very good at all

I am not very good at all these codes, so it is a little bit difficult for me to understand it. I will show your article to my friend and together we will be able to do it.

Pingback

[...] found it and implemented but no performance Fun with Squid and CDNs | Tumbleweed Rants Please help [...]

Repo proxy cache

Have you come across squirm (http://squirm.foote.com.au/) which was for redirect_program (now url_rewrite_program) but could easily be customised for storeurl_rewrite_program.

It's a shame they haven't yet ported the storeurl_rewrite feature to the 3.x release as this would make it really easy to create a repo cache for CentOS or any other one that uses a mirror list. Yum first fetches a list of repo mirrors then chooses one of them to fetch the packages from. Using the storeurl_rewrute_program directive you could then normalise the store url for use between all of the returned mirrors.

"This is just the information

"This is just the information I am finding everywhere. Thanks for your blog, I just subscribe your blog. This is a nice blog. "
Mae Crawford

Nice Post!

wow that is really good. i never thought about it. The YouTube videos are always static url I thought. but it is better to have them not. thank you for sharing this amazing article. keep on posting. really loved to know more about this

Thanks!

wow that is really good. i never thought about it. The YouTube videos are always static url I thought. but it is better to have them not. thank you for sharing this amazing article. keep on posting. really loved to know more about this
stomach virus

They need to be stored

They need to be stored separated on floured trays, and turned twice every day to avoid sticking to each other or the trays. However, they should taste delicious tomorrow night. backlinks service

I love the blog. Great post.

I love the blog. Great post. It is very true, people must learn how to learn before they can learn. lol i know it sounds funny but its very true. . . High Pr backlinks

it is really important.

it is really important. organizational environment is so necessary for the best result. on their blog

good share

very good all off these codes ,its useful for me .thank you for sharing

Qu'est-ce qu'un bon blog vous

Qu'est-ce qu'un bon blog
vous avez ici. S'il vous plaît mettre à jour plus souvent. Ce sujets est mon intérêt. merci
vous
replique montre

Hello, I learn a lot from

Hello, I learn a lot from your article. I needs to spend some time learning more or understanding more. Thanks for great information

5.0.0 upgrade

5.0.0 upgrade neat in recent release1 Squid 2.7 Debian is about 2009 times that we felt really wonderful and exciting new experiences.

Mebel Minimalis

Thanks For Good Articles... Nice Youre Site...

good

good article...

These are really a superb

These are really a superb showcases here I am glad to have a review of it and feel to roll on its steps for my benefit always
Food grade korma
Organic Poultry feed

It is actually really nice

It is actually really nice and kind when I stumble across ourselves that bind us together.
ipad Service center in Bhubaneswar
imac dealer in Delhi

Di Shopious kamu bisa

Di Shopious kamu bisa menemukan barang-barang terbaik dan terbaru dari seluruh toko online terpercaya yang ada di Indonesia. Kamu cari Baju? Tas? Sepatu? Kamu bisa temukan semuanya disini. Belanja online jadi lebih nyaman di Shopious. cari sepatu online

I really like the dear

I really like the dear information you offer in your articles. I'm able to bookmark your site and show the kids check out up here generally. Im fairly positive theyre likely to be informed a great deal of new stuff here than anyone else!
Home Decorating Ideas

I simply want to tell you

I simply want to tell you that I am new to weblog and definitely liked this blog site. Very likely I’m going to bookmark your blog . You absolutely have wonderful stories. Cheers for sharing with us your blog.
http://www.elmetodogabriel2.com

Z6

I just wanted to let you know that what you do really affects peoples lives and that people - like me - truly appreciate it.

Thankyou

That's nice article's, I feel very greatfull by visiting your site :) Thankyou

For his more nimble-footed

For his more nimble-footed teammates.And as usual , you are always welcome to our website to get the Cheap fifa Coins , fifa ultimate team coins and fifa ultimate team coins and even you can have a lot of surprises

it is for sure a really funny

it is for sure a really funny strategy, but seems to be the best alternative!
http://www.yachtbooker.com

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.