Spelunking in M&G's Zapiro Archive

Those who follow me will know that I used to maintain a web frontend to the Mail & Guardian Online Zapiro archive.

M&G used to have a rather crufty website. Subscriber-only content was trivial to access (for non-subscribers), URLs were ugly, and dinosaurs roamed in the far corners of the site. It had RSS feeds, but not an RSS feed for the zapiro archive (or any specific-interest RSS feeds for that matter). I don’t check websites, I read RSS feeds.

Me being a young geek with a little too much spare time, I put together zapiro.rivera.za.net, as a ~200-line PHP script (with no SQL DB) that was really nice to use (in my books) and gave me a Zapiro RSS feed.

When they noticed, the powers at be at M&G weren’t too impressed with it, because it deprived them of eyeballs (and hot-linked their Zapiro images). However I felt satisfied that I was merely providing a fair-use access to their content and allowing people to follow it who wouldn’t have been able to otherwise. The site never got much traffic, so thus far it’s not been a serious problem.

Around June this year, M&G redesigned their website, and I don’t think I even noticed (did I say something about them not having decent feeds?). This redesign broke the machinery in zapiro.rivera.za.net but I didn’t notice that because Zapiro had taken a sabbatical earlier this year, and was going weeks without posting cartoons.

Enough back-story. Point is I took a look at the new M&G Zapiro Archive this evening and was shocked. Before I go into all my problems with it, let me just disclaim that they are rather nit-picky but if these problems weren’t there they site would be a hell of a lot more usable:

  • There are still no useful RSS feeds. There is a rather terse selection of general feeds.
  • The Archive menu only goes back to 2001. M&G has zapiro cartoons going back to 1999.
  • Archive menu URLs are in /Month/Year format. Did anyone even think about URL-scheme when they were designing?
  • Their tagging feature while using multi-select widgets only allows single tags to be selected (oh, and it requires Javascript)
  • Each cartoon has two URLs. Ok, I guess they weren’t thinking about URL scheme.
  1. Today’s cartoon has the /zapiro/all/ URL. Yesterdays /zapiro/all/1, etc. going back to the begging of time (currently residing at /zapiro/all/1870). Way to go with permalinks guys. Oh and did you notice that they are all titled “Latest Zapiro”?
  2. Clicking on the “Comments” link or using the “Archive” menu below takes you to something like /zapiro/fullcartoon/1. Oh, except 1 gives us a non-existent cartoon at the beginning of this Unix Epoch. But take a closer look: it has tags associated. Can anyone say WTF?.

The insanity continues: 2 gives us a cartoon from September 1999. 3-25 are more non-existent wonders, and then things go backwards in time until 36 which jumps us to June 3 2008. (Hmm, I think that may have been around the M&G redesign launch date.)

We move forward in time until 40, when we start moving backwards from May 2008, through many seas of well-tagged gaps, to … well somewhere. (OK, so I got bored and didn’t manually crawl 2000 pages, but would you?) Some cartoons are in totally the wrong position, we randomly move backwards and forwards and sideways.

Finally things settle down, and we go forwards again (with gaps of course) from 2054 to today’s cartoon at 2101 — a fine Zapiro specimen if every I saw one.

Why was I doing all this mind-numbing crawling you ask? Well I wanted to know if I could do anything to make my Zapiro scraper work again. The answer? Not simply. They don’t have any sensible way to locate the cartoon from a specific day, short of crawling the entire archive and recording the URLs found. I don’t think there is any logic to this LSD-induced URL scheme.

URL schemes matter. This seems to be something that the big guns haven’t noticed. I don’t think it’s a co-incidence that the most expensive CMSs out there have the worst URLs, whereas Wordpress and Drupal (with pathauto) encourage sensible URLs and are Open Source.

Sure, most users don’t change what they see in the address bar, but if people are going to link into your site, you should provide nice permalinks. Then, if you want anyone to build anything on top of your site (where anyone includes yourself), it would really help if you had a sane URL scheme. Finally, it gives you geek-cred. :-)

While I think of a better way to get my scraper working again, Happy Spelunking!

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Mister! Mister! The web site has bitted me!

This post is better late than never. The M&G design was much, much needed but I feel it failed to deliver in several key areas. One of them was certainly the Zapiro archives. But still. Remember what it used to look like and tell me things haven’t progressed.

On the case

Hi Stefano

You’re quite right. When the new website went up, a lot of attention was paid to sensible urls and consistency - not to mention a great new visual design. Somehow the Zapiro section just didn’t get the love it deserved during the mamoth process. The M&G technical team is aware of the problems with the Zapiro section, and we will sort it out. So before you redesign your scraper, wait a bit to see what we do with that section. I’ll make sure that you get a decent RSS feed for your script.

Kind regards Jason Norwood-Young Technical Manager M&G Online

re: On the case

Awesome, thanks.

Still made for a fun blog post :-)

Other niggles

While we are on the topic, I have another M&G niggle. As a FLOSS user, I live without flash plugins. (Actually I have it installed on some machines, but on debian/amd64 it breaks regularly so this applies there too)

When I visit M&G articles I get a JavaScript pop-up telling me to install flash. While I use noscript on some machines I don’t use it everywhere.

Can you set a cookie that says “we’ve asked him about flash” or even just not do the pop-up?

BTW: The URLs in the rest of your site are very pretty :-)

Dilbert inspiration

The format of zapiro.rivera.za.net reminds me of www.dilbert.com/fast, which incidentally is referred to as the “Linux/Unix” section by the dilbert site. Maybe they mean that the URL format gels better with the sensibilities of “Linux/Unix” fans—because presumably the kernel you use doesn’t have that much to do with the web pages you can access…?

re: Dilbert inspiration

I’d never seen that before.

But: dilbert.com (the non-fast page) still has perfectly nice URLs

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
By submitting this form, you accept the Mollom privacy policy.