Raw Thought

by Aaron Swartz

Some Announcements

My new Python web application library, web.py, is out.

An essay of mine about productivity has made the rounds. I’m not sure it’s finished, but lots of people have already read it and loved it, so I thought I might as well tell you about it.

On a similar note, over the winter break I wrote a program called arcget to download web sites from the Internet Archive’s Wayback Machine. There are still a few more changes I’d like to make, but I’m unlikely to get to them anytime soon.

The reason I built arcget was so I could put up the Lingua Franca archive. Lingua Franca was a fantastic magazine about academic life that ran for several years before going bust in the recession. Their website, with many of their fantastic articles, sadly disappeared with it (and thus from Google).

Now you can read such classics as Oh My Darwin!, A Most Dangerous Method, and Who Owns The Sixties?.

Of course, these and many others are collected into the wonderful collection Quick Studies which, along, with Boob Jubilee is some of the most enjoyable and edifying material out there — combining both dazzlingly intriguing writing with meaty subjects.

Anyway, that should keep you busy for a while.

You should follow me on twitter here.

January 5, 2006

Comments

Note that Lingua Franca’s editor, Alexander Star, later went on to improve the Boston Globe’s weekly “Ideas” section. As you’d expect, there’s the same pronounced emphasis on academic controversies.

posted by Mike Sierra on January 5, 2006 #

I find playing classical music helps me be productive.

posted by joe on January 5, 2006 #

You might also be interested in Warrick:

“Warrick is a command-line utility for reconstructing or recovering a website that has been lost due to a hard drive crash, fire, failed backup, etc. Warrick will search the Internet Archive, Google, MSN, and Yahoo for stored pages and images and will save them to your filesystem.”

http://www.cs.odu.edu/~fmccown/research/lazy/warrick.html

I haven’t tried it and it’s “not yet available for download” but looks like it might be available on request from the author.

— Regarding arcget’s policy of retrieving the oldest version of any page, “because generally the newest version is whatever lame site has replaced the site you want to archive”:

If you know a threshold date for which you don’t want any later captures, you can use the Wayback Machine’s ranged datespecs to limit the results you see. So, for example, the last version of the linguafranca.com homepage in the IA that looks like a normal active publication appears to be as of 20020207200457:

http://web.archive.org/web/20020207200457/http://www.linguafranca.com/index.html

You could then limit any search results page to return include only results before that date with an URL construction like:

http://web.archive.org/web/0-20020207200457*/linguafranca.com

— Finally, also of potential interest for any such tools: there’s an undocumented method of getting (pseudo-)XML lists of URL captures from the Wayback Machine: append a “xm_” to the datespec portion of the retrieval URL. So the above would become:

http://web.archive.org/web/0-20020207200457*xm_/linguafranca.com

(I say pseudo-XML because there’s no attempt to properly escape reported URLs, so you can get illegal XML from this interface.)

Caveat: this underdocumented/undertested/unsupported avenue could have other problems or go away unexpectedly. Ideally it would be replaced with a OpenSearch API-compliant interface, with some domain-specific special operators, but no such change is imminent.

  • Gordon @ IA

posted by Gordon Mohr on January 9, 2006 #

You can also send comments by email.

Name
Site
Email (only used for direct replies)
Comments may be edited for length and content.

Powered by theinfo.org.