Update 7/23/04 1:25 PM
= = Useless Content Follows = =
First of all, you should know that WGet's "recursive" switch will not help you. As far as I can see, itt simply doesn't recognize the intra-site URL convention that Taco et. al have devised - namely
Use Firefox's "view source" and you'll see what I'm refering to. Apparently, they do this in order to keep you within your "topic" when you click on a link. For those of you that always browse under yro.slashdot.org Blah, whatever.So anyway, maybe I'm doing something wrong, but WGet no likey.
Easy workaround: take advantage of WGet's -i filename switch, which instructs WGet to obtain the list of URLs to download from filename .
Start by obtaining a list of all your journal URLs from the "All Journal Entries" page:
and put them in a simple text file. How you harvest the URLS is up to you. I used a Visual Basic subroutine Sub GrabLinks(strURL) that I wrote last year for work. It utilizes WebBrowser.InternetExplorer and MSHTML.HTMLDocument ActiveX objects. God, how I have grown to hate Visual Basic.
Actually, I bet you could set up a script that checks your journal index for new entries, calls WGet to download the new journals, and creates an updated local index.html page for your archive.
Now set up
robots=off # uncouth but required!
span_hosts=off # just to be safe
We're almost ready to go. Navigate to the folder where you want your journals to reside and execute the following WGet command:
wget -p --convert-links -Pjournal
--html-extension -i journals.txt
-nH -nd -w 2
And watch the magic happen! After about 15 minutes I had all 521 (!) of my journal entries on my computer...and I didn't even get banned!
(Man...521 entries over 2 years...thats an entry every 1.4 days! I need a hobby...)
If you really want to know what the switches mean, see this reference page. Notes: put the full path to journals.txt if it's not in your current working directory. -Pjournal has WGet create a sub-directory "journal," in which it places the downloaded files.
There are still some problems - I tried to include my cookies file so it would see me as logged in, but slashdot is apparently smarter than that and it didn't work. This means that my comments viewing preferences didn't go through. I got "logged-out" pages.
However, if comments are your thing, you can pretty easily get all of those by using Slashdot's standard querystring interface Fortunately, all journal discussions that can be accessed on their own page - i.e.
so just get a list of all the discussions you want to archive, figure out the options you want in the query string, and WGet away!
Also, the journal.com?op=list page doesn't create relative links...however, parsing the file and making your own handy index page without the distracting slashdot stuff shouldn't be so hard.
What would really be cool: writing a perl script that parses the actual content out of the journal downloads. I'm pretty sure there are standard delimitters for the start and end of each journal entry...
And there's more...if you can't get WGet's cookies feature to trick slashdot into thinking you're a logged in user, they you don't have your date formatting preferences...and for some reason Taco et. al. default to MMM DD format (???) that means NO YEAR. Fortunately, the journals themselves are conveniently numbered, so you could write a perl script that processes each of the files adding the year to the journal "Posted on" date. Of course, that's going to be one ugly-ass regex...
I can really see now why so many people like perl..."it makes the hard jobs possible."
USA TODAY: Ambush TV - behind the scenes at "fake" news shows like CrossBalls and The Daily Show with John Stewart