Journal cyranoVR's Journal: mini-HOWTO: Archiving your /. user journal using WGet 8

Journal by cyranoVR on Friday July 23, 2004 @12:24AM

Update 7/23/04 1:25 PM

= = Useless Content Follows = =

First of all, you should know that WGet's "recursive" switch will not help you. As far as I can see, itt simply doesn't recognize the intra-site URL convention that Taco et. al have devised - namely

<A HREF="//slashdot.org/journal.pl?op=display">

Use Firefox's "view source" and you'll see what I'm refering to. Apparently, they do this in order to keep you within your "topic" when you click on a link. For those of you that always browse under yro.slashdot.org Blah, whatever.So anyway, maybe I'm doing something wrong, but WGet no likey.

Easy workaround: take advantage of WGet's -i filename switch, which instructs WGet to obtain the list of URLs to download from filename .

Start by obtaining a list of all your journal URLs from the "All Journal Entries" page:

http://slashdot.org/journal.pl?op=list&uid=your_uid_here

and put them in a simple text file. How you harvest the URLS is up to you. I used a Visual Basic subroutine Sub GrabLinks(strURL) that I wrote last year for work. It utilizes WebBrowser.InternetExplorer and MSHTML.HTMLDocument ActiveX objects. God, how I have grown to hate Visual Basic.

Next week I am going to figure out how to do the same in perl using LWP::Simple , HTML::Parser and some clever regular expressions. I bet it could be done in something like 5-7 lines.

Actually, I bet you could set up a script that checks your journal index for new entries, calls WGet to download the new journals, and creates an updated local index.html page for your archive.

Anyways...

Now set up .wgetrc , the WGet configuration file, which is located in your home directory. If it doesn't exist, create it. Mine looks like this (minus the perl-style comments):

robots=off # uncouth but required! span_hosts=off # just to be safe cookies=on load_cookies=~/.mozilla/default/wierd-ass-profile-name/cookies.txt

We're almost ready to go. Navigate to the folder where you want your journals to reside and execute the following WGet command:

wget -p --convert-links -Pjournal --html-extension -i journals.txt -nH -nd -w 2

And watch the magic happen! After about 15 minutes I had all 521 (!) of my journal entries on my computer...and I didn't even get banned!

(Man...521 entries over 2 years...thats an entry every 1.4 days! I need a hobby...)

If you really want to know what the switches mean, see this reference page. Notes: put the full path to journals.txt if it's not in your current working directory. -Pjournal has WGet create a sub-directory "journal," in which it places the downloaded files.

There are still some problems - I tried to include my cookies file so it would see me as logged in, but slashdot is apparently smarter than that and it didn't work. This means that my comments viewing preferences didn't go through. I got "logged-out" pages.

However, if comments are your thing, you can pretty easily get all of those by using Slashdot's standard querystring interface Fortunately, all journal discussions that can be accessed on their own page - i.e.

comments.pl?sid=115456&threshold=3&mode=flat&commentsort=4&op=Change

so just get a list of all the discussions you want to archive, figure out the options you want in the query string, and WGet away!

Also, the journal.com?op=list page doesn't create relative links...however, parsing the file and making your own handy index page without the distracting slashdot stuff shouldn't be so hard.

What would really be cool: writing a perl script that parses the actual content out of the journal downloads. I'm pretty sure there are standard delimitters for the start and end of each journal entry...

And there's more...if you can't get WGet's cookies feature to trick slashdot into thinking you're a logged in user, they you don't have your date formatting preferences...and for some reason Taco et. al. default to MMM DD format (???) that means NO YEAR. Fortunately, the journals themselves are conveniently numbered, so you could write a perl script that processes each of the files adding the year to the journal "Posted on" date. Of course, that's going to be one ugly-ass regex...

I can really see now why so many people like perl..."it makes the hard jobs possible."

MORE FUN READING
All about SQL injection exploits.
CERT: Protecting web forms from cross-site scripting and code injection.

and

USA TODAY: Ambush TV - behind the scenes at "fake" news shows like CrossBalls and The Daily Show with John Stewart

This discussion was created by cyranoVR (518628) for Friends and Friends of Friends only, but now has been archived. No new comments can be posted.

mini-HOWTO: Archiving your /. user journal using WGet

Load All Comments

Search 8 Comments Log In/Create an Account

Comments Filter:

Very nice (Score:2)

by Safety Cap ( 253500 ) writes:

Unfortunately if you don't subscribe, you can't get that list of your earlier JEs without a whole lotta paging...
- Re:Very nice (Score:2)
  
  by cyranoVR ( 518628 ) * writes:
  
  Not true. You can see a list of ALL of ANY user's Journal entries without a subscription. Just log out and try it [slashdot.org].
  - OMG!!!111 LOL!!!111 WTF!!!11 (Score:2)
    
    by Safety Cap ( 253500 ) writes:
    
    You are the Shizat, man. I humbly beg forgiveness :)
d00d u r teh 1337 (Score:2)

by the_mad_poster ( 640772 ) writes:

Heh *ahem* sorry... I slipped into a bout of stupid there for a minute :p

Just chop out all the extraneous bullshit from my script (hint: using HTML::TokeParser is a lot easier than using HTML::Parser) and you could probably shrink the thing down from 500 lines to almost nothing (feeping creaturism strikes again...). Getting the journals is easy, turning them back into something useful isn't hard, but it's not quite as easy. :\
- Re:d00d u r teh 1337 (Score:2)
  
  by cyranoVR ( 518628 ) * writes:
  
  Thanks for the tip on HTML:TokeParser...but what do you mean by "my script?" What script?
  - Re:d00d u r teh 1337 (Score:2)
    
    by the_mad_poster ( 640772 ) writes:
    
    Clicky link [slashdot.org].
    - Re:d00d u r teh 1337 (Score:2)
      
      by cyranoVR ( 518628 ) * writes:
      
      Well, well...great minds think alike I guess. I will take look this weekend.
    - Re:d00d u r teh 1337 (Score:2)
      
      by cyranoVR ( 518628 ) * writes:
      
      Well, I decided what the hell and tried it
      
      perl getslash.pl -NcyranoVR -aC:\journal
      
      Well look at that, it's working...
      
      Horray. Once again I've wasted my time.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Journal cyranoVR's Journal: mini-HOWTO: Archiving your /. user journal using WGet 8

mini-HOWTO: Archiving your /. user journal using WGet More Login

mini-HOWTO: Archiving your /. user journal using WGet

Very nice (Score:2)

Re:Very nice (Score:2)

OMG!!!111 LOL!!!111 WTF!!!11 (Score:2)

d00d u r teh 1337 (Score:2)

Re:d00d u r teh 1337 (Score:2)

Re:d00d u r teh 1337 (Score:2)

Re:d00d u r teh 1337 (Score:2)

Re:d00d u r teh 1337 (Score:2)

Slashdot Top Deals

Slashdot