Posts for the month of August 2007

Offline Wikipedia

As seen on a million and one websites (/. etc al), a smart geek put together some offline Wikipedia stuff. I had some problems on my work laptop (Fedora Core 7) because of the OS (PHP executable wrongly named) and because of the way I have it partitioned (on a FAT32 partition). Anyway, here's my email to the original poster (wiki-fied):

  1. Thanks, you're a geek hero. ;)
  2. I had a few problems with relative paths, I had to edit the places that pointed at quickstart* executables.
  3. Fedora Core 7's "php5" executable is actually only named "php" - no big deal with a "ln -s /usr/bin/php /usr/bin/php5"
  4. My machine is dual-boot, and the only partition big enough was FAT32. Had some problems with the too many split files. I threw together a perl script (I had done the split by hand before downloading the Makefile ;) ). It's pasted below.

Anyway, thanks again. Feel free to add any of this stuff to your page (like the FC7 notes). If you do, please don't include my email, just credit to RevRagnarok is fine.

  • RevRagnarok
#!/usr/bin/perl -w

# This was written by RevRagnarok (I'm on Wikipedia)
# I was having problems with all the split files on a FAT32 partition. I assume
# it is because there were so many plus two entries for each (LFNs).
# This simply combines all the rec* files again into large chunks of N where
# I used 5, but you can set below with $combine.
# Verification info below.
# Lastly, I needed to modify the Makefile and remove the "split" from the
# "wikipedia" target.

use strict;

# Using: rec13778enwiki-20070802-pages-articles.xml.bz2
my $last = 13778;
my $lastd = 5; # How many digits in above (yes, I can compute this, but why?)
my $date = 20070802;
my $suffix = "enwiki-${date}-pages-articles.xml.bz2";
my $combine = 5; # This will combine every 5 into a group
                 # (If this number makes > 4 digit results, it will not sort nicely)
my $outputdir = '/data/wikipedia/'; # Don't make it the same place...

my $joinstr = '';
my $fcount = 0;

for (1 .. $last) {
  my $num = sprintf "%0${lastd}d", $_;
  $joinstr .= "rec${num}${suffix} ";
  if (($_ % $combine) == 0) {
      &catthem($joinstr, $fcount++);
      $joinstr = '';
  }
}

&catthem($joinstr, $fcount++) if ($joinstr ne '');
print "All done!\n";

sub catthem ($$) {
  my $ofile = sprintf "rec%04d.bz2", $_[1];
  `/bin/cat $_[0] >${outputdir}${ofile}`; # Lazy again, there are more Perl-ish ways.
  print ".";
}

__DATA__

To make sure they were all taken in, you can do this:
bash$ bzip2 -tvv *bz2 2>&1 | grep -v ok | grep -v bz2 | wc -l
13778

...which is equal to the number of start blocks, so I know nothing is missing now.