Tuesday | 05 NOV 2024
[ previous ]
[ next ]

Setting up a Local Copy of Wikipedia

Title:
Date: 2021-03-07
Tags:  

Hello! I've always wanted to have a local copy of wikipedia and now seems like as good a time as any.

I'm not going to set up a true copy of wikipedia with all the bells and whistles but just the bare minimum. For this we will use the zim file provided by Wikipedia and kiwix-serve which is a tool to serve zim content.

Before we start an example!

Wikipedia Dumps

The first step is to get the wikipedia dump.

https://dumps.wikimedia.org/

This link has a variety of dumps but the one we are interested in is the kiwix files.

https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/

Now we have a list of all the files that wikipedia makes available.

The naming scheme is wikipedia_$COUNTRY_$TOPIC_$PICTURES_$YYYY-MM.zim

The english dump with all the pictures is over 80gigs and it will take some time to download so for now let's grab the astronomy dump.

> wget https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/wikipedia_en_astronomy_nopic_2021-02.zim

If for some reason the connection drops, we can start the download where we left off automatically with wget.

Kiwix

The second step is to download kiwix-tools and this will let us serve the zim content directly over the web.

> wget https://download.kiwix.org/release/kiwix-tools/kiwix-tools_linux-x86_64.tar.gz
>  tar xvf kiwix-tools_linux-x86_64.tar.gz

We grab the tools and unpack everything and inside the kiwix-tools directory will be kiwix-serve. This program runs a webserver that will serve the zim content.

> ./kiwix-tools_linux-x86_64-3.1.2-4/kiwix-serve -i 0.0.0.0 -p 7998 -d wikipedia_en_astronomy_nopic_2021-02.zim

We start up the server on the ip 0.0.0.0, so that it binds to all the interfaces, we are using port 7998, the default is 80. I also opened the port on my firewall. The -d option means to run in daemon mode. Then finally we give the path of the zim file we want to serve.

Now we should be able to navigate to the IP address and port number of the machine that kiwix is running on and voila! We have a working copy of Wikipedia!

Enabling On Boot

To make sure our local copy runs on startup, we'll need to write some scripts so that it integrates with systemctl.

The first thing we need are start and stop scripts.

start.sh

#!/usr/bin/bash

/home/media/wikipedia/kiwix-tools_linux-x86_64-3.1.2-4/kiwix-serve -i 0.0.0.0 -p 7998 /home/media/wikipedia/wikipedia_en_astronomy_nopic_2021-02.zim

stop.sh

#!/usr/bin/bash

pkill -f "kiwix-serve"

Now we can write the systemctl unit file.

/etc/systemd/system/kiwix-wikipedia.service

## Systemd unit file for Kiwix#

[Unit]
Description=Kiwix Wikipedia
After=syslog.target network.target

[Service]
ExecStart=/home/media/wikipedia/start.sh
ExecStop=/home/media/wikipedia/stop.sh
Restart=always

[Install]
WantedBy=multi-user.target

Our unit file is quite basic, it simply requires the network interfaces to come up and then we have our start and stop scripts specified. We also have the Restart option set to always.

# systemctl enable kiwix-wikipedia.service
# systemctl start kiwix-wikipedia.service

Voila! We have now started our local copy and enabled to run on boot.