Back

Archiving websites with redbean and recursive wget

2023-02-01

Here is a simple way to make a fully navigable archive of an interesting static website, for instance for offline consumption, or because you are afraid that the Internet will cease to exist soon. We will be using redbean to make this archive easily viewable: this will make the archive an executable, that runs a simple static webserver when launched. For this example, we will be archiving redbean’s website itself.

Step 1. Download all files in the website using recursive wget:

wget --recursive \
     --page-requisites \
     --adjust-extension \
     --convert-links \
     --restrict-file-names=windows \
     --domains redbean.dev \
     --no-parent \
     http://redbean.dev

Step 2. Download a small redbean executable suited to static websites:

wget https://redbean.dev/redbean-static-2.2.com -O redbean.dev.com

Step 3. Put all the stuff in there:

(cd redbean.dev; zip -r ../redbean.dev.com *)

Step 4. Make it executable and serve it right here and now:

chmod +x redbean.dev.com
./redbean.dev.com

The archived website can now be viewed in your browser at http://localhost:8080.

Step 5. Clean up the mess, you don’t need it anymore:

rm -rv redbean.dev

Here it is in the form of a generic script for any web domain starting at any URL:

#!/usr/bin/env bash

if [ -z "$1" ]; then
    echo "Usage: $0 <domain>"
    exit 1
fi

URL=$1
DOMAIN=$(echo "${URL}" | cut -d '/' -f 1)

echo "Archiving web domain: $DOMAIN, starting at url: http://$URL"
echo "Is this ok? Press enter for yes, ^C for no"
read

wget --recursive \
     --page-requisites \
     --adjust-extension \
     --convert-links \
     --restrict-file-names=windows \
     --domains "$DOMAIN" \
     --no-parent \
     --user-agent "curl" \
     "http://$URL"

if [ ! -d "$DOMAIN" ]; then
    echo "Could not retrieve website content"
    exit 1
fi

wget https://redbean.dev/redbean-static-2.2.com -O "${DOMAIN}.com"

(cd "$DOMAIN" && zip -r "../${DOMAIN}.com" * || exit 1)

chmod +x "${DOMAIN}.com"
rm -rv "${DOMAIN}"

I might start a small project of doing this for websites I find useful.