dave's cup of tech

Utterances on Data Science, Machine Learning, Artificial Intelligence, Visualization, and Complex Systems.

Mirroring Websites Is a Pain

Modern web is a pain to mirror, or to make a copy for future reference.
You have to go through every single page in a website and probably print it to keep a copy of what was there.

I end up with mountains of Pdfs. And that sucks, namely when websites have the same title for every page. You manually have to come up with filenames for each of the pages you need to preserve and the all process is slow.

Wget might be a solution, and I’ve used in the past, but it isn’t ideal.

The --mirror switch is not enough and you end up needing to go through the manual anyway to find out what each switch means.
This is because Wget is not a mirroring tool, but a general purpose one.

Yes, I can hammer a screw in, but it isn’t the right tool for the job.