Web Scraping for Preservation's Sake

Web Scraping for Preservation's Sake

Due to the higher complexities of running a modern website or blog securely or rather, for making it easier for people via CloudFlare and the like, I have found that when one of those main sites are down, again like CloudFlare, sites that rely on their DNS redirecting and whatnot, go down too. Many blogs that I follow are starting to jump on that band-wagon which is annoying since it can be quite a bit of time for a site to return to working order whether it was on CloudFlare’s side or the blog owner. Regardless of this I have looked into ways to have backup copies of their sites while this is still an option.

This one I found to be the most organized and make the format match what is in the page.

wget -r –convert-links –html-extension –no-parent some-site-online.com

To pull entire website but mask it by looking like you are from Mozilla doing a site crawl, then wait 10s between each page pull, and finally limit rate to 35Kbps run:

wget -r -p -U Mozilla –wait=10 –limit-rate=35K some-site-online.com

This one will be best to not get potentially marked as a spam crawler.

Both of these provide you with a folder in your pwd which will contain the relevant details to show the site with only the links that are missing.

In this example I have run the first command at my blog site where it has created a folder named blog.wretchednet.com. If I run Firefox against the index.html file it will open Firefox and present my home page with all the pages that I have created that are traversable and viewable exactly as I had intended it.

Caveats

I have found some sites that will not let you traverse their site in this fashion. I don’t know if this was done our by design or just poor web structure but one of the blogs I wanted to scrape would only let me have access to one page at a time. This is not ideal but it allowed me to get the few pages I wanted.

Also I have found myself blocked from even viewing a site after scraping it. Luckily I was able to take the one post I needed but soon after trying to do their entire site my public IP address was blocked for about 30 days.