From ed25b6b650bc6cc863ed9ed42147394ebb1c219e Mon Sep 17 00:00:00 2001 From: Wesley Moore Date: Mon, 22 Mar 2010 17:43:29 +1100 Subject: [PATCH] QA /technical/2009/05/spider-a-site-with-wget-using-sitemap-xml/ --- .../2009/05/spider-a-site-with-wget-using-sitemap-xml.html | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/technical/2009/05/spider-a-site-with-wget-using-sitemap-xml.html b/content/technical/2009/05/spider-a-site-with-wget-using-sitemap-xml.html index d87889f..8ddf51f 100644 --- a/content/technical/2009/05/spider-a-site-with-wget-using-sitemap-xml.html +++ b/content/technical/2009/05/spider-a-site-with-wget-using-sitemap-xml.html @@ -1,5 +1,5 @@ -On a number of sites at work we employ a static file caching extension to do just that: create static files that are served until the cache is invalidated. One of things that will invalidate the cache is deploying a new release of the code. This means that many of the requests after deploying will need to be generated from scratch, often causing the full Rails stack to be started (via Passenger) each time. To get around this I came up with the following to use wget to spider each of the URLs listed in the sitemap.xml. This ensures each of the major pages has been cached so most requests will be cache hits. +On a number of sites at work we employ a static file caching extension to do just that: create static files that are served until the cache is invalidated. One of things that will invalidate the cache is deploying a new release of the code. This means that many of the requests after deploying will need to be generated from scratch, often causing the full Rails stack to be started (via Passenger) each time. To get around this I came up with the following to use `wget` to spider each of the URLs listed in the `sitemap.xml`. This ensures each of the major pages has been cached so most requests will be cache hits. -

wget --quiet http://www.example.com/sitemap.xml --output-document - | egrep -o "http://www\.example\.com[^<]+" | wget --spider -i - --wait 1

+ wget --quiet http://www.example.com/sitemap.xml --output-document - | egrep -o "http://www\.example\.com[^<]+" | wget --spider -i - --wait 1 That should all be executed on one line. There's a one second wait in there to spread out the requests a bit but you can remove it if you like. \ No newline at end of file