1
0
Fork 0
forked from wezm/wezm.net

QA /technical/2009/05/spider-a-site-with-wget-using-sitemap-xml/

This commit is contained in:
Wesley Moore 2010-03-22 17:43:29 +11:00
parent 0ba542043d
commit ed25b6b650

View file

@ -1,5 +1,5 @@
On a number of sites at work we employ a static file caching extension to do just that: create static files that are served until the cache is invalidated. One of things that will invalidate the cache is deploying a new release of the code. This means that many of the requests after deploying will need to be generated from scratch, often causing the full Rails stack to be started (via Passenger) each time. To get around this I came up with the following to use <code>wget</code> to spider each of the URLs listed in the <code>sitemap.xml</code>. This ensures each of the major pages has been cached so most requests will be cache hits.
On a number of sites at work we employ a static file caching extension to do just that: create static files that are served until the cache is invalidated. One of things that will invalidate the cache is deploying a new release of the code. This means that many of the requests after deploying will need to be generated from scratch, often causing the full Rails stack to be started (via Passenger) each time. To get around this I came up with the following to use `wget` to spider each of the URLs listed in the `sitemap.xml`. This ensures each of the major pages has been cached so most requests will be cache hits.
<p style="text-align: left;"><code>wget --quiet http://www.example.com/sitemap.xml --output-document - | egrep -o "http://www\.example\.com[^<]+" | wget --spider -i - --wait 1</code></p>
wget --quiet http://www.example.com/sitemap.xml --output-document - | egrep -o "http://www\.example\.com[^<]+" | wget --spider -i - --wait 1
That should all be executed on one line. There's a one second wait in there to spread out the requests a bit but you can remove it if you like.