0
All the Commands of Wget that we Should Know and How to download an entire website into offline view

How to download an entire website into offline view? How to save all the MP3s from a website to a folder on computer? How to download the files that are behind a login page? How to build a mini-version of Google?

Wget is a free utility that is available for Mac, Windows and Linux which can be help you to accomplish all this and more than this. From most download managers what makes it different is that wget can follow the HTML links on a web page and download the files recursively. It is the same tool that a soldier had used to download the 1000's of unauthorized documents from the US army’s Intranet and published later on the Wikileaks website.



Spider Websites with Wget and 20 Practical Examples

Wget is extremely powerful like  most other command line programs, the options of plethora supports can be intimidating to new users. Thus what we have here are a collection of wget commands that are used to accomplish the tasks from downloading a single files to mirror the entire websites. It will help you to read the wget manually but for the busy souls, these commands are ready to execute.

1. Download a single file

wget http://example.com/file.iso

2. Save the file locally under different name

wget ‐‐output-document=filename.html example.com

3. The Downloaded file is save  in a specific folder

wget ‐‐directory-prefix=folder/sub folder example.com

4. Resume an interrupted download previously started by wget itself

wget ‐‐continue example.com/big.file.iso

5. Download a file only if the version on server is newer than the local copy

wget ‐‐continue ‐‐timestamping wordpress.org/latest.zip

6. Download the multiple URLs with wget. and Put the list of URLs in another text file on separate lines and pass it to the wget.

wget ‐‐input list-of-file-urls.txt

7. From a server Download a list of sequentially numbered files

wget http://example.com/images/{1..20}.jpg

8. Download a web page with all assets like style sheets and inline images that are required to display the  offline web page.

wget ‐‐page-requisites ‐‐span-hosts ‐‐convert-links ‐‐adjust-extension http://example.com/dir/file

Mirror websites with Wget


9. Download an entire website including all the linked pages and files

wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber http://example.com/

10. Download all MP3 files from a sub directory

wget ‐‐level=1 ‐‐recursive ‐‐no-parent ‐‐accept mp3,MP3 http://example.com/mp3/

11. Download all the images from a website into common folder

wget ‐‐directory-prefix=files/pictures ‐‐no-directories ‐‐recursive ‐‐no-clobber ‐‐accept jpg,gif,png,jpeg http://example.com/images/

12. Download the PDF documents through recursion but stay within specific domains.

wget ‐‐mirror ‐‐domains=abc.com,files.abc.com,docs.abc.com ‐‐accept=pdf http://abc.com/

13. Download all the files from  website but exclude a few directories.

wget ‐‐recursive ‐‐no-clobber ‐‐no-parent ‐‐exclude-directories /forums,/support http://example.com

Wget for Downloading Restricted Content

We can use the Wget to download the content from sites that are behind a login screen or ones that check for the HTTP refer and the User Agent strings of the bot to prevent from the screen scraping.

14. Download the files that will check the User Agent and the HTTP Referer

wget ‐‐refer=http://google.com ‐‐user-agent=”Mozilla/5.0 Firefox/4.0.1″ http://nytimes.com

15. Download the files from a password protected sites

wget ‐‐http-user=labnol ‐‐http-password=hello123 http://example.com/secret/file.zip

16. Fetch the pages which are behind a login page. and replace the username and password with the actual form fields while the URL should point to the Form Submit page.

wget ‐‐cookies=on ‐‐save-cookies cookies.txt ‐‐keep-session-cookies ‐‐post-data ‘user=labnol&password=123’ http://example.com/login.php
wget ‐‐cookies=on ‐‐load-cookies cookies.txt ‐‐keep-session-cookies http://example.com/paywall

Retrieving the File Details with wget

17. Find out the size of a file without downloading the file

wget ‐‐spider ‐‐server-response http://example.com/file.iso

18. Download a file and  without saving it locally display the content on the screen.

wget ‐‐output-document – ‐‐quiet google.com/humans.txt

19. Know the last modified date of a web page

wget ‐‐server-response ‐‐spider http://www.labnol.org/

20. Check the links on your website to check whether they are working. The spider option will not save the pages locally.

wget ‐‐output-file=logfile.txt ‐‐recursive ‐‐spider http://example.com

How the Wget is to be nice to the server?

The wget tool is a spider that scrapes/ leeches the web pages but some web hosts may block these spiders with the robots.txt files. wget will not follow the links on the web pages that uses the rel=nofollow attribute.

We can force the wget to ignore the robots.txt and the no follow directives by adding the switch execute robots=off to all wget commands. If a web host is blocking wget requests by using the User Agent string, you can always fake that with the user-agent=Mozilla switch.

The wget command will put the additional strain on the site’s server because it will traverse the links continuously and download the files. A good scraper would limit the retrieval rate and includes a wait period between consecutive fetch requests to reduce the load on the server.

wget ‐‐limit-rate=20k ‐‐wait=60 ‐‐random-wait ‐‐mirror example.com

We have limited download with Above example because bandwidth rate to 20 KB/s and the wget utility will wait anywhere between 30 to 90 seconds before retrieving the next resource.

Finally, What do you think this wget command will do?

wget ‐‐span-hosts ‐‐level=inf ‐‐recursive dmoz.org

Post a Comment

 
Top