2006-08-11

wget的几个选项

以前在使用wget时用得比较多的选项包括 -c(续传),-r(递归),-np(不下载父目录),-i(读取一个文件来获得下载地址),-l(下载深度)。今天又学习了几个很有用的选项: -E,-k,-p:

-E
--html-extension
If a file of type application/xhtml+xml or text/html is downloaded and the URL
does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suf-
fix .html to be appended to the local filename. This is useful, for instance,
when you're mirroring a remote site that uses .asp pages, but you want the mir-
rored pages to be viewable on your stock Apache server. Another good use for this
is when you're downloading CGI-generated materials. A URL like
http://site.com/article.cgi?25 will be saved as article.cgi?25.html.

Note that filenames changed in this way will be re-downloaded every time you re-
mirror a site, because Wget can't tell that the local X.html file corresponds to
remote URL X (since it doesn't yet know that the URL produces output of type
text/html or application/xhtml+xml. To prevent this re-downloading, you must use
-k and -K so that the original version of the file will be saved as X.orig.

-k(小写字母)
--convert-links
After the download is complete, convert the links in the document to make them
suitable for local viewing. This affects not only the visible hyperlinks, but any
part of the document that links to external content, such as embedded images,
links to style sheets, hyperlinks to non-HTML content, etc.

-K (大写的K)
--backup-converted
When converting a file, back up the original version with a .orig suffix. Affects the behavior of -N.

-p
--page-requisites
This option causes Wget to download all the files that are necessary to properly
display a given HTML page.  This includes such things as inlined images, sounds,
and referenced stylesheets.

Ordinarily, when downloading a single HTML page, any requisite documents that may
be needed to display it properly are not downloaded.  Using -r together with -l
can help, but since Wget does not ordinarily distinguish between external and
inlined documents, one is generally left with ``leaf documents'' that are missing
their requisites. Note that Wget will behave as if -r had been specified, but only that single page
and its requisites will be downloaded.  Links from that page to external documents
will not be followed.  Actually, to download a single page and all its requisites
(even if they exist on separate websites), and make sure the lot displays properly
locally, this author likes to use a few options in addition to -p:

wget -E -H -k -K -p http:///

1 条评论:

  1. -A(指定要下载文件的类型)
    -A acclist --accept acclist
    -R rejlist --reject rejlist
    Specify comma-separated lists of file name suffixes or patterns to accept or reject (@pxref{Types of Files} for more details).
    如,我只想下载mp3文件,而不下载html文件和gif文件:
    $ wget -r -l2 -c -np -A.mp3 -R.html,.gif http://www.oldtimeradioarchives.com/mp3/

    或者先用dog和grep命令抓出带有mp3文件的连接放入一个文件,再批量下载:
    $ dog --links http://www.djbc.net/beastles/ | grep mp3 > beastles ; wget -i beastles

    回复删除