wget

最後更新: 2024-10-09

目錄

  • 設定檔
  • Background 下載
  • 續存
  • 下載整個目錄
  • veiw header
  • Download list of file
  • Mirror Website
  • 下載目錄內某類檔案
  • login
  • Limit Speed
  • 其他 Opts
  • wget 401 then 200
  • Drupal cron jobs
  • Other Tools

前言

 

Speed Unit

  • Unit: MB = MByte

wget http://x.x.x.x:8080/systemrescuecd-amd64-6.1.8.iso

--2022-06-15 10:36:12--  http://x.x.x.x:8080/systemrescuecd-amd64-6.1.8.iso
Connecting to x.x.x.x:8080... connected.
HTTP request sent, awaiting response... 200 OK
Length: 716177408 (683M) [application/octet-stream]
Saving to: ‘systemrescuecd-amd64-6.1.8.iso.1’

systemrescuecd-amd64-6.1.8 100%[=======================================>] 683.00M  17.8MB/s    in 36s

Limitation

wget 不會 parallel download

如果想 parallel dl, 可以考慮用 aria2

 


設定檔

 

/etc/wgetrc

~/.wgetrc                                       # For store password in wgetrc

wget --config=/path/to/wgetrc ...     # 設定用其他 config file

 


Example: Background 下載

 

# -t, --tries=            Set number of tries to number. Default:20, 0=inf.
# -o logfile               Log all messages to logfile.

wget -t 3 -o log.txt http://link &

or

# -b, --background, If no "-o", it will log to wget-log
# -c, --continue       Continue getting a partially-downloaded file

wget -b -t 3 -c http://link

 

Example: 續存

wget -c bigfile

# -c 續存

 

Example: 下載整個目錄

wget -cp http://link/directory/

# -p  ‘--page-requisites’                                    # 下載 directory 內所有檔案

 

Example: veiw header

# 下載前會看到 header (--server-response)

wget -S http://web-site/

 

Example: Download list of file

wget -nc -i dl.file

# -i <file>                         file 內是一行一條 link 的      

# -nc, --no-clobber           不再 Download 以存在的 File, 就算它未完整(與 -c 正好相反)

 


Mirror Website

 

方法1: --recursive(-r)

方法2: --mirror(-m)

 

方法1: --recursive(-r)

-r, --recursive                  # Create a mirror of the GNU web site (default 5 level)

-l,--level=                       # 下載幾多層內的 file (nested levels)

-P /PATH                         # saving them to /PATH ( Default PREFIX "." )

-nX

  • -nd, --no-directories           不建立目錄 (假設 URL 是 A/B/C 沒有 -nd  時會建立 A/B/C 目錄)
  • -np, --no-parent                 not to recurse to the parent directory
  • -nH, --no-host-directories    Disable generation of host-prefixed directories.
                                             (當 url 是 http://you-domain/ 會有 Folder you-domain/)

e.g.

# 下載回來後會有目錄結構 "dl.dahunter.org/mysql/c7/m80"

wget -r -l 1 -np https://dl.dahunter.org/mysql/c7/m80/

 * 尾的 "/" 必須加上

# 下載所有 file 到當前 "."

wget -r -l 1 -np -nd https://dl.dahunter.org/mysql/c7/m80/

--cut-dirs=N

Ignore number directory components.

it makes wget not "see" number remote directory components.

"-nX" 配合 "--cut-dirs=N"

url = http://you-domain/A/B/C

# 本地 Folder 結構

-nH -> A/B/C/
-nH --cut-dirs=1  -> B/C
-nH --cut-dirs=2  -> C.

--convert-links                 # view the documents offline

e.g.

# 轉換 html 檔的 link

wget --convert-links -N -l2  -P/tmp -r http://www.gnu.org/

 

-N, --timestamping          # don't re-retrieve files unless newer than local.

-I                                    # comma-separated list of directories included in the retrieval.
                                      # Any other directories will simply be ignored. The directories are absolute paths.

-L                                   # Follow relative links only, 以下的不是 relative links

  • <a href="/foo.gif">
  • <a href="/foo/bar.gif">
  • <a href="http://www.server.com/foo/bar.gif">

-D <url>                            allows you to specify the domains that will be followed,
                                         thus limiting the recursion only to the hosts that belong to these domains.

 

方法2: --mirror(-m)

wget -m -w 5 http://www.gnu.org/

  • -m,  --mirror                      相當於  -N -r -l inf --no-remove-listing.  ( -l inf 相當於 -l 0 )
  • -k,   --convert-links
  • -w,  --wait                         下載一檔案後, 等一定時間才下載另一個, 單位 sec.

 

Example: 下載目錄內某類檔案

wget -r -l1  -A'.gif,.swf,.css,.html,.htm,.jpg,.jpeg' url

Example: Limit Speed

# 限速, 單位是 byte, 可以配合 unit k, m 使用

--limit-rate=100k

Example: 只下載較新的 file

-N,  --timestamping

當 server 沒有提供 timestamp header 時就會出 msg

Last-modified header missing -- time-stamps turned off.

Exclude

--reject-regex

當多過 regex 時, 可以用 "|" 串連

wget --reject-regex 'expr1|expr2|…' https://example.com

只能用一次. 如果有兩個 list 要 exclude, 那要先合併它們

FILE_LIST='
/SRPMS/|
/aarch64/|
/debuginfo/|
/i386/|
/i686/
'

VER_LIST='
-8\.0\.1.-|
-8\.0\.2.-|
-8\.0\.3[0-5]-
'

EXCLUDE_FILE=$(echo $FILE_LIST | tr -d ' ')
EXCLUDE_VER=$(echo $VER_LIST | tr -d ' ')
EXCLUDE_LIST="$EXCLUDE_FILE|$EXCLUDE_VER"

應用

wget --reject-regex logout  https://example.com

--regex-type posix|pcre      # default: posix

Note that to be able to use pcre type, wget has to be compiled with libpcre support.

wget -V | grep pcre

e.g.

list='/fdr0./|/fdr1./'
list='/fdr\.0/|/fdr\.1/'
list='/fdr1[0-3]/|/fdr2[0-3]/'

-R, --reject rejlist

 * --reject 可以與 --reject-regex 一起使用

Specify comma-separated lists of file name suffixes (--reject "bin,dd")
If any of the wildcard characters, rejlist, it will be treated as a pattern ("*.bin,*.dd")

wget ... \
--reject "xml.gz,sqlite.bz2" \
--regex-type pcre --reject-regex "$EXCLUDE_LIST"

Exclude 的測試

# 準備目錄結構

cd /home/vhosts/test.local/public_html
mkdir -p fdr{1,2}{0..4}
touch fdr1{0..4}/test.txt
touch fdr2{0..4}/test.sh
touch test.bin test.dd
touch bin.txt dd.txt

 


Execute command

 

-e, --execute command

Execute command as if it were a part of .wgetrc.
If you need to specify more than one wgetrc command, use multiple instances of -e.

應用

"wget -m ... " 時出 msg

http://test.local/robots.txt:
2024-10-09 16:35:04 ERROR 404: Not Found.

Fix

-e robots=off

 


Login

 

i.e.

wget -O - ftp://USER:PASS@server/README

Remark

# -O - 把下載好的檔案內容 outpurt 到 -
# -O file

登入方式:

  • USER:PASS@URL
  • --user=USER --password=PASS  # 不加 --password 是不會問 password 的
  • --user=USER --ask-password      # wget prompt for password
  • --use-askpass=command            # If no command is specified, ENV - WGET_ASKPASS is used

~/.wgetrc

chmod 600 ~/.wgetrc

~/.wgetrc

user=????

# ask_password = on/off
password=????

 


其他 Opts

 

(-U)--user-agent="user agent"

--referer=

--accept=jpg,gif

--reject=html

--wait=5

 

 


wget 401 then 200

 

401: 時會由 Server 返回 realm="..."

200: wget 傳出 password 並 login

wget and most other programs request a basic authentication challenge from the server side before sending the credentials.

This is wget's default behavior since version 1.10.2.

You can change that behaviour using --auth-no-challenge option

 


Drupal cron jobs

 

-O, --output-document=file

-q, --quiet

0 * * * * wget -O - -q http://?????  > /dev/null 2>&1  &&  touch /root/getlink

Notes

-nv

Turn off verbose without being completely quiet (use -q for that)

which means that error messages and basic information still get printed.

 


Other Tools

 

Parallel download tools

 

 


 

 

 

 

Creative Commons license icon Creative Commons license icon