You may want to read a similar post for cURL, I wrote about the reason for preventing downloading same content.
1 The flow
It seems fairly easy, too:
% wget http://example.com -S -N
--2012-03-23 20:27:23-- http://example.com/
Resolving example.com... 192.0.43.10
Connecting to example.com|192.0.43.10|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.0 302 Found
Location: http://www.iana.org/domains/example/
Server: BigIP
Connection: Keep-Alive
Content-Length: 0
Location: http://www.iana.org/domains/example/ [following]
--2012-03-23 20:27:23-- http://www.iana.org/domains/example/
Resolving www.iana.org... 192.0.32.8
Connecting to www.iana.org|192.0.32.8|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Fri, 23 Mar 2012 12:27:24 GMT
Server: Apache/2.2.3 (CentOS)
Last-Modified: Wed, 09 Feb 2011 17:13:15 GMT
Vary: Accept-Encoding
Connection: close
Content-Type: text/html; charset=UTF-8
Length: unspecified [text/html]
Server file no newer than local file `index.html' -- not retrieving.
You may have noticed the difference, it does not use If-Modified-Since as cURL does. This request should be a HEAD request, Wget determines whether to GET or not based on the Last-Modified and Content-Length, which was not sent by the server and this would be a problem when a server sends it with 0 in length, but actually the content’s length is non-zero. A case for it is Blogger:
% wget http://oopsbroken.blogspot.com --server-response --timestamping --no-verbose
HTTP/1.0 200 OK
X-Robots-Tag: noindex, nofollow
Content-Type: text/html; charset=UTF-8
Expires: Fri, 23 Mar 2012 12:38:47 GMT
Date: Fri, 23 Mar 2012 12:38:47 GMT
Cache-Control: private, max-age=0
Last-Modified: Fri, 23 Mar 2012 10:49:23 GMT
ETag: "f5024c0a-c96f-464f-b96b-d89efdd69010"
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Content-Length: 0
Server: GSE
Connection: Keep-Alive
HTTP/1.0 200 OK
X-Robots-Tag: noindex, nofollow
Content-Type: text/html; charset=UTF-8
Expires: Fri, 23 Mar 2012 12:38:47 GMT
Date: Fri, 23 Mar 2012 12:38:47 GMT
Cache-Control: private, max-age=0
Last-Modified: Fri, 23 Mar 2012 10:49:23 GMT
ETag: "f5024c0a-c96f-464f-b96b-d89efdd69010"
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Server: GSE
2012-03-23 20:38:48 URL:http://oopsbroken.blogspot.com/ [44596] -> "index.html" [1]
Every time you run, it always gets updated even the content is the same.
In the infopage of Wget:
A file is considered new if one of these two conditions are met:
1. A file of that name does not already exist locally.
2. A file of that name does exist, but the remote file was modified
more recently than the local file.
[snip]
If the local file does not exist, or the sizes of the files do not
match, Wget will download the remote file no matter what the time-stamps
say.
If the sizes do not match, then Wget will GET the file. In case of Blogger, it returns with:
Content-Length: 0
Which is incorrect, since the content’s length isn’t not really zero. The problem is Wget believes it. The file length is 44596, they are not match, therefore Wget updates the file.
To avoid this, you need --ignore-length option:
% wget http://oopsbroken.blogspot.com --server-response --timestamping --no-verbose --ignore-length
HTTP/1.0 200 OK
X-Robots-Tag: noindex, nofollow
Content-Type: text/html; charset=UTF-8
Expires: Fri, 23 Mar 2012 12:42:06 GMT
Date: Fri, 23 Mar 2012 12:42:06 GMT
Cache-Control: private, max-age=0
Last-Modified: Fri, 23 Mar 2012 10:49:23 GMT
ETag: "f5024c0a-c96f-464f-b96b-d89efdd69010"
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Content-Length: 0
Server: GSE
Now Wget does not try to get the file because of the Content-Length.
2 Issues
There are several issues or difficulties when using Wget instead of cURL.
2.1 Incompatible with -O
As you can see, I didn’t use -O for specifying output file because it is incompatible with -N (--timestamping), which disables -N. You need to deal with the downloaded filename. Basically, you can use basename or it can be index.html or some bizarre names if query is presented.
2.2 When to process
You can not rely on status code, you need to check if file gets updated in either seeing timestamp change of file or parsing output of Wget to see if a file is saved. Not really a smart way to deal with.
However, if you saves timestamp, then it can be used to check and you don’t really need to keep a local file. Well, not true. You still need to keep a local file since Wget need to get timestamp from local file. I can’t find anyway to specify a timestamp.
3 Conclusion
I recommend using cURL instead of Wget. You may manually add request header for dealing with these issues, but using cURL is much easier. So why bother?
If you have good way to deal with the issues (not really manually do the whole process in your script), feel free to comment with codes. There may be some useful options I miss when I read the manpage and infopage.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.