When writing a script for processing files on Internet, it’s a good idea to take advantage of If-Modified-Since header because:

  1. You don’t waste bandwidth for possibly exactly the same content and
  2. You don’t need to re-process if you know the content is unchanged.

1   The flow

cURL has an option for it, -z <date expression>, where date expression can be either a format listed in manpage of curl_getdate or a filename of existing file. To use it with -o FILE is easy:


~ $ curl http://example.com -z index.html -o index.html --verbose --silent --location
Warning: Illegal date format for -z/--timecond (and not a file name).
Warning: Disabling time condition. See curl_getdate(3) for valid date syntax.
* About to connect() to example.com port 80 (#0)
* Trying 192.0.43.10... connected
* Connected to example.com (192.0.43.10) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.21.4 (x86_64-pc-linux-gnu) libcurl/7.21.4 NSS/3.13.1.0 zlib/1.2.5 libssh2/1.3.0
> Host: example.com
> Accept: */*
>
* HTTP 1.0, assume close after body
< HTTP/1.0 302 Found
< Location: http://www.iana.org/domains/example/
< Server: BigIP
* HTTP/1.0 connection set to keep alive!
< Connection: Keep-Alive
< Content-Length: 0
<
* Connection #0 to host example.com left intact
* Issue another request to this URL: 'http://www.iana.org/domains/example/'
* About to connect() to www.iana.org port 80 (#1)
* Trying 192.0.32.8... connected
* Connected to www.iana.org (192.0.32.8) port 80 (#1)
> GET /domains/example/ HTTP/1.0
> User-Agent: curl/7.21.4 (x86_64-pc-linux-gnu) libcurl/7.21.4 NSS/3.13.1.0 zlib/1.2.5 libssh2/1.3.0
> Host: www.iana.org
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Fri, 23 Mar 2012 11:31:14 GMT
< Server: Apache/2.2.3 (CentOS)
< Last-Modified: Wed, 09 Feb 2011 17:13:15 GMT
< Vary: Accept-Encoding
< Connection: close
< Content-Type: text/html; charset=UTF-8
<
{ [data not shown]
* Closing connection #1
* Closing connection #0

There is a warning about the date syntax, it is safe to ignore since the file index.html has not been created yet. On the second run:


~ $ curl http://example.com -z index.html -o index.html --verbose --silent --location
* About to connect() to example.com port 80 (#0)
* Trying 192.0.43.10... connected
* Connected to example.com (192.0.43.10) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.21.4 (x86_64-pc-linux-gnu) libcurl/7.21.4 NSS/3.13.1.0 zlib/1.2.5 libssh2/1.3.0
> Host: example.com
> Accept: */*
> If-Modified-Since: Fri, 23 Mar 2012 11:33:24 GMT
>
* HTTP 1.0, assume close after body
< HTTP/1.0 302 Found
< Location: http://www.iana.org/domains/example/
< Server: BigIP
* HTTP/1.0 connection set to keep alive!
< Connection: Keep-Alive
< Content-Length: 0
<
* Connection #0 to host example.com left intact
* Issue another request to this URL: 'http://www.iana.org/domains/example/'
* About to connect() to www.iana.org port 80 (#1)
* Trying 192.0.32.8... connected
* Connected to www.iana.org (192.0.32.8) port 80 (#1)
> GET /domains/example/ HTTP/1.0
> User-Agent: curl/7.21.4 (x86_64-pc-linux-gnu) libcurl/7.21.4 NSS/3.13.1.0 zlib/1.2.5 libssh2/1.3.0
> Host: www.iana.org
> Accept: */*
> If-Modified-Since: Fri, 23 Mar 2012 11:33:24 GMT
>
< HTTP/1.1 304 NOT MODIFIED
< Date: Fri, 23 Mar 2012 11:33:54 GMT
< Server: Apache/2.2.3 (CentOS)
< Connection: close
<
* Closing connection #1
* Closing connection #0

As you can see the server returned 304, no content was transfered and index.html left untouched.

2   When to process

The bandwidth is saved as demonstrated in previous section, next question is to know when do our script to process? The key is to get the HTTP Response Code from the server. cURL provides a list of options to format output by using --write-out or simply -w. Here is the command we need:


~ $ curl http://example.com -z index.html -o index.html --silent --location --write-out %{http_code}
304

To utilize this, here is a complete code:


if [[ "$(curl http://example.com -z index.html -o index.html -s -L -w %{http_code})" == "200" ]]; then
# code here to process index.html because 200 means it gets updated
blah blah blah
fi

I also shortened the command with single letter option names. When the response code is 200, then it means the file has been updated, your script will need to process. The example code does not deal with errors, you may want to say the response code to a variable and check up on the variable with case and respond accordingly.

3   Without a saved file

I think not always the file will be saved on disk. You may process it, then discard it. If so, you need to save the timestamp for later use. A possible work flow may look like:


do_process () {
# process here
# blah blah blah
# save timestamp
stat -c %Y index.html > index.html.timestamp
rm index.html
}

if [[ -f index.html.timestamp ]]; then
# not first run
[[ "$(curl -s http://example.com -z "$(date --rfc-2822 -d @$(<index.html.timestamp))" -o index.html -s -L -w %{http_code})" == "200" ]] && do_process
else
# first run
curl http://example.com -o index.html -s -L
do_process
fi

First to check if timestamp file exists, if not, then run cURL directly and call the process function. The process function saves timestamp of index.html after the file has been processed successfully, it will be used for next call of cURL.

When the timestamp file exits, we use date to convert it to RFC 2822 format which cURL can understand. The $(<...) is equivalent to $(cat ...), also date can accept date in Unix time by prefixing @.

4   Conclusion

You can do it without really having a file saved even temporarily by piping out to standard out. But you will need to parse the response header for timestamp and the header is mixed with content, you also need to parse that part, too. Saving to a temporary file is much easy to do.

Also, another way to check is to compare the timestamps instead of response code, that is really up to your liking.

Dealing with timestamp for preventing bandwidth waste and process time waste is not very hard with cURL. It will be nice for your script to be able to take care of that when it gets mature enough.

I am also writing a similar post for Wget, which may be a little tricky.