YJL: Google Docs Terms and Robots

A twitter asked:

Anyone have a way to validate the Links / URLs in a Google Docs (docs.google.com) document? (not http://validator.w3.org/che...)

and he bookmarked W3C Link Checker and commented eariler:

For documents at docs.google.com, getting: Error: 403 Forbidden by robots.txt

The first question poped out of my brain is why does W3C Link Checker comply with robots.txt, is it a robot? The answer is yes, even it wasn't, it is still not allowed to access because it's a script and the Terms from Google.

The robotstxt.org defines:

A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

The W3C Link Checker identifies all links from the link you specify and checks them one by one, which is an act of traverse. Although, I don't seem that it meets the later prerequisite. I would still say: yes, it's a robot, therefore it complies with the robots.txt on Google Docs, whose content is:

User-agent: *
Allow: /$
Allow: /support/
Allow: /a/
Disallow: /

Last line denies all robots to access published documents on Google Docs. robots.txt is not only a standard to Google, but also a part of their Terms of Service. You, developers or users of apps, can't say I just don't follow the standard. In section 5.3:

5. Use of the Services by you

[snip]

5.3 You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by Google, unless you have been specifically allowed to do so in a separate agreement with Google. You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) and shall ensure that you comply with the instructions set out in any robots.txt file present on the Services.

Robots (including scripts) must not access. The future possible automatic way to access is to depend on Google Docs API (the full name is Google Documents List Data API). Right now, I see no interface allowing you to retrieve a Google Docs Document. Manually downloading and checking is only way without violating something.

YJL

Google Docs Terms and Robots

0 comments:

Post a Comment