3

This question made me think of a situation:

  • Alice asks Bob to crawl web site realestate.example.com and return results of the regular expression "Price:([0-9]*).*Size:([0-9]*)"
  • Bob points a major open source web crawler that implements googles robots.txt reading implementation (and it therefore fully consistent with industry standard practices) at realestate.example.com, greps the result, sends data to Alice. He never visits the site manually, he spent about 2 minutes of human time on the whole job and had no need to.
  • realestate.example.com has a robots.txt that says "Take everything" and a sitemap.xml describing the pages of interest because they care about SEO. It has a human readable TOS that says "Scraping is not allowed".

Has Bob done anything wrong?

Trish
  • 50,532
  • 3
  • 101
  • 209
Dave
  • 847
  • 5
  • 15

1 Answers1

5

Probably not

There have not been many cases in this area of the law, and those have mostly dealt with "deep linking", particularly cases where a person knowingly linked to a page bypassing a login or introductory page, when the site was so designed that ordinarily a visitor could only get to other pages by going through such a log-in or intro page. In cases where this deprives the site owner of income, or harms the site's reputation by bypassing disclaimers, this has been held to be actionable. See Nolo's page on Linking, Framing, and Inlining And the Wikipedia article on Deep linking

In Intellectual Reserve, Inc. v. Utah Lighthouse Ministry, Inc, 75 F. Supp. 2d 1290 (D. Utah 1999) deep linking was held to be contributory copyright infringement. See The Wikipedia article on the case In that case, the content being linked to had been posted without the authorization of the copyright holder, and no fair use issue was raised by the defense.

In general, courts have found that publishing a page on the web invites others to visit it and link to it. In the Wikipedia article on "Deep linking" (linked above) it is said that:

In a February 2006 ruling, the Danish Maritime and Commercial Court (Copenhagen) found systematic crawling, indexing and deep linking by portal site ofir.dk of real estate site Home.dk not to conflict with Danish law or the database directive of the European Union. The Court stated that search engines are desirable for the functioning of the Internet, and that, when publishing information on the Internet, one must assume—and accept—that search engines deep-link to individual pages of one's website.

In Perfect 10, Inc. v. Amazon.com, Inc. 508 F.3d 1146 (9th Cir. 2007) a US court held that links to copyrighted images as part of an image search were not copyright infringement. The Ninth Circuit court of Appeals held that Google's display and caching of thumbnails was fair use, mainly because they were "highly transformative."

In Craigslist vs 3Taps (see Jaxenter article) Cragslist objected to repeated scraping by PadMapper, sent a cease and desist order to PadMapper , and blocked its IP addresses. PadMapper used the services of 3Taps to bypass this block with a proxy. Craigslist sued and won. The court held that under the US Computer Fraud and Abuse Act (CFAA), the Cease and Desist order and the IP block were sufficient notice of denial of access, and that further access was unauthorized and a violation of the act. The individual notice was considered essential to this holding.

The case of Ryanair vs PR Aviation was brought in the European Court of Justice. There, Ryanair had argued that continued scraping was a violation of its TOS and a copyright infringement. The court held that the owners of publicly available databases were entitled to impose access restrictions. It further held that the applicability of a TOS was a matter for national courts to determine.

See also this article on "Essential Legal Issues Associated With Web Scraping". There it is emphasized that much scraping is legal, except when copyright is infringed, or when specific access restrictions under the US CFAA (or similar laws) were violated.

Individual facts, such as home prices and sizes, are not subject to copyright protection, although the selection and organization of such facts may be, and a database consisting of such facts may be protected. Pages posted on the web are being made publicly accessible unless specific steps are taken to make them private, such as password protection, requiring a login, or individual notice not to access. The ROBOTS.TXT file, while not technically enforced, is a widely accepted standard, and a visitor is probably entitled to assume that access in accord with the local robots file is authorized, in the absence of specific notice from the site owner to the contrary. Repeated access that negatively impacts the bandwidth or performance of the site might be a different matter.

David Siegel
  • 115,406
  • 10
  • 215
  • 408