(I'm reposting this from workplace S.E. where I was advised to ask here; apologies in advance for any inconvenience)
tl;dr: in the USA, is copying the HTML code from a site (any site whose code is presumably copyrighted) and storing it, for a limited or unlimited amount of time a violation of copyright? Are there prior lawsuits related to this? I'm mostly interested in the particular case where the copy is not reproduced, but kept private.
I've recently learned that this is indeed the case. This came as a huge surprise to me since:
- [First of all] most browsers retain a copy of the HTML (for the period of the visit or much longer, if caching is enabled)
- Proxy servers often keep cached copies of these files
- Web archives (like Google's) not only copy all assets of a site it find to keep historical versions of these pages but also make available to the general public these historical copies.
- Programs that scrape external sites often have in their repositories copies of (likely copyrighted) HTML for testing purposes
Number (4) is the one that directly affects the company I work for, since we do web analysis and therefore write programs that visit other sites. For example, we make extensive use of vcrpy library to record external accesses and test our code against these "frozen" HTMLs.
Also, specifically in our case, we don't really copy the entirety of any site, since we are only concerned with a subset of its pages, but from what I've been informed, that doesn't seem to qualify as "fair use", such as quoting a passage of a book (where, in a sense, the book would be analogous to the entire site with all its public assets). We don't even copy assets like CSS files or images, so we can't reproduce the actual content in full.
After I was told that such copies are likely unlawful, we are not only being limited to explore more robust testing methodologies (which would likely make use of a large amount of HTML copied from the web in a local storage) but the current use of vcrpy library has become something that demanded analysis (as it's not clear if our use of it is unlawful).
 
     
     
    