Pulitzer Prize winning US novelist Michael Chabon and several other writers have filed a proposed class action accusing OpenAI of copyright infringement for allegedly pulling their work into the datasets used to train the models behind ChatGPT.
The suit claims that OpenAI “cast a wide net across the internet” to capture the most comprehensive set of content available to better train its GPT models, allegedly “necessarily” leading it “to capture, download, and copy copyrighted written works, plays and articles.”
One of the more interesting parts of the lawsuit is an allegation about how the authors believe the AI business got its hands on “two internet-based book corpora,” which it notes that OpenAI simply refers to as “Books1” and “Books2.” The filing alleges that in the July 2020 paper introducing GPT-3, “Language Models are Few-Shot Learners,” OpenAI disclosed that in addition to “Common Crawl” and “WebText” web page datasets, “16 percent of the GPT3 training dataset came from… ‘Books1’ and ‘Books2’.”
The writers lawsuit goes on to allege that there are only a few places on the public internet that contain this much material, claiming that OpenAI’s Books1 dataset “is based on either the Standardized Project Gutenberg Corpus or Project Gutenberg itself” and accusing the AI biz of sourcing Books2 from:
infamous “shadow library” websites, like Library Genesis (“LibGen”), Z-Library, Sci-Hub, and Bibliotik, which host massive collections of pirated books, research papers, and other text-based materials. The materials aggregated by these websites have also been available in bulk through torrent systems.
Also included in the suit is Tony and Grammy award winner David Henry Hwang, the playwright and screenwriter behind M. Butterfly, Chinglish, Yellow Face, and The Dance and the Railroad; Peabody winner and Love and other Impossible Pursuits author Ayelet Waldman; Women We Buried author Rachel Louise Snyder; and Who is Rich? scribe Matthew Klam.
The writers allege that because “when ChatGPT is prompted, it generates not only summaries, but in-depth analyses of the themes present in Plaintiffs’ copyrighted works,” the writers believe “the underlying GPT model was trained using [the] plaintiffs’ works.”
The writers’ lawyers also claim that when asked to write a paragraph in the style of The Amazing Adventures of Kavalier & Clay, the book that bagged US novelist Chabon his Pulitzer, ChatGPT generated a passage imitating his writing style and including references to the characters dealing with “the weight of the world at war.”
The suit [PDF] was filed in California federal court late last week and was yesterday assigned to San Francisco Magistrate Judge Peter H. Kang.
OpenAI is facing multiple lawsuits around copyright – including two in San Francisco filed by novelists Paul Tremblay and Mona Awad, and, separately, comedian Sarah Silverman and novelists Christopher Golden and Richard Kadrey. Its lawyers argued in those cases that the AI biz has not violated copyright laws, claiming ChatGPT’s LLMs are protected under the US doctrine of “fair use.” Their argument is that the way the business uses the text conforms to US copyright law, which allows a fair use exception for so-called “transformative uses” of work – a remix of the original that serves a different purpose or audience.
The US Copyright Office is currently seeking comment on a study of the copyright law and policy issues raised by artificial intelligence systems.
Defense for OpenAI hasn’t yet filed a response to the Chabon complaint. We have asked OpenAI for comment.
The allegations in the case include direct and vicarious copyright infringement, illegal removal of copyright management information, unfair competition, and unjust enrichment. They are seeking an injunction against the infringement of their copyrights as well as unspecified damages.
OpenAI boss Sam Altman last week scored Indonesia’s first ever golden visa – meaning he can now live in the archipelagic nation for up to 10 years – in recognition of his potential to “generate inbound investment.” ®