Authors allege Meta used copyrighted books for AI training despite warnings

by Julia Shapero - 12/13/23 11:10 AM ET

Sarah Silverman, Michael Chabon, Ta-Nehisi Coates and other authors accused Meta on Monday of using their copyrighted books to train its artificial intelligence (AI) models despite warnings from the company’s legal team.

The complaint, which consolidates two copyright cases against the parent company of Facebook and Instagram, alleges Meta used an online resource to train its large language models, Llama 1 and Llama 2, that contained the authors’ works without their permission.

Large language models, which can produce human-like responses, require substantial amounts of data for training. In releasing Llama 1, Meta acknowledged it used the Books3 section of The Pile, a publicly available dataset that contains nearly 200,000 books, to train the model.

The authors, whose copyrighted works were included in Books3, allege Meta was aware of potential legal problems with using the dataset, pointing to a series of messages between a Meta AI researcher and researchers affiliated with EleutherAI, the organization that assembled The Pile.

In late 2020, Meta researcher Tim Dettmers expressed interest in using The Pile in a conversation on the EleutherAI public Discord server and asked about “any legal concerns” with using the dataset.

While one researcher with EleutherAI suggested there was “a very strong case for free use,” Dettmers later said Meta’s lawyers “recommended to avoid” using Books3, adding “it seems to be already clear the data cannot be used or models cannot be published if they are trained on that data.”

“At Facebook there are a lot of people interested in working with [T]he [P]ile, including myself, but in its current form, we are unable to use it for legal reasons,” Dettmers added in early 2021, according to Monday’s filing.

However, Meta ultimately used Books3 in its training dataset for Llama 1. The authors also accused the tech giant of using Books3 to train Llama 2, though the company opted not to reveal its training datasets for the latest model “for competitive reasons.”

“This explanation, however, is likely pretextual,” the lawsuit says. “A more plausible explanation for Meta’s decision to conceal its training data is to avoid scrutiny by those whose copyrighted works were copied and ingested during the training process for Llama 2.”

“On information and belief, a key reason Meta chose not to share the training dataset for Llama 2 was to avoid litigation from using copyrighted materials for training that Meta had previously determined to be legally problematic,” it continues.

Tags Artificial Intelligence copyright infringement large language models Meta