shutterstock_256920154_m_shcherbyna
8 January 2024FeaturesCopyright ChannelSarah Speight

Why The New York Times has more to lose than OpenAI

In what could be the biggest lawsuit to date levelled by a mass media organisation at AI platforms, The New York Times (NYT) is taking on OpenAI for copyright infringement.

And at stake is who gets to write the next chapter in generative AI’s future. This is a story of paywalled, original content pitted against free, AI-generated content, with “billions of dollars” of damages on the table. If it ever gets that far, of course.

In a scathing complaint filed on December 27 by NYT, it claims that both OpenAI and Microsoft—which own ChatGPT and Copilot respectively—copied “millions” of its articles to train their chatbots.

NYT further alleges that the pair used this work to develop and commercialise their generative AI (GenAI) products without obtaining its permission.

“Defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment,” it goes on to say, adding that such infringement threatens the service it provides.

For the avoidance of doubt, Microsoft is implicated due to its partnership with OpenAI to develop GPT large language models, or LLMs, as well as creating bespoke computing systems for its partner. The product at issue in this lawsuit is ChatGPT.

In a statement to WIPR from NYT’s “outside counsel”, the news outlet asserts that the defendants used its copyrighted works “to make GenAI products that directly compete with The Times”.

“This is the opposite of ‘fair use’ and violates basic principles of copyright law.”

Fair use a ‘hard sell’

Matthew Kohel, a partner at Saul Ewing, indicates that while OpenAI will likely rely on a fair use defence, he believes this will not be easily won.

“Based on The Times’ allegations, OpenAI has an uphill battle to meet the criteria of this defence.

“In particular, OpenAI’s use of The Times’ content to train LLMs may very well be found commercial in nature,” he tells WIPR.

Willie Stroever, chair of the Intellectual Property Department at Cole Schotz, agrees, pointing out that the approach taken by GenAI companies to past lawsuits has been to claim fair use of copyrighted materials to create something new.

“In this case, though, The Times [makes] some solid claims that OpenAI saves the original copyrighted works and [that] they can be reproduced with the right search parameters,” he adds.

But if, as HansonBridgett partner Andy Stroud suggests, ChatGPT could be taught the parameters of fair use standards, then the “problem can be solved”.

Not so fast. Stroud believes that this case goes further than fair use.

“The fair use analysis in this lawsuit has nothing to do with transformative use,” he suggests. “The defendants are using The Times’ information for the exact same purpose as [the plaintiff]—to inform.

“The issue here is not whether the defendants’ use of the copyrighted work is transformative—the Andy Warhol case completely undermines that trope. The issue is whether they are simply copying and quoting too much of it.”

He recommends that all the litigants in this case should be required to read Justice Story’s opinion in Folsom v Marsh, which established the foundation for fair use in 1841.

“[Justice Story’s] opinion in that case is determinative of the issues in this one.”

Independent journalism under threat?

Clearly, NYT, which currently has more than 10 million subscribers, is building on its 170-plus years’ existence and mighty reputation.

That said, it raises a noteworthy point about how the rise of generative AI tools will affect online, journalistic content created by humans.

An NYT spokesperson told WIPR that the paper “recognises the power and potential of GenAI for the public and for journalism”.

However, they added: “These tools were built with and continue to use independent journalism and content that is only available because we and our peers reported, edited, and fact-checked it at high cost and with considerable expertise.

“Settled copyright law protects our journalism and content. If Microsoft and OpenAI want to use our work for commercial purposes, the law requires that they first obtain our permission. They have not done so.”

A ‘compelling narrative’

Some have commented on how NYT presents its complaint. Silicon Valley lawyer Cecilia Ziniti noted on X that it portrays OpenAI as “profit-driven and closed”, contrasting this “with the public good of journalism”.

“This narrative could prove powerful in court, weighing the societal value of copyright against tech innovation,” she said, adding that allegations of hallucinations (GenAI fabrications of NYT content) add weight to its position.

Kohel agrees. “Irrespective of whether this ends up at trial, The Times tells a compelling story about the value of journalism that it provides to the public, in contrast to OpenAI, which the allegations paint as the big bad profit-driven tech company.

“The NYT may have framed the narrative in this fashion to reach a settlement sooner than later.”

Money talks

Stroud is sceptical. “It seems to me that The Times is complaining about something it should be celebrating,” he tells WIPR.

“One of the goals of The Times, as explained ad nauseum in the lawsuit, is to provide fair, high-quality journalism so that readers have a better understanding of the news and world events.”

This information, he points out, is something that AI is capable of digesting and reproducing on a massive scale.

“Thus, more than ever before, the accurate, factual information presented by The Times can be disseminated to the masses otherwise living in darkness.”

The veteran news provider’s motivation, though, is less that AI is using its news to inform users.

The Times is upset because they are not getting paid by AI for the use of their information,” he says. “This lawsuit is not about the news—it’s about money.”

Visual evidence

Nonetheless, the case presented by NYT is still deemed to be remarkably robust, at least according to Ziniti. She wrote that this “historic lawsuit” is “the best case yet alleging that generative AI is copyright infringement”.

This is, in part, due to meticulous visual evidence, argued Ziniti, with which NYT highlights the “substantial similarity” between its articles and ChatGPT's outputs.

It’s fair to say that the evidence presented by NYT surpasses anything presented by the slew of suits brought by authors against OpenAI and Meta last year—although they made waves in their own right.

Strikingly, NYT places the allegedly stolen text adjacent to ChatGPT-produced text, highlighting any variations of the odd word or paragraph.

There are also examples of prompts used within ChatGPT, which enable the user to circumvent NYT’s paywall and recreate its content.

Indeed, it backs up its arguments with in-depth detail of alleged data scraping by the two AI companies, apparently favouring the domain www.nytimes.com because of its high-quality content.

For example, OpenAI’s WebText dataset was created as a “new web scrape that emphasises document quality”, and this dataset “contains a staggering amount of scraped content from The Times”, according to the complaint.

And the same domain is the largest proprietary dataset in Common Crawl, the most-weighted dataset used to train ChatGPT. The complaint cites more than 16 million unique records of content in Common Crawl from NYT’s various channels, and more than 66 million total records of content.

“The amount and substantiality to which The Times’ articles were used by OpenAI will need to be determined through discovery,” suggests Kohel.

“On the one hand, [it] has provided examples where ChatGPT output completely copied an article. And, on the other hand, what prompts were used to generate those results is likely to be a key factual question.”

Inputs vs outputs

Despite this evidence, Stroud picks apart the theory that NYT is in a strong position.

He believes it stands a good chance of winning with regard to the output of ChatGPT.

However, he is “much less certain as to the input, or training aspect”.

For example, should NYT win, he believes this would not affect how AI companies train their chatbots in gathering information. “It is not a violation of copyright law to read a newspaper article, even for free (such as at a library),” he explains.

“Writers commonly read newspaper articles and then write about what they have read. That practice does not violate copyright law, unless they actually copy verbatim from [those] articles.”

But he concedes that NYT shows “several examples of how that has happened with ChatGPT, which is problematic for the defendants”.

Why the stakes are high for NYT

With previous lawsuits levelled against OpenAI et al prompting the argument for licensing of content, could this be a watershed moment for GenAI companies?

It’s a question of consumer demand, according to Stroever. “AI companies have shown that there is a product that customers want,” he tells WIPR.

But, if NYT wins, “there will certainly be guardrails put up around how AI systems operate” he adds. “It is unlikely that the sanctions or restrictions would be significant enough to eliminate the generative-AI model, though.”

For Kohel, a win for NYT is a win for content creators by “maintaining control over their work and potentially creating new revenue streams, generated through the licensing of works to be used as training data”.

And if it loses? “AI implementation will certainly expand,” Stroud suggests. “Currently there are technological limits to the ‘creativity’ of AI, and there will still be a demand for human-created media.”

Added to this, a loss for NYT “could seriously impact licensing rates charged by these media outlets” adds Stroud, “because it would mean copyrighted works could be reproduced legally without serious change.”

Kohel believes that if NYT fails in its claims, it will “certainly have a large impact on the news media industry, especially given their struggles with online revenue-generation and widespread availability of free content on the internet and social media”.

While Stroud believes there are other ways than licensing to solve the problem—such as retraining AI to prevent outputs that quote extensively from articles verbatim, but only summarise them—he concedes that licensing could certainly be a solution to this case.

“If the defendants agree to pay for the information they are using, then the case goes away. The Times even admits that the only reason they sued was because they could not reach an agreement with the defendants as to licensing,” he notes.

“To me, this case is copyright law basics,” he adds. “ If you copy too much [content], then you have to pay.”

A new chapter for copyright law?

Undoubtedly, this lawsuit will garner its fair share of spectators, commentators and column inches.

But Stroever believes that “realistically, it will probably not see trial, given that the parties had previously been negotiating licensing terms and at least one side thought those discussions were progressing”.

A settlement is more likely, he adds, “especially since OpenAI has already been actively licensing at least some content it was using.

“However, even with a settlement this case should serve as a warning sign to other generative-AI companies to ensure that they are aware of the capabilities of their systems to reproduce copyrighted materials.”

If the case does go to trial, Kohel suggests that the outcome “could have a significant impact on the use of generative AI technologies and how providers of AI tools train their models”.

However this story ends, even Stroud concedes that “these are serious parties and this is a serious matter” and the case “could help shape copyright law in the age of AI”.

Already registered?

Login to your account

To request a FREE 2-week trial subscription, please signup.
NOTE - this can take up to 48hrs to be approved.

Two Weeks Free Trial

For multi-user price options, or to check if your company has an existing subscription that we can add you to for FREE, please email Adrian Tapping at atapping@newtonmedia.co.uk


More on this story

Copyright
13 December 2023   Meta’s legal department had reservations about the legality of a database and “recommended” that Meta “avoid” it when training | First phase of Meta’s language model trained on database that allegedly contained thousands of pirated books.
Copyright
9 January 2024   OpenAI responds to lawsuit, challenging accusations of copyright infringement | Newspaper ‘cherry-picked’ its regurgitation evidence | Complaint met with ‘surprise and disappointment’ by the AI company.
Copyright
17 January 2024   Submission to the UK House of Lords says use of protected material is unavoidable for training modern AI models | ChatGPT creator contends that the broad spectrum of copyrighted expressions is critical for creating effective AI | Lords urge greater use of exixsintg licensing models.