Is NYT v OpenAI bad news for law firms using legal GenAI?

6 March 2024
duel-phone

The New York Times’ copyright infringement lawsuit against OpenAI should be followed closely by any organization that either uses GenAI-backed tech or collects data to train such tools. The implications of the case for the legal profession remain to be seen, but at the very least, using free-for-all GenAI tools to carry out legal research is looking more and more like a case of ‘you get what you pay for’.

The New York Times’ lawsuit against OpenAI

It’s been just over a year since ChatGPT transformed the AI landscape, and the New York Times has potentially imperiled the very basis of the technology by filing a lawsuit in December 2023 against both OpenAI and Microsoft for using millions of its articles to train chatbots.

The NYT wants all its content that is held by ChatGPT to be destroyed – an outcome which could have huge implications for the platform if other data owners follow the NYT’s lead. ChatGPT is not just a technology, it’s become something of a cultural moment.

Yet the NYT appears to have reasonable arguments. As intellectual property lawyer and researcher Andres Guadamuz explains, the newspaper has been able to show potentially infringing output in the form of verbatim paragraphs from 100 of its articles.

But as OpenAI explains in its response, the way the NYT achieved those verbatim outputs was through “hacking” the system. OpenAI argues that NYT made thousands of attempts to hack GPT — most unsuccessfully — and their successful attempts were where the NYT prompted large verbatim portions of the articles: As many as eight paragraphs, resulting in “regurgitation” of the article’s remainder. So OpenAI argues that NYT’s actions were not “normal,” they’re “hacking” OpenAI’s safety guardrails.

Should Law Firms Be Worried?

What does this mean, if anything, for law firms that use legal data to train GenAI-backed legal tech? Or purchase tech that has already been trained on legal data. Could clients and other data owners turn around and claim infringement? Legal AI tech has always had issues with its training data – including the extent to which copyrighted content can be used as part of the dataset. There also have been problems of availability (clients and law firms might not want to disclose specific data).

But, as Guadamuz points out, when it comes to copyright infringement, GenAI has an edge over regular AI – some GenAI models have sometimes seen a particular work so much that they can reproduce it substantially.

Still, law firms can take safeguards to ensure that when training GenAI-backed legal tech, they don’t infringe copyright. One basic safeguard, according to Josh Blandi, CEO of UniCourt, is to make sure that vendors are thoroughly vetted – citing an onus on firms to carry out their own due diligence on the vendor and their data sources. “UniCourt takes its responsibility to its clients to properly vet its sources of court information very seriously so that our clients can use our products and services with great trust,” he adds.

He also recommends that the training data be properly licensed: “If such licensing can be obtained ahead of time, it would prevent many of these claims of infringement being brought by authors, artists, and others against those creating and training Gen AI.”

Another safeguard is using information that is in the public domain, or made available by government sources. Josh notes how the courts have looked favorably upon collecting data from public websites and pages. “Similarly, the use of legal repositories has also been given favorable opinions from the court regarding their collection and use of legal briefs.” Damien Riehl, vLex’s VP of Product, outlines two arguments that could counter the NYT’s infringement claim. The first is the Idea-Expression Dichotomy. “During training, the LLM's foundational model is extracting uncopyrightable Ideas – not copyrightable Expressions” he explains. Then there’s an argument that the NYT, when it provided the first eight paragraphs to GPT, was performing a Red Teaming exercise, where Bad Actors try to get the system to do something it shouldn’t. “Those Bad Actors are the primary infringers, and the Tool that they use is — at most — the contributory infringer,” he says. “The analysis for contributory infringement — under the Supreme Court’s Betamax case — is whether the Tool is capable of substantial non-infringing uses.” And he adds that GPT and other LLMs are capable of thousands or millions of “non-infringing uses,” including law firms using LLMs for their legal work. (The recent episode of the vLex Insights webcast offers an in-depth analysis of the complex copyright issues in the NYT case).

The Right Legal Tech can Reduce GenAI Data Risks

Whatever the outcome of the NYT case, for the legal profession, the copyright issue might be a reason for lawyers to steer clear of free-to-use GenAI tools (we have written about how, throughout 2023, the judiciary clamped down on ChatGPT-generated legal submissions). Since free-to-use models rely on largely uncurated datasets scraped from publicly available data, these tools can increase the risk of reputational, financial, and legal harm.

This is in contrast to GenAI-backed legal technologies that are trained on sources that are made available via strategic partnerships with legal publishers and other reputable content providers. The legal data and information fuelling Vincent AI, for example, is trusted and authoritative – a vast global repository of vLex content that is continually expanding (most recently with the addition of high-quality legal resources from Chile, Mexico, Singapore, and the EU, and exclusive partnerships with the ABA to host legal title collections online).

And with Vincent AI Analyze Documents, lawyers can choose from a set of carefully-engineered prompts to achieve a specific outcome (create a timeline of facts, for example, or set out possible defences) for a particular legal document. Again, the underlying content is authoritative and reputable – eliminating many of the dangers of GenAI that we see in the news headlines.