Artificial intelligence (“AI”) has been present in our lives for a while now, but it became a buzzword when OpenAI introduced ChatGPT to the public. Therefore, the lawsuit against OpenAI and the datasets used by ChatGPT deserve more attention than other similar cases.
The New York Times Company (“The Times”) has initiated legal proceedings against Microsoft Corporation and various OpenAI entities, accusing them of illegally using its copyright-protected material. According to the lawsuit, this material was used in the development of AI models like ChatGPT and Bing Chat, which allegedly violates The Times’s copyrights. The case underscores the potential negative effects of AI on traditional media, including potential revenue losses and a decline in reader trust.
The Times, a recipient of 135 Pulitzer Prizes and publisher of over 250 original articles daily, emphasizes the resource-intensive nature of crucial news reporting. They generate significant revenue through content licensing, including some free licenses for academic and non-profit use. The lawsuit alleges that ChatGPT used The Times’s articles to train the system. Once the system is trained, it presents those articles to users, either in whole or in summary form, and imitates their style. Truly interesting examples of the alleged infringement were presented where only small alterations were made by ChatGPT.
Data Training
The first step in training an AI model is to gather a large dataset. This data can be anything from text, images, and sounds to more complex data like user interactions or sensor readings. The quality and quantity of this data significantly impact the model’s performance. Earlier versions of ChatGPT used substantial content from The Times (e.g., a dataset with 333,160 entries led to The Times). Later versions diversified their data sources (mostly social media posts and comments), yet The Times remained a critical source of reliable information. These entries were used for training without The Times’s permission. One of the sources used is a large collection of online material called “Common Crawl,” which the suit alleges contains information from 16 million unique records from sites published by The Times. Some of the articles were copied in their full length.
According to the lawsuit, Microsoft developed specialized systems to replicate the content of The Times for AI models. To train the GPT models, Microsoft and OpenAI worked together to create a complex and customized supercomputing system that could store and replicate copies of the training dataset, including The Times’ content. For the purpose of training GPT models, allegedly, The Times articles were copied and ingested multiple times.
The defendants publicly defend their actions as fair use, arguing that the use of copyrighted material in AI training serves a transformative purpose (the AI-generated output has a different character than the input). However, The Times argues this is not transformative, as it involves creating competing products without compensation.
(Non)Profit
A competing product? OpenAI initially declared as an altruistic organization, saw a shift in 2019 when an affiliate company was established for profit. Since transitioning to a for-profit model, OpenAI ceased open-source releases of its models, starting with GPT-3 in 2020, keeping subsequent model designs and training details secret. As of August 2023, OpenAI was on pace to generate more than USD 1 billion in revenue over the next twelve months. The market valuation of ChatGPT now is as high as USD 90 billion. Users might get the same or similar articles in both The Times and ChatGPT, which could lead to market disruption.
Negotiations re Licensing
Different standpoints led to the negotiations. The Times, with numerous other media outlets, began talks regarding the price and terms of licensing of the content to the AI creators. The negotiations focused on a concept of partnership around the real-time display of The Times Articles (with attribution) in ChatGPT, in which The New York Times would gain a new way to connect with their existing and new readers, and ChatGPT users would gain access to their reporting. However, the negotiations with The Times have not resulted in a settlement. On the other hand, The Associated Press reached an agreement.
Claims
The Times contends that the success of OpenAI and Microsoft’s AI models heavily relies on copyright infringement.
The professional public already talks about the “hallucinations” Chat GPT has. The lawsuit also addresses this issue along with the false attributions to The Times, causing commercial harm.
The lawsuit seeks to address various legal claims, including vicarious and contributory copyright infringement and trademark dilution and requests the destruction of all infringing AI models that are based on The Times’s articles.
AI is neither good nor bad, and there are still nuances in its creation and use. This court case will certainly make more aspects clear and enable more legal security in the field of new technologies.
OpenAI responds
On 8 January 2024, OpenAI published a blog post with its position claiming:
- They collaborate with news organizations and are creating new opportunities;
- Training is fair use, but we provide an opt-out because it’s the right thing to do;
- “Regurgitation” (providing almost unchanged articles) is a rare bug that they are working to eliminate;
- The New York Times is not telling the whole story, emphasizing the content of the negotiations and good faith actions of the defendant when they took down Chat GPT to solve bugs that tackled The Times.
The official response before the court in New York is still expected.
The entire tech and IP world is watching how these interdependent interests will be resolved.
The information in this document does not constitute legal advice on any particular matter and is provided for general informational purposes only.