As copyright concerns plague the field of generative AI, Apple seeks to preserve privacy and legality through innovative training methods for language learning methods, all while avoiding controversy.
In recent years, the question of generative AI in relation to copyright law has remained a relatively important and complex issue. As language learning models (LLMs) and generative AI apps increase in popularity, copyright issues have continued to pile up without any kind of meaningful resolution.
Problems arise when companies use copyrighted works in training their generative AI software, and when the outputs of said AI software contain sections of works under copyright protection.
Copying copyrighted works in their entirety or using significant sections of such works for training generative AI software is copyright infringement. There is no "fair use" carve-out for AI training, despite what the companies that are training the models say or believe.
Generative AI and copyright infringement lawsuits
In late December of 2023, OpenAI and Microsoft were sued by The New York Times for copyright infringement. In the lawsuit, it was claimed that the two companies trained their generative AI software using millions of articles published by The New York Times.
This was not the first time OpenAi faced a lawsuit about model training. In September 2023, the company was also sued by several prominent authors, with George R. R. Martin, Michael Connelly and Jonathan Franzen being among them.
The history of generative AI and copyright issues goes back even further, as in July of 2023 over 15000 authors signed an open letter addressed to several prominent companies, including Alphabet, OpenAI, Meta, Microsoft and more.
The letter requested that the authors be properly credited and compensated for their work, which was used in the training of generative AI and language learning models.
Another, similar class-action lawsuit alleging copyright infringement was filed against OpenAI by non-fiction authors Nicholas Basbanes and Nicholas Gage. The lawsuit was filed in January of 2024.
In late April of 2024, another AI-related lawsuit was filed, this time against Amazon. The lawsuit alleges that an Amazon employee was instructed to deliberately ignore and violate copyright law so that Amazon could compete against rival products and services more effectively.
In the lawsuit, a former Amazon employee claims she was told by a supervisor regarding copyright-violating AI training that "everyone else is doing it" — implying that people from rival companies were knowingly engaging in copyright infringement.
And, it's pretty clear that they are.
AI and publishers' concerns about reproduction of copyrighted content
AI has been known to reproduce copyrighted content on multiple occasions, and the severity of the problem has inspired companies to analyze the frequency at which this happens.
To gain a better understanding of the rate at which AI chatbots generate copyright-protected content, the company PatronusAI decided to look into the matter. The company, which evaluates generative AI models, compared four major AI models - OpenAi's ChatGPT-4, Meta's Llama 2, Mistral's Mixtral and Anthropic's Claude 2.1.
Patronus AI found the rate at which AI generated copyrighted content ultimately varied based on the model, but that rates of copyrighted content generation were high. The company also released its own tool, known as CopyrightCatcher, which would detect potential copyright violations in LLMs.
While the generation of copyrighted content has serious implications, publishers are also concerned over the use of copyrighted material in training language learning models.
In March of 2024, The Wall Street Journal reported that prominent publishers were investigating the use of their copyrighted works in the training of generative AI models. The publishers wanted to be paid for the use of their work by AI.
Given the number of lawsuits related to generative AI and copyright and the seriousness of the concerns expressed by publishers, it makes sense that a company like Apple would try its best to avoid any potential legal issues.
Apple's unique approach to generative AI, language learning models and copyright issues
As a way of avoiding similar copyright issues during the training of its own generative AI software, Apple has reportedly been licensing the works of major news publications.
In December of 2023, it was reported that Apple planned to try and license works from Conde Nast - the publisher of Vogue and The New Yorker. The company had also spoken to IAC and NBC News in an attempt to make a deal worth approximately $50 million.
While Apple developed its language learning model, known internally as Ajax, with basic on-device functionality, the company took a different approach to more advanced features. Apple considered licensing software such as Google Gemini for more complex tasks requiring an internet connection.
By employing this strategy, Apple clearly intended to avoid copyright issues. With the paid licensing, Apple would not be responsible for copyright infringement caused or perpetrated by software such as Google Gemini.
In a research paper published in March of 2024, Apple revealed that it used a carefully curated mixture of images, image-text and text-based input to train its in-house LLM. The method Apple used allowed for better image captioning, multi-step reasoning and preserving privacy, all at the same time.
We were told by industry sources that Apple's Ajax LLM preserves privacy because it does not require an internet connection for basic text analysis. This means that the on-device LLM cannot connect to a database and identify copyrighted content in offline mode, although more advanced features like text-generation would likely feature such checks and connections.
Reporting and documented projects aside, guard rails and licensing are only as secure if they are enforced. Sources familiar with Apple's AI test environments speaking to AppleInsider have revealed that there were seemingly little to no restrictions to prevent someone from using copyrighted material in the input for on-device test environments.
Our source wasn't clear about regulations inside Apple to prevent copyright-violating training. The output, however, is likely more regulated to avoid word-for-word reproduction of copyrighted material.
Apple should debut its generative AI technology during WWDC which starts on June 10.