OpenAI Faces Scrutiny Over Alleged Use of O'Reilly Books for GPT-4o Training

Thu 3rd Apr, 2025

The artificial intelligence organization OpenAI is currently under investigation for allegedly utilizing works from O'Reilly Media, a prominent US technology publisher, to train its GPT-4o model without appropriate authorization. This claim arises from a recent study conducted by the AI Disclosures Project, which includes input from O'Reilly's founder and CEO, Timothy O'Reilly.

According to the research, OpenAI reportedly relied on at least 34 O'Reilly titles during the training of GPT-4o. The study further examined two other models, GPT-3.5 Turbo and GPT-4o mini, but found less conclusive evidence regarding potential copyright infringements associated with these particular models.

In their analysis, the researchers posed a variety of multiple-choice questions to the OpenAI models. One of the answer options contained a direct quote from one of the 34 O'Reilly books, while the other choices were paraphrased versions. The study encompassed nearly 14,000 excerpts from these works. If the AI model correctly identified the verbatim quote, it was interpreted as an indication that the model had been trained using copyrighted material from the O'Reilly collection.

The researchers calculated an AUROC score, a statistical measure indicating the likelihood of the OpenAI models having been trained on O'Reilly's books. The score for GPT-4o reached 82 percent, suggesting a substantial probability that the copyrighted content was utilized in the training process. Additionally, the researchers speculated that OpenAI might have accessed a database from the shadow library, Library Genesis, which reportedly includes all 34 books in question.

Conversely, the study indicated that the significance of non-public data in training OpenAI models has increased over time. The AUROC score for GPT-3.5 Turbo, based on a dataset from 2021, was 54 percent for non-public excerpts, while GPT-4o mini, released in 2024, achieved a score of 56 percent, suggesting these models were not trained using O'Reilly's works.

The authors of the study highlight a broader, systematic issue regarding the use of copyrighted materials in training language models. They advocate for greater transparency and a formal licensing framework for the content used in such training processes. The authors warn that without appropriate compensation, the availability of content suitable for training AI models could diminish significantly. Recently, the New York Times also filed a lawsuit against OpenAI, alleging copyright violations related to the training of its AI systems.


More Quick Read Articles »