The debate over copyright and AI involves a trade-off between potential benefits from access to training data and the possible harm to content providers. Instead of focusing on the use of data for training AI models, my research shows that we should focus on whether AI outputs violate copyright. AI training on copyrighted works should be classified as fair use.
When models, data, and compute are scaled together, AI performance improves. Small models trained on more data often outperform larger models with less data, and do so more efficiently in terms of compute (and electricity).
Yet, data is limited and risks slowing down AI performance gains. If broad swaths of text or images are unavailable because of copyright controls, you cannot “train longer” on what is left without a decline in model performance. Synthetic data may help compensate for missing material, but it risks reflecting the biases of the model that generated it.
A large and diverse corpus of training materials also helps ensure that models reflect cultural diversity. Copyright frameworks that restrict data access risk becoming a bottleneck on representativeness as well as performance.
“The rules of copyright law privilege access to certain works over others, encouraging AI creators to use easily available, legally low-risk sources of data for teaching AI, even when those data are demonstrably biased,” warns Georgetown Professor Amanda Levendowski. “Limited training data diversity” may “bias the model towards overrepresented cultures,” add Singaporean researchers Yao Qu and Jue Wang.
AI models may violate copyright if they reproduce original content. Yet AI may ease the detection of potential violations, and AI models can be instructed to avoid copyright violations. For example, the prompt for Anthropic’s model Claude includes the following prompt: “CRITICAL: Always respect copyright by NEVER reproducing large 20+ word chunks of content from web search results, to ensure legal compliance and avoid harming copyright holders.”
We should be wary of proposals that require opt-in and opt-out. The difficulty in implementing and enforcing such choices makes them impractical. “The right to opt-out amounts to economically inefficient overprotection of copyright,” argues Bertin Martens of the Brugel Institute in Brussels. “The ongoing bargaining and court cases between media producers and GenAI developers risk entrenching this market failure in jurisprudence.”
While appealing in principle, transparency requirements could also prove impractical and counterproductive. Datasets are vast and often include proprietary and personal data. It is hard to determine what is copyrighted or not. Disclosure could compromise trade secrets and reduce incentives to invest in data quality.
AI is a revolutionary technology, like steam, electricity, and computing. It requires access to a large and representative corpus of data. Copyright should not stop new technology from achieving its full potential.
Brian Williamson is a partner at the London-based Communications Chambers consultancy. He works at the intersection of technology, economics, and policy. Clients have included governments, regulators, telcos, and tech companies. This article is adapted from a just-published paper, AI, Copyright and the Public Good.
Bandwidth is CEPA’s online journal dedicated to advancing transatlantic cooperation on tech policy. All opinions expressed on Bandwidth are those of the author alone and may not represent those of the institutions they represent or the Center for European Policy Analysis. CEPA maintains a strict intellectual independence policy across all its projects and publications.
Tech 2030
A Roadmap for Europe-US Tech Cooperation