It’s a conundrum. Leading AI developers such as OpenAI, Microsoft, and Google depend on text, images, and videos from the web to create their revolutionary large language models (LLMs). Restricting access to copyrighted work risks harming AI innovation and creating biased algorithms. But rightsholders fear for their livelihoods and demand compensation.

How should policymakers respond? CEPA is launching a series on copyright and AI to address this challenge. Countries approach the copyright challenge differently. The US battles in court over fair use. The EU allows commercial text and data mining as long as rightsholders can opt-out. Japan and Singapore’s loose copyright protections privilege AI innovation. By comparing and contrasting different jurisdictions, we aim to foster a productive, solutions-oriented debate and identify potential paths forward.

Data is essential to train LLMs: and the larger and more representative the data set, the better the model’s output. AI companies use web crawlers and web scrapers – automated bots that systematically browse the internet to index and download information – to collect training data. Most training data is copyrighted. 

“Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials,” OpenAI wrote in a statement to the UK Parliament. 

Copyright holders object. In the US, they are suing leading AI developers. In one of the most high-profile cases, the New York Times accuses OpenAI and Microsoft of using millions of its articles to train their chatbots. US legal battles hinge on whether the Fair Use Doctrine permits harvesting and using copyrighted material to train AI models. If Fair Use applies, copyrighted works can be used without compensation. 

Europe’s approach to copyright and AI is also contested. The EU’s 2019 Copyright Directive grants copyright exceptions for text and data mining, as long as rightsholders can exercise the right to opt out. The EU AI Act extends the Copyright Directive. These exceptions “provide balance” between protecting rightsholders and facilitating text and data mining, former EU tech commissioner Thierry Breton argues

Get the Latest
Sign up to receive regular Bandwidth emails and stay informed about CEPA's work.

Many disagree. An open letter on behalf of the International Federation of Journalists and 12 organizations representing creative workers, addressed to the new European digital commissioner Henna Virkkunen, argues that EU copyright law “fails to adequately protect the rights of our creative communities and the value of their cultural works.” The letter points to uncertainty over how to exercise the opt-out right and concern that opt-outs are being ignored. In response, it calls for strengthened opt-out rights as well as “enforceable mechanisms to remunerate the creative community for the AI-generated output.” 

Others fear Europe’s opt-out mechanism could kill the continent’s AI hopes. Widespread uptake will worsen model performance by reducing the amount of high-quality data available to train AI. More than half of news publishers already block the main AI web crawlers by using shared standards for robot exclusion.

Several countries, led by Japan and Singapore, are loosening copyright laws to fuel AI. Since 2019, Japan has allowed AI companies to use copyrighted works without remuneration. It hopes to become “the world’s most AI-friendly country.” In 2021, Singapore introduced a similar exception allowing the use of copyrighted works for “computational data analysis” to support its tech industry. 

The UK is also looking to introduce broader exceptions for text and data mining to encourage AI development. “We will establish a gold standard data access regime,” Prime Minister Keir Starmer has promised, arguing that “artificial intelligence is the defining opportunity of our generation.” 

India represents a test case for the Global South. It has both a strong software industry and a strong news industry. Now, Indian publishers and news agencies are suing OpenAI. The outcome could set a precedent for the developing world. 

Rapid AI breakthroughs have left policymakers playing catch up. Research shows a strong link between permissive copyright laws and fast AI innovation. But rightsholders are demanding compensation for the role they play in training lucrative AI models. Urgent responses are needed to strike a fair equilibrium.

Hillary Brill is a non-resident Senior Fellow with the Tech Policy Program at the Center for European Policy Analysis (CEPA). Hillary served as interim Executive Director of the Georgetown Law Institute for Technology Law & Policy and teaches Copyright Law at Georgetown Law. Previously, she was the IP Practitioner-in-Residence at the American University Washington College of Law. Hillary received her BA from Harvard University and her JD from Georgetown.

Oona Lagercrantz is a researcher at CEPA based in Brussels. She earned a master’s degree with distinction from the University of Cambridge.

Bandwidth is CEPA’s online journal dedicated to advancing transatlantic cooperation on tech policy. All opinions expressed on Bandwidth are those of the author alone and may not represent those of the institutions they represent or the Center for European Policy Analysis. CEPA maintains a strict intellectual independence policy across all its projects and publications.

Tech 2030

A Roadmap for Europe-US Tech Cooperation

Learn More
Read More From Bandwidth
CEPA’s online journal dedicated to advancing transatlantic cooperation on tech policy.
Read More