When Carnegie funded thousands of public libraries at the turn of the 20th century, he recognized that shared access to knowledge was the cornerstone of progress. Today, in the era of artificial intelligence, we need a similar vision and investment in shared, trusted data infrastructure. 

The Scottish-American steel magnate spent almost $60 million (equivalent to around $2.3 billion today), to build a network of 2,509 libraries globally — 1,689 in the US and the rest in places as diverse as Australia, Fiji, South Africa, and his native Scotland. Carnegie supported these libraries for a number of reasons: to burnish his own reputation, but most of all because he was “dedicated to the diffusion of knowledge.”  

In the 21st century, information is both abundant and scarce, offering tremendous potential for the public good yet largely accessible and reusable only to a small (corporate) minority. While more and more aspects of our lives are becoming captured in digital form, the resulting data is increasingly locked away or inaccessible. 

The rise of generative AI underscores the centrality of data. AI relies on vast troves of high-quality, diverse, and timely datasets. Yet access to such data is being eroded as governments, corporations, and institutions impose new restrictions on what can be accessed and reused. 

Open data portals and official statistics once celebrated as milestones of transparency have been defunded or scaled back. Data is increasingly treated as a proprietary asset to be hoarded or sold, rather than a shared resource that can be stewarded responsibly for mutual benefit. Private platforms that once offered public APIs for research — such as Twitter (now X), Meta, and Reddit — have closed or heavily monetized access, cutting off academics, civil society groups, and smaller enterprises from vital resources. 

Get the Latest
Sign up to receive regular Bandwidth emails and stay informed about CEPA's work.

Generative AI has triggered what some call “generative AI-nxiety,” prompting news organizations, academic institutions, and other data custodians to block crawlers and restrict even non-sensitive repositories, often in (understandable) reaction to unconsented scraping for commercial model training. This is compounded by a broader research data lockdown, in which critical resources — such as social media datasets used to study misinformation, political discourse, or mental health, and open environmental data essential for climate modeling — are increasingly subject to paywalls, restrictive licensing, or geopolitical disputes. 

Rising calls for digital sovereignty have also led to a proliferation of data localization laws that prevent cross-border flows, undermining collaborative efforts on urgent global challenges like pandemic preparedness, disaster response, and environmental monitoring. 

We may be entering a new “data winter.” This narrowing of the data commons comes precisely at a moment when global challenges demand greater openness, collaboration, and trust. Left unchecked, it risks stalling scientific breakthroughs, weakening evidence-based policymaking, deepening inequities in access to knowledge, and entrenching power in the hands of a few large actors, reshaping not only innovation but our collective capacity to understand and respond to the world. 

A Carnegie commitment to the “diffusion of knowledge,” updated for the digital age, can help avert this dire situation. Building modern data libraries and embedding principles of the commons could restore openness while safeguarding privacy and security. Without such action, the promise of our data-rich era may curdle into a new form of information scarcity, with deep and lasting societal costs. 

One solution is to empower trusted institutions to steward equitable and responsible access to data, revitalizing the data commons in ways suited to the AI era. These libraries could help break down silos, foster public trust and democratic participation, and empower a broad range of actors. By making datasets more accessible, they would support independent AI models. By fostering algorithm pluralism, they could support the development of varied algorithms. By empowering researchers and civil society groups, they could lower data and insight barriers for academic and nonprofit research. By promoting transparency and ethical AI development, they could encourage clear standards for data ethics and transparency. More transparency, in turn, could help build public trust in AI. 

Much like Carnegie’s original libraries, these proposed “data for public-interest AI libraries” would serve as trusted, community-oriented institutions; instead of books, they would curate, maintain, and share high-quality datasets for public benefit. Modeled on existing projects such as the Institutional Data Initiative at Harvard, they would be operated by multistakeholder consortia and governed transparently in a manner that ensured broad accountability.  

Most importantly, the Carnegie AI libraries would prioritize access for those currently locked out of the data economy, such as researchers and public interest actors — the sectors most at risk from the impending data winter. 

Stefaan G. Verhulst is Co-Founder of The GovLab (New York City) and The DataTank (Brussels) is also Research Professor at the Tandon School of Engineering at New York University, and the Editor-in-Chief of Data & Policy, an open-access journal by Cambridge University Press. 

Bandwidth is CEPA’s online journal dedicated to advancing transatlantic cooperation on tech policy. All opinions expressed on Bandwidth are those of the author alone and may not represent those of the institutions they represent or the Center for European Policy Analysis. CEPA maintains a strict intellectual independence policy across all its projects and publications.

2025 CEPA Forum Tech & Security Conference

Explore the latest from the conference.

Learn More
Read More From Bandwidth
CEPA’s online journal dedicated to advancing transatlantic cooperation on tech policy.
Read More