Technology | May 20, 2026

Harvard Data Trained This AI Model

“Talkie” is a large language model trained on only pre-1931 public domain content from Harvard libraries.

by Olivia Farrar

Harvard Law School's library — **The Institutional Data Initiative at Harvard Law School Library is a research initiative designed to “bring the public domain to AI.”** | photograph by olivia farrar / harvard magazine

Nearly a century of printed history from Harvard libraries has become raw material for artificial intelligence.

In late April, a team of AI researchers including Nick Levine, David Duvenaud, and former OpenAI scientist Alec Radford unveiled “Talkie”: a mid-sized large language model (LLM) trained exclusively on text published before January 1, 1931. Ask Talkie about the internet, television, or World War II, and it falters or makes guesses. Prompt it about early aviation or 1920s social customs, and it responds fluently.

The timeframe was chosen because many works published in the United States in 1930 entered the public domain on January 1, 2026, making them free to use and distribute.

Researchers increasingly see historically bounded models like Talkie as a new way to study how artificial intelligence learns, including whether a LLM can infer ideas that may have emerged after the historical cutoff of its dataset. Since its release, users have been testing Talkie to see whether it can accurately forecast future events from historical data or generalize concepts it was never explicitly taught. In one experiment, Talkie demonstrated the ability to produce new code when given small snippets of Python, despite being trained on material published decades before computers existed.

Could a model trained on only pre-1931 data, for example, come up with any of the paradigm-shifting theories (cosmic inflation, the Standard Model of particle physics, punctuated evolutionary equilibrium) discovered in the latter half of the twentieth century?

Talkie has limitations and liabilities, its researchers acknowledge. Because the model was trained on historical texts that may contain offensive views, it can reproduce racist or discriminatory attitudes that were common at the time. Its creators chose not to sanitize the underlying dataset, arguing that doing so would distort the historical record. The public-facing demo does include moderation layers and warnings for problematic outputs, however.

Other issues are technical. Historical datasets can contain metadata errors, revised editions, editorial insertions, and flawed recognition of historical images, and the model can learn from error-ridden data. In some cases, Talkie appears to know facts and details beyond its historical cutoff, referred to as “temporal leakage.”

Talkie demonstrates an important facet of legal AI development: it may depend as much on libraries and archives as on technology companies. Libraries possess vast collections of material, some in the public domain. As lawsuits against companies like OpenAI, Microsoft, and Meta pile up for the use of copyrighted material in training models, models trained on public domain works face no such liabilities.

Libraries like Harvard’s, with one of the largest collections in the country, are increasingly becoming participants in the translocation of knowledge into computational systems.

“We have the foundational materials needed to train inclusive AI systems,” says Martha Whitehead, Harvard’s university librarian. “We aim to partner in shaping the ethical use of those materials in emerging systems, to ensure they reflect the breadth and depth of human knowledge for the benefit of all.”

Much of Talkie’s training data comes from the Harvard Law School Library’s Institutional Data Initiative (IDI), a dataset of nearly one million public-domain volumes digitized from Harvard Library collections and released last year for use in computational research. The IDI also includes the Caselaw Access Project, a repository of more than seven million judicial decisions extending back to the founding of the United States.

In the coming months, IDI plans to release an expanded version of that dataset along with a new archive of roughly two million digitized newspaper pages created in partnership with the Boston Public Library.

“The AI community has historically played fast and loose with data quality,” says Greg Leppert, the IDI executive director and chief technologist at the Berkman Klein Center for Internet and Society. “[A]t times blindly trusting frontier models with wholesale data cleanup…but the team [working on Talkie] have gone to great pains to ensure their data is a reflection of history rather than their AI tools.”

Such care, he adds, has effects beyond creating an accurate snapshot of the past. “If your goal is testing language models for their ability to predict the future,” Leppert added, “suddenly there’s reason to care about the qualities of data that knowledge stewards have long valued.”

You might also like

An image of a doctor writing set against a background of code, representing AI

Harvard’s Arthur Kleinman reflects on what’s lost when healthcare systems prioritize efficiency.

A man sitting at a wooden table in a formal setting with portraits on the wall.

A theatrical reenactment explores a 1976 clash between science and democracy.

An illustration of robots doing housework

Harvard Business School’s Andy Wu discusses far-out innovations.

Most popular

Star-filled night sky with the Milky Way arching over a rocky silhouette.

There’s a growing movement to curb light pollution. It starts on your front porch.

The Marketplace of Perceptions

Like all revolutions in thought, this one began with anomalies, strange facts, odd observations that the prevailing wisdom could not explain...

White House and Harvard University buildings split diagonally with contrasting colors.

Harvard Weathers a Year of Turmoil

The federal government has launched unprecedented actions against the University. Here’s a guide.

Explore More From Current Issue

Vibrant urban scene at dusk featuring a mural on a building and illuminated structures.

University News

The Goel Center in Allston will open for performances in the fall of 2026.

A vibrant group of dancers in colorful outfits poses on a stage with shiny decorations.

Arts & Culture

The Harvard Arts Medalist wants his smash-hit Cats revival to reach “as many young queer people” as possible.

Racing driver gives a thumbs up from inside a car, wearing a helmet and safety gear.

Alumni Profiles

Harvard graduate and NASCAR racer Patrick Staropoli on pedals, attention, and fearlessness.