Harvard Data Trained This AI Model

“Talkie” is a large language model trained on only pre-1931 public domain content from Harvard libraries.

Harvard Law School's library

The Institutional Data Initiative at Harvard Law School Library is a research initiative designed to “bring the public domain to AI.” | photograph by olivia farrar / harvard magazine

Nearly a century of printed history from Harvard libraries has become raw material for artificial intelligence.

In late April, a team of AI researchers including Nick Levine, David Duvenaud, and former OpenAI scientist Alec Radford unveiled “Talkie”: a mid-sized large language model (LLM) trained exclusively on text published before January 1, 1931. Ask Talkie about the internet, television, or World War II, and it falters or makes guesses. Prompt it about early aviation or 1920s social customs, and it responds fluently.

The timeframe was chosen because many works published in the United States in 1930 entered the public domain on January 1, 2026, making them free to use and distribute.

Researchers increasingly see historically bounded models like Talkie as a new way to study how artificial intelligence learns, including whether an LLM can infer ideas that may have emerged after the historical cutoff of its dataset. Since its release, users have been testing Talkie to see whether it can accurately forecast future events from historical data or generalize concepts it was never explicitly taught. In one experiment, Talkie demonstrated the ability to produce new code when given small snippets of Python, despite being trained on material published decades before computers existed.

Could a model trained on only pre-1931 data, for example, come up with any of the paradigm-shifting theories (cosmic inflation, the Standard Model of particle physics, punctuated evolutionary equilibrium) discovered in the latter half of the twentieth century?

Talkie has limitations and liabilities, its researchers acknowledge. Because the model was trained on historical texts that may contain offensive views, it can reproduce racist or discriminatory attitudes that were common at the time. Its creators chose not to sanitize the underlying dataset, arguing that doing so would distort the historical record. The public-facing demo does include moderation layers and warnings for problematic outputs, however.

Other issues are technical. Historical datasets can contain metadata errors, revised editions, editorial insertions, and flawed recognition of historical images, and the model can learn from error-ridden data. In some cases, Talkie appears to know facts and details beyond its historical cutoff, referred to as “temporal leakage.”

Talkie demonstrates an important facet of legal AI development: it may depend as much on libraries and archives as on technology companies. Libraries possess vast collections of material, some in the public domain. As lawsuits against companies like OpenAI, Microsoft, and Meta pile up for the use of copyrighted material in training models, models trained on public domain works face no such liabilities.

Libraries like Harvard’s, with one of the largest collections in the country, are increasingly becoming participants in the translocation of knowledge into computational systems.

“We have the foundational materials needed to train inclusive AI systems,” says Martha Whitehead, Harvard’s university librarian. “We aim to partner in shaping the ethical use of those materials in emerging systems, to ensure they reflect the breadth and depth of human knowledge for the benefit of all.”

Much of Talkie’s training data comes from the Harvard Law School Library’s Institutional Data Initiative (IDI), a dataset of nearly one million public-domain volumes digitized from Harvard Library collections and released last year for use in computational research. The IDI also includes the Caselaw Access Project, a repository of more than seven million judicial decisions extending back to the founding of the United States.

In the coming months, IDI plans to release an expanded version of that dataset along with a new archive of roughly two million digitized newspaper pages created in partnership with the Boston Public Library.

“The AI community has historically played fast and loose with data quality,” says Greg Leppert, the IDI executive director and chief technologist at the Berkman Klein Center for Internet and Society. “[A]t times blindly trusting frontier models with wholesale data cleanup…but the team [working on Talkie] have gone to great pains to ensure their data is a reflection of history rather than their AI tools.”

Such care, he adds, has effects beyond creating an accurate snapshot of the past. “If your goal is testing language models for their ability to predict the future,” Leppert added, “suddenly there’s reason to care about the qualities of data that knowledge stewards have long valued.”

Read more articles by Olivia Farrar

You might also like

At Harvard Talk, Retired Supreme Court Justice Breyer Defends Shadow Docket

The current law professor also spoke about affirmative action, partisanship, and the limits of “bright-line rules.”

Is Copyright Law the Wrong Weapon Against AI?

Harvard law professor Rebecca Tushnet explains how “fair use” applies to LLMs.

Government Seeks More Harvard Admissions Data

Justice Department says it needs proof that Harvard is complying with a 2023 court ruling.

Most popular

Harvard Stem Cell Institute Names New Faculty Co-Director

Biology professor Lee Rubin is a leading expert on neurogenerative diseases.

Harvard Discloses Top Earners’ Compensation

The University files its annual report for tax-exempt organizations.

AI Outperforms Doctors in Emergency Room Tasks, New Harvard Study Shows

Researchers say the technology could help physicians with triage, diagnosis.

Explore More From Current Issue

Bronze statues of three historical figures under a stylized tree in a softly lit space.

The Costly Choice Native Americans Faced

How the Revolution reshaped indigenous New England

Historical scene depicting a parade with soldiers and a town square in the background.

When the Revolution Hit Cambridge, Harvard Moved to Concord

College students broke hearts and windows during their year in exile.

Three joyful graduates in caps and gowns celebrate together outdoors.

Your Harvard 2026 Commencement Week Guide

College reunions and Alumni Day will take place the following week