Nearly a century of printed history from Harvard libraries has become raw material for artificial intelligence.
In late April, a team of AI researchers including Nick Levine, David Duvenaud, and former OpenAI scientist Alec Radford unveiled “Talkie”: a mid-sized large language model (LLM) trained exclusively on text published before January 1, 1931. Ask Talkie about the internet, television, or World War II, and it falters or makes guesses. Prompt it about early aviation or 1920s social customs, and it responds fluently.
The timeframe was chosen because many works published in the United States in 1930 entered the public domain on January 1, 2026, making them free to use and distribute.
Researchers increasingly see historically bounded models like Talkie as a new way to study how artificial intelligence learns, including whether an LLM can infer ideas that may have emerged after the historical cutoff of its dataset. Since its release, users have been testing Talkie to see whether it can accurately forecast future events from historical data or generalize concepts it was never explicitly taught. In one experiment, Talkie demonstrated the ability to produce new code when given small snippets of Python, despite being trained on material published decades before computers existed.
Could a model trained on only pre-1931 data, for example, come up with any of the paradigm-shifting theories (cosmic inflation, the Standard Model of particle physics, punctuated evolutionary equilibrium) discovered in the latter half of the twentieth century?
Talkie has limitations and liabilities, its researchers acknowledge. Because the model was trained on historical texts that may contain offensive views, it can reproduce racist or discriminatory attitudes that were common at the time. Its creators chose not to sanitize the underlying dataset, arguing that doing so would distort the historical record. The public-facing demo does include moderation layers and warnings for problematic outputs, however.
Other issues are technical. Historical datasets can contain metadata errors, revised editions, editorial insertions, and flawed recognition of historical images, and the model can learn from error-ridden data. In some cases, Talkie appears to know facts and details beyond its historical cutoff, referred to as “temporal leakage.”
Talkie demonstrates an important facet of legal AI development: it may depend as much on libraries and archives as on technology companies. Libraries possess vast collections of material, some in the public domain. As lawsuits against companies like OpenAI, Microsoft, and Meta pile up for the use of copyrighted material in training models, models trained on public domain works face no such liabilities.
Libraries like Harvard’s, with one of the largest collections in the country, are increasingly becoming participants in the translocation of knowledge into computational systems.
“We have the foundational materials needed to train inclusive AI systems,” says Martha Whitehead, Harvard’s university librarian. “We aim to partner in shaping the ethical use of those materials in emerging systems, to ensure they reflect the breadth and depth of human knowledge for the benefit of all.”
Much of Talkie’s training data comes from the Harvard Law School Library’s Institutional Data Initiative (IDI), a dataset of nearly one million public-domain volumes digitized from Harvard Library collections and released last year for use in computational research. The IDI also includes the Caselaw Access Project, a repository of more than seven million judicial decisions extending back to the founding of the United States.
In the coming months, IDI plans to release an expanded version of that dataset along with a new archive of roughly two million digitized newspaper pages created in partnership with the Boston Public Library.
“The AI community has historically played fast and loose with data quality,” says Greg Leppert, the IDI executive director and chief technologist at the Berkman Klein Center for Internet and Society. “[A]t times blindly trusting frontier models with wholesale data cleanup…but the team [working on Talkie] have gone to great pains to ensure their data is a reflection of history rather than their AI tools.”
Such care, he adds, has effects beyond creating an accurate snapshot of the past. “If your goal is testing language models for their ability to predict the future,” Leppert added, “suddenly there’s reason to care about the qualities of data that knowledge stewards have long valued.”