Making the Public Record Public

Harvard legal database released

montage of folders with HLS shield going into a computer demonstrating digitization

Harvard Law School digitized seven million judicial opinions, which are now available for public use. | montage illustration by niko yaitanes/harvard magazine

Generally, librarians are tasked with protecting books. But over the past decade, Harvard Law School (HLS) librarians have sent tens of thousands of books to the literary guillotine, severing their spines and running their pages through scanners. In March, the school announced the full public release of their ambitious digitization, the Caselaw Access Project (CAP). Now, researchers and civilians (and artificial intelligence models) can freely access seven million judicial decisions extending back to the nation’s founding.

Before the digital age, lawyers seeking to consult past judicial opinions had to comb through bulky “legal reporters”—thick hardcover books of court decisions. But these chronological volumes are inadequate for understanding the law, says Bemis professor of international law and vice dean for library and information resources Jonathan Zittrain, who helped lead the initiative. Research is “bound up in understanding the concurrent outputs of judges around the country,” he says. Creating an “entire database that could be searched and analyzed in all sorts of ways…was a grail…worth pursuing.”

To digitize seven million case records, HLS partnered with Ravel Law, a startup that used machine learning techniques to run complex searches. As part of that agreement, announced in 2015, Ravel funded the digitization of Harvard’s 40,000 law reporters and agreed to release the full database to the public after a few years of limited access during which Ravel could recoup its investment. During that interim period, researchers had full access and regular users could make up to 500 daily searches, Zittrain says.

He jokes that the arrangement between Ravel and Harvard was quite complex, “perhaps appropriately, given this is all about the law,” but that it exemplifies how private companies and public institutions can collaborate on projects for the public good. Ravel delivered investor value by demonstrating a “new way of visualizing cases and seeing interrelationships among them,” he says, with Harvard’s data, and was acquired by legal information giant LexisNexis in 2017. Now, Harvard students—and the broader world—can benefit from a powerful open legal database.

CAP is more than a simple amalgamation of millions of judges’ opinions: it’s a tool to draw broad inferences that would be difficult to examine manually. It can be used, for example, to track the prevalence of “he said” versus “she said,” a proxy for gendered participation in law, or quantify the present legacy of slavery by pulling all decisions that cite slave-related cases. (Read a feature on the effort a Harvard historian had to apply in searching such cases by hand here.) “It’s really a whole new category of questions that you can try to pose and answer,” Zittrain says.

This March milestone—making the entire data set available to the public—has significant implications for AI models. Unlike other legal digital repositories like LexisNexis or Westlaw, CAP allows large language models to download all of its decisions. When scanning began in the mid-2010s, few people imagined the power and adoption of AI tools, Zittrain says. Now, CAP can “become part of the secret sauce of various large-language models,” either as a training set or as a referential database. He hopes that sharing CAP’s data with AI models will help alleviate misinformation. “The internet is fast becoming a place where maybe the least authoritative or helpful information is the most available,” he says. Training AI models with this data could help elevate complex, accurate information.

The database does not include every related legal document. Case dockets—exhaustive records of a lawsuit’s development—remain mostly confined to courthouse file cabinets. Instead, it focuses on judicial opinions (explanations of how the court reached its decision) and includes “pretty much all” opinions deemed important enough to be printed in a legal reporter, says Zittrain, ranging from the founding of each jurisdiction up to 2018.

Deprived of their spines and shrink-wrapped in plastic, tens of thousands of Harvard’s legal books now reside in a repurposed Kentucky limestone mine. Though the volumes sacrificed their physical utility for digital optimization, their knowledge could soon be used to ask an AI chatbot important legal questions, including how much jail time participants in a book heist would face.

Read more articles by Max J. Krupnick

You might also like

Teaching Nutrition in Medical Education

Will Harvard Medical School return nutrition instruction to pre-eminence?

Animal (Code) Cracker

After listening to leviathans, an undergraduate comes to conservation.  

Breaking Bread

Alexander Heffner ’12 plumbs the state of democracy.

Most popular

Prepare for AI Hackers

Human systems of all kinds may soon be vulnerable to subversion by artificial intelligence.

The Missing Middle

How overheated political attention warps campus life

Teaching Nutrition in Medical Education

Will Harvard Medical School return nutrition instruction to pre-eminence?

More to explore

Architect Kimberly Dowdell is Changing Her Profession

Kimberly Dowdell influences her profession—and the built environment.

How Schizophrenia Resembles the Aging Brain

The search for schizophrenia’s biological basis reveals an unexpected link to cellular changes seen in aging brains.

Harvard Researchers on Speaking to Whales

Project CETI’s pioneering effort to unlock the language of sperm whales