September 1, 2002

Testing Trap

Supporters of the reauthorization, last January, of the Elementary and Secondary Education Act hail it for tightening school accountability...

From The September-October 2002 Issue

Supporters of the reauthorization, last January, of the Elementary and Secondary Education Act hail it for tightening school accountability substantially, for granting more flexibility to states and school districts in the use of federal funds, and for applying sanctions to and providing aid for failing schools. Opponents argue that the bill doesn't go far enough, because congressional supporters of school choice failed to persuade their colleagues and the president's advisers to include vouchers in the bill.

Sadly, from an educational perspective, both sides miss the major issues. This is an "accountability bill" that utterly fails to understand the institutional realities of accountability in states, districts, and schools. And its provisions are considerably at odds with the technical realities of test-based accountability. In the history of federal education policy, the disconnect between policy and practice has never been so evident, nor so dangerous. Ironically, the conservative Republicans who control the White House and the House of Representatives are sponsoring the single largestand the single most damagingexpansion of federal power over the nation's education system.

Under the new law, the federal government mandates a single test-based accountability system for all statesa system currently operating in fewer than half the states. It requires annual testing at every grade level, and states must disaggregate their test scores by students' racial and socioeconomic backgroundsa system currently operating in only a handful of states, and one fraught with technical difficulties. The federal government further mandates a single definition of adequate yearly progress, the amount by which schools must increase their test scores in order to avoid some sort of sanctionan issue that in the past has been decided jointly by states and Washington. Finally, the law sets a single target date by which all students must exceed a state-defined proficiency levelan issue that in the past has been left almost entirely to states and localities.

Thus the federal government is now accelerating the worst trend of the current accountability movement: that performance-based accountability has come to mean testing alone. In the early stages of the current movement, reformers had an expansive view of performance that included, in addition to tests, portfolios and formal exhibitions of students' work, student-initiated projects, and teachers' evaluations of their students. The comparative appeal of standardized tests is easy to see: they are relatively inexpensive to administer; can be man- dated simply; can be rapidly implemented; and deliver clear, visible results. But relying only on standardized tests dodges the complicated questions of what tests actually measure and of how schools and students react when tests are the sole yardstick of performance.

If this shift in federal policy were based on the accumulated wisdom gained from experiences with accountability in states, districts, and schools, or if it were based on clear design principles that had some basis in practice, it might be worth the risk. In fact, however, it is based on little more than talk among people who know hardly anything about the institutional realities of accountabilityand even less about the problems of improving instruction in schools.

The idea of performance-based accountability was introduced in the mid 1980s by the National Governors Association, led by Bill Clinton, then governor of Arkansas. It took the form of what was then called the "horse trade": states would grant schools and districts more flexibility in making decisions about what and how to teach, in return for more accountability for academic performance. This idea became the central theory of today's accountability reforms. It was appealing in principle: governors and state legislators could take credit for improving schools without committing themselves to serious increases in funding. From the beginning, performance-based accountability was an explicitly political idea, designed to bring a broad coalition together behind a single vision of reform. As with most such ideas, it was weak on practical details, most of which were left to state and local policymakers and educators.

The movement got a major boost in 1994, when Title Ithe flagship federal compensatory education programwas amended to require states to create performance-based accountability systems for schools. The vision behind the 1994 amendments was that Title I would complement and accelerate the trend that began at the state level; the amendments required states to develop academic standards, assessments based on the standards, and progress goals for schools and school districtsall within ambitious timetables. The merger of state and federal accountability policies ("alignment," as it was called) was supposed to occur by 2000. By the end of the decade, it was difficult to find more than one or two states lacking some form of testing program and public release of the results. In all but a few states, however, the basic architecture of accountability remained relatively crude and underdeveloped. In those few states where the idea had been developed most extensivelyTexas and Kentucky, for examplethe systems worked well enough, according to the testimonials of their sponsors, to legitimate the idea that they were successful in general. But even in these states, there were legitimate criticisms of the accountability system's actual effect on academic performance and drop-out rates.

By the late 1990s, it was abundantly clear that the states had fallen well short of what the crafters of the 1994 Title I amendments had envisioned. It was also clear that the federal government possessed very little leverage with which to force them along. States varied vastly in their administrative capacities to implement performance-based accountability systems. More important, creating accountability systems at the state level is essentially a political act, and Washington's harmless knuckle-rapping was hardly going to overcome the intransigence of a state legislature or governor. The U.S. Department of Education's ability to monitor and enforce compliance was limited; budget cuts whittled away at the Department's Title I staff just as their responsibilities were increasing; and its senior political appointees were reluctant to make life too difficult for governors and chief state school officers, who are among their key political constituencies. So by the target date for full compliance, fewer than half the states had met the requirements. It came as no surprise to learn that by the year 2000, many schools with Title I-eligible students were simply unaware of the program's major policy shift in 1994.

This experience should have signaled to the Bush administration and Congress that complex issues of state and local capacity could not be brushed aside just by tightening the existing law's requirements. If more than half the states were unable or unwilling to comply with the requirements of the previous, less-stringent, more forgiving law, why would one expect all the states to comply with a much more stringent and exacting law?

Even though virtually all the states have joined the accountability bandwagon, doing so was, for many, largely a symbolic act. The designs of the systems are still primitive; state education officials' authority to oversee school districts is still limited in many cases; and the political consequences of imposing large-scale, statewide testing in areas with strong traditions of local control are risky. Moreover, mounting a statewide testing system is beyond the capacity of most state departments of education. Those that have embarked on large-scale testing are stretched to their limits just managing test-development work or monitoring testing contractors. Finally, there are technical issues. Standardized tests inevitably become highly politicized and, in the course of the debate, the limits of testing are subjected to public scrutiny. Many policymakers enter the accountability debate not knowing much about testing, and they often discover, much to their chagrin, that off-the-shelf tests may not validly measure the content specified in state-mandated standards and that norm-referenced tests (tests that deliberately create a normal distribution around a mean) may not be effective in measuring changes in performance.

The working theory behind test-based accountability seems simpleperhaps fatally so. Students take tests that measure their academic performance in various subject areas. The results trigger certain consequences for students and schoolsrewards, in the case of high performance, and sanctions for poor performance. Attaching stakes to test scores is supposed to create incentives for students and teachers to work harder and for school and district administrators to do a better job of monitoring their performance. If students, teachers, or schools are chronically low performing, presumably something more must be done: students must be denied diplomas or held back a grade; teachers or principals must be sanctioned or dismissed; and failing schools must be fixed or simply closed. The threat of such measures is supposed to motivate students and schools to ever-higher levels of achievement.

In fact, this is a naïve view of what it takes to improve student learning. Fundamentally, internal accountability must precede external accountability. That is, school personnel must share a coherent, explicit set of norms and expectations about what a good school looks like before they can use signals from the outside to improve student learning. Giving test results to an incoherent, atomized, badly run school doesn't automatically make it a better school. A school's ability to make improvements has to do with the beliefs and practices that people in the organization share, not with the kind of information they receive about their performance. Low-performing schools aren't coherent enough to respond to external demands for accountability.

The work of turning a school around entails improving "capacity" (the knowledge and skills of teachers)changing their command of content and how to teach itand helping them to understand where their students are in their academic development. Low-performing schools, and the people who work in them, don't know what to do. If they did, they would be doing it already. You can't improve a school's performance, or that of any teacher or student in it, without increasing the investment in teachers' knowledge, pedagogical skills, and understanding of students. Test scores don't tell us much of anything about these important domains; they provide a composite, undifferentiated signal about students' responses to a problem.

Test-based accountability without substantial investments in internal accountability and instructional improvement is unlikely to elicit better performance from low-performing students and schools. Furthermore, the increased pressure of test-based accountability alone is likely to aggravate the existing inequalities between low-performing and high-performing schools and students. Most high-performing schools simply reflect the social capital of their students (they are primarily schools with students of high socioeconomic status), rather than the internal capacity of the schools themselves. Most low-performing schools cannot rely on the social capital of students and families and instead must rely on their organizational capacity. With little or no investment in capacity, low-performing schools get worse relative to high-performing schools.

Some changes in the new law provide unrestricted money that states can use to enhance capacity in schools, if they choose to. But neither state nor federal policy addresses the capacity issue with anything like the intensity applied to test-based accountability. The result is an enormous distortion in the relationship between accountability and capacitya distortion that is being amplified rather than dampened by federal policy.

In today's environment, critics who suggest that there might be problems with the ways tests are used for accountability purposes are branded apologists for a broken system. That the performance of students and schools can be accurately, reliably, measured by test scores is almost an article of faith. As a result, tests are being misused in ways that will eventually undermine the credibility of performance-based accountability systems.

The most serious problem lies in the use of test scores to make decisions about whether students can advance to the next grade or graduate from high school. The American Psychological Association's guidelines for test use (and the consensus of professional judgment in the field of educational testing and measurement) specifically prohibit basing any consequential judgment about an individual student on a single test score. Why? Because test scores are associated with a significant margin of error. That margin of error increases as the number of cases decreases; individual scores are typically much less reliable than aggregates of many individual scores.

The solution is to use multiple measures of a student's performance when making consequential decisions. But this solution is more expensive and it introduces a new level of complexity into the system. Were high-school graduation to be contingent on a composite of grades, test scores, and portfolios of students' work, developing such a composite would be a challenging technical feat. It would also introduce a certain amount of judgment into the system, and policymakers tend to distrust the professionals who make such judgments.

A similar problem arises at the lower-school level. Under Title I, schools are expected to meet their annual yearly progress goals, measured by a school's annual gain in test scores. Title I also requires disaggregating these scores by students' ethnic and economic backgrounds. But such measures are highly unreliable for populations the size of a typical elementary school, and they are particularly unreliable for even smaller sub-groups of students. Schools are often misclassified as low- or high-performing purely because of random variation in their test scores, unrelated to any educational factor.

The standards and accountability movement is in danger of being transformed into the testing and accountability movement. States without the human and financial resources to select, administer, and monitor tests are now being forced to begin testing at all grade levels. Instead of creating academic standards that drive the design of an appropriate assessment, low-capacity states will simply select a test based on its expense and ease of administration, making charges of "teaching to the test" increasingly accurate. A test with no external anchor in standards or expectations about student learning becomes a curriculum in itself, trivializing the whole idea of accountability.

The enthusiasm for performance-based accountability plays to the worst weaknesses of the American education system. After World War II, most industrialized countries nationalized their education systems, but not the United States. Because decisions about content and performance were left to states and localities for so long, they never developed the capacity to monitor the quality of teaching and learning in schools, to support the development of teachers' and administrators' knowledge and skill, or to evolve measures of performance that are useful to educators and the public.

The difficult, uneven, and protracted slog toward clearer expectations and supports for learning has barely begun in most states and localities. The history of federal involvement in that long effort is mixed at best. The current law repeats all of the strategic errors of the previous law, but with greater federal intervention. The prognosis is not good.

The best we can hope for is that the capacity problems of states and localities will become more visible as a political issue, triggering responses that will help schools overcome the real obstacles they face in improving the quality and intensity of teaching and learning. Similarly, we can hope that the technical failures of testing will trigger a response that focuses more on broad assessments of student learning.

The worst that can happen is that test-based accountability will widen the gap between schools serving the well-off and those serving the poor, thus confirming the public's suspicion that expecting high levels of learning from all children is unrealistic. Performance-based accountability in education is mutating into a caricature of itself.

Richard F. Elmore, Ed.D. '76, Anrig professor of educational leadership at the Harvard Graduate School of Education, is completing a study of school accountability. Recent publications include "Building a New Structure for School Leadership" and "Bridging the Gap between Standards and Achievement," both available from www.shankerinstitute.org. This article is adapted with permission from an earlier version, titled "Unwarranted Intrusion," which appeared in the Spring 2002 issue of Education Next (www.educationnext.org), published by the Hoover Institution, Stanford University.

Published in the September-October 2002 print issue in the Features section.