Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Exploring the Reliability and Validity of Pilot Teacher Ratings in a Large California School District

Abstract

Many states and school districts have recently instituted revamped teacher evaluation policies in response to incentives from the federal government as well as a changing political climate favoring holding teachers accountable for the performance of their students. Many of these overhauls have mandated the incorporation of multiple performance indicators -- often including rubric-based classroom observation scores, estimated contributions to student test score outcomes, and surveys of students and parents -- into teacher evaluations. This three-paper dissertation explores the pilot implementation of a new standards-based multiple-measure teacher evaluation system in a large California school district in 2011/12.

It examines both participants' views about the new system (particularly the challenges they faced and the early outcomes they felt were achieved), as well as the reliability and validity of the teacher observation ratings that resulted during pilot implementation.

Results indicated that this self-selected group of pilot teachers and administrators generally appreciated the district's new teaching framework and pre/post-observation conferencing process, and participants also tended to report that certain key early outcomes were achieved, including increased reflection by teachers about their performance against the new teaching framework and better understandings of teachers' individual needs for instructional support (although a higher proportion of administrators than teachers reported that this latter outcome was achieved). Time constraints, staffing shortfalls and technology problems were all key challenges cited by both teachers and administrators during the pilot year.

Analyses of the ratings for the small sample of participating teachers who received a complete set of observational focus element (item) scores from both of their raters across both observation cycles indicated that these teachers tended to be scored higher during the second cycle -- although such improvement wasn't universal -- and that across cycles the scores from second raters (who typically did not work at the school site) tended to be slightly lower than those awarded by the teachers' supervising site administrator. But ultimately, good agreement was evident between the primary and second raters who scored common teachers. Generalizability analyses indicated that approximately two-thirds of the variation in participating pilot teachers' total scores was attributable to systematic differences among teachers, while the variability associated with the observation cycle (approximately 25 percent) was larger than that associated with rater group (approximately 6 percent). These results were then used to forecast reliability coefficients based on different combinations of rater groups and observed lessons (cycles), and suggested that, based solely on pilot implementation and results from this particular analysis sample, varying the number of observations influenced reliability estimates far more than varying the number of observers.

Finally, the group of participating pilot teachers who completed end-of-year surveys generally felt that the observations of their practice conducted during the pilot year represented a valid measure of their effectiveness, and pilot teachers' classroom observation-based ratings were not related to their ethnicity or the grade span they taught (factors that should theoretically be unrelated to performance). Low to moderate correlations were evident between pilot teachers' classroom observation-based ratings and their student survey ratings and value-added scores for the 2011/12 year.

The uniqueness of this pilot context restricts the generalizability of these findings, however. The pilot consisted primarily of volunteers, and there was attrition during the pilot year -- approximately one-third of the teachers trained in fall 2011 never had any ratings entered online by an observer. In turn, the final pilot sample was comprised of a self selected group of experienced, mostly elementary school teachers who administrators from case study sites tended to characterize as particularly hard working and high performing. Moreover, our research team's limited capacity for qualitative data collection in spring 2012 (we were only able to visit five participating schools) and our low survey response rates (52 percent for teachers and 54 percent for administrators) also limit our ability to generalize findings more broadly. We did not hear the perspectives of those who dropped out of the pilot. Finally, the tools and processes under study were still being revised and fine-tuned by the district during the pilot year; observers were still learning the tools and teachers and administrators were just becoming familiar with the processes and measures. All told, these results likely do not reflect what will be found in any eventual full-scale roll out.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View