The Widget Effect was published in 2009 and it described the state of teacher evaluation. It identified that, using the evaluation systems used in the United States at that time, all teachers were satisfactory (less than 1% were rated unsatisfactory). The report also concluded that truly excellent teaching went unrecognized, that professional development was not connected to evaluations, and that poor performance was not addressed. Despite the overwhelmingly positive rating that teachers were receiving, 57% of teachers and 81% of administrators reported that there were poor teachers in their school. These findings pointed, we were told, toward the need for a new system of teacher evaluation. Results similar to the findings of the report were published for New York and discussed in this post.
Recently, Kraft and Gilmour explored the ratings that teachers received and found that a similar gap persists. They dug deeper and through data they collected and interviews they conducted that evaluators of teachers actually expect such a gap before starting the evaluation process. In effect: it’s just part of the flawed system. A variety of reasons were offered as explanations of the inflated ratings:
- Evaluators are factoring in teacher potential when determining a rating
- Evaluators are uncomfortable with the assignment of low ratings
- Evaluators have a persistent lack of faith in the overall system
- Evaluators disagree with the underlying foundation of the system
- It could always be worse
- Skewed district-determined scoring systems
Basically, the rating system is inaccurate for explainable and understandable reasons. Whether or not the perpetuation of the teacher evaluation system as it exists now should be maintained is a matter for a different post and a different discussion. This purpose of this post is to offer food for thought.
Along those lines, we did a little data collection of our own to see whether the discrepancy that Kraft & Gilmour reported exists in our area. We also compared the NYSUT and FFT rubrics to see whether the rubric made a difference. Our results (based on data collected from 109 lead evaluators):
|Difference between ratings derived from the system
compared to evaluators overall assessment
As you can see, the reported inflation of the highly effective rating was similar for both rubrics: approximately 1/6 more teachers were rated as highly effective than the overall assessment of the teacher. On the other end, fewer teachers were rated on the lower end of the scale by the system than by the evaluators’ overall judgment. These data suggest that the APPR ratings need to be taken with a grain of salt and a dose of skepticism. Whether this is different under §3012-d remains to be seen. As attention turns to the development of a new system for APPR, perhaps the persistent Widget Effect” should be taken into consideration as we realize that scores and overall rating systems like these are doomed to fail. We need evaluation, absolutely. But the emphasis on overall ratings and scores is misguided.