Validity and Reliability in Performance Assessment

Validity and Reliability in Performance Assessment

Teaching has been characterized as "holistic, multidimensional, and ever-changing; it is not a single, fixed phenomenon waiting to be discovered, observed, and measured" (Merriam, 1988, p. 167). As such, any attempt to assess the performance of a teacher necessarily will be fraught with difficulties. Nonetheless, a means of assessing teaching practice accurately must be established if the assessment is to be both believed and trusted.

In this course you will be asked to assess the quality of the teaching practice you observe in area high school physics classrooms. As a teacher candidate, you can expect to have your own teaching performance similarly evaluated as you teach micro lessons and move through student teaching. Fair and accurate assessment of preservice teacher practice is very important because it allows for formative assessment of teaching practice that can lead to improvement in practice, provides quality assurance to those who would hire the teacher candidate some time in the future, and offers a means of ensuring accountability for both teacher training specialists and prospective teachers.

In this course you will learn about established standards for exemplary teaching practice, and how teacher assessment works using these standards as a basis. To help you understand standards-based assessment practices, you will learn how to perform assessments of videotaped micro lessons and inservice teachers. It is hoped that in doing so, you will internalize these standards of quality teaching practice and eventually exhibit them in your own. In addition to learning about standards, you also will learn about appropriate indicators that can be used to assess teaching practice both validly and reliably.

Assessment of teacher practice must be both valid and reliable if it is to be believed and trusted. Validity relates to the question of whether or not one assesses what one claims to or intends to assess. It deals with whether or not an assessor's findings correspond to some form of objective reality. The data collected during an assessment must in some way accurately reflect the actions being assessed. To the extent that this is so, the assessment is valid. Reliability relates to whether or not the findings can be replicated, either by the same observer watching similar teaching practice or by another observer viewing the same teaching practice as the first assessor. If an assessment practice is reliable, then both assessors should arrive at the same approximate score. To the extent that the assessors agree in their scoring, the assessment is reliable.

Validity does not ensure reliability, and reliability does not ensure validity. For instance, a study can be valid, but lack reliability, and visa versa. An analogy might help one understand the meaning of and relationship between validity and reliability. Consider a set of targets, and someone who shoots ten bullets at each target. In this analogy the tightness of the clustering refers to reliability (consistency of aiming), and centering of the cluster refers to the validity (closeness to the mark). In figure 1. found on page 2, the bullet holes are clustered but off center; the targeting is repeatable, just not accurate. In figure 2. the bullets are scattered all around the target, but on the average they are centered; such targeting is more or less centered, with individual shots wide of the mark. In figure 3 the ten bullets are to one side of the target and are scattered; such targeting is neither valid nor consistent. In figure 4. the targeting is tightly clustered and centered on the bull's eye; such targeting is both accurate and repeatable. So it is with good assessment of teaching practice.

Figure 1. Figure 2.

Not valid, reliable. Valid, not reliable.

(Not accurate, but repeatable) (Accurate on average, not repeatable)

Figure 3. Figure 4.

Not valid, not reliable. Valid and reliable.

(Not accurate, not repeatable) (Accurate and repeatable)

To carry the explanation further, consider two assessors who observe the practice of a teacher, and are assessing classroom atmosphere. To one assessor the students, though noisily engaged in pertinent discussion, appear to be enthralled with an intriguing investigation; the students are seen as independent and self-directed, and the lesson is student-centered. To the other assessor the students are unruly, not sitting in their seats, and talking loudly; they appear to be having a grand time, but they are not learning anything. In fact, the teacher isn't doing his job and the students appear to be completely in charge.

What we have here is a problem with interrater reliability. The assessment procedure is most probably unstructured and based upon the personal preferences of the raters. The teacher being assessed may well be at the mercy of subjective interpretations of student behavior -- not objective standards of performance. At least one of the raters appears to be untrained. Such conflicting assessments could not be guaranteed to be either valid or reliable, even though one of them might be right on target.

Using observable indicators of performance and trained assessors is the only way to ensure valid and reliable assessment of teacher practice. Following a discussion of a number of teaching standards in this course (primarily INTASC standards and the Illinois Professional Teaching Standards), students will formulate a grading rubric that will be used objectively to assess teaching performance in a way that is (hopefully) both valid and reliable.

By Carl Wenning and based on the following book: Merriam, S. B. (1988). Case Study Research in Education: A Qualitative Approach. San Francisco: Jossey-Bass Publishers.