Evaluation Design and Tools
Edited by Gilbert Valdez, PhD, Director of the Teaching, Learning, and Curriculum Center at the North Central Regional Educational Laboratory
Neff Walker from the Georgia Institute of Technology (Georgia Tech) created an evaluation design for the SUCCEED program that is relevant and informative and can serve as a model for developing an evaluation design and process. Walker noted that the purpose of this paper was to help SUCCEED project investigators develop and carry out sound evaluations. The full text of the document is at http://succeed.ee.vt.edu/walker.html
Walker's evaluation design model is intended to help answer three basic questions:
1. Why should you evaluate your project?
2. What should you evaluate?
3. How should you evaluate?
In an effort to make the primer as practical and useful as possible, he divided the material into a series of eight sequential evaluation steps. Those steps represent a series of decisions and actions to be taken by SUCCEED investigators, each building on the one before. Each section of the primer addresses one step and includes three parts. First, the basic evaluation concepts and techniques are briefly explained. Second, an example is provided, illustrating the application of evaluation to an engineering education project. Finally, each section closes with a series of recommended action steps that can serve as guidelines for project investigators.
Because the original document was designed to evaluate college engineering courses, Walker's work has been edited and excerpted so that the design is more generic and appropriate for K-12 settings. People interested in evaluation are encouraged to view the original work so as to better capture Walker's original intent.
Step 1: Defining the Purpose of the Evaluation
One common misconception about evaluation is that it is something done after a project has been implemented, by an outside group who then judges whether or not the project was effective. While this scenario is often true in practice, it is not the only or necessarily the right way to conduct an evaluation. A good evaluation plan should be developed before a project is implemented, should be designed in conjunction with the project investigators, and should ensure that the evaluation serves two broad purposes. First, evaluation activities should provide information to guide the redesign and improvement of the intervention. Second, evaluation activities should provide information that can be used by both the project investigators and other interested parties to decide whether or not they should implement the intervention on a wider scale.
These two purposes correspond with two broad types of evaluation: formative and summative. The goal of formative evaluation is to improve an intervention or project. The goal of summative evaluation is to judge the effectiveness, efficiency, or cost of an intervention.
The purpose of formative evaluation is to provide information to the project team so that their intervention can be modified and improved. It focuses on whether the intervention is being carried out as planned. Formative evaluation activities can include materials and software development and beta testing, focus groups to assess students' attitudes and responses to aspects of intervention design and materials, and experimental studies to determine the effect of specific design characteristics on students' mastery and retention of concepts and skills. While some of these activities also yield data related to intervention effectiveness, their primary goal is to provide information for intervention improvement.
The purpose of summative evaluation is to produce information that can be used to make decisions about the overall success of the intervention. There are three specific and sequential types of summative evaluation questions that should be addressed for any intervention:
The use of a staggered approach to summative evaluation should allow one to identify and address operational difficulties in the use of the intervention. Too often, summative evaluations simply measure efficacy. If an intervention is to go beyond being a simple "pilot project," the investigators must also evaluate intervention effectiveness and cost.
Action Steps for Investigators:
Step 2: Clarify Project Objectives
A prerequisite for evaluation is the development of a project plan with measurable objectives that are logically related to one another and to the goals and interventions defined in the project proposal. All objectives should specify what is to be done, by when. There are three types of objectives: impact, outcome, and process. Impact objectives should focus on changes in the long-term performance of students that are expected to result from project activities, and should correspond to the priority goal of the project as stated in the project proposal. Outcome objectives should focus on changes in knowledge, attitudes, behaviors, or availability of educational programs or supports that result from project activities, and should be directly related to the intervention's target population. Process objectives specify the actions needed for projectimplementation and should correspond to the various activities (development of written or computer software materials, peer education sessions, placements in internships, training of educators, etc.) necessary to achieve the intended outcomes and impact.
Action Steps for Investigators:
Step 3: Create a Model of Change
A model of change clarifies underlying assumptions about how the proposed intervention will lead to the expected outcomes and goals of the intervention. While this concept sounds like a simple one, it is often the weakest element of an evaluation plan. Development of a clear and correct model of change is the most critical step in the development of a sound evaluation plan.
What is a model of change? A model of change refers to the specific set of relationships that one believes connects the intervention to the achievement of the impact objectives of the project. The model should specify how the proposed interventions will lead to these goals.
A simple model of change for this project might begin with the assumption that multimedia methods are more effective for presenting knowledge than didactic lectures. Because multimedia methods are more effective, students will learn more, retain more, and, therefore, will have a higher probability of passing the course.
If this model reflects the assumptions underlying the proposed intervention and how it leads to achievement of the project goals, investigators should try to assess each of the proposed links in the model of change. For example, do the students who use the multimedia system learn more than those taught by the traditional lecture system? Does use of multimedia result in a higher percentage of students passing the course? Does it result in students liking chemistry more?
It could be that the intervention does increase learning (let's say that the students develop better conceptual knowledge of chemistry due to use of interactive simulations, as reflected in their laboratory worksheets) but this knowledge may not lead to a higher percentage of students passing the course. This "failure to pass" could occur because the course grade is based on a curve or because the exams do not tap this increased conceptual understanding. Alternatively, students could learn more, perform better in the course, but still choose to drop out of chemistry because -- even when passing the course -- they do not like chemistry more. Or it could be that they like chemistry so much after participating in the multimedia intervention that they decide to take more chemistry.
The important point here is that the set of relationships theorized to exist between the intervention and the goals of the project must be clearly defined. To the extent possible, each of the defined relationships should then be measured as part of the evaluation plan, allowing you to determine why and how the project either succeeded in reaching its goals or failed to do so. The more specific you are in developing your model of change, the more useful the information generated by the evaluation will be.
Of course, few projects have sufficient resources to assess all assumptions. They must choose which of the relationships that exist in their model to test. These choices should be based on:
1) What can be measured well, given available resources
2) Where problems can be anticipated
3) Where investigators have control and can improve the intervention or project based on the results
Action Steps for Investigators:
Step 4: Select Criteria and Indicators
Once measurable objectives and priority assumptions have been defined, investigators can make plans for evaluation based on specific criteria and indicators. Criteria are technical standards that can be used as the basis for making judgments about the quality of a curriculum, intervention, or other project component. For example, criteria for a curriculum might include whether it has measurable learning objectives or the quality of support and training provided to educators in the use of participatory learning methods.
Indicators are quantified measurements that can be repeated over time to track progress toward the achievement of objectives. Most indicators are expressed as rates or proportions, and include a numeric numerator and denominator. Selection of indicators should be based on their:
In addition, only those indicators that can be measured with available project resources should be selected.
Step 5: Identify Data Sources and Define How Often Indicators Will Be Measured
Once criteria and indicators have been defined, investigators must identify the best sources of data and determine how often these variables will be measured. Reports and records collected routinely by project or institutional personnel, such as class attendance reports, graduation records, SAT scores, or student performance on examinations, can be important sources of evaluation data if they are of sufficient accuracy. Where such data do not exist or are not accurate, special studies or audits may be necessary. Investigators should also explore whether data collected for other purposes or projects may be available and appropriate for use in evaluating activities. For example, student course evaluations conducted for other educational purposes may provide an opportunity to obtain data specific to project activities.
Investigators must also define how often indicators will be measured. Considerations include:
Step 6: Design Evaluation Research
The key to a good evaluation plan is the design of the study or studies to answer the evaluation questions. There are many possible research designs and plans. Your objective should be to maximize the reliability and the validity of your evaluation results.
Reliability refers to the consistency or dependability of the data. The idea is simple: if the same test, questionnaire, or evaluation procedure is used a second time, or by a different research team, would it obtain the same results? If so, the test is reliable. In any evaluation or research design, the data collected are useful only if the measures used are reliable.
Validity refers to the extent to which the questions or procedures actually measure what they claim to measure. Another way to say this is that valid data are not only reliable, but are also true and accurate. Measures used to collect data about a variable in your evaluation study must be both reliable and valid if the overall evaluation is to produce useful data.
Investigators should select a research design that controls as many threats to validity as possible. Of course, few studies can control completely for all threats, and investigators are often constrained by cost, availability of subjects, or other factors that preclude the optimal study design. However, the key is to systematically assess possible designs based on the various threats to validity, and select the design that is most valid given other constraints. Below we will give a brief overview of three of the major threats to validity in evaluation research designs, followed by an overview of qualitative and quantitative research methods.
Common Threats to Validity
Selection. A common threat to validity occurs when the people selected for the experimental group are different from those in the comparison group. For example, suppose you want to determine if tutorial sessions will improve course performance. In seeking to answer this question, you ask for volunteers from the class to participate in the tutorial sessions and then compare their performance in the course to the students who did not volunteer. The question, however, is whether the two groups of students are alike in all characteristics except for participation in the tutorial session. Perhaps better students (or more motivated students) volunteer for the extra work. Any differences in course performance may be due simply to the selection bias introduced through asking students to volunteer rather than randomly assigning students to the tutorial group. Investigators need to ensure that the students in all the groups being compared on course or test performance are equal in all the characteristics that may affect performance (e.g., knowledge, skills, motivation). If this is not possible, some differences may be able to be addressed through statistical analysis.
Mortality. Mortality refers to the differential loss of students from an intervention as compared to the usual treatment group, resulting in differences between the students in the groups at the time of testing. For example, one could assign students to one of two groups: one group spends an extra hour each week solving problems while the other has small, one-hour discussion groups weekly. It could be that more students would drop out of the problem group than the discussion group, especially those with less motivation. If this occurs, one could end up with differences between the students in the two groups that could be the source of any differences in performance.
Hawthorne Effect. The Hawthorne effect, while not normally described as a threat to validity, is one issue that evaluators of educational interventions must consider. The Hawthorne Effect can best be explained by relating it to the concept of placebo effects. As we all know, it has been shown that when people believe they are being given an effective treatment, whether for a psychological or physical illness, they tend to improve even if the treatment is simply a sugar pill. People begin to feel or perform better because of increased motivation or self-confidence. The Hawthorne Effect is similar. It states that when one introduces a new method of performing a task and participants know that it is part of an effort to improve performance, there is a temporary gain in performance, even if the new method is no better (or even worse) than the old way of doing things. The explanation for this phenomenon is that when people are told a new system will improve their performance and when they know they are being watched or evaluated, they tend to increase their effort and motivation, which results in better performance. However, this increase in performance is only temporary. The Hawthorne Effect can seriously affect the validity of evaluation results, particularly if you are evaluating a new educational intervention.
Sound evaluation plans include study designs that control for these threats to validity. In the following section we will provide an overview of various research designs.
Qualitative Research. Some evaluation questions address issues that are not easily quantified. Particularly in formative research, investigators may be interested in faculty or student attitudes about an intervention or approach, their ideas about how it could be improved, or their explanations about why they performed in a particular way. Qualitative research can help investigators understand these issues.
Qualitative research must be undertaken with the same level of methodological rigor as quantitative research. Indeed, for investigators without previous experience, we recommend that they identify an experienced qualitative researcher to provide technical assistance.
Qualitative methods that may be particularly useful include the following:
Quantitative Research. There are three broad classes of quantitative research designs: non-experimental designs, experimental designs, and quasi-experimental designs. In describing these designs, we will use the notation developed by Campbell and Stanley (1963). The notation is explained below.
Non-experimental designs. Non-experimental designs are generally used only when one is trying to collect descriptive data. These types of studies are characterized by the absence of a control or comparison group. There are two commonly used non-experimental designs in evaluation research: (1) the posttest-only design and (2) the pretest-posttest design.
There are several key points to note about both of these non-experimental designs. First, while both can be used for descriptive purposes, neither can be used to claim that the intervention is better than any other intervention. The Pretest-Posttest Design does allow one to judge the amount of gain made by the treatment group, but you cannot attribute this change to your intervention. It could be that time or other events that occurred during the intervening time period caused the gains between the first and second tests. Because of these problems, non-experimental designs are the designs of last choice.
Quasi-Experimental Designs. Quasi-experimental designs are studies that follow the basic structure of a true experiment, but without controlling for differences in subject selection. That is, the subjects are not randomly assigned to conditions. There are two classic quasi-experimental designs that will be discussed: time series design and nonequivalent control group design.
Time series designs are similar to non-experimental pretest-posttest designs, with the added advantage of repeated measurements before and after the intervention. The primary advantage of this type of design is that it gives trend information. One can compare the changes between O3 and O4 to all other pairs of observations. If the intervention is the cause of the change (not time or changes in subject's performance due to aging or learning in other courses) the changes between O3 and O4 should be greater than those between any other pair of observations.
The nonequivalent control group design has the advantage of providing a direct comparison group. It controls for changes that may be due to time or other causes, but does not control for subject differences. However, if the two groups are equivalent on the pretest scores, the threat to the validity of the study due to differences in subjects is somewhat reduced.
Experimental Designs. The key distinction that separates experimental designs from non- or quasi-experimental designs is the random assignment of subjects into the intervention groups. Random assignment helps ensure that subjects in the groups will be equal before the intervention is introduced. This leveling helps eliminate bias due to subject selection. We will briefly describe two of the more common experimental designs: the pretest-posttest control group design and the multiple intervention design.
The pretest-posttest control group design has several advantages over the designs presented earlier. First, it provides for random assignment of students into groups, helping eliminate the threat of selection bias. Second, it provides a clear comparison group and uses a pre- and posttest design, allowing one to measure not only differential gains between groups, but also absolute gains in skills and knowledge. The only weakness in this design is that it does not control for the Hawthorne Effect.
The multiple intervention design has the advantage of controlling for threats to validity due to selection and the Hawthorne Effect. In addition, if interventions are based on theoretical understanding of how the intervention produces change, isolating individual or groups of causal variables, it can be used to identify the specific causes of any changes in learning due to the intervention.
In the multiple intervention design, the intervention groups can be systematically designed to vary on how much of the total intervention is received by students in each group. For example, if one is interested in determining the effectiveness of a multimedia tutoring system in teaching chemistry, there may be many aspects of the system that one believes will aid learning (e.g., additional simulations, structured drill). One could design the study so that one group receives the simulations only, one group the structured drill only, one group to both structured drill and simulation, and the fourth group extra chemistry problems to work. By comparing the four groups on how much chemistry was learned (e.g., exam and course grades) one could determine the relative effectiveness of drill alone, the simulations alone, the combined effect, and with the problem set group, the effect of additional time spent studying without the use of the multimedia system. Using random assignment of subjects to the groups, one has controlled for selection bias, most other threats to validity, and the Hawthorne Effect.
This use of a multiple intervention group design provides the best test of the effectiveness of the proposed intervention, yielding data on both process and outcome variables. It does so by isolating the effects of specific variables in the overall intervention. This type of design can be combined with a pretest-posttest design, yielding even more data regarding initial equivalence of groups. The use of multiple intervention groups allows one to test the independent effects of variables in a complex intervention, and provides an easy way to control for the Hawthorne Effect and time-on-task that other designs do not. This method is clearly the superior study design for the evaluation of most projects, although it is often difficult for investigators to implement.
Action Steps for Investigators:
Step 7: Monitor and Evaluate
Once project investigators have developed a plan for evaluation, the next challenge is to carry it out successfully. This is harder than it may seem. All too often, evaluation is forgotten amid the day-to-day pressures of project implementation, and becomes important only when reports are due or publications are being prepared. Under these conditions, the essential formative role of evaluation as a means of improving project interventions and operations is lost.
Strategies that can help ensure that evaluation activities are an integral part of the project include:
Action Step for Investigators:
Step 8: Use and Report Evaluation Results
All evaluation is wasted unless the results are used to improve project operations or interventions. This essential step, however, is frequently overlooked. For projects, all reports should include not only evaluation results and reports of progress, but detailed explanations of how those results were used to reinforce, refine, or modify project activities.
Evaluation activities should be fully incorporated into the project management process (design, implement, evaluate, redesign...). Too often, insufficient time and resources are available for the redesign stage. You should schedule project activities to allow time for reviewing evaluation results and modifying project design after results become available and before the next iteration of the intervention begins.
The purpose of educational science is to improve educational practice. Therefore, it is essential that you use evaluation results not only to inform your project and primary audiences, but also that you disseminate them to a wider audience. Dissemination can and should include the publication of your evaluation results in peer-reviewed journals, or presentations at professional conferences. In addition, there are a number of less formal avenues that can be used to share preliminary results and experiences.
Action Steps for Investigators:
Evaluation is essential to improve the quality and effectiveness of projects designed to improve K-12 education. The first step in the development of appropriate evaluation activities is to incorporate an evaluation strategy into the project planning process.
So where do you start? Most currently funded projects do not have the personnel or financial resources to design and implement comprehensive evaluations of their projects. A practical approach to this dilemma is to proceed incrementally, beginning with what is possible now and gradually increasing evaluation activities as the project develops. Projects should strive to evaluate a few components well, rather than several poorly or not at all. Investigators may want to focus their short-term evaluation efforts on the most important process and outcome objectives of their projects. From an evaluation perspective, a focus on implementation and immediate outcomes is advantageous because relatively inexpensive and straightforward methods for valid assessments of student performance exist and have been used successfully to evaluate other educational interventions.
A limited set of priority indicators useful to project managers should be identified in an overall plan for evaluation. The plan should specify the data sources and how often indicators will be measured. Priority indicators will vary from project to project, based on their goals and specific objectives. Project directors should systematically select the indicators appropriate for their project as a part of the planning process. In addition, State, Regional or National managers may have uniform indicators to be collected by all projects; you should discuss the selection of indicators with your Project Officer to ensure that any key indicators are adequately covered by your evaluation plan.
For evaluation to lead to improvements in educational programs, it must be clearly defined as a part of the project activities. Investigators can increase the yield from their project evaluation activities by working collaboratively with other disciplines and with national staff of the project to define appropriate guidelines, evaluation questions and methods. A coordinated approach will conserve resources and allow comparisons among various firstname.lastname@example.org