Throughout this article, we illustrate major issues by infusing examples from real-life project evaluations into the text. It is not our intent to present comprehensive evaluation results from any of these projects or to critique them publicly; rather, we aim to provide examples from multiple sources that exemplify the points being made. One or both of the authors were involved with the evaluation of each of these projects. The Toward Other Planetary Systems (TOPS) and Authentic Research on Variable Stars (VS) projects are professional development workshops for teachers, the former a three-week residential workshop focusing on the integration of astronomy (including archaeoastronomy and astrobiology) into the classroom, and the latter a 10-week part-time training program in the use of a research protocol for investigating variable star data available on the Internet. The University of Hawai'i's Research Experiences for Undergradutes (UH REU) program is a 10-week summer research program in astronomy. Finally, the Princeton Earth Physics Project (PEPP) created a network of seismometers in middle and high school classrooms across the country. Additional information about each of these programs can be found in Notes 2–5.
The selection of an evaluator or evaluation team is not a simple matter. In general, evaluators work for the project investigators or directors and not for the funding agency, and as such, it is the project directors who decide what information is released to the funding agency or into the public domain. As Frechtling (2002) described, it is desirable to enlist the services of an evaluator from the beginning of the project, although this may not always be possible. Although the evaluator can be internal to the project investigators' department, it is often preferable to enlist someone from outside the project. This could be a colleague from another department at the same institution (such as education or psychology departments), or someone from a professional evaluation company. The evaluator should have a good understanding of the project's goals and objectives and be accepting of them without introducing bias. It is also preferable for the evaluator to have at least a basic understanding of the scientific content underlying the project.
The appropriate use of project evaluation is analogous to that of the selection of suitable assessment techniques in our classrooms (Brissenden et al. 2002). As described in greater detail by Hannah (1996), evaluators generally use three types of evaluation. Other evaluative studies can be used (see Rossi, Lipsey, & Freeman 2004, pages 62–65, for a definitional list), but those described below tend to be the most common. The choice and depth of evaluation type depends largely on the project under scrutiny and the goals of the project directors and evaluators. In the ideal case, multiple types of evaluation would be included over the lifetime of the project.
A planning evaluation serves to assess a project in its planning stages—in essence, to align a project's goals, objectives, processes, and timelines. This helps to focus project goals and, when used early enough in the proposal development process, can uncover weaknesses or inconsistencies in proposed programs before they are submitted for merit review. A needs assessment, a systematic study of the needs of the program and its potential consumers, may be included as part of the planning evaluation.
A formative evaluation assesses a project while in progress, making suggestions for change and further evaluating any midproject alterations that are implemented. It is ongoing throughout the project and may be iterative in nature. In contrast, a summative evaluation looks at a project only upon its completion. This is performed with the intent of making a final judgment about the level of success of the project—for example, by determining if its goals and objectives were met.
Just as in a well-designed astronomy course, the most successful astronomy education and public outreach projects are those that have plainly articulated goals and a clear path to achieving them. Remember that the purpose of an evaluation, especially the summative, is to determine whether the project goals have been met. The presence of well-defined goals was probably part of what helped the project receive funding; don't lose sight of those goals once this happens.
It is also important to make explicit all of the goals that will be evaluated. Often, we have implicit goals—an example is to improve a participant's attitude toward astronomy—that go unstated but are assumed by the persons involved in the project. These implicit goals may then surface during a project evaluation without ever having been addressed by the project in a measurable way.
Science education projects are frequently built upon the following types of goals: (1) cognitive goals designed to increase participants' knowledge of science concepts or to improve their scientific inquiry skills; (2) affective goals aimed at enhancing participants' attitudes, values, and interests (defined further in section II F 3) in science; and (3) product creation and dissemination goals for classroom-tested instructional materials or techniques. Projects may focus on only one of these types of goals, or may have all three types incorporated. The VS project, for example, has all three types of goals at varying levels. The cognitive goal is that teachers (and eventually their students) who used the research protocol should demonstrate increases in knowledge of variable stars. Generating participants' interest in using the protocol is related to affective goals, and the widespread distribution of the student research protocol and accompanying guide book for teachers is a clear example of a product creation and dissemination goal.
Instructors can generally attest that having overarching goals for their class is important, but also can recognize that broad goals are not easily measured. Instead, we often break down our instructional goals into smaller chunks—objectives or measurable outcomes—so that we might more easily and effectively establish the extent to which our goals have been met. For example, if a project (or classroom) goal is to significantly increase participant interest in cosmology, objectives might include increased scores on an interest survey when measured over the course through pretests and posttests, increased time spent with media related to cosmology (such as choosing to read magazine articles or watch science-based television programs), or the voluntary selection of cosmology as a topic for a paper or project. In much in the same way that scientists use different data sources to establish the validity of their conclusions, multiple objectives often are needed to determine if a broad goal has been met, providing a way to “triangulate” the evaluation data through a variety of sources.
In the PEPP evaluation, the original goal of the project (established by a different principal investigator from the one who contracted the evaluation efforts) was simply to create a network of working seismometers located within high schools across the country. This is actually quite simple to evaluate by counting the number of active seismometers within the network. However, implicit goals that were eventually made explicit for the purposes of evaluation included, for example, the use of the real science data in the classroom and affective gains toward learning about earthquakes. The summative evaluation for PEPP met with many challenges as a result of the low number of project goals explicit in the initial design of the project and the variety of implicit goals that became apparent during the one-year evaluation period.
A common format to summarize evaluation plans is a matrix that relates the various goals of a project to the evaluation procedures and objectives to be met (the evaluation matrix for the VS project is provided in Table I as an example). Rows of the matrix describe the specific project goals and outcomes; columns indicate project activities, assessment data sources and analysis strategies, and performance indicators of success for each of the listed goals. As such, the evaluation matrix provides a structured approach for the external project evaluation, and clearly specifies how the evaluation is related to the project's desired outcomes and definitions of success. The development of such an evaluation matrix during planning is essential to making project goals and activities more explicit for the evaluators, program directors, and proposal reviewers. When clearly laid out, the matrix directs the work of the evaluator and also provides merit reviewers a succinct and clear overview of a proposed project, its planned evaluation, and its indicators of success.
Project evaluation, like educational research in general, can encompass a variety of quantitative and qualitative data collection strategies. Systematic qualitative methods might include, but are not limited to, repeated classroom observations, collections of participants' work, or clinical and group interviews of participants. While such techniques are ideally suited for gathering insightful anecdotal evidence or describing a single event or participant, they tend to work best for time-intensive case studies. However, this is usually insufficient for making wide generalizations that can be applied to other projects or participants. Alternatively, quantitative methods—surveys, for example—allow more room for statistical analysis and perhaps wider generalizations. Unfortunately, these methods are also limited in scope and might not thoroughly describe a given situation if the evaluator is not aware of the nuances of the project or specific scientific concepts.
The most useful evaluation studies use a combination of quantitative and qualitative data sources. In particular, results obtained from quantitative instruments need to be validated qualitatively using individual or group interviews (or other qualitative techniques) with participants. One approach is to use qualitative methods to inform the design of quantitative instruments, later revisiting qualitative methods to support the analysis and interpretations of the data. Some evaluators accomplish this in part by presenting the qualitative data to project participants and asking them to help interpret the results. The evaluation results that are the most convincing employ a triangulated multiple data source approach to assess stated project goals with both quantitative and qualitative collection strategies. Such triangulation of data also helps reduce potential weaknesses in a single data source.
Changes in knowledge (the cognitive domain) among participants are commonly measured using pretests and posttests that cover either a narrow range of content knowledge specific to the project, or broader science content. Only a few national instruments exist with large comparison databases to measure broad understanding of astronomy and earth/space science concepts, and none is yet in widespread use. These instruments include the Astronomy Diagnostic Test (Hufnagel et al. 2000), the AGI/NSTA Earth Science Examination (Callister et al. 1988), the Earth Science “Literacy Test” (DeLaughter et al. 1998), and the Nature of Science Survey (Libarkin 2001). The Astronomy Diagnostic Test was used in pre/posttest administrations for the TOPS workshop nearly every summer in order to assess what astronomy content knowledge was gained by the participants over the three-week period. However, it is much more common for evaluators, or even the project investigators, to develop their own pretest/posttest knowledge surveys that are responsive to the emphases of individual projects. Usually multiple-choice in format, these surveys are time consuming to develop and suffer from many of the weaknesses of such instruments outlined elsewhere (Astwood & Slater 1997).
Experience suggests that participant skills (unlike content knowledge) that have improved as a result of an activity or project can be measured reasonably reliably by self-report if participants do not have a reason to inflate the results. Alternatively, participants can be observed and the important components of a performed skill can be rated by the evaluator using an observation checklist. In the UH REU evaluation, interviews of participants at both the beginning and end of the summer term often included questions about their observational or data reduction skills.
Whereas the cognitive domain focuses on what participants understand and can apply, the affective domain embraces attitudes, values, and interests. In the language of education, attitudes are the extent to which participants like or enjoy something, values are the degrees to which participants think that something is important to engage in, and participants' interests are things worthy of allocating time to (Anderson 1981). These facets of the affective domain may be related to, but quite different from, participants' satisfaction with the program itself. The affective domain is most commonly evaluated by asking participants to self-report their feelings using pretest and posttest Likert scale surveys. Participants are presented with a statement and asked to respond on a scale ranging from 1 (strongly disagree) to 5 (strongly agree). As an example, Table II shows the survey that was administered after the 2003 TOPS workshop. Typically, mean scores and standard deviations are calculated for each item, and a t test of significance is calculated for a given response to judge the meaningfulness of any observed change between preproject and postproject administrations.
One concern about surveys that use such a scale is the variability across participants; what one person agrees with, another might strongly agree with. Likewise, an individual might feel different from one administration to another. Often results are binned into positive (e.g., responses 5 and 4), neutral (3), and negative (2 and 1) categories to help alleviate the potential differences between participants or over time. These are questions of validity and reliability, respectively. More information on how to design validated, reliable surveys can be found in any number of references on testing and measurement, such as Campbell & Stanley (1963). Other data collection methods, especially qualitative methods such as interviews, might more accurately represent participants' true feelings in the affective domain.
Science education projects often include the development of curriculum products. It is important that these products be thoroughly evaluated (including classroom testing) to judge their ease of use, appropriateness, value, and effectiveness before they are widely disseminated. The measurement of these attributes is often tied to the previously described measures of the cognitive and affective domains. Similarly, a careful plan at the beginning of the project for how and how widely the materials will be disseminated and adopted may point to fixable weaknesses and thus increase the later success of the dissemination plan. Dissemination of the VS research protocol is one of that project's three primary goals; as such, the tracking of its distribution and use will be an important part of the evaluation efforts. Continued contact with current and future teachers will be paramount to the success of this part of the project evaluation.
The uses of planning and formative project evaluations are fairly straightforward. Because they occur before or during a project, changes can be made based upon evaluation results in order to improve the project if the project leadership values and actively participates in the evaluation process. During the TOPS program, daily forms were collected with evaluation comments on each session offered. The project directors could use this information, compiled in a report submitted at the end of the summer, to determine which sessions to repeat (or not) during future summer workshops. Also, if immediate needs or concerns were provided by participants on a daily evaluation form, project staff could address them within one to two days.
Summative evaluations have their own place in improving education and outreach efforts, though their intrinsic value may not be as obvious as that of formative evaluation. Lessons can be learned both about the project itself (e.g., how to best achieve certain goals, logistics that need to be reconsidered) and the evaluation process. Sometimes we find that our data collection was not the best for the objective or goal described; this can be fixed in future projects and their evaluations.
The most tenacious problem with evaluation efforts is the self-selection effect shown throughout the process. Especially in the case of teacher workshops, participants who are the most willing to participate in evaluation activities (such as interviews or the completion of surveys) are often those most motivated to succeed, either intrinsically or as a result of the project with which they are involved. This self-selection effect can skew results in a positive direction, making the project appear quite successful when in reality, a portion of the participants (and their data) are not represented. In the case of PEPP, those teachers who were active and excited about the program were happy to contribute their time to evaluation efforts. However, several of the original teachers had moved on to new jobs or simply had not continued their participation with the seismic network. Not surprisingly, these teachers often could not be contacted to determine why their participation had lapsed and what could be done to alter the situation.
An additional challenge faced by the PEPP evaluation team was its late involvement in the project. PEPP had been funded at some level since 1993, but the formal evaluation efforts did not begin until 2000, largely as a result of changes in funding requirements as they related to evaluation. In this case, a summative evaluation suffers from the inability to investigate or demonstrate the potential changes in the cognitive and affective domains as a result of the project; it is impossible to collect pretest data after the fact.
As mentioned earlier, sometimes we don't realize until after the analysis is complete that a data collection strategy is less effective than we would like. Early during the TOPS evaluation, we used the Astronomy Diagnostic Test (Hufnagel et al. 2000; Zeilik 2003) to ascertain cognitive growth as a result of participation in the workshop. However, upon analysis, it was noted that many of the topics included on the ADT had not been addressed in the TOPS workshop; likewise, many astrobiology and archaeoastronomy topics were covered in the workshop but not assessed in the ADT. The instrument was subsequently changed to more accurately reflect the topics emphasized in the workshop, incorporating questions from the ADT and other sources. Any data-driven claims about how the workshop impacted repeat participants were, of course, compromised by this change.
The collection of longitudinal data (tracking participants months or even years after their initial involvement in a project) is one of the most valuable and most difficult of all evaluation tasks. Funding for such evaluation efforts is rarely available, and often deadlines for reports to funding agencies preclude the collection of long-term data. However, knowing the influence of a project “down the road” (what is known as an impact evaluation) can be beneficial for similar or extension projects of the future.