Astronomy Education Review, Vol. 6, No. 1, pp. 25–42, April 2007
©2007 Erik Brogt. Copyright assigned to the Association of Universities for Research in Astronomy, Inc.. All rights reserved.

Up: Issue Table of Contents
Go to: Previous Article | Next Article
Other formats: HTML (smaller files) | PDF ( kB)

Analysis of the Astronomy Diagnostic Test

Erik Brogt

University of Arizona

Darrell Sabers

University of Arizona

Edward E. Prather

University of Arizona

Grace L. Deming

University of Maryland, College Park

Beth Hufnagel

Anne Arundel Community College

Timothy F. Slater

University of Arizona

(Received: 31 January 2007; revised: 13 April 2007; published online: 14 June 2007)

Seventy undergraduate class sections were examined from the database of Astronomy Diagnostic Test (ADT) results of Deming and Hufnagel to determine if course format correlated with ADT normalized gain scores. Normalized gains were calculated for four different classroom scenarios: lecture, lecture with discussion, lecture with lab, and lecture with both lab and discussion. Statistical analysis shows that there are no significant differences in normalized gain among the self-reported classroom formats. Prerequisites related to mathematics courses did show differences in normalized gain. Of all reported course activities, only the lecture and the readings for the course correlate significantly with the normalized gain. This analysis suggests that the ADT may not have enough sensitivity to measure differences in the effectiveness of different course formats because of the wide range of topics that the ADT addresses with few questions. Different measures of gain and their biases are discussed. We argue that the use of the normalized gain is not always warranted because of its strong bias toward high pretest scores. ©2007 Erik Brogt. Copyright assigned to the Association of Universities for Research in Astronomy, Inc.


Contents

INTRODUCTION

Conceptual diagnostic tests can be used to measure course effectiveness by assessing student understanding about a particular concept both before and after instruction. This is often called pretest and posttest design. Student achievement is measured prior to and after instruction, and a gain in students' scores as a result of instruction is calculated. In the context of physics education research, one of the most commonly used diagnostic tests is the Force Concept Inventory (FCI; Hestenes, Wells, & Swackhamer 1992). In a large meta-study, Hake (1998) obtained results from more than 60 courses, encompassing more than 6,000 students who were surveyed with the FCI, the Mechanics Baseline Test (MBT; Hestenes & Wells 1992), or the Mechanics Diagnostic test (MD; Halloun & Hestenes 1985). He used these results to measure the effectiveness of interactive engagement and traditional lecture-based course formats. Hake showed that interactive engagement methods in physics led to higher gains than traditional lecture-based methods.

In astronomy, tools like the FCI, MBT, and MD are less common. Furthermore, the student population is significantly different. Students taking introductory physics courses in which these physics conceptual diagnostics are administered are typically science, engineering, or pre-med majors. Most of these students are required to take the introductory physics series as a prerequisite for their degree program. The vast majority of students taking an introductory astronomy course are non–science majors fulfilling a general education science requirement; the course often will serve as their terminal course in science.

The most commonly used diagnostic to date in introductory astronomy courses is the Astronomy Diagnostic Test (ADT; Zeilik 2003). The ADT has 21 multiple-choice content questions covering a wide range of astronomy topics and is aimed at the introductory-level courses typically taught to non–science majors at colleges and universities. In addition, the ADT includes 12 demographic questions. Deming and Hufnagel (2001) constructed a database that contains more than 5,000 students' pretest scores and over 3,500 students' posttest scores on the ADT version 2.0. In addition, a vast array of instructor-reported information about the courses is available in the database.

In this study, we are interested in pursuing two questions: Are there differences in gain between the different course formats? Can we identify and quantify additional variables that may help predict student gains? It should be noted that this study is not similar to the Hake (1998) study in the sense that we do not make a distinction between interactive engagement and traditional lecture-based formats. There is not enough information in the database or other materials (Deming 2002) to suggest that the different course formats have true interactive engagement elements in them. Although we do not discount such an option, it is not an a priori assumption in this study.

The article is set up as follows: In Section II, we briefly discuss the structure of the ADT database, in Section III we describe the methods we used, and in Section IV, we present the results. In Section V, we discuss a further analysis on measures of gain. In Section VI, we discuss the results guiding our conclusions and offer recommendations for further study.

THE ADT DATABASE

The ADT database contains information about more than 5,000 pretest and 3,500 posttest results obtained from approximately 100 classrooms across the United States, reflecting a wide variety of institutions. In addition to the student responses to individual questions on the ADT and the total number of correct items, the database contains instructor-reported information about the following items: geography, institution type and size, class size and format, type of course (Solar System, universe in one semester, and so on), math prerequisite for the course, and information on how well the course topics align with ADT questions. To maintain protections afforded by human subjects policies, the database we worked with was completely absent of variables that might identify individual students. In this study, we were interested in the pretest and posttest scores as a function of course format. Of all the formats listed, four were useful for this analysis: lecture alone, lecture with mandatory laboratory, lecture with mandatory discussion or recitation sessions, and lecture with both laboratory and discussion sessions. We reduced the data set to contain only those entries that had these class formats, with 70 out of 100 classes meeting the requirements for this study.

METHODS

The participants for the collection and submission of ADT results for the database were instructors who volunteered to administer the ADT to their students at the beginning and/or end of their undergraduate introductory astronomy survey courses. As such, the sample represents one of convenience rather than a true random sample, and many instructors were obtained by personal contacts of the ADT design team. Students' pretest and posttest data are not matched; this restriction was imposed by removing identifying characteristics and partly by attrition in student numbers in the classes over the semester, as indicated by the difference in the number of pretests and posttests administered. We calculated class mean prescores and postscores and the normalized gain (Hake 1998) for each class, which is defined as

Normalized  gain = (%  post − %  pre)/(100 − %  pre)

We then averaged the normalized gains per instructional format. We have 16 classes characterized by lecture alone, 5 for lecture with discussion, 40 for lecture with lab, and 9 for lecture with both lab and discussion. If a course has multiple components, it is likely that a wider variety of student learning styles are being served. Based on the variety of opportunities to learn, we predicted that the lowest gains would be in the lecture-only format, and the highest gains would be in the lecture with both lab and discussion format. The lecture with only lab, and the lecture with only discussion formats were predicted to have gains between those two extremes. This assumption allowed us to do one-tailed tests, increasing the statistical power.

We chose a family-wise alpha level of alphaFW=.05. This means that the overall chance of finding significance when in fact the result is attributed to random chance is 5%. For the analysis, we used the Holm-Bonferroni planned contrast method. We chose this method because it is appropriate for the unequal sample sizes in the data obtained for this analysis. The high statistical power comes with a price: One is required to plan all contrasts prior to analysis to keep the alpha slippage (increasing the chance of claiming a significant result when it is not warranted) for the entire set of contrasts under control. Independent sample t tests are done for each contrast. One can argue that there is considerable overlap between the students doing the pretest and the posttest in each class and that an independent sample t test would lead to a decrease in statistical power. However, because there is no information available on which students did the pretest and posttest, using an independent sample t test is the most conservative estimate that one can make. The resulting p values (the probability that the result, in this case the difference in gains, is the result of random chance rather than an actual effect) are rank ordered, with lowest value first, and compared with the threshold value. Because the first contrast is evaluated at an alpha level of alpha=(alphaFW/total  number  of  contrasts), it is important to keep the number of contrasts low to increase the statistical power of the test. Each subsequent evaluation in this method is run at a slightly higher alpha level (denominator goes down by one for each evaluation), but it requires a statistically significant previous evaluation. When one of the evaluations yields a nonsignificant result, the subsequent evaluation will not be significant either. For this reason, we decided not to evaluate the contrast dealing with lecture with lab, and lecture with discussion because we could not a priori make a reliable prediction of which of those formats would yield a higher gain. Table I summarizes the planned contrasts, which are set up in the following form:

Gain(format  1) − Gain(format  2) < 0

Because we have five contrasts, the threshold for significance for the first contrast is alpha=.05/5=.01.

A.Other Variables in the Analysis

The database contains several variables that could potentially influence the normalized gain as well. Three variables had an a priori high face validity for further investigation. Those were class size, math prerequisite for the course, and the extent to which the course content mapped onto the topics covered in the ADT. All these data were self-reported by the course instructors or listed on the course syllabi. In the project that created the database used in this project, not all instructors reported all these variables. Therefore, because not all course formats in the database had data associated with these variables, it was not possible to use them as covariates; too many classes would have been eliminated from the analysis. Instead, we used a simple correlation to measure the effects of the variable on the entire set of classes.

A1.Class size

Class sizes in the database varied from only a few students to over 300. In larger classes, it is generally accepted that students will be more anonymous than in smaller classes. This could lead to a lessened sense of relatedness to the class, one of the three fundamental ingredients for intrinsic motivation (Deci & Ryan 1985). Although one could argue that smaller lab or discussion sections would partly negate this effect, we expected to see a slight negative correlation between normalized gain and class size.

A2.Math prerequisite

In traditional instruction of introductory astronomy for non–science majors, some emphasis is placed on mathematical operations, usually in the form of solving algebraic equations and interpretation of graphs, as evidenced by introductory astronomy textbooks. This is poised to present a problem for those students with math anxiety and/or limited math skills. The courses in the original ADT database are coded for mathematics prerequisites. We expected a difference in gain scores between classes that have a formal university-level math prerequisite (algebra and trigonometry) and those that do not. We expected the former to have a higher normalized gain than the latter.

A3.Course content

In the original ADT study (Deming 2002; Hufnagel 2002), instructors were asked to rate on a scale from 1 to 11 the extent to which they thought that the different parts of their course (reading, lecture, homework, activities, and lab) aligned with the items on the ADT. The alignment does not indicate what fraction of the course was actually spent on topics covered on the ADT. However, we use the reported alignment as a first-order approximation because it seems reasonable to assume that a course with a higher reported alignment will produce a higher normalized gain than a course with a lower reported alignment. Because not all course designs had a sufficient number of classes to make a stratification, we aggregated all classes and calculated the Pearson correlation coefficients (a measure for a linear relation) between the different elements of the course and the normalized gain.

RESULTS

Summary statistics for the class formats can be found in Table II. In the subsections below, we discuss the various results in more detail.

A.Homogeneity of the Class Formats

The original Hake (1998) study examined distinct populations: high school, college, and university students in both interactive engagement (IE) or traditional course format. In all populations, the normalized gains for the IE classes are higher than the gains achieved by traditional classes. This is shown in Hake's plot, pretest percentage score (the percentage of questions on the FCI or ADT answered correctly before instruction) versus normalized gain, in Figure 1. Moreover, the traditional classes in Hake's original study also occupied distinct areas in the plot, indicative of different populations (high school, college, and university students). In Figure 2, we plotted our data in a similar fashion to Figure 1. The different class formats show overlap, which we interpret to be indicative of a more homogeneous population. This result is not particularly surprising because the database contains only information about introductory astronomy students at the college/university level, making the population much more homogeneous than the populations in the original Hake study (Deming 2002; Hufnagel 2002).

Figure 1. Figure 2.

A1.Shape of the distributions

For all classes, we calculated skew (gamma1) and kurtosis (gamma2) of the distribution of scores (see Appendix A). The skew value measures in what direction the distribution is tailed, with gamma1<0 meaning that the distribution has a tail to the left, and gamma1>0 meaning that the distribution is tailed to the right. The kurtosis is a measure of the flatness of the distribution, with gamma2>0 meaning the distribution has a high peak, and g2<0 meaning that the distribution is less peaked. Overall, the classes showed a shift from pretest to posttest toward lower values for both gamma1 and gamma2. This is consistent with learning taking place (shift to the right in scores), but not everyone is learning at the same rate (flattening of the distribution and a larger standard deviation posttest as compared with pretest). Because in our study, an entire class is the unit of analysis, we did not consider the skew and kurtosis and their effects on the normality assumption in statistical tests of the distribution of an individual class. However, in an analysis on the classroom level, skew and kurtosis should be considered because they can undermine the assumptions of normality that underlie most statistical analyses.

B.Differences in Normalized Gain as a Function of Teaching Format

Results for the statistical tests using five planned contrasts are given in Table III. The results show that even the first contrast is not significant, meaning that the other contrasts are not significant either. For one class that included lecture plus discussion, we noticed that the data showed extremely low gain that severely impacted the mean gain of the group (the sample size of this group is only 5). We recalculated the contrasts leaving out this anomalous value in Table IV, and again, no contrasts were significant.

C.Additional Variables

C1.Class size

We plotted the normalized gain as a function of class size in Figure 3. A bivariate correlation yielded a nonsignificant Pearson r correlation coefficient (a measure for a linear relationship) for this distribution of r=.05. Class size does not appear to be a significant factor in predicting normalized gain scores in classes up to 50 students. The larger classes are not sufficiently sampled to draw a firm conclusion.

Figure 3.

C2.Math prerequisite

The courses in the database are coded for a prerequisite in mathematics. Table V shows the results of a one-tailed independent sample t test of the average normalized gain in courses that did have a math prerequisite, and those that did not have a prerequisite yielded a significant (p<.01) difference. The Cohen's d effect size (in essence, the difference between the means in units of standard deviation) was calculated using the formula of Rosenthal and Rosnow (1991):

<i>d</i> = <i>t</i>(<i>n</i>1 + <i>n</i>2)/(<i>d</i><i>f</i> <sup>*</sup><i>n</i>1 <sup>*</sup><i>n</i>2)<sup>1/2</sup>

in which t is the obtained t value, n1 and n2 the sample sizes and df the degrees of freedom. The effect size in Table V shows that we are dealing with a medium to large effect, keeping in mind that Cohen's classification of d=.8 as being a large effect should not be used as an absolute benchmark, following Thompson (2007).

We checked if this result was due to a higher pretest level of student content knowledge in the course that had a mathematics prerequisite. A one-tailed independent sample t test yielded a significant (p<.05) result. The Cohen's d effect size indicates that this is a medium effect. The results are summarized in Table VI. At least part of the difference in normalized gain between classes that had a math prerequisite and those that did not can be explained by the difference in pretest scores.

C3.Course content mapping

Using a scale from 1 to 11, instructors self-reported the alignment of a course element with items covered on the ADT. We correlated the reported alignment on the various course elements with the normalized gain. However, in the database, some of the fields for a class were left blank, whereas others had the value zero. It was not clear whether a zero actually meant “not related at all to any item on the ADT” or if it simply was another way of denoting missing data (normally, fields missing data are left blank). Therefore, we calculated the Pearson r coefficients twice in Table VII: once with the original database, in which only blank values were ignored in the analysis, and once in which all the zero values were also ignored.

One class in particular stood out. Class number 69 (see Appendix A) reported a rating of 1 (out of 11) for the reading on the ADT, yet has a normalized gain of 0.52. We judged this to be an anomaly. Leaving out this anomalous value leads to the Pearson r value reported in brackets.

Normally, one would expect combinations of factors—for example, a high score on the content mapping for reading and lecture—to have an effect as well. However, because all the data are self-reported, almost certainly leading to inconsistent values attached to similar mappings, we did not investigate such interactions.

THE USE OF DIFFERENT ESTIMATORS

The normalized gain is biased toward high pretest scores. It is thus possible to find statistical significance between two normalized gains, which is an artifact of the different pretest scores. To investigate the effect of bias, we modeled three different measures of gain. These different measures of gain are biased toward different regions of pretest scores. If one finds statistical significance in one measure but not in others, the results can be suspect. However, if one finds significance on a multitude of measures, or if one fails to find significance on a multitude of measures, a much more compelling case can be made regarding to the validity of the results.

We evaluated the following measures of gain:

• Hake's normalized gain, defined as: gain=(post−pre)/(100−pre)

Gain 2, defined as: gain=(post−pre)/(post+pre)

• Gain 3, defined as: gain=(post−pre)/pre

For a detailed analysis of the biases of these measures of gain, see Appendix B.

A correlation between the pretest and posttest scores (Table VIII) shows the biases involved in a different way. It is clear that there is a strong linear relation between pretest score and normalized gain.

To investigate the effect of these biases on our data, we used the different measures of gain to recalculate the planned contrasts in order to see if one of them would yield significance. The results of the planned contrasts analysis are given in Tables IX and X. No significance for any contrast was found with the estimators Gain 2 and Gain 3.

CONCLUSIONS

Based on the results listed in the previous section, we reached the following conclusions. First, there are no significant differences in normalized gain between the four course formats. This can be interpreted in two ways. First, one can argue that the ADT only contains 21 questions that cover a wide range of astronomy topics. The ADT is thus not as tightly focused on a sample of related concepts as the FCI is. As such, the ADT cannot be considered a true diagnostic tool in the same sense of the FCI (Hestenes et al. 1992). There are probably not enough questions per concept covered in a typical introductory astronomy class to adequately probe student understanding of any one particular concept, if the ADT covers the concept at all. The low resolution of the ADT may thus influence the relatively low normalized gains that were observed. Gains lower than 0.3 are considered to be in the low region according to Hake (1998); the medium region is between 0.4 and 0.7, and gains larger than 0.7 are considered large. Only five classes (numbers 22, 28, 53, 67, and 69 in Appendix A) score a medium gain, and the rest of the classes are in the low region. Because of these low gains and low final scores (around 50%), there is ample room for growth, both positive and negative. This means that ceiling and floor effects in all the measures of gain are negligible.

Another way to interpret the results is via the argument that the four different formats that we investigated here are instructionally equivalent; all are essentially instructor-centered formats, without explicit interactive engagement elements in the courses in the sense of the Hake (1998) study. Therefore, it should not come as a surprise that all observed gains are statistically equivalent because one can argue that the only pedagogically relevant variable among the courses potentially is time on task.

Second, the use of the normalized gain as a measure for course effectiveness may be suspect. The normalized gain is biased toward high pretest scores, as indicated in Table VII. The bias inflates differences, which makes it easier to find statistical significance. This can lead to claims about course effectiveness that may not be warranted. Other estimators used in this study were not so strongly biased toward pretest scores.

Third, the size of the class does not correlate with the normalized gain for class sizes smaller than 50, as illustrated by Figure 3 and by the low and nonsignificant Pearson r coefficient found for this distribution. Larger classes were not sufficiently sampled to draw any conclusion. Although this may indicate that class size does not influence student scores, we do not want to draw that conclusion because of the issues with the ADT as an instrument mentioned above. In addition, the sizes of lab and discussion sections were unknown. Part of the anonymity of a large lecture can be overcome in smaller, more personalized lab and discussion sections.

Fourth, prerequisites for mathematics show a positive correlation with the calculated normalized gain. This may be partly due to students entering such a class having higher pretest scores than students in classes with no mathematics prerequisite. We also suspect individual student demographics to be a factor because some mathematics prerequisites encourage students to take an astronomy course later in their academic careers. This may mean that students have developed more success skills for college courses. As such, they may have learned to get more out of a class, resulting in higher gains.

Fifth, the alignment of lecture and ADT items is positively correlated. The lecture is the most consistent factor in Table VII. This also is not surprising because it seems likely that lecture encompasses most of the time on task for the course, although the database does not provide direct evidence for this.

Last, it appears that the learning in all formats can be described best by a growth model of the form postscore=constant *prescore. This model accounts for over 50% of the variance in the cases in which a significant value (significantly deviating from zero) was found. The combination of low gains, the avoidance of the ceiling and floor effect regions, and a first-order relationship between pretest score and posttest score would argue for using a different measure for gain than the normalized gain. A case can be made for using Gain 3 because it is not biased toward pretest scores in this region, with this functional relationship between pretest and posttest scores. In general, choosing a measure for gain should depend on the relationship between pretest and posttest scores because different functional relations will bias different measures for gain in different ways.

A.Recommendations

To truly measure student understanding as a function of class format, more sensitive instruments will be needed. However, this is a double-edged sword. Concept inventories that focus on a single conceptual domain, like the ones on lunar phases (Lindell 2001), stars (Bailey 2006), greenhouse effect (Keller 2006), and light and spectra (Bardar 2006; Bardar et al. 2007), probe conceptual understanding of one particular topic and may be more sensitive to different instructional designs. For concept inventories to be successful, it is important that they are developed by people who are also experts in the discipline, as Hake (2007) argued. However, a word of caution is applicable in the use of concept inventories as measures for overall course effectiveness. Just as with the FCI in physics, a measurement of student understanding for a single concept might not be representative of student overall understanding or course effectiveness. In a semester, many concepts are covered, and time spent on one of the topics that can be measured by one of the concept inventories listed above may be small. The alternative viewpoint is that if an instructor designs effective instruction for a particular conceptual domain, it is likely that students are receiving similarly effective instruction on other topics.

On a more logistical front, several recommendations can be made. If a large data-gathering project like the one by Hufnagel and Deming (1999) is undertaken again, some elements from that project could be improved to make the database product more useful to researchers. The first recommendation is to find avenues to develop and secure pretest and posttest data that are matched to individual student gains, yielding more powerful normalized gain scores. This would allow us to use repeated-measure statistics rather than independent-sample statistics, which would drastically reduce error terms. Moreover, Bao (2006) argued that using class averages rather than individual student scores can lead to different gain scores. This may require additional adjustments to determine how the attrition rate biases the data. An additional advantage would be that researchers can investigate individual classes rather than an aggregate of classes only. It would allow us to do a rigorous item analysis on the questions on the ADT. A second recommendation is to endeavor to obtain a more homogeneous determination of the mapping of the content on the ADT (rather than to rely on self-reports by the instructors) and to give an estimate of time spent on each of the mapping factors (the course elements). Although this would be difficult to do, it would allow researchers to make a more rigorous determination of which course elements influence gain scores most effectively.

Acknowledgments

EB would like to thank Joel Levin and Sanlyn Buxner for their advice, recommendations, and comments. The authors thank the anonymous referee for the thoughtful comments and suggestions.

APPENDIX A: SUMMARY STATISTICS FOR THE CLASSES MODEL

See EPAPS supplementary material for Appendix A in PDF format.

APPENDIX B: GAIN BEHAVIOR AS A FUNCTION OF LEARNING MODEL

See EPAPS supplementary material for Appendix B in PDF format.

REFERENCES


    Auxiliary Material (EPAPS)

References

  1. Bailey, J. M. 2006, “Development of a Concept Inventory to Assess Students' Understanding and Reasoning Difficulties About the Properties and Formation of Stars,” Phd Dissertation, University of Arizona, Tucson. first citation in article
  2. Bao, L. 2006, “Theoretical Comparison of Average Normalized Gain Calculations,” Am. J. Phys., 74(10), 917. [ISI] first citation in article
  3. Bardar, E. M. 2006, “Development and Analysis of Spectroscopic Learning Tools and the Light and Spectroscopy Concept Inventory for Introductory College Astronomy,” PhD dissertation, Boston University, Boston, MA. first citation in article
  4. Bardar, E. M., Prather, E. E., Brecher, K., & Slater, T. F. 2007, “Development and Validation of the Light and Spectroscopy Concept Inventory,” Astronomy Education Review, 5(2), 103. first citation in article
  5. Deci, E. L., & Ryan, R. M. 1985, Intrinsic Motivation and Self-Determination in Human Behavior, New York: Plenum Press. first citation in article
  6. Deming, G. 2002, “Results from the Astronomy Diagnostic Test National Project,” Astronomy Education Review, 1(1), 52. first citation in article
  7. Deming, G., & Hufnagel, B. 2001, “Who's Taking ASTRO 101?” Phys. Teach., 39(6), 368. first citation in article
  8. Hake, R. R. 1998, “Interactive-Engagement versus Traditional Methods: A Six-Thousand-Student Survey of Mechanics Test Data for Introductory Physics Courses,” Am. J. Phys., 66(1), 64. first citation in article
  9. Hake, R. R. 2007, “Should We Measure Change? Yes!” To appear as a chapter in Evaluation of Teaching and Student Learning in Higher Education, American Evaluation Association monograph. first citation in article
  10. Halloun, I., & Hestenes, D. 1985, “Common Sense Concepts about Motion,” Am. J. Phys., 53, 1056. [ISI] first citation in article
  11. Hestenes, D., & Wells, M. 1992, “A Mechanics Baseline Test,” Phys. Teach., 30, 159. first citation in article
  12. Hestenes, D., Wells, M., & Swackhamer, G. 1992, “Force Concept Inventory,” Phys. Teach., 30, 141. first citation in article
  13. Hufnagel, B. 2002, “Development of the Astronomy Diagnostic Test,” Astronomy Education Review, 1(1), 47. first citation in article
  14. Hufnagel, B., & Deming, G. 1999, “The Astronomy Diagnostic Test: Comparing Your Class to Others,” BAAS, 31(3), 937. first citation in article
  15. Keller, J. M. 2006, “Eliciting and Addressing Undergraduate Student Beliefs and Reasoning Difficulties Regarding the Atmospheric Greenhouse Effect,” PhD dissertation, University of Arizona, Tucson. first citation in article
  16. Lindell, R. 2001, “Enhancing College Students' Understanding of Lunar Phases,” PhD dissertation, University of Nebraska, Lincoln. first citation in article
  17. Rosenthal, R., & Rosnow, R. L. 1991, Essentials of Behavioral Research: Methods and Data Analysis, (2nd ed.), New York: McGraw Hill. first citation in article
  18. Thompson, B. 2007, “Effect Sizes, Confidence Intervals, and Confidence Intervals for Effect Sizes,” Psychology in the Schools, 44(5), 423. first citation in article
  19. Zeilik, M. 2003, “Birth of the Astronomy Diagnostic Test: Prototest Evolution,” Astronomy Education Review, 1(2), 46. first citation in article
  20. See EPAPS supplementary material at http://dx.doi.org/10.3847/AER2007003 for the additional content of Appendices A and B in PDF format. [EPAPS] first citation in article

CITING ARTICLES


This list contains links to other online articles that cite the article currently being viewed.
  1. Do Concept Inventories Actually Measure Anything?
    Colin S. Wallace et al., AER 9, 010116 (2010)
  2. The First Big Wave of Astronomy Education Research Dissertations and Some Directions for Future Research Efforts
    Timothy F. Slater, AER 7, 1 (2008)

FIGURES


Full figure (53 kB)

Fig. 1. The Hake distribution of classes. Note that the different populations barely overlap. (Adapted from E. F. C. Dokter & S. R. Buxner, pers. comm.). First citation in article


Full figure (49 kB)

Fig. 2. The ADT distribution of classes. Note that the different course formats do overlap. First citation in article


Full figure (46 kB)

Fig. 3. Normalized gain distribution as a function of class size. First citation in article

TABLES

Table I. Planned contrasts evaluations
Format 1Format 2
LectureLecture+discussion
LectureLecture+lab
LectureLecture+lab+discussion
Lecture+labLecture+lab+discussion
Lecture+discussionLecture+lab+discussion
First citation in article

Table II. Summary statistics for the different course formats
LectureLecture+
discussion
Lecture+labLecture+lab
+discussion
# classes165409
# students pretest10455491730723
Mean prescore (21 max)6.656.096.707.31
Standard deviation0.710.281.090.90
# students posttest7583691371582
Mean postscore (21 max)9.668.609.7711.44
Standard deviation1.611.131.882.01
Mean normalized gain0.20980.16810.21460.3016
Standard deviation on normalized gain0.08700.06280.09730.1221
Standard error on normalized gain0.02170.02810.01540.0407
Lower limit 95% confidence interval0.16720.11310.18450.2218
Upper limit 95% confidence interval0.25250.22310.24480.3814
First citation in article

Table III. Ranked planned contrasts with obtained and critical (Holm-Bonferroni) p values
Format 1Format 2Obtained t
value
Obtained p
value
Rank of
contrast
Critical p
value
LectureLecture+
discussion
1.105a0.16 (0.84)5N/A
LectureLecture+lab−0.2020.424N/A
LectureLecture+lab+discussion−2.240.0172N/A
Lecture+
discussion
Lecture+lab+discussion−2.310.193N/A
Lecture+labLecture+lab+discussion−2.350.01210.01
Critical value for the first contrast is p=0.01.
aThe obtained t value indicates that it is located in the opposite tail of the distribution in the one-tail analysis, hence the ranking of 5.
First citation in article

Table IV. Ranked planned contrasts after the anomalous value in the group Lecture+Discussion was removed
Format 1Format 2Obtained t
value
Obtained p
value
Rank of
contrast
Critical p
value
LectureLecture+discussion0.662a0.26 (0.73)5N/A
LectureLecture+lab−0.2020.424N/A
LectureLecture+lab+discussion−2.240.0173N/A
Lecture+discussionLecture+lab+discussion−2.600.0142N/A
Lecture+labLecture+lab+discussion−2.350.01210.01
Critical value for the first contrast is p=0.01.
aThe obtained t value indicates that it is located in the opposite tail of the distribution in the one-tail analysis, hence the ranking of 5.
First citation in article

Table V. Results for the independent sample t test between math prerequisite and normalized gain
PrerequisiteNMean
normalized
gain
Obtained t
value
p (one-tailed)Cohen's d
effect size
Algebra+trigonometry320.262.7960.0040.70
No math prerequisite330.19
First citation in article

Table VI. Results for the independent sample t test between math prerequisite and pretest score
PrerequisiteNRaw mean
prescore
Obtained t
value
p (one-tailed)Cohen's d
effect size
Algebra+trigonometry326.921.7410.0430.44
No math prerequisite336.50
First citation in article

Table VII. Pearson r coefficients for the correlation of various parts of a class with the normalized gain
Form of content deliveryr (zeros included)r (zeros ignored)
Reading0.22 (0.40b)0.19 (0.40b)
Lecture0.38b0.38b
Homework0.31a0.02
Activity0.100.05
Lab0.120.23
aSignificant at the 0.05 level.
bSignificant at the 0.01 level.
First citation in article

Table VIII. Correlations between pretest percent score and the various measures of gain
Posttest%NormalizedGain 2Gain 3
Pearson r0.7540.4690.0180.025
Note the strong correlation between the normalized gain and the pretest value.
First citation in article

Table IX. Obtained p values using Gain 2
Format 1Format 2Obtained t
value
Obtained p
value
Rank of
contrast
Critical
p value
LectureLecture+discussion0.466a0.33 (0.67)5N/A
LectureLecture+lab−0.1780.433N/A
LectureLecture+lab+
discussion
−1.610.0610.01
Lecture+discussionLecture+lab+
discussion
−1.630.0652N/A
Lecture+labLecture+lab+
discussion
1.48a0.074N/A
Critical value for the first contrast is p=0.01.
aThe obtained t value indicates that it is located in the opposite tail of the distribution in the one-tail analysis, hence the ranking at the bottom of the list.
First citation in article

Table X. Obtained p values using Gain 3
Format 1Format 2Obtained t
value
Obtained p
value
Rank of
contrast
Critical
p value
LectureLecture+discussion494a0.31 (0.69)5N/A
LectureLecture+lab−0.2450.403N/A
LectureLecture+lab+
discussion
−1.630.062N/A
Lecture+discussionLecture+lab+
discussion
−1.710.0610.01
Lecture+labLecture+lab+
discussion
1.48a0.074N/A
Critical value for the first contrast is p=0.01.
aThe obtained t value indicates that it is located in the opposite tail of the distribution in the one-tail analysis, hence the ranking at the bottom of the list.
First citation in article


Up: Issue Table of Contents
Go to: Previous Article | Next Article
Other formats: HTML (smaller files) | PDF ( kB)