Part of the ritual of college life is that, at the end of every semester, students dutifully fill out evaluations of their courses, rating their professors’ teaching ability and their own satisfaction with the course. These student evaluations of teaching or SETs can influence everything from tenure and promotion decisions to yearly raises. How important they are varies from school to school: smaller liberal arts colleges and universities tend to value teaching more than do big state schools, where research and grants are more critical. Recent years have witnessed a plethora of critiques of these student evaluations, on the part of the academics they affect. Their objections tend to fall into two broad categories. First, that students are unqualified to rate professors and student evaluations therefore don’t really measure teaching effectiveness. Second, that student evaluations tend to be biased against women and ethnic minorities. As with many emotionally laden issues, the condemnations of student evaluations are often expressed with almost religious zeal. But is the evidence against them clear cut?
Faculty critiques of student evaluations have an obvious, built-in conflict of interest. Given that the reliability of social science data has lately come under scrutiny because of false positives largely driven by researcher biases, this isn’t a minor issue. But is there evidence to support these faculty concerns anyway?
Many of the critics express absolute certainty about either the worthlessness or biased nature of SETs. For instance, the American Sociological Association recently released a statement suggesting that SETs are only “weakly related” to other measures of student learning and “systematically disadvantage faculty from marginalized groups,” particularly women and ethnic minorities. But do the available data support such claims?
Many critics of SETs appear to employ citation bias. That is, they only cite the evidence that supports their claims and fail to cite contradictory evidence. This is poor scientific practice, and arguably unethical, yet it remains common even among professional advocacy organizations, when they adopt policy positions. The other issue is shifting goalposts. Critics insist on a high standard of evidence for the validity of SETs, but don’t apply the same standard to studies of bias.
Do SETs Reflect Student Learning?
Validity is measured by correlational coefficients between SET scores and outcomes like student grades or standardized tests. Correlation effect sizes measure the strength of association between these variables, or the degree to which knowing the score in one helps us predict the score in the other. Correlation effect sizes range from 0 to 1.0, with 0 being no correlation, 1.0 a perfect correlation. The number of almonds you personally eat each year and the rainfall in your neighborhood might have a close to 0 correlation, whereas the correlation between arsenic consumption and a decrease in heart rate might be close to 1.0 (dead people’s hearts don’t beat at all). Most correlations in social science don’t exceed 0.2. Studies suggest that SETs do, in fact, correlate with outcomes such as student grades, with correlations in the range of 0.2–0.5—which is not bad. Most social science outcomes scholars believe to be true are based on far smaller effect sizes. Granted, a lot of those social science beliefs based on weak correlations (below 0.20) are probably false positives, particularly given the replication crisis, but the evidence for SETs’ validity is probably in the safe zone. Taking the mid-range of these effect sizes, we might say that somewhere around 15% of the variance in SETs is accounted for by student learning. Obviously, that means that many other factors influence SETs. Indeed, research suggests that other predictable factors, such as instructor likeability and personality, influence SETs. But, in the world of measurement, something that can predict 15% of an outcome behavior isn’t something to shake a stick at. Also, data suggest that SETs correlate pretty strongly with other measurement tools faculty seem to prefer, such as peer evaluations.
SETs do reflect student learning, but they also reflect a bunch of other things. Indeed, they’re perhaps best thought of as satisfaction surveys, of which learning is one part. However, the notion that they do not at all reflect student learning or do so less well than other methods does not appear to be well-supported by the data.
Are SETs Biased?
The other issue is bias. Here, I found the evidence to be contradictory and inconsistent. First, some critiques of SETs rely, in part, on studies that examine ratemyprofessor.com, rather than actual SETs. Generalizing from ratemyprofessor.com to professionally designed SETs is conceptually foolish and sloppy. We need to be careful here. Studies that look at mean differences in SETs based on gender or ethnicity turn up mixed results. Some studies find mean group differences based on gender or ethnicity. Others, however, do not. One meta-analysis of studies of gender bias concludes that, “Findings suggest that SETs appear to be valid, have practical use that is largely free from gender bias and are most effective when implemented with consultation strategies.” In this article, the effect size for gender bias was nearly zero, certainly an effect size much lower than that supporting the validity of SETs. One article—even a meta-analysis—is not the end of the story. Other single studies do find evidence for bias larger than that found by this meta-analysis, although even there effect sizes are generally smaller than those used in assessing the validity of SETs. Yet critics of SETs tend to ignore null results and inconsistencies in evidence when raising claims of bias. An argument for some mean differences is supported by some studies, while other studies don’t find these mean differences. Many critiques of SETs fail to acknowledge this.
I generally favor conservative interpretations of effect size. However, we can’t interpret effect sizes in the 0.2–0.5 range as poor evidence of SETs’ validity and simultaneously interpret lower effect sizes as sufficient evidence for bias. To do so would imply obvious confirmation bias, suggesting that the research data itself is irrelevant, merely a fig leaf used to conceal personal opinions beneath a covering of science.
How to Best Demonstrate Bias (or Lack Thereof) Empirically
Even if mean differences between groups exist, this is not, in itself, evidence of bias. Mean differences can either indicate bias or may indicate real differences between groups. Given that the mean differences, when found, appear to be very small, they may not be very meaningful. In other words, if tiny differences between Group X and Group Y on student evaluations are discovered in some research, we should not conclude that Group X is better at teaching than Group Y, nor that students are particularly biased against Group Y. Some effects are just too small to interpret as practically meaningful.
To indicate bias, studies would need to show that not only are there mean group differences, but that these mean group differences do not correspond to similar mean group differences in the outcome variable (e.g. student learning outcomes). That is to say that, if Group X has higher SETs than Group Y, but student learning outcomes are also higher for Group X than Group Y, and the predictive validity for SETs is the same for both groups, there is no bias. This isn’t to say that there isn’t any bias, merely that most studies currently available, even those that show evidence of mean differences, generally don’t take the data to the next step that would be required to make such arguments. There are exceptions, but the overall trend hasn’t been clear, and the current data pool has not been assessed under conditions of open, transparent science. The potential for false positive results, particularly on such an emotionally valenced topic, is very high.
To design an effective series of SET questions, administrators could develop a pool of potential items, see which best predict student learning, preferably measured via some kind of standardized content examination in each major, and then examine the items to make sure their validity is similar across relevant gender and ethnic groups. This might take a bit of work, but developing effective SETs is far from impossible. The quality of SETs probably varies from institution to institution and advocating for high quality SETs is worthwhile.
However, many current arguments against the validity of SETs or that cite their potential biases exceed the available data, or fail to provide a comprehensive presentation of often conflicting and inconsistent evidence. Further, given the current replication crisis in social science, and the increasing array of beliefs that have been found to be due to false positive findings, evaluations of SETs by faculty with obvious conflicts of interest as a group are clearly ripe for publication bias and other bias issues. Thus, it is essential that future research adopt open science principles, such as preregistration, to reduce the potential for researcher expectancy effects to create false positive findings.
Conclusions
At present, the evidence probably does not support the wholesale rejection of SETs as invalid, nor clearly indicate that they are, as a whole, biased (nor does it fully exonerate them). More and better research is needed and, in the future, could indicate some real issues. Nonetheless, faculty are within their rights at an institutional level to insist that SETs are developed using a rigorous, empirical process using preregistration and that the potential for biases, as indicated by differential validity between groups, be carefully examined. Clear metrics for what outcomes constitute evidence for concern should be used. For instance, the potential for “noise” false positive results in effect sizes below r = 0.10 is quite high and such evidence may not be sufficient either for validity or evidence of bias even if statistically significant. In the meantime, faculty would do well to be cautious in their claims when advocating for better SETs.
How many of these professors were promoted past their competence level due to social factors and thus receive negative evaluations as a result?
As an early career lecturer, just my tuppence worth on the correlation between evaluations and grades. Students like easy courses, and are more likely to evaluate the prof generously than for tougher courses. However, better profs will make course more challenging and push the students harder, potentially at the cost of their evaluations. So take those data with a pinch of salt.
I’d love to see ratings for High School teachers.