Today the EEF published a long-term study looking at the relationship between structured teacher observation and pupil outcomes at GCSE in maths and English. This interested me as it is one of the issues in education on which I do not hold a strong opinion. On the one hand, I find the idea of teachers observing one another a good thing, and my starting presumption would be that I and my pupils have benefited from other teachers observing me, and that I have helped other teachers by observing them. On the other hand, I have always been sceptical about non-specialists observing one another, and I suspect a great deal of observation that takes place in schools – particularly performance management observation – does not do the job it is supposed to do. So I came to this EEF study with a relatively open mind.
What I found did not answer the questions I had about teacher observation. Instead, I found a study that I felt to be poorly thought-through, and which raised some significant questions about value for money and the rigour of the process by which projects are approved.
The best starting point here is to think through the causal mechanism that sits behind this study. The study sought to answer the research question “What is the impact of two years of the Teacher Observation programme on learners’ GCSE maths and English achievement?” To be clear, the evaluation looks at a particular structured form of observation that involves teachers observing one another and then rating the other teacher’s teaching using a rubric which is set out on pp62-64 of the evaluation.
Now if I were designing this study, my first question here would be “why would teacher observation make a difference?” I would want to know why this intervention – observing teachers using a rubric – would result in a change in the grades the pupils get in the exams. This is what I mean by a ‘causal model’ – what is the presumed causal relationship between the intervention and the results? By far the most obvious answer, barring some form of ‘observer influence’ on the pupils (which seems unlikely given the relatively small number of observations), is that observing or being observed causes the teacher to change his or her practice in some way as a consequence of the observation. It is this change in practice, one presumes, that would be the thing that has an effect on how well the pupils do.
Having established that the causal link in the design of this study is a change in teacher practice, the next step would be to ask “in what ways does the study imagine that teachers’ practice will change?” The answer here lies in the rubric on pp.62-64. The rubric defines various types of practice as ‘ineffective’, ‘basic’, ‘effective’ and ‘highly effective’. I presume that the designers of the study were working on the basis that teachers who are observed and judged against this rubric will do fewer of the practices defined as ‘ineffective’ or ‘basic’ and more of the practices defined as ‘effective’ or ‘highly effective’. This would then have an effect on pupil outcomes. The causal model can thus be summarised as:
- Teachers are observed against a rubric of ineffective and effective teaching strategies
- As a consequence of observation, teachers adopt more effective teaching strategies
- As a consequence of teachers adopting more effective teaching strategies, pupil results improve.
This is not a particularly complex causal model, although I could not find it set out anywhere in the evaluation document, which I do find baffling. Anyway, what does this all mean? Well, it means that the question
“Does structured teacher observation improve outcomes?”
is actually the following two questions:
(a) Do the teaching styles identified as ‘effective’ and ‘highly effective’ in the rubric improve pupil outcomes?
(b) Does the observation model used cause a teacher to adopt this style of teaching?
Question (b) should in theory be a relatively straightforward research question to answer. Teacher surveys and interviews, subsequent observation findings, analyses of pupil work, and so on, can all tell researchers the extent to which the style of teaching recommended in the observation rubric has or has not been adopted. It is, however, only one part of the causal link. The next one – whether those practices are then effective in improving outcomes – is the next crucial stage in the logic behind the study. And, of course, this question is significantly harder to answer. A great deal of digital ink has been spilt on the effectiveness of different teaching practices. This debate is not considered in the study. Instead, the study relies on the assumption that the teaching practices identified in the rubric as ‘effective’ and ‘highly effective’ are the ones that are going to result in improved outcomes.
So let’s turn to that question. What does the rubric in the EEF intervention describe as ‘highly effective’ teaching. I have chosen a sample here and you can read the whole rubric on pp.62-64 of the evaluation. ‘Highly-effective’ teaching means:
- Questions reflect high expectations and are culturally and developmentally appropriate. Students formulate many of the high-level questions and ensure that all voices are heard.
- Students, throughout the lesson, are highly intellectually engaged in significant learning and make material contributions to the activities, student groupings, and materials. The lesson is adapted as needed to the needs of individuals, and the structure and pacing allow for student reflection and closure.
- Assessment is used in a sophisticated manner in teaching, through student involvement in establishing the assessment criteria, self-or peer assessment by students, monitoring of progress by both students and the teacher, and high-quality feedback to students from a variety of sources. Students use self- assessment and monitoring to direct their own learning.
- The teacher seizes an opportunity to enhance learning, building on a spontaneous event or student interests, or successfully adjusts and differentiates instruction to address individual student misunderstandings. The teacher ensures the success of all students by using an extensive repertoire of teaching strategies and soliciting additional resources from the school or community.
Straightaway, I can see some fairly fundamental problems in this rubric, which I am surprised did not form part of the evaluation. For starters, the rubric ‘incorporates features’ of the Teachers’ Standards and Ofsted framework, which is somewhat surprising given that Ofsted itself says that the framework should not be used to evaluate the effectiveness of individual lessons. As is nearly always the case with generic descriptors, the statements in the rubric are very vague, non-subject-specific and open to interpretation. I do not know what “highly intellectually engaged in significant learning” means. The rubric states that a highly-effective teacher uses “an extensive repertoire of teaching strategies”, without specifying what those teaching strategies are. The evaluation states that teachers received training on using these statements, but the detail of this is not recorded in the evaluation, so we are left none-the-wiser as to the nature of a fundamental causal link in the theory underpinning the study.
Where the statements are less vague, there are some steers towards a certain style of teaching. For example, the following is deemed an ‘ineffective’ teaching style:
- Some of the teacher’s questions elicit a thoughtful response, but most are low-level, posed in rapid succession.
There is of course a great deal of debate on this. I attended a very interesting seminar recently which looked at the relationship between fluency (i.e. accuracy + speed) in attainment in mathematics, where at least some of the research evidence suggested that getting better at maths was helped by the teacher posing lots of rapid low-level questions. In the evaluation materials one teacher refers to the observations encouraging her to do more ‘group work’, although it is not at all clear whether the form of group work she was encouraged to do from the project was likely to help or hinder pupils in maths or English, and most of those who read the education blogs will know that ‘group work’ is a highly-contested idea where the efficacy of the practice is far from clear.
So, in short, the fundamental link in the causal chain between the observations and pupil performance – i.e. the teaching practices that the observations encouraged – are vague and open to interpretation, and, in some cases, might encourage teachers to adopt a teaching style that might not actually be effective.
What conclusions can we draw from this study? You should note, first, that the trial showed no improvement in results when the intervention group of schools who conducted the interventions were compared to a control group of schools. I think, from the design of the study, that there are three possible conclusions one might draw from this:
Possible Conclusion 1 – the observation model used did not result in a change in teacher practice, and therefore there was no impact on pupil outcomes
Possible Conclusion 2 – the observation model used did result in a change in teacher practice, but the vague definitions of ‘effective’ and ‘highly effective’ meant that the changes did not all take the same form, resulting in more and less effective teaching approaches cancelling each other out in the end result.
Possible Conclusion 3 – the observation model used did result in a change in teacher practice in line with the definitions of ‘effective’ and ‘highly effective’ used in the rubric, but these practices are not actually effective.
Each of these possible conclusions results in a different takeaway message when read by a wider audience.
Possible Conclusion 1 implies that the form of structured observation used in the intervention was ineffective in changing teacher practice, and should therefore not be adopted. This outcome would mean that we spent £1.18 million (and a great deal of teacher time) determining that giving observers an iPad with the RANDA TOWER software loaded (which cost £200,000 – I must admit I cannot tell from the Evaluation what this software could do that existing freeware could not) and asking them to observe lessons with the frequency used did not result in a change in practice. If this is the correct interpretation of the results, then at least we know not to use this particular approach to observation in future, although I would question the value for money here. Importantly, this possible conclusion does not support the headline (both on the EEF website and the media) that “increasing structured teacher observation makes no difference to GCSE English and maths results”.
Possible Conclusion 1 at least gives us a takeaway. Possible Conclusion 2 is in many cases the worst outcome, as it means we wasted £1.18 million by not having a rubric that was sufficiently helpful to result in a consistent change in teacher practice. I reckon the designers of the intervention would reject Possible Conclusion 2 on the grounds that there was lots of training and that teachers did say they found it helpful, but I would want to see a lot more evidence to show that (a) all the teachers interpreted the rubric in the same way and (b) what that way was. Without this, this intervention is £1.18 million down the drain, for we are no closer to understanding the relationship between structured observation and pupil outcomes.
Possible Conclusion 3 is the most interesting, and would actually be a meaningful discovery from this study. If it is the case that the observations did change practice, and that ‘effective practice’ was understood consistently by teachers, then we have some good evidence here that what was defined as ‘effective’ teaching in the rubric is in fact not that effective. If so, then then would suggest that the practices encouraged through this intervention are perhaps not worth using in teacher training or CPD. Interestingly, however, at no point in the Evaluation do we even get the consideration that the types of teacher practice assumed to be ‘highly effective’ might in fact not be, even though that would seem a very plausible interpretation of the results.
I try not to be cynical concerning large-scale trials in education. I have been persuaded that it is possible to design teaching interventions and to use an experimental design to assess whether these improve outcomes for pupils. It is clear to me, however, that this requires the teaching intervention being made to be defined with great precision and delivered consistently. This is very hard to do, although not impossible, particularly if you have large sums of money available to support the project.
It is also clear to me that the causal mechanism that sits behind a project – that is, what the causal chain is that links the intervention made with the results achieved – should be made clear. With such large sums of money on the line, presumably someone, somewhere, asks “so why would you expect this intervention to work?” This particular EEF study does not address this question: if it had, then some of the conceptual flaws in the study that I have set out in this blog post could have been avoided. As it happens, we have ended up with newspaper headlines saying things like “teacher observations do not improve outcomes”, when in fact, for all the reasons I have set out in this post, this study does not actually show this. We all know that scientific studies are very frequently poorly reported, but in this case I think the fault lies not in a journalist’s interpretation, but rather in a poorly conceptualised study and an evaluation that ignores or brushes over these complexities.
I have focused in this critique on what I see as the obvious oversights in the Evaluation. It should be noted, however, that the Evaluation itself sets out a number of quite concerning things about the whole set-up of this experiment. I still think the EEF is a good idea and that we should be putting public funds into well-conceptualised projects. In this case, over £1 million was spent telling us very little. This is not a good use of public money, and I would urge the EEF and the DfE to think more carefully about approving such studies in the future. To reiterate my opening point, I have written this critique not because I am particularly attached one way or the other to the idea of teacher observation. Rather, it is because I think studies such as this threaten to undermine the potential of good experimental work in education.
This particular study is, in my final analysis, an eye-wateringly-expensive case study of how not to conduct an educational trial.