Save

Learning by Implementing: What Small-Scale Implementation Studies Reveal about Instructional Innovations

in Implementation and Replication Studies in Mathematics Education
Autor:in:
Julia Tsygan Department of Pedagogical, Curricular and Professional studies, University of Gothenburg Ulricehamnsvägen 31, 121 39 Johanneshov Sweden

Search for other papers by Julia Tsygan in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0000-0003-4357-9764

Abstract

Research-designed innovations need not be fixed once they are published. Implementation of such innovations may reveal not just contextually significant factors that influence where and how the innovation can successfully be implemented, but also opportunities to make the innovations more effective or practical. In this study, two groups of secondary school teachers implemented two proof-related innovations originally designed and tested in tertiary education. The first was an instructional sequence which aims to transition students from an empirical proof scheme to a deductive one. In parallel, another team of teachers attempted to implement an assessment framework for proof comprehension. Reflection on these two implementation cases illustrates the opportunities offered in studies of small-scale implementation attempts in identifying opportunities for adaptations or improvements, as well as the potential need for greater clarity in instructions accompanying the innovation.

The impact sheet to this article can be accessed at https://doi.org/10.1163/26670127-bja10032 under Supplementary Materials.

1 Introduction

This paper reports on two teacher-led implementations of innovations pertaining to mathematical proof. These implementations did not originate as research studies, but rather as developmental work in which a group of secondary school teachers, meeting every other week, sought to address two problems in teaching and assessment of proof comprehension. Interesting experiences and questions became apparent in the rich notes, teaching materials, and student work produced during the implementation. This inspired me, a researcher and simultaneously one of the teachers implementing the innovations, to take an interest in what such implementations can teach us about the innovations being implemented.

Methodologically, this study is a retrospective multiple case study (Thomas, 2011). The object of inquiry, i.e., the theoretically framed phenomenon under scrutiny, is what small-scale teacher-led implementations can illuminate about the nature, strengths, and limitations of the innovations being enacted. Innovations may have flaws or opportunities for further improvement, unnoticed by the developers. Moreover, even when an innovation works as intended in its original setting, effectiveness and practicality can vary across contexts. When other researchers or teachers implement the innovation in new settings, improvements and context-specific adaptations often become visible. This is not a replication study, because it does not repeat an academic study but rather analyzes the implementation of innovations communicated through such studies. However, the insights one may gain from implementation and replication studies sometimes overlap (Jankvist, Aguilar, et al., 2021). Analogous to the potential of replication studies to inform appraisal of validity and reliability (Aguilar, 2020), implementation studies may help identify issues in effectiveness or practicality and clarify the role of contextual variables. Accordingly, this study examines what insights about innovations are offered by small-scale teacher-led implementations.

The subjects of the inquiry — the particular cases through which this object is explored — are the two implementation efforts documented in this paper. According to Thomas’s (2011) typology, these two cases constitute key cases of small-scale implementation: they are serious and well-informed attempts, carried out by experienced practitioners who sought to implement the innovations faithfully to solve salient problems in teaching and learning. These cases therefore exemplify contexts in which research-based innovations are expected to be adopted and thus offer a meaningful lens through which to examine the object of inquiry.

Both innovations aim to improve the teaching and learning of proof, a domain whose practices diverge markedly between secondary and tertiary education, making contextual variables salient. Yet the two cases differ in kind — one is a teaching sequence and the other an assessment model — so each highlights distinct implementation problems and opportunities. They are also communicated through different channels: one exclusively in academic articles, the other also in a teacher-facing book. Analyzing their implementation side by side, therefore, yields a richer account than a single-case study.

One way to understand the research presented in the current article is as implementation-integral (Cai & Hwang, 2021), meaning that there is ‘a dynamic, continuous reciprocity between the activities of research and implementation with an accompanying blurring of the roles of researchers and teachers’ (p. 1151). Cai and Hwang (2021) argue that such research may be characterized by the interplay of theory for teaching and teaching for theory. By the former, Cai and Hwang (2021) mean theory that informs teachers about actions that promote individual learning and productive social interactions among students. By the latter phrase, they mean ‘teaching that is deliberately designed to generate, elaborate, and test theory so that the field, as well as the student, learns’ (Cai & Hwang, 2021, p. 1150). Cai and Hwang (2021) outline four aspects of implementation-integral research which, in the context of the current study, include: (i) attempts to solve problems of practice related to the teaching and assessment of proof comprehension; (ii) iterative refinement of proof comprehension assessments; (iii) production of shareable artifacts such as lesson plans, tests, and an assessment rubric; and (iv) development of teachers’ capacity to cross the teacher-researcher divide and participate in disciplined inquiry.

This study examines how small-scale, teacher-led implementation studies serve as teaching for theory and clarify the practical usefulness of theory for teaching. The research question is: how may analysis of small-scale, teacher-led implementations inform (a) innovation and theory refinement and (b) evaluation of the feasibility of theory in context and clarity of published guidance?

This article proceeds as follows. The Background clarifies two domains. Implementation research refers to the object of inquiry for this case study and is framed by the lenses of implementation and implementability. Mathematical proof comprehension defines the intended outcomes of the two innovations and thus helps explain the subjects, the two implementation cases. The first case is then presented: implementation of the Stylianides & Stylianides (2009) teaching sequence designed to help students develop an appreciation of the need for mathematical proof. This section describes the teachers’ implementation of this teaching sequence, the obstacles they encountered and the improvements and adaptations they proposed. Next, the second case is presented. This innovation includes a test-based approach to assessing proof comprehension and a process for generating such tests, as presented in Mejía-Ramos et al. (2012) and Mejía-Ramos et al. (2017). Given its greater scope, this section is longer and subdivides the implementation method. Finally, the discussion compares the experiences of implementing the two innovations and synthesizes the reflections.

1.1 Background

This section positions the study within implementation research and introduces the theoretical terminology. It then outlines key constructs and theory in proof comprehension. Throughout, each concept is contextualized in relation to the two secondary-school, teacher-led implementations that follow.

2 Constructs from Implementation Research Motivating this Study

Following Koichu et al. (2024), the dual lenses present in this paper are those of implementation and of implementability. This aligns well with many of the reasons for doing implementation research, as discussed in Century and Cassata (2016). The first lens examines the effectiveness of the innovations and suggests possible improvements. It is the fourth aim of implementation research, in Century and Cassata (2016), to ‘improve innovation design, use, and support in practice settings’ (p. 174).

Using the terminology of Century and Cassata (2016), innovations here signify purposefully designed changes in practice, implementation is the enactment of these changes, and implementation research is the study of this implementation, the variables that affect it and the interplay between the innovation itself, influential variables, and outcomes. As will be evident, a more detailed definition of implementation is useful here. Koichu et al. (2021, p. 986) define implementation as

an ecological disruption to a particular mathematics education system, through the gradual endorsement of innovation in conjunction with an action plan aimed at resolving what is perceived as a problem by (at least some of) the stakeholders involved. The defining feature of implementation is that it occurs in interaction between the innovation and plan proponents and the innovation adapters.

In this study, the two implementation cases addressed perceived problems in teaching proof and assessing proof comprehension. The ecological disruption was the change from ordinary teacher practices to using the two researcher-designed innovations explained later in this text.

The interaction between the innovation proponents and the adopters was indirect. Rather than contact with the researchers who had developed the two innovations, the team consulted material resources that detailed the two innovations. For the first innovation, the Stylianides teaching sequence (Stylianides & Stylianides, 2009), we consulted the academic article just cited, in which the teaching sequence is explained in detail. We also used the depiction in the subsequent teacher-facing book (Arbaugh et al., 2019), to which the researchers involved in the academic article contributed. For the second innovation, the proof comprehension design process, we consulted two articles: Mejía-Ramos et al. (2012), which introduced the proof-comprehension framework, and Mejía-Ramos et al. (2017), which presented the process for constructing multiple-choice tests.

Our work in these two implementation cases is part of a larger teacher-researcher co-learning partnership project. In different constellations, four teachers participated, and, additionally, me in dual roles as a researcher and teacher. In the researcher role, I facilitated the disciplined inquiry elements as described in Pinto and Koichu (2021) by conducting in-depth reading of research, systematizing and analyzing data, and generalizing and communicating findings, such as in the current article. Together, the teachers and I, in my teacher role, contributed insights from teacher inquiry (Pinto & Koichu, 2021) such as selecting proofs and writing proof-comprehension tasks, planning for how and when to use the innovations in our classrooms, deciding how to interpret student responses and behaviors during the enactment of the innovations, and drawing conclusions about feasibility and effectiveness of the innovations.

The implementation cases in this article are unusual. Following the categorization in Koichu et al. (2021), this is a local (i.e., small-scale instance) implementation study of a material-centered (i.e., physical or digital artefact or guidelines for designing such artefacts) object of implementation, in which the practitioners themselves implement a researcher-designed innovation. In a review of implementation research in mathematics education, Koichu et al. (2021) state that such research is rare: ‘This is not to say that such situations cannot exist in practice: perhaps they exist but are not documented’ (p. 982). Given the sparsity of such studies, it is especially interesting to note what they, by virtue of the strong teacher agency, can add to our understanding of specific innovations and implementation research itself.

Following Koichu et al. (2021), implementability instead examines the feasibility of innovations in a specific context, the international secondary school in which these implementation efforts took place and includes a critical look at the interpretation challenges faced by the teacher-researcher team seeking to learn about the innovations through the articles and the book in which the innovations are presented. I therefore now turn to issues of feasibility, which include necessary adaptations, and questions of interpretability.

To study adaptations to local context, I use the integrity approach to implementation (Meland & Brion-Meisels, 2023). In this context, integrity means

the degree to which an intervention was implemented maintaining its core active ingredients, while authentically and fully integrating the assets and needs of the local community. (Meland & Brion-Meisels, 2023, p. 2)

Integrity, therefore, requires implementers to be attentive to the opportunities and challenges in the context in which they seek to implement the innovation, and that they choose strategies accordingly.

To facilitate implementation with integrity, researchers may provide a clear description of the innovation to those seeking to adopt it. Meland and Brion-Meisels (2023) refer to the innovation’s critical elements as its core active ingredients, whose preservation is essential for maintaining integrity during implementation. This requirement places specific demands on designers’ articulation of the innovation: the description must explicitly state not only the overall goals (ultimate outcomes) but also the intermediate outcomes, those proximal or intermediate determinants, that the various components or features of the innovation are intended to achieve, individually or jointly. In the words of Meland and Brion-Meisel (2023, p. 5), ‘specifying the theoretically important program components — often called active ingredients […] becomes critical’. Likewise, Jankvist, Gregersen & Lauridsen (2021) argued for a clearly articulated and stakeholder-aligned theory of change, i.e., the theoretical constructs underpinning the design, implementation, and evaluation of implementation efforts. In the absence of such alignment, integrity will be difficult to achieve and evaluate.

When teachers encounter challenges during implementation — whether due to contextual differences from the original research setting or resulting from flaws of the innovation or its articulation — they may introduce changes aimed at ensuring the innovation achieves its intended positive outcomes. When such modifications result from teachers’ efforts to adapt an innovation to the particular needs and resources of their students and institution, one may speak of productive adaptations (Debarger et al., 2013). When, on the other hand, modifications are motivated by perceived opportunities to enhance the innovation itself, literature offers no term for these, so I label them simply improvements. Productive adaptations thus arise from external conditions and may vary considerably across different implementation contexts, whereas improvements result from issues directly related to the innovation’s inherent characteristics or design.

3 Proof and Proof Comprehension

Mathematical proof is widely regarded as an essential part of mathematical practice, yet the extent to which students comprehend the proofs they encounter remains uncertain. While many mathematics courses in tertiary education require students to learn proofs, this comprehension has often been assessed by asking students to reproduce a previously presented proof (Weber, 2012). Repeating back a proof one has learned may, however, indicate memorization rather than comprehension (Conradie & Frith, 2000). This concern has prompted scholars to investigate the nature of proof comprehension and to develop instruments that measure students’ understanding of the structure, content, and purpose of proofs (e.g., Conradie & Frith, 2000; Davies & Jones, 2022; Mejía-Ramos et al., 2012). To date, most of these efforts have targeted tertiary contexts, where the expectation of formal reasoning is higher than in secondary education, and students are typically exposed to transition-to-proof courses or their equivalents in other tertiary contexts.

Comprehension of proof can enhance students’ reasoning, communication, and problem-solving skills (Zaslavsky et al., 2012). Hanna and Barbeau (2010) argue that explicit engagement with proofs also exposes students to standard mathematical techniques that may transfer to proving a range of similar results, just as the standard proof by contradiction that the square root of 2 is irrational transfers to similar proofs of the irrationality of the square root of 3, and other prime numbers. From another perspective, Harel (2013) highlights the need for causality in students’ mathematical learning, noting that proofs sometimes illuminate why particular results hold, or how they build on students’ prior knowledge. This emphasis on causality highlights how some proofs can serve not only to give certainty about a proposition but also to explain it, making comprehension of proof a gateway to explanatory understanding in mathematics.

Moreover, proof is sometimes an explicit part of the secondary curriculum. For example, when studying geometry, students in secondary school in many U.S. states are routinely expected to learn various theorems and proofs related to triangle congruence and similarity. Assessing such comprehension is also possible in the secondary context. For instance, Yang and Lin (2008) developed a model for conceptualizing and assessing students’ proof comprehension in geometry.

In secondary mathematics education, when proofs appear, they do so in ways that differ from their occurrence in most tertiary settings. Selden (2012) identifies key differences: at the secondary level, the proofs students encounter are usually less rigorous, shorter, and less complex, and require a smaller knowledge base compared to proofs at the tertiary level. Additionally, there may be epistemological differences, such as students more often encountering non-deductive modes of justification (such as empirical or perceptual) and rarely meeting certain types of proof, such as existence proofs. Selden (2012) also identifies certain common student differences between secondary and tertiary levels. Students are required to use more formal mathematical language and precise deductive reasoning, such as mastering propositional connectives (e.g., ‘and,’ ‘or,’ ‘implies’), quantifiers (e.g., ‘there exist,’ ‘for all’), and understanding the relationships between related conditionals such as implication, converse, inverse, and contrapositive. As students come to grips with these elements of reasoning and language, they also face challenges with notation. Such differences may all be influential in the current study, where we implement proof-related innovations tested in tertiary settings in our secondary school.

Secondary school students often hold empirical or other non-deductive conceptions of mathematical proof, here referred to as proof schemes (Harel & Sowder, 1998). In order for students to benefit from working with proof, they need to first understand what mathematical proof is and how it provides certainty beyond empirical testing. In response to this need, Stylianides and Stylianides (2009) developed an instructional sequence that uses counterexamples and reflection prompts to show students that empirical testing, even of many cases, cannot give absolute certainty, whereas a deductive method can. The implementation of this instructional sequence is the first implementation case in this paper.

Because all school mathematics has at some historical point been proven, proof could be part of all curriculum topics. Being able to assess students’ proof comprehension would make it easier to integrate proof within teaching and learning generally, not just in specific proof topics or courses. There is, therefore, a pressing need to develop viable strategies for assessing proof comprehension. In response to this need, Mejía-Ramos et al. (2012) built on the work of Yang and Lin (2008) and Conradie and Frith (2000) to develop a framework for tertiary-level proof comprehension tests, including multiple types of questions and the assessment purposes served by each type. This is the proof comprehension framework that is used in the current study. Mejía-Ramos et al. (2017) later presented a process that researchers could follow to generate such proof comprehension tests in multiple-choice format. This process uses a number of steps to make effective multiple-choice questions and answer options. This is the process implemented in the current study.

The Mejía-Ramos et al. (2017) process of generating multiple-choice proof comprehension assessments has been a starting point for multiple research efforts seeking to develop assessment of proof comprehension. Davies et al. (2020) found that ‘[s]uch test development, however, requires time- and resource-intensive iterative work for every new proof. In the present article, we investigate an alternative approach …’ (p. 182) and proceeded to instead develop assessment through comparative judgment of proof summaries. Likewise, Cooley et al. (2024) report on a long-term researcher-educator collaboration in which tertiary-level mathematics teachers sought to develop assessment of proof comprehension. Similarly, Cooley et al. (2024) found it unrealistic and impractical in their context to use the Mejía-Ramos et al. (2017) process for generating multiple-choice tests and opted instead to create free-response tests based on the Mejía-Ramos et al. (2012) framework of local and holistic questions. Even this attempt led to challenges when students’ responses did not align with the dimension or facet intended in the question. As a result, Cooley et al. (2024) report on a rubric assessment of students’ responses, a method that allows the instructors to better capture facets of student comprehension shown in student responses.

In each of these cases, researchers and educators did not implement the Mejía-Ramos et al. (2017) process of generating multiple-choice proof comprehension assessment but opted instead for other modes of assessment. There is an open question, therefore, whether the Mejía-Ramos et al. (2017) process could be adapted for use by teachers. If the process is relaxed, does it still succeed in generating useful assessments of proof comprehension?

Together, facilitating students’ development of proof schemes and assessing students’ comprehension of existing proofs (proof comprehension) are two central components of teaching and learning proof. These components are not exhaustive; others include specific proof techniques, conventional notation, and communication skills for structuring deductive arguments. But they form a useful foundation because, without a basic appreciation of the deductive nature of mathematical proof, there is no ground on which to build. Similarly, without a way to assess comprehension, we cannot determine whether our efforts are fruitful.

3.1 Method and Analysis

I participated, as a researcher and simultaneously as one of the teachers, with a team of teachers implementing the two innovations over a span of six months. To make sense of this lived experience of this group of people, closely cooperating over an extended time and implementing the innovations in multiple classes and year-groups, narrative inquiry (Clandinin & Caine, 2008) was used as the organizing research method. Practically, this meant tracing implementation stories across planning meetings, classroom enactments, and post-lesson reflections, while attending to relationships within the group (particularly the potential for tension between the researcher’s and teachers’ interests) and maintaining flexibility and sensitivity towards different teacher and student groups’ needs.

The narrative inquiry method requires one to attend to the social, place, and time aspects of an unfolding narrative. Socially, the total of five teachers involved in the two implementation attempts had all known each other and collaborated closely on different development projects over the past several years. Three of these teachers had backgrounds in other countries and educational systems. All the teachers had at least eight years of teaching experience, and in several cases experience of teaching multiple stages (lower secondary as well as upper secondary), and multiple subjects (such as science/psychology/philosophy, and mathematics).

The two proof-related designs were implemented in a secondary school in Stockholm, Sweden, which serves as the local context in this study. In lower-secondary, ages twelve through fifteen, the students are learning mathematics following the International Baccalaureate (IB) Middle Years Programme, which does not include proof but does require, for the highest achievement levels, that students reason about and verify their conclusions to mathematical pattern-seeking conjecturing problems. In upper secondary school, students follow the IB Diploma Programme, in this case, the Mathematics Analysis and Approaches curriculum (International Baccalaureate Organization, 2019). This curriculum requires simple deductive proof at the standard level of the course, and more sophisticated forms (e.g., proof by contradiction, mathematical induction) at the higher level of the course. Additionally, all students complete an independent research project in mathematics involving research of unfamiliar mathematical relationships and their deductive justifications. The school is international, with students from many nations around the world coming and going yearly, and with a more stable student body in the upper secondary part. This means that students, like teachers, have very diverse prior experiences of school mathematics.

A range of data was used as field texts in this study. The data included written student responses on the proof comprehension test tasks, teacher observation notes on students’ behavior and engagement, short oral feedback from students when asked how they perceived the experience, and meeting notes from collegial discussions, as transcripts of recordings of the meetings. Based on transcripts of the recorded meetings, each meeting was summarized in a page or two of meeting notes, including links to or examples of student work and teacher-produced materials discussed or produced during those meetings. The written meeting notes thus constituted interim research texts, which were shared with and improved together with the teachers before the next meeting.

I then read these interim research texts (meeting notes) with a focus on challenges encountered, adaptations made, and ideas noted for improvement of the innovation. I noted any meaningful information relevant to these overarching categories, summarized the interim texts for each of the two interventions, wrote them into the research texts presented in the current study, and confirmed via conversations with the other teachers to make sure that they mirrored the teachers’ recollection. This all happened within a few months of the implementation attempts, suggesting that my own and the teachers’ recollections were still fresh and reliable.

3.2 Case 1: the Stylianides Instructional Sequence

This section begins with a description of the Stylianides and Stylianides (2009) teaching sequence, which is the innovation whose implementation is described here. Hereafter, I refer to it as the Stylianides sequence. Then, I will describe the aim and method of implementation along with the adaptations made. This section concludes with an analysis of the obstacles encountered, in relation to potential opportunities for improvement in the design itself, as well as relevant contextual factors and adaptations.

The Stylianides sequence is designed to help students transition from relying on empirical observations to developing formal mathematical proofs. The sequence was designed and iteratively tested in an undergraduate mathematics course for prospective elementary (K-6) teachers, who were not otherwise required to have any undergraduate mathematics experience (Stylianides & Stylianides, 2009). This group of participants was in tertiary education, yet their mathematical background might be comparable to upper-secondary school students. Stylianides and Stylianides (2009) argue, citing numerous research articles, that prospective elementary teachers, similarly to even high-attaining secondary school students, often hold empirical proof schemes. Thus, while the design was developed in tertiary education, the stated ambition of the researchers (p. 349) can be applied in other contexts, to help other groups of students transition from an empirical proof scheme to a deductive one.

The process begins with the Squares task, where students identify patterns in counting squares within larger grids, initially justifying their findings by checking specific cases, i.e., using an empirical proof scheme. In a later book (Arbaugh et al., 2019), Stylianides and co-authors suggest that the Squares task can be adapted for different levels of sophistication. The easier task requires counting the number of 3 × 3 squares, while the harder task requires counting the total number of squares, of any size. In either case, it is expected that students will solve this task by counting squares in smaller grids such as 4 × 4 and 5 × 5 first, see a pattern, and formulate a general rule. Students are then asked to reflect on how they found the pattern and their certainty of the veracity of their conjecture: ‘Can we be sure that this expression (1² + 2² + 3² + … + 59² + 60²) will give us the right answer for the 60-by-60 square? Why?’ (Stylianides & Stylianides, 2009, p. 336). Then, the instructor tells the students that the most common response in class was to say that the pattern from the smaller squares can be trusted to give the right number for the larger squares.

The Circle and Spots task, next, has students consider the number of regions a circle can be divided into when n points are placed on the circumference and connected with chords. Students count the initial cases and arrive at the sequence 2, 4, 8, 16 for 2, 3, 4, and 5 points on the circumference, respectively. Students are expected to then formulate the rule that the number of regions is R = 2n ¹. This is then challenged when, for n = 6, R = 31. This task is meant to demonstrate that empirically established patterns do not always hold generally, leading students to question whether checking more cases is sufficient to establish a general rule. Students are asked to reflect on what they learned from this problem. Finally, they are asked to consider what they think of a fictitious student response: ‘This problem teaches us that checking 5 cases is not enough to trust a pattern in a problem. Next time I work with a pattern problem, I’ll check 20 cases to be sure’ (Stylianides & Stylianides, 2009, p. 341).

The Monstrous Counterexample task further disrupts the empirical proof scheme, illustrating that no finite set of examples can confirm a conjecture’s validity. Students are presented with a formula that, the instructor claims, never produces a square integer. They are expected to try many input integers, but each time the output is not a square. Until, that is, n = 30 693 385 322 765 657 197 397 208. Students are asked to reflect on what they can learn from this fact, and whether they can be sure of anything in mathematics.

Finally, students revisit the Squares problem with a new perspective and are shown a deductive proof which establishes the result with certainty, reinforcing the advantages of rigorous mathematical reasoning over empirical verification.

An illustration of the instructional sequence is found in Figure 1.

Illustration summary of the Stylianides sequence
Figure 1

Illustration summary of the Stylianides sequence

Citation: Implementation and Replication Studies in Mathematics Education 5, 2 (2025) ; 10.1163/26670127-bja10032

4 The Implementation Attempt of the Stylianides Sequence

Fostering an appreciation of the deductive nature of mathematical proof was the shared aim that drove both our classroom intervention and the original researchers’ design of the instructional sequence (Stylianides & Stylianides, 2009). In our grades 6–11 classrooms (students aged 11–18 years), we had recently investigated students’ understanding of proof and mathematical explanation by asking them to justify familiar results — exponent rules, the triangle-area formula, and the Pythagorean theorem — following the design conjectures explained in Tsygan and Helenius (2024). As Harel and Sowder’s (1998) framework on proof schemes predicts, many students relied on empirical arguments. Wanting to guide students toward a more deductively oriented view of mathematical knowledge and proof, we sought to implement the Stylianides sequence.

Stylianides and Stylianides (2009) originally developed this instructional sequence with undergraduate mathematics learners in tertiary-level education, studying to become teachers. The current implementation was to be in secondary school, spanning from grade 6 to grade 11. The teachers for grades 8 through 10 used the simpler version of the Squares task (‘how many 3 × 3 squares?’) as a starter activity, while the teacher for grade 11 used the more sophisticated version (‘how many squares of any size?’). A productive adaptation was made by the teacher for grades 6 and 7, who chose a wholly different task instead, judging that even the simpler version would be too difficult for her young students. Instead of the squares task, she used a simple arithmetic pattern (the sum of two odd numbers is always even), and then continued the sequence as originally designed.

We carried out the sequence as intended, in a standardized fashion. I developed PowerPoint slides imitating the materials presented in Stylianides and Stylianides (2009) and Arbaugh et al. (2019), both of which contained detailed illustrations and explanations of each part of the sequence. The teachers and I encountered no difficulties in interpreting the instructions in these materials. The resulting slides then provided the tasks and reflection questions exactly as presented in the Stylianides sequence. The teacher for grades 6 and 7 replaced the Squares task with the simpler one about sums of odd integers. All teachers agreed that carrying out the sequence went smoothly, without challenges.

Reflecting on student learning, the teachers all felt that the instructional sequence had been successful in getting students to question the usefulness of checking many examples. In particular, the Monstrous Counterexample part of the sequence made a strong impression on students. This was not entirely positive, however, as some students then came to believe that nothing can be certain in mathematics. For instance, when asked after the sequence whether the possibility of a counterexample meant that we cannot trust Pythagoras’ rule to always hold for right triangles, some otherwise successful mathematics learners in grade 11 expressed that yes, Pythagoras’ rule could potentially fail to hold at some point, even though we haven’t yet found a counterexample to it. This kind of comment signaled confusion about what, really, the monstrous counterexample meant, and the students did not seem to be able to reach a useful decision about this question on their own. Some students, especially in the higher grades, suggested that there must be a different method of investigating patterns, such as logical thinking, but these remarks were not consistently convincing to other students.

A suggestion for improvement of the Stylianides sequence is therefore to include a task and reflection on using specific examples. While we cannot prove things by using specific examples, we can (i) disprove things (by counterexample), (ii) use multiple examples to formulate a conjecture, (iii) remind ourselves of a forgotten fact by using specific examples with generic numbers, and finally (iv) use specific examples to illustrate and thereby understand the meaning of an abstract formula or rule. We do not want students to come away from this instructional sequence thinking that specific examples are worthless in mathematics, when in fact the use of specific examples has many purposes. So, the Stylianides sequence needs to be supplemented with learning about the proper use of examples in mathematics.

Had the sequence worked as intended, the students would have come away from the lesson knowing that although no number of specific examples can give certainty, deductive reasoning can. However, the Squares task did not seem to have the desired effect, which was to provoke students to activate and later replace their empirical proof schemes with a deductive one. Several teachers independently identified the problem: counting squares in the 5 × 5 grid requires a systematic approach that is too analogous to the deductive argument revealed in the sequence’s final step. Because of this congruence, some students actually found the deductive justification quickly, before the subsequent parts of the Stylianides sequence. And those that did not find it, in many cases, could not distinguish later between the empirical systematic counting they had done before, and the deductive systematic counting argument which they were later told was proof. This difficulty in distinguishing between the empirical and the deductive in the Squares task made it difficult to replace the former empirical reasoning with the latter deductive type of reasoning.

5 Suggested Improvements or Supplements to the Stylianides Sequence

Based on the implementation attempt, we suggested multiple improvements. First, the Squares task needs to be replaced with some other task in which (i) the students are likely to explore a pattern by counting or other empirical means, (ii) there exists a deductive proof which is not congruent with systematic counting, and (iii) this deductive proof is cognitively or mathematically accessible for the students with whom the sequence is used. Such a task could involve showing that the sum of three consecutive numbers is divisible by 3, or more generally, that the sum of n consecutive numbers is divisible by n. Alternatively, the Squares task can be abandoned completely, instead starting the instructional sequence with the Circle and Spots problem. This would have the added benefit of being less time-consuming, thereby giving the sequence higher practicality in content-heavy secondary curricula. However, an unfortunate aspect of the Circle and Spots problem is that the actual formula and the proof for it are not accessible to most secondary school students. Therefore, one would need to inelegantly follow the Stylianides sequence with other examples of patterns, conjectures, and proofs that are accessible to students but disconnected from the tasks in the sequence.

We unanimously agreed that the sequence led students to question any empirical proof scheme they may have held. As one teacher put it, the sequence ‘chipped away at’ their empirical proof schemes. But in a few cases, the effect was transient and did not transfer to other tasks and lessons. In particular, the grade 6 and 7 students later during the same lesson argued from an empirical proof scheme again, demonstrating limited transfer of their learning from the instructional sequence. It might therefore be useful to think of students holding different proof schemes in different contexts and times, thereby making it essential to continue highlighting the role of deductive proof in regular concept learning and skill acquisition rather than thinking that a single lesson will make a pivotal difference.

In summary, this implementation of the Stylianides and Stylianides (2009) teaching sequence required little adaptation to a secondary school context but revealed several opportunities for improvement and the need to supplement the sequence with instruction on the usefulness of specific examples and further integration of proof and proving throughout the curriculum.

5.1 Case 2: Assessment of Proof Comprehension

This section reports on the implementation of one innovation consisting of two parts. First, Mejía-Ramos et al. (2012) developed a framework for proof comprehension tests (hereafter called the Mejía-Ramos assessment framework) that was aligned with the many purposes that proof comprehension is seen as fulfilling. This assessment framework, therefore, consists of different kinds of questions, to be explained below. Second, Mejía-Ramos et al. (2017) also developed a process through which such tests can be systematically designed and validated, henceforth referred to as the Mejía-Ramos design process, also described below. In the current study, we sought to develop a test according to the Mejía-Ramos assessment framework following the Mejía-Ramos design process.

Following the description of the assessment framework and the design process through which such tests can be constructed, our implementation efforts are described. The section concludes with an analysis of the obstacles encountered, in relation to potential improvement of the design itself as well as relevant contextual factors.

In their assessment framework, Mejía-Ramos et al. identified two key dimensions of proof comprehension: local and holistic. Local comprehension involves understanding individual statements, recognizing definitions, assumptions, and the specific logical dependencies among claims within a proof. Holistic comprehension, by contrast, focuses on the overall structure or big idea, which may include summarizing the proof in simple terms, illustrating the proof via graphical means, or using a specific example, and considering how certain proof techniques might transfer to other contexts. The Mejía-Ramos assessment framework thus consists of seven types of questions, summarized together with their purpose in Table 1.

Types of questions and their functions for proof comprehension assessment in the Mejía-Ramos et al. (2012) assessment framework
Table 1

Types of questions and their functions for proof comprehension assessment in the Mejía-Ramos et al. (2012) assessment framework

Citation: Implementation and Replication Studies in Mathematics Education 5, 2 (2025) ; 10.1163/26670127-bja10032

6 The Implementation Attempt of Mejía-Ramos Framework and Design Process

The teachers who sought to implement the Mejía-Ramos assessment framework did so for a reason extraneous to proof comprehension itself. As a group, we were working on developing methods for assessing students’ understanding of mathematical results, such as theorems, rules, and algorithms. Understanding here means that the student has a valid mental model for how that result builds deductively from previously known definitions and results. We hypothesized that a student who understands a mathematical result will be better able to comprehend a proof based on the same mental model that underpins the student’s understanding. For example, a student who has some understanding that partitioning a triangle into two right-angle triangles and applying the Pythagorean theorem can lead to the Cosine rule, will be better able to comprehend a proof of the Cosine rule based on the same approach. Thus, as one way to assess students’ understanding of results, we implemented the Mejía-Ramos assessment framework using the Mejía-Ramos design process.

In our implementation of the process for generating proof comprehension tests, it was clear from the outset that the limited time we had at our disposal (about one hour per week, outside of regular work hours) would not make it possible to follow the rigorous process of the design process as it was elaborated in Mejía-Ramos et al. (2017). Neither could we ask our students to contribute as much of their time as students in tertiary education, taking transition to proof classes. We therefore hoped that a less rigorous process would also lead to useful results. Specifically, we removed the second and third rounds of having a smaller number of students take the test. Instead, we added a short debriefing interview with students (in groups) to find out about their experience of the proof comprehension test. Additionally, we made the pre-test small-group assessment in writing instead of through an oral interview. The simplified process we decided to follow is presented in Table 2, for comparison with the process described by Mejía-Ramos et al. (2017).

Comparison of instructions in Mejía-Ramos et al. (2017) and implementation
Table 2

Comparison of instructions in Mejía-Ramos et al. (2017) and implementation

Citation: Implementation and Replication Studies in Mathematics Education 5, 2 (2025) ; 10.1163/26670127-bja10032

Each step in the implementation, together with the obstacles encountered and adaptations made, is explained in the next section.

7 Preliminaries: Selecting Results and Writing Proofs

Definitions of mathematical proof vary, but Stylianides’ (2007) school-friendly conceptualization emphasizes that a proof is (i) an argument built upon statements accepted as true without further justification, such as axioms and definitions, (ii) reliant on forms of reasoning that are valid within the classroom community’s conceptual reach, and (iii) communicated with expressive modes familiar to that community (ranging from natural language to symbolic, graphical, or tabular representations). This definition serves as a reminder that proof must be aligned with the learners’ existing resources and modes of understanding, a point especially crucial when proof-related innovations developed in tertiary education are implemented in secondary contexts.

Compared to tertiary students, students in secondary school are often less familiar with mathematical notation (such as ∀ and ∃) involved in rigorous proofs. In addition, they are less likely to know certain definitions and theorems usually encountered at the tertiary level. Following recommendations by Selden (2012) and Stylianides et al. (2023), we therefore used natural language instead of dense mathematical notation and relaxed the rigor of proofs normally found in university textbooks. For instance, to prove the Fundamental Theorem of Calculus, we avoided the use of the Mean Value Theorem (with which students were unfamiliar) and relied on a less rigorous approximation-based argument. Thus, if we subscribe to a flexible definition of proof, the innovation itself offers room for productive adaptations to the secondary school context.

In search of results to prove, we encountered a shortage of theorems in curricula. While mathematical claims are abundant, these are often denoted as formulas (e.g., formulas for calculating the volume of 3D geometrical objects) or rules (e.g., exponent rules). This prompted us to broaden the scope of what would be proven from mathematical theorems to anything for which one could meaningfully ask ‘why?’ or ‘how do we know that?’. For example, one of the proofs we considered was about the claim that if

Equation

Citation: Implementation and Replication Studies in Mathematics Education 5, 2 (2025) ; 10.1163/26670127-bja10032

This way, we ensured that students would still engage with a structured argument that would be relevant to a mathematical fact or relationship that they were expected to learn.

The above two considerations made it difficult for us to pick proofs present in university mathematics textbooks. Instead, we designed the proofs ourselves, based on explanations we knew or adapted from proofs we researched elsewhere. As we designed the proofs, we wrestled with what level of transparency and detail to include. If we were too transparent about the deductive steps from one claim to the next, the proof may become too long for many students to read. It would also not make sense to ask students comprehension questions about the logical connections between the steps of the proof if these connections were too clearly stated in the proof itself, e.g., if in Proof 1 (see Appendix 1) we had included the statement that the chain rule was used between steps 2 and 3. We tried to determine how Mejía-Ramos et al. (2017) addressed these difficulties of what steps of reasoning to make explicit and which steps not to. This issue seems as relevant to tertiary level education as it does for secondary level, but the Mejía-Ramos et al. (2017) paper does not describe the design of the proofs themselves. This omission made it difficult for us to determine whether we were implementing the design process as intended, and in what ways the design of the proofs themselves could make a difference.

The three proofs are presented in the Appendix.

8 Developing Assessment Questions

Initially, for each proof, we developed 8 to 10 questions targeting the local dimension of comprehension, and another two targeting the holistic dimension. It was straightforward to come up with the questions for the local dimension, because questions of definitions, what line in the proof a certain claim would later be useful for, and other typical local dimension questions were easily relevant to the proofs we developed. The holistic dimension, however, seemed more difficult to write. Seeing as we wanted the proofs to be relatively simple for our secondary level learners, we did not include proofs with a modular structure. Because we were primarily interested in proofs’ explanatory potential, we wanted to focus on the big ideas in the proofs and how they explained the fact that was proven. We therefore focused our holistic questions on illustrating with a diagram and summarizing the proof in the student’s own words. This adaptation may be seen as a narrowing of the design process in order to focus on one specific benefit of proof comprehension.

Some example questions for Proof 2 are presented below:

Local:

  1. Write down an expression for the area of the fifth rectangle.

  2. Why, in line 3, do we introduce approximate areas if we are trying to prove exact equality?

  3. What kind of proof is this?

Holistic:

  1. Summarize the proof in three steps.

  2. Draw a diagram to illustrate the proof.

We then piloted the questions with a small group of 5–6 students — far fewer than the recommended 12. The reason for this was concerns with practicality. We knew that it would be time-consuming for students to take the proof comprehension test in the free-response format, and asked just a few volunteers. They joined us after lessons and in each case spent about 30 minutes reading the proof and writing the answers. We did not think that it would be ethically justifiable or practical to ask more students for their time, given that these students attend an academically challenging curriculum and often experience a lack of time for their strenuous workload. Additionally, we could not offer students any monetary or other incentives for participating in this initial pilot phase. In tertiary education, particularly in a mathematics program with transition-to-proof courses, asking a greater number of students for their time seems more warranted, as students are there, presumably, because they have chosen to engage with mathematics specifically.

To administer the proof and questions to the students, we did not entirely follow the method laid out by Mejía-Ramos et al. (2017). Whereas they interviewed the students individually, this would have been logistically too taxing and time-consuming in a secondary school context in which teachers have very little time to plan and reflect on assessments. Instead, we gave the students the proof and the questions and asked them to read the proof and complete the questions in writing, individually, without speaking to each other.

From this small pilot trial of the questions, we found discouraging results. For the local dimension questions, students either answered correctly, or they did not answer at all, or produced an answer that was difficult to comprehend. If they answered correctly, they sometimes did so in quite wordy ways that made it difficult to make into a multiple-choice distractor option. Sometimes it was hard to understand what the student meant. For example, this response from a student to local dimension question 1 above was difficult for us to find a use for in designing distractors:

The addition of the distances between each subinterval (F (x₁) − F (x₀) or F (x₂) − F (x₁)) ultimately leads to an overall subtraction of the designated x subinterval [a, b].

So if the interval was divided into 6 subintervals, the simplification would be:

Area ≈ (F (x₆) − F (x₀)

When a pattern of miscomprehension was apparent, it was seen in questions of definition. For example, several students thought that ‘signed area’ in the proof referred to ‘designated area’, rather than the actual meaning of integrals as areas taking into account whether it is above or below the x-axis, and whether the interval [a, b] has a>b or a<b. When such a misconception was visible, it lent itself to designing one distractor for that question. Another example is that all the students who took this free-response test identified the proof as ‘deductive’, saying that there are logical deductions from each step to the next. While this raises concern about student proof schemes (what other kinds of proofs do these students believe exist in mathematics?) it helped design one distractor. Because our multiple-choice questions had four answer options, we still did not know how to design the other two distractors.

Might we have generated better distractors by orally interviewing twelve students, as was done in the Mejía-Ramos design process, rather than having half that many produce answers in writing? It is difficult to imagine that a larger sample of students would have given more distractors, because our students tended to answer either correctly or with some phrase we could not comprehend. It was unclear whether students had trouble using terminology correctly or if they were confused about the concepts and logical flow in the proof. With oral interviews, we could perhaps have learned more about what was going on for the students. However, we worry that the follow-up questions in the interview would have led students to answer something they would not otherwise have thought of. It would have been very useful to understand better the nature of the interviews carried out by Mejía-Ramos et al. (2017) to successfully generate the distractors, but this information was not available in the paper.

While we experimented with using the two types of multiple-choice questions suggested in the Mejía-Ramos process, questions that ask students for all correct alternatives, and questions that ask for the best alternative, we quickly abandoned this approach and opted for a multiple-choice test in which each question had just one correct answer. The reason was that we found it too difficult to identify multiple correct answers to the local questions, and when we collegially discussed multiple correct answer alternatives, we could not always agree, even among us teachers, which alternative was the best. Additionally, we did not understand the rationale for including these two types of questions and when each type would be useful, as this was not explained in the Mejía-Ramos et al. (2017) paper.

Another finding, pertaining to the holistic questions, was that students did reasonably well on summarizing the proof in three short statements, but no one could draw an illustration of the proof. The drawings also did not lend themselves to creating visual multiple-choice distractors. We therefore decided to exclude visual illustrations from the holistic assessment questions.

9 Developing Multiple-Choice Distractors

Because the written responses of our small groups of students, in most cases, did not help us create multiple-choice distractors, we decided to invent the distractors ourselves. On one hand, this was difficult because we could find no background research on students’ misconceptions pertaining to our specific proofs. On the other hand, we reasoned that a student is unlikely to chance on the correct answer on a multiple-choice question as long as the alternatives are all phrased in a believable way (similar in length, complexity, and notation).

Designing the distractors ourselves is, in a sense, a productive adaptation in the secondary school context because it is more aligned with how teachers would normally design assessments. Typically, teachers do not have the possibility to pilot assessments to hone questions and answer options. We consulted with each other, gave each other feedback and revised the tests accordingly, and administered them to our students with the intention to iteratively learn from our students’ responses which parts of the test were working well, and which weren’t.

The full process to create the multiple-choice alternatives for the local dimension questions was intended to be as follows.

  • (1) Upload the proof and free-response questions to an AI service (we used OpenAI’s ChatGPT) to generate an initial list of multiple-choice options for each question.

  • (2) Individually, analyze and improve the options created in step 1.

  • (3) Meet with the other two teachers participating in this implementation to further analyze and improve the alternatives.

  • (4) Administer the test to a larger number of students, roughly 20–30 each time.

  • (5) analyze each question’s discrimination power.

  • (6) Improve distractors or remove questions based on the analysis in step 5.

In actuality, the results from step 5 made step 6 irrelevant, as we shall see.

An example of a multiple-choice question and alternatives from the proof on the Fundamental Theorem of Calculus, part 2, is seen below:

In this proof, what is the purpose of subdividing the interval [a, b] into subintervals?

  • (1) To make the calculation easier

  • (2) To approximate the area under the curve with rectangles

  • (3) To find the average value of the function

  • (4) To apply the Mean Value Theorem.

An example question from the proof of the derivative of y = ln x is seen below:

In what line further down in the proof are we using the statement in line 2?

  • (1) In line 4

  • (2) In line 5

  • (3) In line 6

  • (4) In line 7

For the holistic dimension questions, ultimately, we decided to focus on asking the students to summarize the proof in a short summary. We did not find a way to make this into a multiple-choice question, and, in fact, previous research (Davies et al., 2020) likewise opted to replace multiple-choice assessment with a proof summary for the assessment of comprehension of key ideas in the proof. We reached out to the lead author of Mejía-Ramos et al. (2017), who kindly provided us with the three complete tests from that study. From these, it was evident that neither of the three tests that the design process had culminated in had multiple-choice questions for recognizing the big ideas in a proof. Thus, we felt that a free-response question was more suitable for assessing students’ comprehension of the big ideas in the proof. An example of a rather good, student-produced, proof summary of Proof 2 is presented below:

Divide the area under curve into rectangles of width dx

Set up an expression for area as a sum of the area of the rectangles

Using the definition of the derivative, substitute this into the area expression and simplify using definition of a sum.

We found a few surprising patterns by discussing work from students in our different classes and of different ages. First, students did surprisingly well on the local dimension questions. While the classes are generally quite heterogeneous in prior learning and grades, the results on the local dimension questions generally demonstrated high performance on almost all questions. There were a few questions on which students had lower success rates. Upon analysis and discussion with students, these questions were found to be ambiguous either in the wording of the question or in the wording of the options.

The discriminating power of each question was also quite low. Whereas in the Mejía-Ramos et al. (2017) paper, most questions had acceptable discrimination power, showing a biserial correlation above 0.3, our results on the proof comprehension test on Bayes’ theorem showed low biserial correlation for most questions, as seen in Table 3.

Results from proof comprehension test on Bayes’ theorem
Table 3

Results from proof comprehension test on Bayes’ theorem

Citation: Implementation and Replication Studies in Mathematics Education 5, 2 (2025) ; 10.1163/26670127-bja10032

Student performance on the holistic free-response question was assessed by a small rubric we developed for this purpose, presented in Table A1 in the Appendix. Results from the rubric assessment of proof summaries were much more heterogeneous than the results from multiple-choice tests, and matched, to a greater extent, students’ general mathematics proficiency (i.e., performance on class tests, grades). Some students gave clear and complete summaries, but some did not answer at all, some did not show evidence of recognizing the central concepts, and some did not write in a way that made it possible for us to assess whether the student had comprehended the deductive flow in the proof.

We also noted student engagement with the task itself. This was generally very good with the local dimension questions, but declined with the free-response question. Overall, student engagement with the task was very strong in the higher mathematics group. Students persisted with the free-response proof summary beyond the time they had originally been allotted. In the standard mathematics group, however, few students attempted or completed the free-response proof summary.

We then asked the students for their reflections on the experience of taking the proof comprehension test. Most students in the higher group volunteered that it was a fun and interesting experience. Some students also volunteered, unprompted, that seeing the options in the multiple-choice questions improved their understanding of the proof itself. ‘I didn’t quite get it until I saw the questions and thought about the options’, one student said. Once this had been voiced, we asked all students whether this was so, and they almost unanimously agreed that the multiple-choice question options had helped them comprehend the proof.

In Mejía-Ramos et al.’s (2017) paper, the authors also report a final interview with students, but this is a validation check in which students think out loud as they try to answer the test. The authors do not report having asked students about whether the presence of multiple-choice options helped the students understand the proof.

10 Summary Reflection

Our implementation attempt using the Mejía-Ramos design process to create tests according to the Mejía-Ramos assessment framework was not successful. The two main obstacles were the difficulty of creating viable multiple-choice alternatives and our students’ insistence that the alternatives helped them better comprehend the proof. Ultimately, it is very difficult to untangle to what extent these obstacles reflect flaws in the design process or assessment framework, and to what extent it is due to the changes we made, necessitated by our limited time and the unavailability of more student volunteers. The design of multiple-choice options was not a problem for Mejía-Ramos et al. (2017), and the high biserial correlations of their questions indicate that the response options probably did not unduly help students figure out the correct answers. It could therefore simply be that the tests we designed were of insufficient quality and would have been better had we followed the Mejía-Ramos design process more closely. If so, it is difficult to determine which of our adaptations made the negative difference. Was it the fact that we used written instead of oral responses in our pilot group? That we did not consult mathematicians? That we did not use the instructions to choose the ‘best’ option or ‘all the correct’ options? Had the core components of the design process been clearer it would have been easier to analyze and determine which adaptations, if any, were unproductive.

It is, of course, also possible that the Mejía-Ramos design process has certain flaws that offer opportunities for improvement. A closer replication study of the design process is necessary to determine whether the process, under the conditions used in the original study, leads to intended results. Additionally, validation of the tests themselves in different educational contexts would help determine the usefulness of these tests in contexts other than that of the original study. Following this implementation attempt, the teacher team and I remain confident in the value of the Mejía-Ramos assessment framework but not in the Mejía-Ramos design process, or at least not in our secondary school setting.

10.1 Discussion

This study asked how analysis of small-scale, teacher-led implementations may inform (a) innovation and theory refinement and (b) evaluation of the feasibility of theory in context and clarity of published guidance. I will consider the findings by returning to the lenses of implementation and implementability.

11 Implementation: Probing the Soundness of the Innovation

The analysis of the two implementations sheds light on implementation-related concerns such as the effectiveness of the innovations for their stated goals. For example, implementing the Stylianides sequence confirmed its power in dislodging students’ empirical proof scheme, but suggested that it does not fully succeed in replacing it with a deductive proof scheme. This constitutes teaching for theory (Cai & Hwang, 2021), in that researchers may be inspired to add more steps to the hypothetical learning trajectories from Stylianides and Stylianides (2009). Similarly, the analysis identified that the Mejía-Ramos assessment framework and design process produced assessments that tended to teach proof comprehension rather than assess it. This suggests that educators should prefer free-response options for high-stakes assessments of proof comprehension, using a rubric such as PRIUM (Cooley et al., 2024). Moreover, researchers could investigate using multiple-choice to promote proof comprehension rather than to measure it. These findings were made possible by teachers’ close attention to students’ discussions at the end of each implementation and benefited from trusting relationships that enabled students to speak freely.

The teachers’ enactments of the innovations contained purposeful adaptations that served as small design probes. For instance, teachers tried simpler and harder starting problems in the Stylianides sequence, swapped interviews for short written responses in the Mejía-Ramos design process, and compared multiple-choice to free-response summaries. Those moves generated new insights about the innovations: the Stylianides sequence works well with other starter tasks, even with students as young as 12; using written rather than oral responses to design multiple-choice distractors may not work; and it is possible to supplement the multiple-choice proof comprehension tests with quick rubric-based ratings for assessing students’ proof summaries. Given classroom-specific contexts and teacher agency to adapt, numerous improvement ideas were tried.

Finally, the teachers in this study made and collected various artifacts such as lesson plans, student work, tests, and rubrics. These artifacts make it possible for other teams to learn from and improve on these efforts indefinitely, suggesting the value of an open, teacher-facing artifact repository for future implementations.

12 Implementability: Gauging Interpretability and Feasibility

Implementing from the guidance offered from academic papers and a teacher-facing book — with minimal contact with the original teams of researcher-designers — forced a material-centered implementation that led to interpretation gaps that are rarely seen in designer-led trials. Core active ingredients and theory of change had to be inferred from texts without the assistance of the innovation proponents. It became apparent whether the texts were clear about how each part of the innovation contributed to the desired outcomes. Such naturalistic small-scale studies may thus serve as iterations to improve the written articulation of the innovation before larger-scale implementation.

Contrasting the implementation of two different innovations highlighted feasibility issues. An instructional sequence, especially one of such short duration as the Stylianides sequence (stretching over at most two lessons), is feasible where school culture permits it. Because the task was well-explained, with detailed rationale and illustrations for each part of the sequence, and readily convertible to teaching materials, there were no institutional or logistical barriers. Implementing the Mejía-Ramos assessment design, on the other hand, was constrained by logistics (limited time; few student volunteers to help derive realistic distractors) and by communication demands (students struggled to articulate reasoning clearly). In general, the design of valid and reliable assessments may be less implementable than a teaching sequence in a secondary school context, which welcomes innovative teaching methods but lacks resources for lengthy instrument construction processes.

Regarding interpretability, while these implementations were teacher-led, they were unlikely to have occurred without one teacher (the author) embedded in mathematics education research. Even articulating the learning of proof in terms of proof schemes is far from obvious to teachers whose training did not include this perspective. While Arbaugh et al. (2019) have published a teacher-facing book on proof, the team and I were initially unaware of it, and the Mejía-Ramos framework and process are both available only in academic articles whose length and language are prohibitive for many teachers. Thus a teacher-researcher partnership appears essential for interpretability, at least for innovations that are novel to most secondary school contexts or those available predominantly through academic articles.

13 Analytic Heuristics for Small-Scale Trials

Even small, teacher-led studies can produce large and rich data sets that are demanding to analyze. Throughout the process of the current study, some patterns of data became apparent as sources of specific kinds of insights. They are offered here in the hope that they may be helpful in future studies.

First, by attending to obstacles or negative results encountered during implementation, researchers may learn which design elements would benefit from productive adaptations or improvements. For example, a productive adaptation in the Stylianides sequence might be to replace the Squares task with a simpler task for learners younger than those for whom the sequence was originally designed. A productive adaptation encountered in the implementation of the Mejía-Ramos design process was to write proofs so that they were more suitable for secondary school students. In both cases, productive adaptations were relatively straightforward to identify when the issue was the age and mathematical experience of the students. An opportunity for improvement was also identified in the Stylianides sequence, where teachers’ reflections on the lack of success using the Squares task revealed the need to replace the task with one that had less overlap between empirical exploration and deductive proof.

Second, by attending to implementers’ puzzlement, researchers can learn what parts of the delivery or communication of the innovation should be improved. For example, in the Mejía-Ramos design process, teachers were stumped as to how to phrase the proofs that would be used for the assessment. Lack of puzzlement about how to adapt an innovation to contextual differences can likewise indicate that the core active ingredients are clearly articulated. When each part is clearly motivated, as it is in the Stylianides and Stylianides (2009) article and later book (Arbaugh et al., 2019), it is easier to find different ways to reach the intended outcomes, as the team and I did when we replaced one of the tasks with another one for the youngest students. Such moments of puzzlement, or lack thereof, are therefore especially useful for analyzing aspects of implementability that concern the communication of the innovation to would-be implementers.

Third, by noting the nature of changes made, researchers may learn where an innovation is flexible — and where implementers perceive limits. In attempting to extend the multiple-choice format to holistic proof questions, teachers encountered an impasse. The team therefore supplemented the multiple-choice test with a free-response proof summary question assessed by a qualitative rubric. This departure from the multiple-choice format breaks from the stated goal of the Mejía-Ramos process. Thus, it represents a decision that the process could not be adapted to the implementers’ purpose of assessing explanatory understanding. Analyzing the changes made may therefore inform us about the suitability of an innovation for different but closely related purposes.

Finally, by listening to students’ experiences, researchers could learn about unintended effects embedded in the innovation. Students reported that the test items themselves appeared to teach them the proofs — a consequence that, though familiar in many assessments, was striking in its consistency. This highlights the need to debrief and otherwise collect students’ reflections on their experience of an educational innovation.

14 Conclusion

This study examined what researchers can learn from small-scale, teacher-led implementations about implementation and implementability. In both cases, analysis of classroom enactments probed the innovations’ effectiveness and feasibility, while the material-centered nature of these implementations exposed how clearly (or opaquely) innovations are articulated for would-be adopters. The study promotes a bidirectional account of small-scale implementation. On the one hand, teacher-led trials can inform our evaluation of core components and theory of change, thereby refining theory and guiding subsequent design. On the other hand, the same trials reveal whether the way theory is communicated makes it usable for teaching: which articulations are clear and sufficient, and where additional explanatory supports are needed.

These findings were enabled by a teaching for theory approach within a small-scale study in which teachers, with in-depth knowledge of their curriculum, the focal problem, and students’ prior knowledge, led planning, enactment, interpretation, and cross-classroom comparisons. The resulting data were rich yet manageable precisely because of the small scale and direct access to teachers’ perspectives. To understand an innovation’s promise for improving teaching and learning, larger-scale implementation studies should incorporate or be supplemented by small case studies that center teachers’ and students’ perspectives.

References

  • Aguilar, M. S. (2020). Replication studies in mathematics education: What kind of questions would be productive to explore? International Journal of Science and Mathematics Education, 18(1, Suppl.), S37–S50. https://doi.org/10.1007/s10763-020-10069-7.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Arbaugh, F., Smith, M. S., Boyle, J. D., Stylianides, G. J., & Steele, M. D. (2019). We reason & we prove for all mathematics: Building students’ critical thinking. Corwin.

  • Cai, J., & Hwang, S. (2021). What does it mean to make implementation integral to research? ZDM — Mathematics Education, 53(5), 11491162. https://doi.org/10.1007/s11858-021-01301-x.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Century, J., & Cassata, A. (2016). Implementation research: Finding common ground on what, how, why, where, and who. Review of Research in Education, 40, 169215. https://doi.org/10.3102/0091732X16665332.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Clandinin, D. J., & Caine, V. (2008). Narrative inquiry. In The SAGE encyclopedia of qualitative research methods. SAGE. https://doi.org/10.4135/9781412963909.

  • Conradie, J., & Frith, J. (2000). Comprehension tests in mathematics. Educational Studies in Mathematics, 42(3), 225235. https://doi.org/10.1023/A:1017502919000.

  • Cooley, L., Dorfmeister, J., Miller, V., Duncan, B., Littmann, F., Martin, W., Vidakovic, D., & Yao, Y. (2024). The PRIUM qualitative framework for assessment of proof comprehension: A result of collaboration among mathematicians and mathematics educators. ZDM — Mathematics Education, 56(7), 15131566. https://doi.org/10.1007/s11858-024-01628-1.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Davies, B., Alcock, L., & Jones, I. (2020). Comparative judgement, proof summaries and proof comprehension. Educational Studies in Mathematics, 105(2), 181197. https://doi.org/10.1007/s10649-020-09984-x.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Davies, B., & Jones, I. (2022). Assessing proof reading comprehension using summaries. International Journal of Research in Undergraduate Mathematics Education, 8(3), 469489. https://doi.org/10.1007/s40753-021-00157-6.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Debarger, A. H., Choppin, J., Beauvineau, Y., & Moorthy, S. (2013). Designing for productive adaptations of curriculum interventions. Teachers College Record, 115(14), 298319. https://doi.org/10.1177/016146811311501407.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Hanna, G., & Barbeau, E. (2010). Proofs as bearers of mathematical knowledge. In G. Hanna, H. N. Jahnke, & H. Pulte (Eds.), Explanation and proof in mathematics: Philosophical and educational perspectives (pp. 85100). Springer. https://doi.org/10.1007/978-1-4419-0576-5_7.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Harel, G. (2013). Intellectual need. In K. R. Leatham (Ed.), Vital directions for mathematics education research (pp. 119151). Springer. https://doi.org/10.1007/978-1-4614-6977-3_6.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Harel, G., & Sowder, L. (1998). Students’ proof schemes: Results from exploratory studies. American Mathematical Society, 7, 234283.

  • International Baccalaureate Organization. (2019). Diploma programme. Mathematics: Analysis and approaches guide. https://dp.uwcea.org/docs/Mathematics%20-%20Analysis%20and%20Approaches%20Subject%20Guide.pdf.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Jankvist, U. T., Gregersen, R. M., & Lauridsen, S. D. (2021). Illustrating the need for a ‘Theory of Change’ in implementation processes. ZDM — Mathematics Education, 53(5), 10471057. https://doi.org/10.1007/s11858-021-01238-1.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Jankvist, U. T., Aguilar, M. S., Misfeldt, M., & Koichu, B. (2021). What to replicate? Implementation and Replication Studies in Mathematics Education, 1(2), 141153. https://doi.org/10.1163/26670127-01010015.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Koichu, B., Aguilar, M. S., & Misfeldt, M. (2021). Implementation-related research in mathematics education: The search for identity. ZDM — Mathematics Education, 53(5), 975989. https://doi.org/10.1007/s11858-021-01302-w.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Koichu, B., Misfeldt, M., Aguilar, M. S., & Jankvist, U. T. (2024). About implementability. Implementation and Replication Studies in Mathematics Education, 4(2), 161176. https://doi.org/10.1163/26670127-00402001.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Mejia-Ramos, J. P., Fuller, E., Weber, K., Rhoads, K., & Samkoff, A. (2012). An assessment model for proof comprehension in undergraduate mathematics. Educational Studies in Mathematics, 79(1), 318. https://doi.org/10.1007/s10649-011-9349-7.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Mejía-Ramos, J. P., Lew, K., de la Torre, J., & Weber, K. (2017). Developing and validating proof comprehension tests in undergraduate mathematics. Research in Mathematics Education, 19(2), 130146. https://doi.org/10.1080/14794802.2017.1325776.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Meland, E. A., & Brion-Meisels, G. (2023). Integrity over fidelity: Transformational lessons from youth participatory action research to nurture SEL with adolescents. Frontiers in Psychology, 14, Article 1059317. https://doi.org/10.3389/fpsyg.2023.1059317.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Pinto, A., & Koichu, B. (2021). Implementation of mathematics education research as crossing the boundary between disciplined inquiry and teacher inquiry. ZDM — Mathematics Education, 53(5), 10851096. https://doi.org/10.1007/s11858-021-01286-7.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Selden, A. (2012). Transitions and proof and proving at tertiary level. In G. Hanna & M. de Villiers (Eds.), Proof and proving in mathematics education: The 19th ICMI Study (pp. 391420). Springer. https://doi.org/10.1007/978-94-007-2129-6_17.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Stylianides, A. J. (2007). Proof and proving in school mathematics. Journal for Research in Mathematics Education, 38(3), 289321. https://doi.org/10.2307/30034869.

  • Stylianides, G. J., & Stylianides, A. J. (2009). Facilitating the transition from empirical arguments to proof. Journal for Research in Mathematics Education, 40(3), 314352. https://doi.org/10.5951/jresematheduc.40.3.0314.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Stylianides, G. J., Stylianides, A. J., & Moutsios-Rentzos, A. (2023). Proof and proving in school and university mathematics education research: A systematic review. ZDM — Mathematics Education, 56(1), 4759. https://doi.org/10.1007/s11858-023-01518-y.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Thomas, G. (2011). A typology for the case study in social science following a review of definition, discourse, and structure. Qualitative Inquiry, 17(6), 511521. https://doi.org/10.1177/1077800411409884.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Tsygan, J., & Helenius, O. (2024). Functional explanations for the understanding of mathematical results. In C. K. Skott, M. Blomhøj, A. Eckert, R. Elicer, R. Herheim, B. Kristinsdottir, D. M. Larsen, G. A. Nortvedt, P. Nyström, J. Ö. Sigurjónsson, & A. L. Tamborg (Eds.), Proceedings of NORMA24 the tenth Nordic conference on mathematics education Copenhagen, 2024 (pp. 361368). SMDF.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Weber, K. (2012). Mathematicians’ perspectives on their pedagogical practice with respect to proof. International Journal of Mathematical Education in Science and Technology, 43(4), 463482. https://doi.org/10.1080/0020739X.2011.622803.

    • Über Google Scholar suchen
    • Zitierung exportieren
  • Yang, K.-L., & Lin, F.-L. (2008). A model of reading comprehension of geometry proof. Educational Studies in Mathematics, 67(1), 5976. https://doi.org/10.1007/s10649-007-9080-6.

  • Zaslavsky, O., Nickerson, S. D., Stylianides, A. J., Kidron, I., & Winicki-Landman, G. (2012). The need for proof and proving: Mathematical and pedagogical perspectives. In Hanna, G., de Villiers, M. (Eds.), Proof and proving in mathematics education: The 19th ICMI Study (pp. 215229). Springer. https://doi.org/10.1007/978-94-007-2129-6_9.

    • Über Google Scholar suchen
    • Zitierung exportieren

Appendix

A1 Results and Proofs Used in Implementation

These are three examples of the results and proofs we used in our implementation attempt of the Mejía-Ramos design process and assessment framework.

Result 1: If y = ln x, with x > 0, then with x > 0.

Proof 1:

  1. We know that x = elnx

  2. We also know that

  3. We know that

  4. From the above follows:

  5. Equation

    Citation: Implementation and Replication Studies in Mathematics Education 5, 2 (2025) ; 10.1163/26670127-bja10032

  6. So, which is equivalent to .

This concludes the proof.

Result 2: The Fundamental Theorem of Calculus, Part 2.

If f is a function on the domain represents the signed area between f (x) and the

x-axis between x = a and x = b, and F is an antiderivative of f on (a, b), then

Equation

Citation: Implementation and Replication Studies in Mathematics Education 5, 2 (2025) ; 10.1163/26670127-bja10032

Proof 2:

  1. We can divide the interval [a, b] into n small subintervals of equal width Δx. The width of each subinterval is then

  2. Let us denote the endpoints of these subintervals as xi, where xi = a + iΔx for i = 0, 1, 2, … n

  3. The signed area between f (x) and the x-axis from x = a to x = b can be approximated by

    Equation

    Citation: Implementation and Replication Studies in Mathematics Education 5, 2 (2025) ; 10.1163/26670127-bja10032

  4. Since f (x) is the derivative of F (x), for small

  5. Therefore

  6. Simplifying: .

  7. As n approaches infinity, Δx approaches 0, and the approximation becomes exact:

    Equation

    Citation: Implementation and Replication Studies in Mathematics Education 5, 2 (2025) ; 10.1163/26670127-bja10032

This concludes the proof.

Result 3: Bayes’ Theorem

For n mutually exclusive and collectively exhaustive events A1, A2, …, An and complementary events B and B′ with P(B) ≠ 0, Bayes’ theorem for Ai given B is :

Proof 3:

  1. Imagine a tree diagram with n main branches for the events A1, A2, …, An, each representing a mutually exclusive and exhaustive outcome. Each branch splits into sub-branches for events B and its complement B′.

  2. Therefore, for each i ∈ 1, 2, … n, P(AiB) = P(Ai)P(B|Ai)

  3. We can express P(B) as:

  4. We have:

  5. From this, we obtain:

This concludes the proof.

A2 Assessment Rubric

Assessment rubric for analysis of students’ proof summaries
Table 4

Assessment rubric for analysis of students’ proof summaries

Citation: Implementation and Replication Studies in Mathematics Education 5, 2 (2025) ; 10.1163/26670127-bja10032

Kennzahlen

Insgesamt Letzte 365 Tage In den letzten 30 Tagen
Aufrufe von Kurzbeschreibungen 0 0 0
Gesamttextansichten 209 209 25
PDF-Downloads 292 292 26