Relation of Student Achievement to the Quality of Their Teachers and Instructional Quality
2.3.1
Sample
This study is based on grade four student and teacher data from the majority of countries participating in TIMSS 2011. Five countries were excluded because there were no data on one or more predictors (Austria, Belgium, Kazakhstan and Russia) or there were very high levels of missing values for most of the variables included in the analysis (Australia). For students with more than one mathematics teacher, data from only one of the teachers was included at random, resulting in a data set with a simple hierarchical structure, where students were nested in one specific class with one specific teacher. The amount of data excluded by this procedure was negligibly small (for details see Chap. 1). The final sample included 205,515 students from 47 countries nested in 10,059 classrooms/teachers with an average classroom size of 20 students. Student sample sizes per country varied between 1423 and 11,228, with the number of classrooms/teachers ranging from 67 to 538, and an average classroom size between 12 and 34 students. The school level was neglected in the analyses to avoid overly complex hierarchical models. Furthermore, the choice of omitting the school level in the analysis is based on the fact that for many countries the classroom and school level cannot be analyzed separately, since only one grade four classroom was drawn per school.
2.3.2
Variables
A structural model was developed to reflect the hypothesized relations between teacher quality, instructional quality and student achievement (Fig. 2.1). Furthermore, the internationally-pooled descriptives of all variables, including their range across countries were inspected (Table 2.1).Footnote 1
Fig. 2.1
Model of the hypothesized relations of teacher quality (left hand side of the figure) in terms of years of teaching experience (Years exp), teacher education degree (Degree), major focus of teacher education (Major), professional development represented by three indicators (PDmath, PDspec and Collabor), and sense of preparedness represented by three indicators (PrepNumb, PrepGeo and PrepData), to instructional quality (InQuaCI, InQuaCA, and InQuaSC), and to student achievement represented by five plausible values (PV1–5; right hand side of the figure); all abbreviations are explained in Table 2.1, and the numbers linking the relations hypothesized correspond to columns in Table 2.2, where the actual estimates can be found
Full size image
Table 2.1 Descriptives of the variables used in the model
Full size table
Teacher quality measures
Teacher quality is represented by three central dimensions in our model, namely teacher education background, participation in professional development (PD) activities, and teachers’ sense of preparedness. Teacher education background is described by teachers’ years of experience and their formal initial education. These characteristics were included as separate categorical and manifest variables because they do not reflect a joint and theoretically derived latent construct. Instead they represent different and not necessarily related dimensions of teacher quality.
The variation between countries for these variables was remarkably large. Across all countries, the modal category of number of years of experience (“By the end of this school year, how many years will you have been teaching altogether?”) was more than 20 years. The Eastern European countries were particularly pronounced in having many teachers with extensive teaching experience, indicating an older teaching force than elsewhere (see Appendix A, Table A.1). But there were also countries in the data set where the largest group of teachers that taught mathematics at grade four had less than 10 years of experience, and, in some countries, less than 5 years of experience. The Arabian countries were most pronounced in having a relatively young teaching force.
Teachers provided information about their degree from teacher education (“What is the highest level of formal education you have completed?”) out of six options from “did not complete ISCED level 3” to “finished ISCED level 5A, second degree or higher”. Across all countries, the modal category was “ISCED level 5A, first degree”, indicating that many countries had a large proportion of teachers with a bachelor degree. But there were also some countries where the largest group of teachers did not have university degrees, but had completed practically-based programs at ISCED level 3. Italy and the African countries were most pronounced in this respect (see Appendix A, Table A.2). In contrast, there were countries where the largest group of teachers held a university degree at least equivalent to a master degree (“ISCED level 5A, second degree or higher”). The Eastern European countries were most pronounced in this respect.
A dichotomous variable was created by combining teachers’ responses to two questions regarding their specialization in mathematics. This variable identifies teachers with a major in mathematics or in mathematics education (“During your <post-secondary> education, what was your major or main area(s) of study?” and “If your major or main area of study was education, did you have a <specialization> in any of the following?”). On average, slightly fewer than 40 % of all teachers across all countries had a major with a specialization in mathematics. However, in some countries the proportion was below 10 % (for example in some of the Eastern European countries), whereas in other countries the proportion was more than 80 % (for example in several Arabian countries) (see Appendix A, Table A3).
Furthermore, there were measures of teachers’ participation in PD activities. One set of questions asked the teachers whether or not they had participated in PD during the last two years. These questions are represented in the model by two item parcels reflecting either broad PD activities covering, for example, “mathematics content” in general, or reflecting PD activities preparing for specific challenges, for example”integrating information technology into mathematics”. Across all countries, approximately 40 % of the teachers had participated in broad or specific PD activities, respectively. However, the between-country variation was large, from countries having as few as 10 % the teachers taking part in broad or specific PD, to countries where more than two-thirds of the teachers had taken part in one or both forms of PD activities. It is difficult to discern any systematic cultural pattern in these differences (see Appendix A, Table A.4).
In addition, there was a set of questions regarding whether teachers had taken part in collaborative activities representing continuous, collaborative and school-based PD (“How often do you have the following types of interactions with other teachers?”, with “Visit another classroom to learn more about teaching” as an “exemplary” form of interaction). Across all countries, teachers commonly participated in these types of activities two to three times each month. However, in some countries the largest group of teachers participated in collaborative PD daily or almost daily. These questions were included as the third item parcel defining the latent construct of PD.Footnote 2
The third teacher quality dimension included in the model reflects teachers’ self-efficacy. The indicator used was their self-reported sense of preparedness to teach specific topics in mathematics within the three domains of number, geometric shapes and measures, as well as data display (“How well prepared do you feel you are to teach the following mathematics topics?”, with “Adding and subtracting with decimals” included as an exemplary topic). For each domain, teachers were asked to rate these topics on a three-point Likert scale from “Not well prepared” (0) to “Very well prepared” (2). Teachers were also invited to use a “not applicable” response category if the topic was not covered in their curriculum. In our analysis, the items marked as not applicable were treated as missing. To simplify the final model, the three domains were represented as item-parcel indicators of the latent construct of preparedness. Across all countries, the mean of the three item parcels was each time around 1.8 and, thus, close to the maximum category of the Likert scale. This suggests that there was little discrimination evident in the items. The international variation was also more limited within this dimension than in others included in the model. The lowest means were around 1.5 and, thus, straddled the categories “Somewhat prepared” and “Very well prepared”. Interestingly, slightly lower self-efficacy was most evident in Japan and Thailand (see Appendix A, Table A.5).
Instructional quality measures
The measure of InQua applied in this chapter is based on the teacher questionnaire in TIMSS where six questions asked teachers to report how often they perform various activities in this class (“How often do you do the following in teaching this class?”). This measure was preferred over other measures available (see Sect. 2.5) since it has a more explicit relation to three of the four characteristics of high quality instruction (Table 2.1). Teachers were asked to rate these activities on a four-point Likert scale from “Never” (0) to “Every or almost every lesson” (3). These items are represented by three item parcels with two items in each parcel covering different aspects of the latent construct InQua. The first parcel reflected teaching characteristics that were intended to deepening students’ understanding through clear instruction (such as “Use questioning to elicit reasons and explanations”). The second parcel pursued this objective through cognitive activation (through questions such as “Relate the lesson to students’ daily lives”). The final parcel covered a supportive climate (for example “Praise students for good effort”). Across all countries, the indicators for a supportive climate appeared to be widely present, as the mean was close to the maximum of the scale. The mean of the other two parcels was slightly lower. Interestingly, Scandinavian countries had the lowest means on the cognitive-activation item-parcel (see Appendix A, Table A.6). Some international variation existed on all three item parcels.
Outcome measure
We selected student achievement in mathematics represented by five plausible values as our outcome measure. The scale was defined by setting the international mean to 500 and the standard deviation to 100. Country means varied between 248 and 606 points, which is a difference of more than 3.5 standard deviations (for more information, see Martin and Mullis 2012).
Control variables
Data about gender and socioeconomic background were gathered through students’ self-reports to the questions “Are you a girl or a boy?” and the frequently used proxy measure of home background “About how many books are there in your home?”Footnote 3
2.3.3
Analysis
The research questions were examined using multi-level structural equation modeling (MLSEM). The intra-class correlation (ICC) for students’ achievement in the pooled international data set (ICC = 0.30) and within countries (ICC = 0.07–0.56) were all above the threshold at which multi-level modeling is recommended (Snijders and Bosker 2012).
Item-parcels were used as indicators, as recommended when structural characteristics of the constructs are the focus of interest (Little et al. 2002), as applies in the present investigation, and when sample size is limited in comparison to the number of parameters to be estimated (Bandalos and Finney 2001). The latter also applies to the present investigation given that there are only about 140 to 260 classrooms in most of the countries. By using parcels as indicators for the latent variables, the number of free parameters to be estimated was significantly reduced. The items were combined into parcels based on theoretical expectations confirmed by initial exploratory analysis of sub-dimensions in the latent variables included in the model.
Data analysis was carried out using the software MPlus 7.4. The clustered data structure was taken into account by using a maximum-likelihood estimator with robust sandwich standard errors to protect against being too liberal (Muthén and Muthén 2008–2012). Missing data were handled by using the full-information-maximum-likelihood (FIML) procedure. The model fit was evaluated with the chi-square deviance and a range of fit indices.Footnote 4
Before the final model was run, measurement invariance (MI) across countries was tested for the latent constructs in the model. Comparing constructs and their relations across countries produces meaningful results only if the instruments measure the same construct in all countries (Van de Vijver and Leung 1997). In order to ascertain such equivalence, MI was established using multiple-group confirmatory factor analysis (MG-CFA; Chen 2008). As instructional quality and the teacher constructs were measured at the classroom level, we tested for measurement invariance at the school level. Firstly, configural invariance was examined, which means that in each country the same items had to be associated with the same latent factors. As a second step, we tested for metric invariance, by studying whether the factor loadings were invariant across countries. Invariance of factor loadings enabled us to compare the relationship between latent variables across groups. It was possible to establish metric invariance for all latent constructs included in the present model (see Appendix B).
To examine our research questions, a single-group model was first applied before country-by-country analyses were carried out. In the multi-group model, factor loadings were constrained to be the same for all countries, reflecting the metric invariance criterion referred to above, in order to ensure comparability. Indirect relations at the between-level were estimated by multiplying the coefficients for the respective direct relation. In the single-group model, the two control variables gender and books at home were grand-mean centered on the international mean, whereas all predictors, the mediator InQua and the dependent variable student achievement in mathematics were group-mean centered on the country means. In the multi-group model the control variables were again grand-mean centered (this meant now on the country mean) whereas the predictors, the mediator and the dependent variable remained unaltered. Relations were regarded significant on the within-level if p < 0.05, but given the relative small number of units at the between-level as compared to the number of parameters to be estimated, a more liberal decision rule for the significance testing with p < 0.10 was applied for this level.