- Software Article
- Open Access

# lsasim: an R package for simulating large-scale assessment data

- Tyler H. Matta
^{1, 2}Email author, - Leslie Rutkowski
^{2, 3}, - David Rutkowski
^{2, 3}and - Yuan-Ling Liaw
^{2}

**6**:15

https://doi.org/10.1186/s40536-018-0068-8

© The Author(s) 2018

**Received:**9 May 2018**Accepted:**10 November 2018**Published:**19 November 2018

## Abstract

This article provides an overview of the R package **lsasim**, designed to facilitate the generation of data that mimics a large scale assessment context. The package features functions for simulating achievement data according to a number of common IRT models with known parameters. A clear advantage of **lsasim** over other simulation software is that the achievement data, in the form of item responses, can arise from multiple-matrix sampled test designs. Furthermore, **lsasim** offers the possibility of simulating data that adhere to general properties found in the background questionnaire (mostly ordinal, correlated variables that are also related to varying degrees with some latent trait). Although the background questionnaire data can be linked to the test responses, all aspects of **lsasim** can function independently, affording researchers a high degree of flexibility in terms of possible research questions and the part of an assessment that is of most interest.

## Introduction

### Large-scale assessments in education

An important tool for monitoring educational systems around the world, international large-scale assessments (ILSAs) are cross-national, comparative studies of achievement. ILSAs are used to measure educational achievement in select content domains for representative samples of students enrolled in primary and secondary educational systems. The achievement tests that students take are intended to be an adequate representation of what students know and can do in the relevant content areas (e.g., math, science, and reading). And the results of these assessments are used to compare educational systems (usually, but not exclusively, countries) and to inform policy, practice, and educational research both nationally and internationally. In terms of numbers of participants, these studies have grown tremendously over the past few decades. Today, two-thirds of all countries with populations greater than 30,000 have participated in one or more international or regional large-scale assessments (Lockheed et al. 2015). Among the most well-known ILSAs are the Trends in International Mathematics and Science Study (TIMSS) and the Programme for International Student Assessment (PISA). On a 4 year cycle, beginning in 1995, TIMSS measures mathematics and science in a representative sample of fourth and eighth grade students. Starting in 2000 and every 3 years afterward, PISA assesses 15-year-olds enrolled in school in math, science, and reading. Besides the subject assessment (e.g. math, science, and reading tests), these studies also solicit information from students, their teachers, principals, and their parents regarding beliefs, attitudes, experiences, and the context of schooling. With over half a million students from 70 educational systems taking part, PISA is now the largest such study (OECD 2017). New versions of PISA, such as PISA for Development, targeting developing economies, and PISA for schools, focused on providing participating schools with internationally comparable results, will only increase these numbers and bring international assessments to new contexts and audiences. The quantitative nature, scale, and scope of these and related modern educational surveys necessitates a fairly sophisticated approach to the survey design, sampling, analysis, and reporting. We elaborate subsequently.

2011 TIMSS booklet design

Booklet | Part 1 | Part 2 | ||
---|---|---|---|---|

Block 1 | Block 2 | Block 3 | Block 4 | |

1 | M01 | M02 | S01 | S02 |

2 | S02 | S03 | M02 | M03 |

3 | M03 | M04 | S03 | S04 |

4 | S04 | S05 | M04 | M05 |

5 | M05 | M06 | S05 | S06 |

6 | S06 | S07 | M06 | M07 |

7 | M07 | M08 | S07 | S08 |

8 | S08 | S09 | M08 | M09 |

9 | M09 | M10 | S09 | S10 |

10 | S10 | S11 | M10 | M11 |

11 | M11 | M12 | S11 | S12 |

12 | S12 | S13 | M12 | M13 |

13 | M13 | M14 | S13 | S14 |

14 | S14 | S01 | M14 | M01 |

Only a fraction of the students in the sample take any one item, and any selected student takes only a fraction of the total available items. As a result, the actual distribution of student proficiency cannot be approximated by its empirical estimate (Mislevy et al. 1992b). Further, traditional methods of estimating individual achievement introduce an unacceptable level of uncertainty and the possibility of serious aggregate-level bias (Little and Rubin 1983; Mislevy et al. 1992a). As one means for overcoming the methodological challenges associated with multiple-matrix sampling, large-scale assessment programs adopted a population or latent regression modeling approach that uses marginal estimation techniques to generate population- and subpopulation-level achievement estimates (Mislevy 1991; Mislevy et al. 1992a, b)

More specifically, using information from background questionnaires, other demographic variables of interest and responses to the cognitive portion of the test, student achievement is estimated via a latent regression model, where achievement \((\theta )\) is treated as a latent or unobserved variable for all examinees. Essentially, the limited achievement test responses, complete student background questionnaires responses, and select demographic information are used in conjunction with a measurement model-based extension of Rubin’s (1987) multiple imputation approach to generate a proficiency distribution for the population (or sub-population) of interest (Beaton and Johnson 1992; Mislevy et al. 1992a, b; von Davier et al. 2006). A short, slightly more technical description follows.

As in multiple imputation methods, an imputation model (called a “conditioning model”) is used to derive posterior distributions of student achievement. This model uses all available student data (cognitive as well as background information) to generate a conditional proficiency distribution from which to draw a number of plausible values (usually five) for each student on each latent trait (e.g. mathematics, science, and associated sub-domains).

*t*, \(\hat{t}(\mathbf {X}, \mathbf {Y})=E[t(\theta ), \mathbf {Y}|\mathbf {X}, \mathbf {Y}]=\int t(\theta ,\mathbf {Y})p(\theta |\mathbf {X}, \mathbf {Y})d\theta\) where \(\mathbf {X}\) is a matrix of achievement item responses for all examinees and \(\mathbf {Y}\) is the matrix of responses of all examinees to the set of administered background questions. Because closed-form solutions are typically not available, random draws from the conditional distributions are drawn for each sampled examinee

*j*(Mislevy et al. 1992b). In line with missing data practices (Rubin 1976, 1987), values for each examinee are drawn multiple times. These are typically referred to as plausible values in LSA terminology or multiple imputations in missing data literature. Using Bayes’ theorem and the IRT assumption of conditional independence,

### The role of simulations in large-scale assessment research and development

In the past 20 years, national and international assessments have expanded significantly in terms of the number of national- or system-level participants, platforms (computerized in addition to paper and pencil), content domains (e.g., collaborative problem solving), and the degree to which participating countries differ in economic, cultural, linguistic, geographic, and other terms. To that end, two areas of research in large-scale assessment are evaluating the performance of currently used methods given the changing nature of LSAs and developing new methods. In both cases, areas of emphasis can conceivably include test design, administration, data collection, sampling, or other relevant areas. As study administrators are naturally cautious about implementing new designs and methods without evidence of their merit and worth, a viable option for testing new methods is through simulation. Further, using empirical data to examine the performance of current and new methods is limited by the fact that we can never know the true, underlying population values of item- or person-parameters. Simulation is a low-cost powerful means for conducting methodological research in the area of large-scale assessment. Examples include Adams et al. (2013), Rutkowski and Zhou (2015).

Traditionally, the mandate of large-scale assessments surrounds measuring and reporting achievement across populations of interest. As such, large-scale assessment developers prioritize achievement measures, in terms of framework development, psychometric quality, analysis, and reporting (OECD 2014; Martin et al. 2016). Nevertheless, background questionnaires serve to contextualize educational achievement and provide opportunities to understand correlates of learning. To that end, the background questionnaire and achievement measures have distinct frameworks, and different teams work to develop and innovate in each respective area. This (in some cases arbitrary) distinction between the achievement test and background questionnaires frequently leads researchers to regard each component separately for many methodological investigations. Therefore, **lsasim** (Matta et al. 2017) simulates data in a way that treats background questionnaire responses as separate from but related to the achievement test.

### Software for generating large-scale assessment data

The goal of **lsasim** is to provide a set of functions that enable users to design and modify test designs that are commonly utilized in large-scale educational surveys. Such goals are similar to the goals of **catR** (Magis and Raiche 2012) for generating item response patterns from computer adaptive tests and **mstR** (Magis et al. 2017) for generating item response patterns from multi-stage tests. The difference, however, is that multi-matrix sampling designs utilized in large-scale assessments are not (yet) adaptive, and can thus, depend on other packages to estimate item parameters and achievement estimates.

Although generation of item responses, given a set of item parameters and a true score, for a fixed test is not unique, none of the existing **R** (R Core Team 2017) IRT packages provide a means to establish multi-matrix sampling designs. Two of the most commons IRT packages used are **TAM** (Robitzsch et al. 2017) and **mirt** (Chalmers 2012). Both packages include functions to simulate item response patterns, but every “observation” will be given a generated response to every item. Users would have to delete item responses post-hoc to arrive at data that resembles a matrix sampling design.

In addition to the inability to generate item responses under a multi-matrix sampling designs, none of the IRT packages reviewed provide a means for generating responses to “background questionnaires,” data that are commonly used in the estimation of achievement. To include responses to background variables, one would need to develop functions on their own, or utilized an alternative package to generate mixed normal, bivariate, and ordinal data such as **GenOrd** (Barbiero and Ferrari 2015).

With **lsasim** designed to generate item responses, it has no functionality to estimate item parameters or achievement estimates. For this, users should turn to existing packages, for example, **TAM**, as is demonstrated later in this article. The data output from **lsasim** is formatted to be used with **TAM** or **mirt** without any further data manipulation. Furthermore, the **ibd** package (Mandal 2018) can be used in tandem with **lsasim** to generate balanced incomplete designs.

## Simulation methodology

### Generating correlated questionnaire data

Let \(X = \{X_{1}, X_{2}, \ldots , X_{p}, \ldots , X_{P}\}\) be a set of continuous random variables and \(W = \{W_{1}, W_{2}, \ldots , W_{q}, \ldots , W_{Q}\}\) be a set of ordinal (possibly dichotomous) random variables. For any \(W_{q}\), let there be \(1, \ldots , k_{q}, \ldots , K_{q}\) ordered response categories where \(p(W_{q} = k_{q}) = \pi _{q,k}\) such that \(\sum _{1 \le k \le K} \pi _{q,k} = 1\). Furthermore, let \(\mathbf {R}\), be a \((P+Q) \times (P+Q)\) possibly heterogeneous correlation matrix which includes (a) Pearson product-moment correlations for \(\rho (X_{p}, X_{p^{\prime }})\); (b) polychoric correlations for any \(\rho (W_{q}, W_{q^{\prime }})\); and (c) polyserial correlation for any \(\rho (X_{p}, W_{q})\).

*k*th threshold for \(W^{\star }_{q}\), delineating responses \(k-1\) and

*k*on the scale of \(W^{\star }_{q}\).

To simulate correlated mixed-type data, we need only a \((P+Q) \times (P+Q)\) data-generating correlation matrix, \(\mathbf {R}\) and the \(K_{q}\) marginal probabilities \(\pi _{q}\) corresponding to each ordinal variable \(W_{q}\). First, generate *N* replicates from \(P+Q\) independent standard normal random variables \(\mathbf {Z} = \{Z(X_{1}), \ldots , Z(X_{P}), \ldots , Z(W^{\star }_{1}), \ldots , Z(W^{\star }_{Q})\}\), such that an \(N \times (P+Q)\) data matrix, \(\mathbf {Z}\), is obtained. Second, let \(\mathbf {L}\) be the lower triangle matrix of the Cholesky factorization of \(\mathbf {R}\) where \(\mathbf {R} = \mathbf {L} \mathbf {L}^{\prime }\). We can transform \(\mathbf {Z}\) to \(\{\mathbf {X}, \mathbf {W^{\star }}\}\) using \(\mathbf {L}\) such that \(\{\mathbf {X}, \mathbf {W^{\star }} \} = \mathbf {Z} \mathbf {L}\). Finally, we transform the latent variables \(\mathbf {W^{\star }}\) to \(\mathbf {W}\) by coarsening based on Eq. 3.

### Generating IRT-based data

*k*is the response to item

*i*by respondent

*j*, \(\theta _{j}\) is the respondent’s true score, and \(K_{i}\) is the maximum score on item

*i*. Furthermore, \(b_{i}\) is the average difficulty for item

*i*, \(d_{iu}\) is the threshold parameter between scores

*u*and \(u-1\) for item

*i*, \(a_{i}\) is the item’s discrimination parameter, \(c_{i}\) is the item’s pseudo-guessing parameter, and

*D*is a scaling constant for the item.

## The **lsasim** package

**lsasim**package contains a set of functions that facilitate the generation of large-scale assessment data. The package can be divided into two interrelated sets of functions: one set for generating background questionnaire data and another set for generating the cognitive data. This section provides a description of each function within the package and demonstrates how they can be used. To start, we set a seed for replicability purposes and load the

**lsasim**package and the

**polycor**package (Fox 2016), both available on CRAN.

### Background questionnaire data

The code above provides an example of cumulative proportions for two background items. The first background variable has one category, indicating a continuous response. The second background item has four response categories with marginal population proportions of 0.23, 0.31, 0.27, and 0.19, respectively.

In the above example, we specify a polyserial correlation of .7 between the discrete background item and the continuous item. Notice that the size of ex_cor is equal to the length of the ex_prop as the size and order of cor_matrix corresponds to cat_prop.

**polycor**package.

It is important to note that converting the factor variables to numeric and estimating a Pearson correlation will not recover the generating correlation matrix.

*i*to be distributed \({\mathcal {N}}\)(c_mean[i], c_sd[i]). Finally, theta is a logical argument where theta = TRUE results in the first continuous background item to be named “theta” in the resulting data frame. This optional argument is only for convenience when generating both background questionnaire data and cognitive data.

Notice in the example above, the continuous variable is now named theta and has a mean and standard deviation close to that specified by c_mean and c_sd.

**lsasim**provides two functions for generating the correlation matrix and list of marginal cumulative proportions. The cor_gen function generates a random correlation matrix by specifying the number of variables via the n_var argument.

### Cognitive data

The cognitive assessments for LSAs are much more involved than the background questionnaire. As mentioned above, the cognitive assessments use an IRT measurement model administered using a multi-matrix sampling scheme. The package **lsasim** has been designed to provide researchers with extensive flexibility in varying these design features while providing valid default solutions as an option. There are five functions that make up the cognitive data generation, which we organize here under three categories (a) item parameter generation: item_gen; (b) test assembly and administration: block_design, booklet_design, and booklet_sample; and (c) item response generation: response_gen.

#### Item parameter generation

*b*,

*a*, and

*c*parameters, respectively. Note that a_bounds are only applied to the two- and three-parameter items and c_bounds are applied to three-parameter items only.

The above example shows the item information for the 15 items in item_pool. All 15 items have a \(b_{i}\) parameter, which is the average difficulty for the item. The five two-parameter items were specified as generalized partial credit items with two thresholds. Thus, item 1 though item 5 have two *d* parameters, d1 and d2 such that \(b_{i} + d_{ik}\) is the *k*th threshold for item *i*. All 15 items have a discrimination parameter, \(a_{i}\), while only item 6 through item 15 have a *c* parameter (pseudo-guessing). The last two variables in item_pool, k and p, are indicators to identify the number of thresholds and whether the item is from a 1PL, 2PL, or 3PL model, respectively.

#### Test assembly and administration

#### Block design

The first step in the test assembly is to determine the number of blocks and the assignment of items to those blocks. The function block_design facilitates this process with two required arguments and one optional argument. The n_blocks argument specifies the number of blocks while the item_parameters argument takes a data frame of item parameters. The default allocation of items to blocks is a spiraling design. For \(1, 2, \ldots , H\) item blocks, the first item is assigned to block 1, item 2 is assigned to the block 2, and item *H* is assigned to block *H*. The process is continued such that item \(H+1\) is assigned to block 1, item \(H+2\) is assigned to block 2 and item \(H+H\) is assigned to block *H* until all items are assigned to a block.

The column names of block_assignment begin with b to indicate block while the rows begin with i to indicate item. For block b1, the first item, i1, is the first item from item_pool, the second item, i2, is the fifth item from item_pool, the third item, i3, is the ninth item from item_pool, and the fourth item, i4, is the 13th item from item_pool. Because the 15 items do not evenly distribute across 4 blocks, the fourth block only contains three items. To avoid dealing with ragged matrices, all shorter blocks are filled with zeros.

This table indicates the number of items in each block and the average difficulty for each block. Again, notice blocks 1 though 3 each have four items while block 4 only has three items. Furthermore, the easiest block is b4 with an average difficulty of − 0.267 while the most difficult block is b1 with an average difficulty of 0.718. Note that for partial credit items, \(b_{i}\) is used in the calculation of the average difficulty.

#### Booklet design

Default booklet design

Booklet | Item blocks | |||||||
---|---|---|---|---|---|---|---|---|

\(b_{1}\) | \(b_{2}\) | \(b_{3}\) | \(b_{4}\) | \(\ldots\) | \(b_{H-2}\) | \(b_{H-1}\) | \(b_{H}\) | |

\(B_{1}\) | 1 | 1 | 0 | 0 | \(\ldots\) | 0 | 0 | 0 |

\(B_{2}\) | 0 | 1 | 1 | 0 | \(\ldots\) | 0 | 0 | 0 |

\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |

\(B_{N-1}\) | 0 | 0 | 0 | 0 | \(\ldots\) | 0 | 1 | 1 |

\(B_{N}\) | 1 | 0 | 0 | 0 | \(\ldots\) | 0 | 0 | 1 |

book_ex uses the default item-block assembly of block_ex. In the above output, book B1 contains the items from block 1 and block 2, booklet B2 contains the items from block 2 and block 3, booklet B3 contains the items from block 3 and block 4, and book B4 contains the items from block 1 and block 4. Notice that the first two test booklets contain eight items while the last two books contain seven items. This is because block 4 only has three items whereas blocks 1, 2, and 3 have four times. Like block_design, booklet_design avoids ragged matrices by filling shorter booklets with zeros.

Notice how booklets B1 and B5, contain 11 items each while booklets B2 and B6 contain four items each.

#### Booklet administration

The final component of the test assembly and administration is the administration of test booklets to examinees. The function booklet_sample facilitates the distribution of test booklets to examinees. The two required arguments are n_subj, the number of examinees, and book_item_design, the output from booklet_design. The default sampling scheme makes all books equally likely but the optional argument book_prob takes a vector of probabilities to make some books more or less likely to be administered. The logical argument resample will resample booklets until the difference in booklets sampled is less than e or iter attempts. The resampling functionality may be useful when n_subj is small and only one dataset is being generated.

The result is a data frame with three columns: subject, book, and item. The data frame is organized in a long (univariate) format where there is one row for each subject-item combination. The long format is required for generating item responses with the response_gen function. As can be seen in the output above, subject 1 has been administered booklet 2 while subject 2 has been administered booklet 4.

#### Item response generation

Both arguments subject and item take length-*N* vectors that provide the subject by item information where \(N = \sum _{1:J} n_{j}\) where \(n_{j}\) is the number of items in the test for examinee *j*. The arguments theta takes a *J*-length vector of true latent proficiency values for the examinees, where *J* is the total number of examinees. Finally, because the simplest model is the one-parameter model, only the b_par argument is required for any items. The b_par argument takes an *I*-length vector of item difficulties where *I* is the total number of items in the item pool. The optional arguments a_par and c_par also take *I*-length vectors for the corresponding item parameters, while d_par takes an *I*-length list where each element in the list is an \((K_{i} - 1)\)-length vector containing the thresholds for each item. The argument item_no is used only when a subset of items are used from a given item pool. Finally, the argument ogive allows the user to omit the scaling constant for a logistic ogive (default) or to include a normal ogive, ogive = “Normal”.

The result of response_gen is a wide (multivariate) data frame where there is one row per examinee and each item is a column with the final variable indicating the subject ID. Because not every examinee sees every item, items not administered are considered missing.

### Combining cognitive and background questionnaire data

**lsasim**package is designed such that both generated datasets have equal rows for easy merging using the subject variable in each data frame. The resulting dataset, ex_full_data is the generated large-scale assessment data.

Note that this section provided a general overview of the **lsasim** package. Interested readers can turn to the **lsasim** wiki for further documentation and testing results. In particular, one can find vignettes that further illustrate item parameter generation, test assembly, and test adminstration.

## Generating data from PISA 2012

In this section, we demonstrate how existing background information, item parameters, and booklet designs can be used to generate data. The **lsasim** package includes prepared data from the 2012 administration of Programme for International Student Assessment.

### PISA 2012 background questionnaire

The **lsasim** package includes the cumulative probabilities and heterogeneous correlation matrix for 18 background questionnaire items and a single mathematics plausible value. The 18 items comprise three scales, perseverance, openness to problem solving, and attitudes toward school. The cumulative proportions and correlation matrix were estimated using Switzerland student questionnaire data available from the PISA 2012 data base. It is important to note, however, that this background information is included for demonstration purposes only and is not suitable for valid inferences.

### PISA 2012 mathematics assessment

*b*parameter, and, for those partial credit items, two

*d*parameters.

*standard*booklets and does not include the

*easy*booklets. Each row indicates a book while each column indicates an item block.

## Test design simulation study

We now conduct a simple simulation to demonstrate that the default test generating functions operate as intended.

**TAM**(Robitzsch et al. 2017). The model is such that the intercept for country 1 is constrained to zero so that we are estimating the group difference. As seen in Table 3, the country means and variances were recovered.

Simulation results, means and standard deviations of country-specific parameters based on 100 replications

Country 1 | Country 2 | |||
---|---|---|---|---|

\(\bar{\theta }\) | \(\mathrm {var}(\theta )\) | \(\bar{\theta }\) | \(\mathrm {var}(\theta )\) | |

Generating value | 0 | 1 | 0.25 | 1 |

Mean | 0.0000 | 1.0000 | 0.2514 | 1.0014 |

Std. dev | 0.0000 | 0.0255 | 0.0205 | 0.0253 |

## Discussion

ILSAs are tasked with obtaining sound measures of what students from around the world know and can do in the relevant content areas, as well as obtaining a host of background variables to aid in the contextualization of those measures. These tasks place a unique set of requirements on the test design and psychometric modeling. Although innovations in ILSAs can come from the re-analysis of past assessments, those data are fundamentally constricted to a particular design and by extant data. Due to the scope of ILSAs, pilot testing is highly restricted, leaving simulation studies as the primary means for understanding issues and possible solutions within the ILSA arena. The intention for **lsasim** was to develop a minimal number of functions to facilitate the generation of data that mimics the large scale assessment context to the extent possible. To that end, **lsasim** offers the possibility of simulating data that adhere to general properties found in the background questionnaire (mostly ordinal, correlated variables that are also related to varying degrees with some latent trait). The package also features functions for simulating achievement data according to a number of common IRT models with known parameters. A clear advantage of **lsasim** over other simulation software is that the *achievement* data, in the form of item responses, can arise from multiple-matrix sampled test designs. Although the background questionnaire data can be linked to the test responses, all aspects of **lsasim** can function independently, affording researchers a high degree of flexibility in terms of possible research questions and the part of an assessment that is of most interest. Built in default functionality also allows researchers to opt for randomly chosen population parameters. Alternatively, users can specify their own test design specifications and population parameters, offering the possibility of full control over the research design and data generation process.

By way of introduction, the paper described and briefly illustrated the eight functions that make up the package. Because researchers will in many circumstances use information from previous assessments for simulation purposes, the paper went on to demonstrate how LSA data can be generated from parameter estimates and design features of PISA 2012. Finally, a small simulation showed that using the default test assembly functions recovered known population proficiency parameters for two groups. Although we demonstrated the soundness of the package’s default settings, we expect users to rely on those default settings only for aspects of a given simulation design that are considered to be nuisances. For example, a study designed to examine the efficiency of various item block designs could rely on the default background questionnaire functions without loss of generality. Alternatively, a researcher interested in studying background questionnaire invariance across heterogeneousness populations might utilize many of the default test assembly functions. Otherwise, we generally assume that users will bring a set of known or plausible population parameters that will provide the basis for further investigations. We believe that **lsasim** can be a useful tool for operational test developers and basic and applied measurement researchers. As national and international assessments branch into new platforms and populations, it is important that researchers with a solid background in measurement and large-scale test design have a ready means for evaluating the performance of new and existing methods. Finally, as a reminder, the default PISA parameters that are included with **lsasim** are not intended to be used to infer anything about the 2012 PISA administration. Rather, they provide an illustrative example of a test design and associated parameters that approximates an operational setting.

## Declarations

### Authors' contributions

The concept for **lsasim** was fostered by LR and DR. THM carried out the development of **lsasim**. YL led the testing of **lsasim**. THM, LR and DR contributed to the writing of the manuscript. All authors read and approved the final manuscript.

### Acknowledgements

The authors acknowledge Dr. Eugenio Gonzalez for his feedback during development as well as the Norwegian Research Council for supporting this research.

### Competing interests

None of the authors have any competing interests that would be interpreted as influencing the research and ethical standards were followed in the conduct of **lsasim** and the writing of this manuscript.

### Availability of data and materials

The package **lsasim**, including data used in this manuscript, can be found on Comprehensive R Archive Network, CRAN.

### Funding

This manuscript was partially funded by the Norwegian Research Council, FINNUT program, Grant 255246.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- Adams, R. J., Lietz, P., & Berezner, A. (2013). On the use of rotated context questionnaires in conjunction with multilevel item response models.
*Large-scale Assessments in Education*,*1*(1), 5.View ArticleGoogle Scholar - American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing 2014. AERA.Google Scholar
- Barbiero, A., & Ferrari, P. A. (2015).
*GenOrd: Simulation of discrete random variables with given correlation matrix and marginal distributions [Computer software manual]*. Retrieved from https://CRAN.R-project.org/package=GenOrd (R package version 1.4.0) - Beaton, A. E., & Johnson, E. G. (1992). Overview of the scaling methodology used in the national assessment.
*Journal of Educational Measurement*,*29*(2), 163–175.View ArticleGoogle Scholar - Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment.
*Journal of Statistical Software*,*48*(6), 1–29. Retrieved from http://www.jstatsoft.org/v48/i06/ - Foshay, A. W., Thorndike, R., Hotyat, F., Pidgeon, D., & Walker, D. (1962).
*Educational achievement of thirteen-year-olds in twelve countries (Tech. Rep.)*. Hamburg: UNESCO Institute for Education. Retrieved from http://unesdoc.unesco.org/images/0013/001314/131437eo.pdf - Fox, J. (2016).
*polycor: Polychoric and polyserial correlations [Computer software manual]*. Retrieved from https://CRAN.R-project.org/package=polycor (R package version 0.7-9) - Gonzalez, E., & Rutkowski, L. (2010). Principles of matrix booklet designs and parameter recovery in large-scale assessments.
*IERI Monograph Series*,*3*, 125–156.Google Scholar - Little, R. J. A., & Rubin, D. B. (1983). On jointly estimating parameters and missing data by maximizing the complete-data likelihood.
*The American Statistician*,*37*(3), 218–220.Google Scholar - Lockheed, M., Prokic-Breuer, T., & Shadrova, A. (2015).
*The experience of middle-income countries participating in PISA 2000–2015*. Washington, DC: World Bank Publications. https://doi.org/10.1787/9789264246195-en.View ArticleGoogle Scholar - Magis, D., & Raiche, G. (2012). Random generation of response patterns under computerized adaptive testing with the R package catR.
*Journal of Statistical Software*,*48*(8), 1–31. https://doi.org/10.18637/jss.v048.i08.View ArticleGoogle Scholar - Magis, D., Yan, D., & von Davier, A. (2017).
*mstR: Procedures to generate patterns under multistage testing [Computer software manual]*. Retrieved from https://CRAN.R-project.org/package=mstR (R package version 1.0) - Mandal, B. N. (2018).
*ibd: Incomplete block designs [Computer software manual]*. Retrieved from https://CRAN.R-project.org/package=ibd (R package version 1.4) - Martin, M. O., Mullis, I. V. S., & Hooper, M. (Eds.). (2016).
*Methods and procedures in TIMSS 2015*. Boston: TIMSS & PIRLS International Study Center, Boston College. Retrieved from http://timssandpirls.bc.edu/publications/timss/2015-methods.html - Matta, T., Rutkowski, L., Rutkowski, D., & Liaw, Y. (2017).
*lsasim: Simulate large scale assessment data [Computer software manual]*. Retrieved from https://CRAN.R-project.org/package=lsasim (R package version 1.0.0) - Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex samples.
*Psychometrika*,*56*(2), 177–196.View ArticleGoogle Scholar - Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992a). Estimating population characteristics from sparse matrix samples of item responses.
*Journal of Educational Measurement*,*29*(2), 133–161.View ArticleGoogle Scholar - Mislevy, R. J., Johnson, E. G., & Muraki, E. (1992b). Scaling procedures in NAEP.
*Journal of Educational and Behavioral Statistics*,*17*(2), 131–154.Google Scholar - OECD. (2014).
*PISA 2012 technical report (Tech. Rep.)*. Paris: OECD Publishing. Retrieved from https://www.oecd.org/pisa/pisaproducts/PISA-2012-technical-report-final.pdf - OECD. (2017). PISA 2015 technical report (draft). Paris: OECD Publishing.Google Scholar
- R Core Team. (2017).
*R: A language and environment for statistical computing [Computer software manual]*. Vienna: R Core Team. Retrieved from https://www.R-project.org/ - Robitzsch, A., Kiefer, T., & Wu, M. (2017).
*TAM: Test analysis modules [Computer software manual]*. Retrieved from https://CRAN.R-project.org/package=TAM (R package version 2.0-37) - Rubin, D. B. (1976). Inference and missing data.
*Biometrika*,*63*(3), 581–592.View ArticleGoogle Scholar - Rubin, D. B. (1987).
*Multiple imputation for nonresponse in surveys*. Hoboken, NJ: Wiley.View ArticleGoogle Scholar - Rutkowski, L., von Davier, M., Gonzalez, E., & Zhou, Y. (2014). Assessment design for international large-scale assessment. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.),
*Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis*. Boca Raton: Chapman & Hall/CRC Press.Google Scholar - Rutkowski, L., & Zhou, Y. (2015). The impact of missing and error-prone auxiliary information on sparse-matrix sub-population parameter estimates.
*Methodology*,*11*(3), 89–99.View ArticleGoogle Scholar - Shoemaker, D. M. (1973).
*Principles and procedures of multiple matrix sampling*. Oxford: Ballinger.Google Scholar - von Davier, M., Sinharay, S., Oranje, A., & Beaton, A. (2006). The statistical procedures used in national assessment of educational progress: Recent developments and future directions. In C. R. Rao & S. Sinharay (Eds.),
*Handbook of statistics*(pp. 1039–1055). Amsterdam: Elsevier.Google Scholar