Simulating Education Data based on the R Wakefield Package

Unlike point-and-click or GUI based statistical programs, the R language requires practice so that programming skills continue to get developed and committed to human memory. Programming skills directly relate to R as a language, and computing languages work when humans use them to solve problems. However, in the process of conducting research, a solo researcher needs to call upon several skills, of which statistical computing is only one. When a solo author pulls together an entire project and begins analyzing data, some skills will be donned and doffed at particular moments in the process, such as statistical computing. Therefore, since programming skills will inevitably be put down and picked back up during the research process, computing practice is essential to refresh the suite of human ability to call upon the R language.

At its simplest, R can be thought of as a language of nouns and verbs, with respective activities belonging to each class of language component (Wickham & Grolemund, 2017). Nouns represent data frames, vectors, and lists. Verbs are the functions that are called upon to act on the nouns. Some would argue that to do anything in R requires the presence of nouns and verbs coming together, the latter of which forming the correct conjugation. If we just had nouns in R, then data frames would be as ontological things with no motion. If verbs, or functions, were written and accumulated, but they had no noun to work upon, then there would be no actions within the R language.

Some people work with R to mostly solve problems dealing with simulated data (synthetic nouns), and the accompanying issues that present to scientists invoking verbs to solve. Other people work with R to solve problems related to verbs that assume the presence of nouns. To be sure,

each form of work (noun-heavy and verb-heavy) assumes that the user will work with nouns and verbs in tandem at least to a small degree.

A very present assumption is that unless you belong to a group of individuals outside the world of applied education science (including mathematicians and computer programming teachers), you will focus on R verb functions, mostly. Those outside this very narrow group of people use social scientific methodologies as education scientists. In the realm of generating inferences, education scientists are more focused on coming to conclusions in many cases than on making data with problems inherent in them. There is more emphasis on "cleaning" data quickly and applying tests to get on with hypothesis testing or hypothesis discovery.

As it is understood here, applying statistical tests requires practice when working with R. Case studies can be invoked in which nouns (data frames, lists, vectors) have been created and are the subject of speculation. To get the noun component of a hypothetical case study completed, education scientists can create simulated data either through base R or other R packages. However, working with base R can be complicated, as there are times when likert scales would theoretically be administered to participants (these can be tricky to work upon in base R), and are required to be generated in data simulations.

The Wakefield package helps to simulate data in a way that allows verb-heavy users to get ahold of nouns quickly (Rinker, 2018). It will include, in the very least, columns like dummy codes for ANOVA, pseudo id numbers, education level, IQ, Likert scales, which can help to easily create nouns used for structural equation modeling and related maneuverings.

Add identification numbers from Rinker (2018b).

Additionally, there is a name neutral (at the time of writing) data set that provides names that can then be bound to case numbers (rbind()). In many instances, data are not de-identified, and the user must either take great pains to de-identify or work under careful, HIPAA regulations in the United States. Adding pseudo-names to cases allows education scientists to think about how to handle classified information.

Getting started with wakefield in R is straightforward. The code below creates a data frame of 1,000 cases with variables such as ID, grade, level, Likert scale results, and grades from 0-100. This quick noun generation can help the verb-forward education scientist with generating practice modules that help with remembering code. Example 1 is a small taste of what wakefield can accomplish, and it is worth exploring for time series and other examples of simulated data.

If we call str() on the R object (as done in example 2), we can find that the code has created columns for grades (0-100), Likert scales, and an ID variable. For a statistics teacher trying to get students jumpstarted into data, these simulations can be of great benefit. This is because the code quickly allows a teacher to get into classical statistics, and beyond.

Results

Running the example provides a tibble with information about the data. The view of the top of the data frame shows most of the variables ready to be worked upon.

Conclusion

R is an open-source statistical programming language. Programming principles must be practice to excel with its grammar. For education scientists, this means working with data to practice verb-heavy work. "Verb-heavy" refers to actively working with functions or creating them. This definition is in contrast to noun-heavy work, where the focus is on creating simulated data from the functions in base R, which might take up more time and effort to compute. As a part of the solution for generating data, the wakefield package has been described, with only one function(r_data_frame()) discussed for researchers interested in classical statistics. The function has been discussed in terms familiar to education researchers interested in practicing their skills starting from verb-intensive work.

References

Rinker, T. (2018a). Wakefield: Generate random data [version 0.3.3]. https://github.com/trinker/wakefield

Rinker, T. (2018b). Package 'Wakefield'. https://cran.r-project.org/web/packages/wakefield/wakefield.pdf

Wickham, H. and Grolemund, G. (2017). R for data science. O'Reilly: Sebastopol, CA.

Researching Education

Search This Blog