Skip to main content

Simulating Education Data based on the R Wakefield Package

Unlike point-and-click or GUI based statistical programs, the R language requires practice so that programming skills continue to get developed and committed to human memory. Programming skills directly relate to R as a language, and computing languages work when humans use them to solve problems. However, in the process of conducting research, a solo researcher needs to call upon several skills, of which statistical computing is only one. When a solo author pulls together an entire project and begins analyzing data, some skills will be donned and doffed at particular moments in the process, such as statistical computing. Therefore, since programming skills will inevitably be put down and picked back up during the research process, computing practice is essential to refresh the suite of human ability to call upon the R language.

At its simplest, R can be thought of as a language of nouns and verbs, with respective activities belonging to each class of language component (Wickham & Grolemund, 2017). Nouns represent data frames, vectors, and lists. Verbs are the functions that are called upon to act on the nouns. Some would argue that to do anything in R requires the presence of nouns and verbs coming together, the latter of which forming the correct conjugation. If we just had nouns in R, then data frames would be as ontological things with no motion. If verbs, or functions, were written and accumulated, but they had no noun to work upon, then there would be no actions within the R language.

Some people work with R to mostly solve problems dealing with simulated data (synthetic nouns), and the accompanying issues that present to scientists invoking verbs to solve. Other people work with R to solve problems related to verbs that assume the presence of nouns. To be sure,
each form of work (noun-heavy and verb-heavy) assumes that the user will work with nouns and verbs in tandem at least to a small degree.

A very present assumption is that unless you belong to a group of individuals outside the world of applied education science (including mathematicians and computer programming teachers), you will focus on R verb functions, mostly. Those outside this very narrow group of people use social scientific methodologies as education scientists. In the realm of generating inferences, education scientists are more focused on coming to conclusions in many cases than on making data with problems inherent in them. There is more emphasis on "cleaning" data quickly and applying tests to get on with hypothesis testing or hypothesis discovery.

As it is understood here, applying statistical tests requires practice when working with R. Case studies can be invoked in which nouns (data frames, lists, vectors) have been created and are the subject of speculation. To get the noun component of a hypothetical case study completed, education scientists can create simulated data either through base R or other R packages. However, working with base R can be complicated, as there are times when likert scales would theoretically be administered to participants (these can be tricky to work upon in base R), and are required to be generated in data simulations.

The Wakefield package helps to simulate data in a way that allows verb-heavy users to get ahold of nouns quickly (Rinker, 2018). It will include, in the very least, columns like dummy codes for ANOVA, pseudo id numbers, education level, IQ, Likert scales, which can help to easily create nouns used for structural equation modeling and related maneuverings.
Add identification numbers from Rinker (2018b).
Additionally, there is a name neutral (at the time of writing) data set that provides names that can then be bound to case numbers (rbind()). In many instances, data are not de-identified, and the user must either take great pains to de-identify or work under careful, HIPAA regulations in the United States. Adding pseudo-names to cases allows education scientists to think about how to handle classified information. 

Getting started with wakefield in R is straightforward. The code below creates a data frame of 1,000 cases with variables such as ID, grade, level, Likert scale results, and grades from 0-100. This quick noun generation can help the verb-forward education scientist with generating practice modules that help with remembering code. Example 1 is a small taste of what wakefield can accomplish, and it is worth exploring for time series and other examples of simulated data.


If we call str() on the R object (as done in example 2), we can find that the code has created columns for grades (0-100), Likert scales, and an ID variable. For a statistics teacher trying to get students jumpstarted into data, these simulations can be of great benefit. This is because the code quickly allows a teacher to get into classical statistics, and beyond. 

Results

Running the example provides a tibble with information about the data. The view of the top of the data frame shows most of the variables ready to be worked upon.

Conclusion

 R is an open-source statistical programming language. Programming principles must be practice to excel with its grammar. For education scientists, this means working with data to practice verb-heavy work. "Verb-heavy" refers to actively working with functions or creating them. This definition is in contrast to noun-heavy work, where the focus is on creating simulated data from the functions in base R, which might take up more time and effort to compute. As a part of the solution for generating data, the wakefield package has been described, with only one function(r_data_frame()) discussed for researchers interested in classical statistics. The function has been discussed in terms familiar to education researchers interested in practicing their skills starting from verb-intensive work. 

References

Rinker, T. (2018a). Wakefield: Generate random data [version 0.3.3]. https://github.com/trinker/wakefield

Rinker, T. (2018b). Package 'Wakefield'. https://cran.r-project.org/web/packages/wakefield/wakefield.pdf

Wickham, H. and Grolemund, G. (2017). R for data science. O'Reilly: Sebastopol, CA.


Popular posts from this blog

Persisting through reading technical CRAN documentation

 In my pursuit of self learning the R programming language, I have mostly mastered the art of reading through CRAN documentation of R libraries as they are published. I have gone through everything from mediocre to very well documented sheets and anything in between. I am sharing one example of a very good function that was well documented in the 'survey' library by Dr. Thomas Lumley that for some reason I could not process and make work with my data initially. No finger pointing or anything like that here. It was merely my brain not readily able to wrap around the idea that the function passed another function in its arguments.  fig1: the  svyby function in the 'survey' library by Thomas Lumley filled in with variables for my study Readers familiar with base R will be reminded of another function that works similarly called the aggregate  function, which is mirrored by the work of the svyby function, in that both call on data and both call on a function toward...

Digital Humanities Methods in Educational Research

Digital Humanities based education Research This is a backpost from 2017. During that year, I presented my latest work at the 2017  SERA conference in Division II (Instruction, Cognition, and Learning). The title of my paper was "A Return to the Pahl (1978) School Leavers Study: A Distanced Reading Analysis." There are several motivations behind this study, including Cheon et al. (2013) from my alma mater .   This paper accomplished two objectives. First, I engaged previous claims made about the United States' equivalent of high school graduates on the Isle of Sheppey, UK, in the late 1970s. Second, I used emerging digital methods to arrive at conclusions about relationships between unemployment, participants' feelings about their  (then) current selves, their possible selves, and their  educational accomplishm ents. I n the image to the left I show a Ward Hierarchical Cluster reflecting the stylometrics of 153...

Bi-Term topic modeling in R

As large language models (LLMs) have become all the rage recently, we can look to small scale modeling again as a useful tool to researchers in the field with strictly defined research questions that limit the use of language parsing and modeling to the bi term topic modeling procedure. In this blog post I discuss the procedure for bi-term topic modeling (BTM) in the R programming language. One indication of when to use the procedure is when there is short text with a large "n" to be parsed. An example of this is using it on twitter applications, and related social media postings. To be sure, such applications of text are becoming harder to harvest from online, but secondary data sources can still yield insightful information, and there are other uses for the BTM outside of twitter that can bring insights into short text, such as from open ended questions in surveys.   Yan et al. (2013) have suggested that the procedure of BTM with its Gibbs sampling procedure handles sho...