Skip to main content

Design-Based Survey Analysis

One of the persisting problems for secondary analysis-based researchers is generating a statistical model from data that is generalizable only to a fixed population (Lumley, 2010). A key difference between creating statistical inferences towards similar populations and estimating the results of a sample towards a fixed population is using several preemptive steps to guarantee that design-based sampling is replicated. Bell, Onwuegbuzie, Ferron, Jiao and Kromey (2012) have reported on the lack of clarity in remaining faithful to survey designs
"Case Damascus Barlow Knife" by Michael E. Cumpston CC-BY-SA 3.0

by many investigators relying on large survey data covering adolescent health. However, reporting on international survey data suffers from the same issues, as sampling weights are not included in investigatory analysis, or they are not discussed thoroughly in methodology sections of investigation reports. While the rationales for incomplete discussions are not definitively concluded upon, a heuristic may be formed that the process of not including design-based anti-bias mechanisms is owing to the mystery behind how these mechanisms work, the nature of survey design itself, and statistical packages that can help.
Two of the most common ways of remaining faithful to survey design is to use and report on survey clusters and sampling weights. The former approach, with survey clusters or primary sampling units (PSUs), can be slightly overwhelming to the novice secondary data researcher, especially when importing data from several .sav files into R. The daunting process of gathering separate data files into one data frame can take several lines of code, if the analyst is not using a specialty package like ‘intsvy’ to choose appropriate variables and concatenate them into a workable data frame (Caro & Biecek, 2017). Sometimes data sets do not provide PSUs, and it can be especially difficult to program computers to apply the appropriate calculus behind the scenes so that a well-formed estimate of the population parameters can be achieved. A function with a survey ‘weights’ argument will allow the statistical programmer to place the correct variable in the appropriate slot. However, finding an R package with a weight slot and the ability to perform bootstrapping or jackknife procedures can be challenging.
Using the R programming language for analysis of secondary data can be deceptively simple in solving the issue of survey replication, especially when running a linear model in R using the base lm() function. The assumption of assigning a set variable to the ‘weights’ argument will not intuitively work the way in which one would expect and can get the programmer in trouble. A few R packages (‘EdSurvey’, and ‘intsvy’) will take on a weights argument.
There are two solutions (depending on how fine grain you want the analysis). One answer is to use the ‘BIFIEsurvey’ package (BIFIE, 2018). It is based on the work of Breit and Schreiner (2016), and has slots for weighting variables and can handle replicate variables in the TIMSS and PIRLS surveys. Outputs include standard errors, degrees of freedom, p-values, and Wald statistics, derived from jackknife replicates or bootstrapping procedures. The output is intuitive, and can be directly exported to LaTeX while calling the str() command in R and using the ‘xtable’ package (Dahl, 2016). The other solution is to use the ‘survey’ package by Lumley (2018). This package requires more information to produce an object that can then be carried forward for further analysis, but for those analysts who can identify the cluster id variables in their dataset, it will give greater control over the analysis. Both packages will also give statistical output based on an estimation towards a fixed population, along with standard errors (a sine qua non of estimating in this circumstance). An example of the practical usage of the ‘survey’ package can be found by Murray (2015). Novices will probably want to use the ‘BIFIEsurvey’ package and advance towards the ‘survey’ package once they have mastered the full set of complex options in estimating the fixed population.
Both packages take on the problems reported by Bell, Onwuegbusie, Ferron, Jiao, and Kromey (2012) in their review of the literature, and to good effect. The packages help with several options for estimating statistical output from a sample towards a fixed population with one to two-step (and possibly more) treatments of a design sample. Both packages are contingent upon the statistical programmer creating a basic data frame that captures the variables of interest for parameter estimation. Both packages will allow for statistical procedures that involve the cross between the process of mathematical modeling in addition to estimation towards the fixed populace.

References

Bailey, P., C'deBaca, R., Emad, A., Huo, H., Lee, M., Liao, Y.,
Nguyen, T.,  Xie, Q., Yu, J. and Zhang, T. (2018). EdSurvey: Analysis of NCES
Education Survey and Assessment Data. R package version 2.0.3.
https://CRAN.R-project.org/package=EdSurvey
Bell, B., Onwuegbuzie, A., Ferron, J., Jiao, Q. Hibbard, S. and Kromey, J. (2012). Use of design effects and sample weights in complex health survey data: a review of published articles using data from 3 commonly used adolescent health surveys. American Journal of Public Health, 102(7), 1399-1405. Retrieved from http://doi.org/10.2105/AJPH.2011.300398.
BIFIE (2018). BIFIEsurvey: Tools for survey statistics in educational assessment. R
package version 2.191-12. https://CRAN.R-project.org/package=BIFIEsurvey
Breit, S. and Schreiner, C. (2016). Large-scale assessment mit r: Methodische grundlagen der österreichischen bildungsstandardüberprüfung. Facultas: Vienna, Austria.
Caro, D. H., and Biecek, P. (2017). “intsvy: An R Package for Analyzing International
Large-Scale Assessment Data.” Journal of Statistical Software, 81(7), pp. 1-44.
http://doi.org/10.18637/jss.v081.i07.
Dahl, D. (2016). xtable: Export Tables to LaTeX or HTML. R package version
1.8-2. https://CRAN.R-project.org/package=xtable
Lumley, T. (2010). Complex surveys : a guide to analysis using R. Hoboken, N.J: John Wiley.
T. Lumley (2018) "survey: analysis of complex survey samples". R package version
3.34. https://cran.r-project.org/web/packages/survey/
Murray, J. (2015). Intro to the survey R package (36-303). Retrieved from https://www.andrew.cmu.edu/user/jsmurray/teaching/303/files/lab.html

Popular posts from this blog

Persisting through reading technical CRAN documentation

 In my pursuit of self learning the R programming language, I have mostly mastered the art of reading through CRAN documentation of R libraries as they are published. I have gone through everything from mediocre to very well documented sheets and anything in between. I am sharing one example of a very good function that was well documented in the 'survey' library by Dr. Thomas Lumley that for some reason I could not process and make work with my data initially. No finger pointing or anything like that here. It was merely my brain not readily able to wrap around the idea that the function passed another function in its arguments.  fig1: the  svyby function in the 'survey' library by Thomas Lumley filled in with variables for my study Readers familiar with base R will be reminded of another function that works similarly called the aggregate  function, which is mirrored by the work of the svyby function, in that both call on data and both call on a function toward...

Digital Humanities Methods in Educational Research

Digital Humanities based education Research This is a backpost from 2017. During that year, I presented my latest work at the 2017  SERA conference in Division II (Instruction, Cognition, and Learning). The title of my paper was "A Return to the Pahl (1978) School Leavers Study: A Distanced Reading Analysis." There are several motivations behind this study, including Cheon et al. (2013) from my alma mater .   This paper accomplished two objectives. First, I engaged previous claims made about the United States' equivalent of high school graduates on the Isle of Sheppey, UK, in the late 1970s. Second, I used emerging digital methods to arrive at conclusions about relationships between unemployment, participants' feelings about their  (then) current selves, their possible selves, and their  educational accomplishm ents. I n the image to the left I show a Ward Hierarchical Cluster reflecting the stylometrics of 153...

Bi-Term topic modeling in R

As large language models (LLMs) have become all the rage recently, we can look to small scale modeling again as a useful tool to researchers in the field with strictly defined research questions that limit the use of language parsing and modeling to the bi term topic modeling procedure. In this blog post I discuss the procedure for bi-term topic modeling (BTM) in the R programming language. One indication of when to use the procedure is when there is short text with a large "n" to be parsed. An example of this is using it on twitter applications, and related social media postings. To be sure, such applications of text are becoming harder to harvest from online, but secondary data sources can still yield insightful information, and there are other uses for the BTM outside of twitter that can bring insights into short text, such as from open ended questions in surveys.   Yan et al. (2013) have suggested that the procedure of BTM with its Gibbs sampling procedure handles sho...