Skip to main content

Striking a Balance: How Much Should I Cast My own Site Variables, Versus Referring to the Literature When Doing Predictive Machine Learning as a Data Scientist?

 Recently I came back from the TAIR 2025 conference and I was struck by the number of presenters that focused on using either auto machine learning or artificial intelligence in creating models for predictive analytics in higher education. One of the striking things about the works presented is that the independent variables were somewhat similar to each other but yet different from each other enough to raise the question. How much should there be consistency between predictive machine learning models? Or, how generalizable should any given model be?

These two questions strike at the limits of what local work should aim towards. One way to look at the issue is the pressing need to look at all available variables locally and use them to forage a way forward at predictions about issues like retention, enrollment, and so forth at the university level. To a certain degree this is a moot point, as some would argue that data science is about creating actionable insights. 

That is, until the emergence of literature reviews have recently emerged about predictive analytics using machine learning. Another way to interrogate this as a problem might be to ask, what is the purpose of a literature review in the field of data science? An example of a systematic review (a kind of literature review) in the field of data science is Cardona, Cudney, Hoerl, and Snyder (2023). Their work examines various kinds of retention models in higher education related to methodologies bracketed by machine learning and data mining techniques. 

One might ask: what do we gain by using literature reviews as data scientists? One benefit is to look for starting points for independent variables to feed into predictive variables. Looking prominent models might inform what could go into our local model if we have access to those variables. It also might help to see how our local plans would compare on the stage of what has preceded us nationally or internationally. It gives us a feel for whether our model might be considered normative or not. In other words, it's a good check on the model.

One might ask: what hinders us by using literature reviews to inform a predictive machine learning model? This is very important. The tendency is to compare a local set of variables to other set of site specific variables. In this case, our site specific variables cannot compare; our variables are not generalizable to other populations. Therefore, literature reviews may be viewed as interesting but not informative to our model. 

In the final analysis, we can be guided by a literature review for machine learning in predictive models as a guiding set of previous approaches, but ultimately, if we are creating a local model with our own individual set of variables, we will have to use what is available to us, knowing that we can sidestep some pitfalls with what has come before us in a detailed literature review that has been carefully written and peer reviewed.  


References:

Cardona, T. , Cudney, E. Hoerl, R.  and Snyder J. (2023). Data mining and machine learning in retention models in higher education. Journal of college student retention: Research, theory and practice, 25(1), 51-75. DOI: 10.1177/1521025120964920.


Comments

Popular posts from this blog

Persisting through reading technical CRAN documentation

 In my pursuit of self learning the R programming language, I have mostly mastered the art of reading through CRAN documentation of R libraries as they are published. I have gone through everything from mediocre to very well documented sheets and anything in between. I am sharing one example of a very good function that was well documented in the 'survey' library by Dr. Thomas Lumley that for some reason I could not process and make work with my data initially. No finger pointing or anything like that here. It was merely my brain not readily able to wrap around the idea that the function passed another function in its arguments.  fig1: the  svyby function in the 'survey' library by Thomas Lumley filled in with variables for my study Readers familiar with base R will be reminded of another function that works similarly called the aggregate  function, which is mirrored by the work of the svyby function, in that both call on data and both call on a function toward...

Digital Humanities Methods in Educational Research

Digital Humanities based education Research This is a backpost from 2017. During that year, I presented my latest work at the 2017  SERA conference in Division II (Instruction, Cognition, and Learning). The title of my paper was "A Return to the Pahl (1978) School Leavers Study: A Distanced Reading Analysis." There are several motivations behind this study, including Cheon et al. (2013) from my alma mater .   This paper accomplished two objectives. First, I engaged previous claims made about the United States' equivalent of high school graduates on the Isle of Sheppey, UK, in the late 1970s. Second, I used emerging digital methods to arrive at conclusions about relationships between unemployment, participants' feelings about their  (then) current selves, their possible selves, and their  educational accomplishm ents. I n the image to the left I show a Ward Hierarchical Cluster reflecting the stylometrics of 153...

Bi-Term topic modeling in R

As large language models (LLMs) have become all the rage recently, we can look to small scale modeling again as a useful tool to researchers in the field with strictly defined research questions that limit the use of language parsing and modeling to the bi term topic modeling procedure. In this blog post I discuss the procedure for bi-term topic modeling (BTM) in the R programming language. One indication of when to use the procedure is when there is short text with a large "n" to be parsed. An example of this is using it on twitter applications, and related social media postings. To be sure, such applications of text are becoming harder to harvest from online, but secondary data sources can still yield insightful information, and there are other uses for the BTM outside of twitter that can bring insights into short text, such as from open ended questions in surveys.   Yan et al. (2013) have suggested that the procedure of BTM with its Gibbs sampling procedure handles sho...