Striking a Balance: How Much Should I Cast My own Site Variables, Versus Referring to the Literature When Doing Predictive Machine Learning as a Data Scientist?

Recently I came back from the TAIR 2025 conference and I was struck by the number of presenters that focused on using either auto machine learning or artificial intelligence in creating models for predictive analytics in higher education. One of the striking things about the works presented is that the independent variables were somewhat similar to each other but yet different from each other enough to raise the question. How much should there be consistency between predictive machine learning models? Or, how generalizable should any given model be?

These two questions strike at the limits of what local work should aim towards. One way to look at the issue is the pressing need to look at all available variables locally and use them to forage a way forward at predictions about issues like retention, enrollment, and so forth at the university level. To a certain degree this is a moot point, as some would argue that data science is about creating actionable insights.

That is, until the emergence of literature reviews have recently emerged about predictive analytics using machine learning. Another way to interrogate this as a problem might be to ask, what is the purpose of a literature review in the field of data science? An example of a systematic review (a kind of literature review) in the field of data science is Cardona, Cudney, Hoerl, and Snyder (2023). Their work examines various kinds of retention models in higher education related to methodologies bracketed by machine learning and data mining techniques.

One might ask: what do we gain by using literature reviews as data scientists? One benefit is to look for starting points for independent variables to feed into predictive variables. Looking prominent models might inform what could go into our local model if we have access to those variables. It also might help to see how our local plans would compare on the stage of what has preceded us nationally or internationally. It gives us a feel for whether our model might be considered normative or not. In other words, it's a good check on the model.

One might ask: what hinders us by using literature reviews to inform a predictive machine learning model? This is very important. The tendency is to compare a local set of variables to other set of site specific variables. In this case, our site specific variables cannot compare; our variables are not generalizable to other populations. Therefore, literature reviews may be viewed as interesting but not informative to our model.

In the final analysis, we can be guided by a literature review for machine learning in predictive models as a guiding set of previous approaches, but ultimately, if we are creating a local model with our own individual set of variables, we will have to use what is available to us, knowing that we can sidestep some pitfalls with what has come before us in a detailed literature review that has been carefully written and peer reviewed.

References:

Cardona, T. , Cudney, E. Hoerl, R. and Snyder J. (2023). Data mining and machine learning in retention models in higher education. Journal of college student retention: Research, theory and practice, 25(1), 51-75. DOI: 10.1177/1521025120964920.

Researching Education

Search This Blog

Striking a Balance: How Much Should I Cast My own Site Variables, Versus Referring to the Literature When Doing Predictive Machine Learning as a Data Scientist?

Comments

Post a Comment

Popular posts from this blog

Persisting through reading technical CRAN documentation

The Matrix Literature Review and the 'rectangulate' Function from the r7283 Package