Speakers: Dr. Amanda Licastro (University of Pennsylvania), Dr. Benjamin Miller (University of Pittsburgh), Dr. Kyle McIntosh (University of Tampa), Dr. David Reamer (University of Tampa), Dr. Duncan Buell (University of South Carolina), Dr. Andrew Kulak (Virginia Tech), Dr. Kathryn Lambrecht (Arizona State University).
This roundtable discussion features contributors to Composition and Big Data (University of Pittsburg Press, 2021), which is edited by Dr. Amanda Licastro and Dr. Benjamin Miller. The speakers discussed the affordances and challenges of using big data in composition for writing program assessment purposes. Throughout the roundtable discussion, Dr. Kyle McIntosh and Dr. David Reamer shared ways in which big data informed questions that guided their evaluation of the writing program at the University of Tampa as a case study for this approach. This roundtable discussion guides participants through the process and concerns of collecting and analyzing big data sets for the purposes of writing program evaluation; however, many of these guiding questions can be applied to other datasets in other interdisciplinary contexts.
Collecting Big Data on Writing Programs
Dir. Licastro starts the session by framing the discussion with two questions: 1.) How can we safely collect and analyze data about writing? 2.) What challenges will we face? Dr. McIntosh follows up on these questions by recommending that researchers always seek IRB approval through their institutions prior to collecting data. In their evaluation of the writing program at the University of Tampa, Dr. McIntosh and Dr. Reamer proposed looking at students’ performance before and after a series of writing courses in order to conduct an analysis academic writing program. In order to tailor the data to the questions that Dr. McIntosh and Dr. Reamer were exploring, they excluded some courses from their dataset (namely, the courses in the writing credit sequence). Due to the large number of data points that they could evaluate such as: students’ grades, gender, etc., they focused on measures to protect students’ privacy by anonymizing the data.
A couple concerns that were raised in this study were the level of technical knowledge needed to conduct a statistical analysis, and the amount of incomplete/redundant data they gathered. Dr. McIntosh and Dr. Reamer collaborated with colleagues who had the technical expertise in analyzing large datasets. Additionally, the researchers gathered data from multiple sets and consolidated the data, which informed the types of questions they were able to explore. Overall, Dr. McIntosh and Dr. Reamer recommend anonymizing the data and not doing analysis that segregates data to a small sample size where students’ privacy could be at risk.
Bias in Big Data
Furthermore, the second section of the roundtable was framed by two questions: 1.) Is big data ever ethically neutral, or colorblind? 2.) Where are the openings for ethical intervention? This part of the roundtable was led by Dr. Kulak, author of “Ethics, Information Security, and Algorithmic Accountability”. Dr. Kulak begins with a definition of big data: he defines big data as data sets that are so large that they require computational or algorithmic processing. Dr. Kulak asserts that big data is never ethically neutral because of the decisions that are made through the collection, reduction, and analysis of the data. Although the rhetoric around big data often connotes neutrality, algorithms are trained on biased data. For example, this is commonly found in “training sets”, which are sample datasets that algorithms are trained on. In response to the second question, Dr. Kulak refers to his book chapter on cybersecurity and suggests that researchers consider the risk that data collection can have towards participants. He also raises the question about if all data is needed and how to properly secure the data that is collected. Furthermore, Dr. Kulak invites participants to consider the rhetoric surrounding big data and the transparency that researchers can disclose when working with biased algorithms.
Context and Ethics in Data Analysis
The next part of the roundtable was led by Dr. Kathryn Lambrecht, author of “Integrating Corpus Analysis with Qualitative Work”. The two questions that framed this section were: 1.) How should we prepare students to engage in/with data analysis? 2.) How to best convey the importance of context and ethics? Dr. Lambrecht’s research focuses on corpus analysis in interdisciplinary programs. Dr. Lambrecht began by describing how she has students draw upon data that is attached to themselves. For example, she asks students to collect papers that they’ve written and put it through a digital program so that they can identify trends in their own writing. For example, students can see how many times and where they use certain modifiers or words in their essays. Dr. Lambrecht then asks them how it represents their own perceptions of language and has students attach a narrative to the data they collect. This starts conversations surrounding ethics and allows students to consider both quantitive and qualitative data. As a result of this exercise, students are able to see patterns of words or phrases that they commonly use at different points in their papers and attach personal stories to their data.
Inclusion in Research Design
This next section was introduced by Dr. Duncan Buell, author of “Stylistic Complexity in a FYC Corpus”. Dr. Buell begins by examining two questions: 1.) How can we invite more perspectives in research design? 2.) Can we measure improvements in equitable inclusion? He asks researchers to consider the nuances of “what might be relevant and what might be explored”. Dr. Buell gives the example of collecting 20,000 first-year essays at the University of South Carolina with Dr. Chris Holcomb. He anonymized most of the essays himself, but wished that they could have collected more data on majors, gender, etc. In regards to the second question, Dr. Buell suggests that researchers consider the “before”/input data just as much as the “after”/output data. If the study focuses on measuring improvements, then Dr. Buell suggests keeping track of as many pieces of data as possible, because there might be a possibility of needing to look back at causal data. In response to Dr. Buell’s suggestions, Dr. Kulak also pointed out the tension between needing to collect more data to be equitable to more groups of people, while also considering the privacy risks associated with securing such a large dataset.
Takeaways
In regards Dr. McIntosh and Dr. Reamer’s evaluation of the writing program, they discussed ways in which they took ownership over the data and created narratives around it. Suggestions from the speakers include: asking how you’re going to use the data, how the data is segregated, and what narratives are attached to the data. Some of the resources that were shared throughout the roundtable were Cathy O’Niel’s Weapons of Math Destruction, Catherine D’Ignazio and Lauren Klein’s Data Feminism; which emphasize the importance of constructing a transparent narrative around large datasets. Additionally, the speakers suggest asking questions such as how you’re going to act on the data, and how to invite participants into the research process.