“The amount of data available affects everything we do. Whether you call it data wrangling, cleaning, munging, translating or manipulating, it involves moving data from the form it’s given in to another form that will reveal information.” –Statistics professor Daniel Kaplan
In 2016 we are swimming in data. The sequence of 3 billion bases in a person’s genome can be read in about a day. When you use a credit card, the company checks your transaction for signs of fraud, drawing on a database of hundreds of millions of previous transactions. Amazon knows what people like you are buying and suggests a next purchase for you.
No matter what your field of interest, complex collections of data are likely to be a part of it. Some students arrive at college eager to dig into data sets; others may wish there were a way around it. If you’re not necessarily inclined toward wrangling data, it can be intimidating, but never fear, there are classes to prepare you to take on data analysis, statistics, and other methods of making numbers give up their secrets.
Katherine Kinnaird, visiting professor in Mathematics, Statistics, and Computer Science, runs the Data Science TRAin (Try, Read, Ask, and Incorporate) Lab, which welcomes students with broad data experience— or none at all. They begin by reading papers in computer science, developing a glossary as they go, making no assumptions of prior knowledge, so no one is left behind. Students build on these readings by working on their own self-directed data science projects.
“There’s no tolerance for technical snobbery,” say Kinnaird. “It takes two to three weeks to work our way through the first paper.” As the students learn about techniques for using data, they consider a question and what techniques could be used to answer that question. “For projects, students with more technical skills have tended to pair with those with more background in a particular subject matter, political science, for example.”
Psychology professor Steve Guglielmo regularly teaches a course in statistics that is required for psychology majors—Research in Psychology I, also known as RIP I. The course is an introduction to the principles of research with an emphasis on statistical techniques used in psychological science to quantify behaviors and relationships.
“On the first day, I acknowledge that some students are probably apprehensive,” says Guglielmo. “They may not be comfortable with or interested in math and formulas and may question the purpose of statistics in psychology.
“I show them that they already have a lot of the background they need. They can bring formula ‘cheat sheets’ to exams because the goal to build a base of concepts, not math per se. They learn what formulas do, when to use them and how to interpret what they tell us.” Before the course is over, the students have run an experiment, collected data, used statistical analysis and written a report.
“If psychology students are interested in improving people’s lives, actions have to be grounded in what works,” says Guglielmo. “To know what works, we need good research techniques.”
Macalester alumnus J.J. Allaire ’91 brought his Internet entrepreneurial talents to the problem of providing sophisticated data analysis. His open-source RStudio.com organization provides the leading way to work with the statistics language R. Among his clients: NASA, Eli Lilly, and eBay. The very first users of his software were the students in Macalester’s Introduction to Statistical Modeling, one of the first courses in the world to be taught with R and taken by almost 60% of Macalester students.
Biology professor Paul Overvoorde served as the initial program director of the Howard Hughes Medical Institute grant at Macalester, a $1.3 million grant dedicated to transforming data into knowledge by promoting the computational skills needed to draw dependable conclusions from large and increasingly complex data sets.
One result of the grant is the new Data and Computation Fundamentals (DCF) course.
Macalester statistics professor Daniel Kaplan spearheaded the development of the course and wrote the textbook for it: Data Computing: An Introduction to Wrangling and Data Visualization with R.
DCF is a one-credit, eight-week course with no prerequisites. In the course, students learn data-wrangling techniques including how to combine data from multiple sources. Working on in-class projects and out-of-class homework, students learn how to approach real-world questions such as If people in New York City shared taxicabs, how many fewer cabs would be needed? How should a library manage its collections in order to provide needed services most efficiently?
A commitment of only 10 hours of class time, DCF allows students to develop important data skills in eight weeks and complete the course before their other courses come into the home stretch of finals and papers.
“We are living in a revolutionary period,” says Kaplan. “The amount of data available affects everything we do. Whether you call it data wrangling, cleaning, munging, translating or manipulating, it involves moving data from the form it’s given in to another form that will reveal information.”
Mengdie Wang ’16, a math and computer science major from Wuhu, China, has seen more and more students become proficient in the tools of data science. “The Biology Department is encouraging their students, especially the cell biology lab, to use Rstudio to handle data. To help smooth the transition process they hire stats fellows—students who have expertise in Rstudio—to help students with their assignments.
“Also, the Mathematics and Computer Science Department is offering a new data science minor that encourages students to integrate a big data approach with their expertise in a particular area.”
According to Overvoorde, “Whatever your major, and whatever your level of experience with data science, Macalester offers many resources to help develop the skills you need.”
March 23 2016Back to top