When you have worked hard to code a model in R, how easy is it for somebody else to pick up what you have done and repurpose it? Can they easily understand its inner workings, or will you have left them with a black box that is hard to unpick?
The concept of reproducible computing is gaining ground, not least thanks to the efforts of Mine Çetinkaya-Rundel, associate professor of the practice at Duke University. And it is one delegates will be able to learn more about from her presentation at DataTech on 14th March, part of this year’s DataFest.
“The idea is that you get a data set, carry out analysis, and then make choices about the computational options to operationalise your output,” she told DataIQ in an interview last December. “Some of those are better than others for collaboration or to help somebody else rerun the same analysis.”
Decisions made by analysts, including which programming language to use and how well they document their process, have an impact on the level of collaboration which is possible and just how reproducible a piece of work will be. Given that many analytical queries need to be rerun when the client comes back with different parameters, adopting a working practice which supports this will make those iterations run more smoothly.
That is part of what Çetinkaya-Rundel will be expounding on in Edinburgh and one of the key learning points is the simple instruction to “take notes”. Documenting the journey from a smart idea to an adopted output helps to create value and impact, not just in large organisations, but any realm in which analytics is active. Tools exist to help manage the process, not least of which is version control as models evolve, something that can easily trip up an incomer trying to pick up and rerun an existing analysis.
“No-one thinks this is a bad idea. But if it takes up a lot of their time, they won’t do it. So having the right tools early on can be very helpful,” noted Çetinkaya-Rundel. In her teaching at Duke University, which she is about to transport across to the University of Edinburgh, she delivers the same message to two distinct audiences.
“New tooling is easier to learn than the notion that somebody needs to reproduce their work.”
The first of these are undergraduates who don’t know any other way of working. They do not have fixed habits so can have the concept drilled into them early on, thereby avoiding problems later in their careers, something of which that other famous fictional resident of Edinburgh, Miss Jean Brody, would have approved.
The second group are data scientists and researchers who are already established in their careers. “It is hard for an experienced person to change,” she acknowledged. “But as time passes, the tooling they use will change anyway, so this is an idea that will not go away. That said, new tooling is easier to learn than the notion that somebody needs to reproduce their work.”
Çetinkaya-Rundel has a PhD in Statistics and is a faculty member at Duke. It was while teaching statistics there that she realised the course did not fundamentally involve computing and that this was not realistic given the real-world way in which statistics is now used. “It is no longer about using printed tables, it is about recognising there is data that needs to be cleaned up, whereas we were only teaching particular techniques on clean data. I wanted to do a more realistic course.”
Most of the 100-plus students she sees each term have never done any computing before, so are introduced to new generation programming languages like R which has become a widespread analytical tool. More importantly, she educates them on best practice in analytics.
“The feedback I get from undergraduates is that it feels like a lot to learn. But when they go into internships, they realise it is the same tooling, the same notions of reproducibility and controls. So they appreciate having been exposed to that. It is not easy to learn and it takes time to get good, but the earlier they start, the better,” stated Çetinkaya-Rundel. Analysts and data scientists eager to make their own early start could do worse than book a place at her talk.
DataIQ is a media partner for DataFest