Seth Dobrin, vice president and chief data officer at IBM, has led the 60-strong Data Science Elite Team since 2017. Dobrin began at the multinational IT company by speaking to different clients trying to understand their challenges with data science projects. He came to the realisation that there were three reasons as to why clients were not getting value from their data teams. This was because they were being experimentation-driven rather than case-driven, they saw ‘data scientist’ as a single person instead of a team of people, and they were not applying agile methodologies
As a result the Data Science Elite Team was officially launched in February 2017. It is composed of experts who have successfully operationalised and implemented data science and AI projects. “We have a proven methodology that we show the clients. We help them figure out how to identify a decision or outcome they want to drive, and how to break that outcome into the component parts or models. Most outcomes worth doing are not one machine learning or AI model but several,” said Dobrin. In helping the clients to figure out their outcomes, the Elite Team has found that there are four principles of best practice for success with data science projects.
“Take the bias out.”
The first principle of best practice is ‘take the bias out.’ Dobrin said that is when thinking about bias, first and foremost it is important to set the context. He said that with data science projects, we are using math to make decisions and the concept of decision making is an activity of discriminating one from another.
“AI is very good at saying ‘you should do A over B’ and it is also very good at identifying underlying components of that. We call it features,” explained Dobrin. He said that bias comes from the underlying data and gave an example of a financial services model that was making a decision of whether or not to offer mortgages to applicants.
Because the model used zip codes as one of the features on which it based its decision, and because in the United States residential neighbourhoods are generally divided down racial lines, zip code became a proxy for race, resulting in people from minority ethnic groups being less likely to be offered mortgages. “There was unintentional bias towards specific ethnic groups,” said Dobrin. However, he said that as soon as they realised that was happening, zip code was removed from the dataset.
Fortunately, he explained that once it is known which groups you do not want to discriminate against, and which features could be used as proxies, modellers can actively monitor bias with simple statistical distributions. He added that it is important to have diverse teams because this can help the team to see issues of bias sooner.
While he admits that it is not easy to build a diverse team, it should be standard practice to have a diverse pool of qualified viable candidates to choose from, because “diversity is incredibly important to making good decisions and building good models.”
“Go agile or go home.”
The second principle of best practice is to ‘go agile or go home’ which for Dobrin is essential because using agile methodologies is the only way that these projects will be successful. For him being agile means being able to show value quickly, pivot quickly and having direct input from a subject matter expert throughout the duration of the project.
He said that each project should be broken into sprints of between one and four weeks in length though the ideal sprint is two weeks. “The ultimate goal is to show value after the second sprint regardless of the duration. We are definitely not going to show full value but in more than 95% of cases, you can start showing value after the second.”
“Get outside the comfort zone.”
The Elite Team also advises that team members delivering a data science project to ‘get outside their comfort zone.’ Dobrin explained that this is about team members in different job roles learning from each other. “When you build an agile team, there are different skillsets, and we expect to have T-shaped skills,” said Dobrin. This is where a person will have a depth of understanding in one area but and broader understanding in adjacent areas such as machine learning, decision optimization, data engineering, and data visualisation for data journalism. The likelihood of a successful projects increases when there is a cross-pollination of skills. So for example, a machine learning expert working with a data engineer will learn about data engineering and vice versa will both be able to add more value to the project.
“Tell the story, sell the story.”
The final piece of best practice advice is to ‘tell the story, sell the story’. This is because data scientists must present their insights in a compelling way to internal and external stakeholders, which requires data visualisation and data journalism skills.
This encompasses having a role of a data visualisation engineer/data journalist whose job is to show the team how to communicate the outcome of the project to the senior executives. Dobrin said that to help traverse this ‘last mile of data science,’ he has brought in people who have backgrounds with a combination of machine learning and art, journalism or graphic design.
Dobrin said that the presentation at the end of the project should have everyone from the team taking part as well as many end users as possible watching, because it is about good visualisation and good storytelling, not just building a dashboard at the end.