As more organizations heed the call of big data and the competitive advantage it promises, they need the right people to make sense of it all. Rising to the challenge is the data scientist—a specialist who provides deep analytical insights to inform better decision-making across an organization.

Several developments have indicated that the data scientist role is more than a flashy, faddish pseudonym for the person who looks after your data.

A 2012 Harvard Business Review article dubbed the data scientist position “the sexiest job” of the 21st century. Engineering schools have been steadily adding undergraduate and postgraduate degree programs for data scientists over the last half decade. Employment forecasters report the demand for data scientists is surpassing the available supply.

However, despite the rise in popularity, the role of the data scientist is undergoing growing pains. Many companies struggle to clearly define what a data scientist is and how to recruit the best talent. Academia and analytics companies alike are coming together to address some of these challenges.

Who Is a Data Scientist?

The emergence of the data scientist aligns with the evolution of the IT department, says Rob Thomas, vice president of product development for IBM Analytics. “We have entered the data era, which is redefining what it means to be a skilled worker in IT,” he says.

The emergence of the data scientist aligns with the evolution of the IT department, says Rob Thomas. Image source: IBMThe emergence of the data scientist aligns with the evolution of the IT department, says Rob Thomas. Image source: IBM Once organizations start to look at how analytics and concepts like the cloud are reshaping enterprises, “there is going to be a lot less need for some of those traditional skills we have seen in IT environments, like systems administration and architecture,” Thomas says.

This IT transition includes shifting the role of data/business analysts to one of data scientist. While there is no industry-wide standard that distinguishes the two jobs, leaders in the field are reaching a consensus on key differentiators.

By most accounts, the data scientist collects data, finds relationships in both structured and unstructured data, extracts meaning from that data, and communicates his or her findings to non-technical stakeholders in a way that is both accessible and actionable.

“Data science combines statistics, probability and math with a business lens to drive unique insights, and ultimately business outcomes,” Thomas says. That contrasts with what most business analysts do, which is to focus on analyzing the data in front of them, he says. Put in another way, the data analyst typically focuses on what has happened whereas the data scientist forecasts what might happen, says Kirk Borne, principal data scientist for Booz Allen Hamilton.

Beyond this shift from hindsight to foresight, the data scientist also provides prescriptive analytics. Prescriptive analytics suggest what may be the best course of action to take given the predicted outcome.

Tools for Better Insights

The data analyst focuses on what has happened whereas the data scientist forecasts what might happen, says Kirk Borne of Booz Allen Hamilton.The data analyst focuses on what has happened whereas the data scientist forecasts what might happen, says Kirk Borne of Booz Allen Hamilton. To deliver the insights expected of them, data scientists rely on a variety of tools. One such tool is Apache Spark, an open-source large-scale data processing engine focusing on speed, agility and ease of use. In June 2015, IBM committed big resources to the Apache Spark project saying that “it hits some of the major pain points our clients face in getting more people in the company access to the right data at the right time,” Thomas says.

Spark offers a processing framework that is intended to enable organizations to look at all of their data and perform real-time analytics, as opposed to building models on top of a specific repository of data, which can be challenging, Thomas says.

In joining the Apache Spark community, IBM will embed the framework into its analytics and commerce business units, offer Spark as a service on IBM Cloud and donate its machine learning technology. Additionally, IBM plans to educate more than 1 million data scientists and data engineers on Spark.

Current uses of Spark by IBM clients include projects such as optimizing public transportation planning, developing new services for health insurance customers and analyzing terabytes of deep-space radio signals in the hunt for extraterrestrial life.

Other types of software also are emerging to benefit data scientists. Tableau Software, for instance, helps users and decision-makers visualize data of any size via a drag-and-drop interface.

The Automatic Statistician project of the University of Cambridge, meanwhile, aims to automate many of the tasks that currently fall to the data scientist. The initiative, billed as “artificial intelligence for data science,” uses machine learning and statistical methods to make predictions and automatically generate reports. To further development, in late 2014, Google awarded a $750,000 grant to the project’s developer, engineering professor Zoubin Ghahramani, and his team.

Educators in the field of advanced analytics see these tools as a way to help the data scientist improve efficiency, rather than as a way to replace the job altogether. “By definition, you cannot really automate the process of ‘deriving insight,’” says Daniel Apley, professor in the department of Industrial Engineering and Management Sciences at Northwestern University.

However, what can be automated in many situations is the process of collecting data, putting it in the right format for analysis, executing statistical algorithms, and displaying the results of the analysis. This automation lays the groundwork to facilitate the insight, Apley says.

Finding the Right People

Organizations looking to leverage the insights delivered by a data scientist first may need to overcome some common challenges. The very nature of data now—what is often described as its volume, velocity and variety—imposes one of the biggest obstacles. “It is hard to keep up in terms of figuring out all the useful things you might be able to do with the data, and then developing analytical tools that will help you do it,” says Northwestern’s Apley.

Don’t expect that technology alone will magically produce great analytical insights, says Jack Phillips.Don’t expect that technology alone will magically produce great analytical insights, says Jack Phillips. Another issue that can lead to challenges is expecting that technology will “magically produce great analytical insights,” says Jack Phillips, CEO of the International Institute for Analytics, a research and advisory firm.

Companies also need to expect to make an upfront investment to find the right people, but that does not automatically guarantee success, either. “Companies will put together a couple of data analysts or scientists to produce some interesting results, but they get overrun with demand inside the firm,” Phillips says.

IBM’s Thomas says that while companies recognize the need for a higher level of data analysis, they are less certain about who should fill those roles. The responsibility often falls to IT personnel who may not have the proper skills to be effective. However, Thomas says that organizations will become more serious about training once they realize they may not be getting the business impact from the people they “anointed as data scientists.”

To that end, the IBM analytics unit will launch Datapalooza, a three-day workshop in November in San Francisco that Thomas describes as “an immersion camp on how to become a data scientist.” Similar events are planned in 10 other cities around the world through the first half of 2016.

While such events may address some of the short-term headaches facing enterprises, managers have begun to realize the importance of educating the next generation of data scientists. Booz Allen’s Borne came to this conclusion in the late 1990s after spending nearly two decades working with large data systems related to NASA’s astronomy missions. He recognized how much data was growing in a variety of sectors and hence the need for an educated workforce capable of managing data, interpreting it and predicting outcomes to benefit organizations.

Borne joined George Mason University in 2003 as a professor of astrophysics and computational science with the intent of creating a bachelor’s degree in data science. The program—among the first of its kind in the United States—launched in 2007. Although the program enjoyed little uptake at the start, it began to flourish after President Barack Obama announced the Big Data Research and Development Initiative. The goal of that government initiative is to extract insights from large and complex collections of data “to help solve some of the nation’s most pressing challenges,” according to a statement released at the time.

The push for this initiative was the result of a 2011 report indicating that the U.S. faced a shortage of as many as 200,000 knowledge workers with deep analytical expertise and another 1.5 million individuals who could make decisions based on big data analysis. Since then, Borne estimates that hundreds of degree programs have opened at universities across the U.S. and throughout the world.

Two factors will likely drive continued interest in data science among young learners, says Borne. One is making the subject exciting to learn as part of STEM initiatives in elementary, middle and high school. The other is making the subject truly tangible at the undergraduate level. During his tenure at George Mason, Borne saw that students were much more engaged in the math and science behind data science when they worked with real live data. “They felt more empowered to make discoveries,” he says.

It is up to industry to hire and retain the best of these new professionals to manage the entire data and analytics lifecycle. As IBM’s Thomas says, “data scientists will dictate a company’s ability to be competitive.”

To contact the author of this article, email engineering360editors@ihs.com