How IT Supports the Data Science Operation - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Data Management
Commentary
11/4/2019
08:00 AM
Connect Directly
Twitter
RSS
50%
50%

How IT Supports the Data Science Operation

Data science and IT no longer are separate disciplines. Think of it as a partnership.

The data science world in its most puristic state is populated by parallel processing servers that primarily run Hadoop and execute in batch mode, large troves of data that these processors operate on, and statistically and scientifically trained data scientists who know nothing about IT, or about the requirements of maintaining an IT operation.

While there are organizations that include data science specialties within IT and therefore have the IT management and support expertise nearby, there are an equal number of companies that run their data science departments independently of IT. These departments have little clue of the IT disciplines needed to maintain and support the health of a big data ecosystem.

This is also why many organizations are discovering how critical it is to have data science and IT work hand in hand.

Image: Pdusit - stock.adobe.com
Image: Pdusit - stock.adobe.com

For CIOs and data center leaders, who by necessity should be heavily involved in an IT-data science partnership, and what are the important bases that need to be covered to assure IT support of a data science operation?

Hardware

Two or three years ago, it was a basic rule of thumb that Hadoop, the most dominant big data/data science platform in companies, ran in batch mode. This made it easy for organizations to run big data applications on commodity computing hardware. Now, with the move to more real-time processing of big data, commodity hardware is migrating to in-memory processing, SSD storage and an Apache Spark cluster computing framework. This requires robust processing that can’t necessarily be performed by commodity servers. It also requires IT know-how for configuring hardware components for optimal processing. Accustomed to a fixed record, transactional computing environment, not all IT departments have resident skills for working with or fine-tuning in-memory parallel processing. This is a technical area that IT may need to cross-train or recruit for.

Software

In the Hadoop world, MapReduce is the dominant programming model for processing and generating big data sets with a parallel, distributed algorithm on a cluster. Apache Spark processes in-memory, enabling real-time big data processing. Organizations are moving to more real-time processing, but they also understand the value that Hadoop delivers in a batch environment. From a software standpoint, IT must be able to support both platforms.

Infrastructure

Most IT departments function with a hybrid computing infrastructure that consists of in-house systems and applications in the data center, coupled with private and public cloud systems. This has required IT to think outside of the data center, and to implement management policies, procedures and operations for systems, applications and data that may be in-house, in-cloud or both. Operationally, this has meant that IT must continue to manage its internal technology assets in-house, but also work with cloud vendors that technology asset management is outsourced to, or work in the cloud themselves if assets are only hosted, with the enterprise continuing to manage them.

Support for data science and big data in this more complicated infrastructure takes the IT technology management responsibility one step further, because the management goals for big data differ from those of traditional, fixed data.

Among the support issues for big data that IT must decide on are:

  • How much big data, which is voluminous and constantly building, should be archived, and which data should be discarded?
  • What are the storage and processing price points of cloud vendors, and at what point do cloud storage and processing become more expensive than their in-house equivalents?
  • What is the disaster recovery plan for big data and its applications, which are becoming mission critical for organizations?
  • Who is responsible for SLAs, especially in the cloud world, when a big data production problem occurs?
  • How is data shuttled safely and securely between the cloud and the data center?

Insights

Data scientists have expertise in statistical analysis and algorithm development, but they don't necessarily know how much or which data is available for them to operate on. This is an area where IT excels, because its organizational charter is to track all of the data in enterprise storage, as well as data that is incoming and outgoing.

If a marketing manager wants to develop customer analytics that take into account certain facts that are stored internally on customer records, and also in customers’ purchasing and service histories with the company -- and the manager also wants to know what customers are interested in by tracking customer activity on Websites and social media -- IT is the most knowledgeable when it comes to determining all paths to achieving a total picture of customer information. And it’s the database group, working in tandem with other IT departments, that develops JOINS of data sets that aggregate all of the data so the algorithms data scientists develop can operate on it to develop truest results.

Without IT’s expertise of knowing where the data is and how to access and aggregate it, analytics and data science engineers would be challenged to arrive at accurate insights that can benefit the business.

IT support of the data science operation is a key pillar of corporate analytics success.

IT enables data scientists to do what they do best -- design algorithms to mine the best information from data. At the same time, IT is engaged in its best of class “wheel house” -- knowing where to find the data and aggregate it.

Mary E. Shacklett is an internationally recognized technology commentator and President of Transworld Data, a marketing and technology services firm. Prior to founding her own company, she was Vice President of Product Research and Software Development for Summit Information ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Commentary
Why It's Nice to Know What Can Go Wrong with AI
James M. Connolly, Editorial Director, InformationWeek and Network Computing,  11/11/2019
Slideshows
Top-Paying U.S. Cities for Data Scientists and Data Analysts
Cynthia Harvey, Freelance Journalist, InformationWeek,  11/5/2019
Slideshows
10 Strategic Technology Trends for 2020
Jessica Davis, Senior Editor, Enterprise Apps,  11/1/2019
White Papers
Register for InformationWeek Newsletters
State of the Cloud
State of the Cloud
Cloud has drastically changed how IT organizations consume and deploy services in the digital age. This research report will delve into public, private and hybrid cloud adoption trends, with a special focus on infrastructure as a service and its role in the enterprise. Find out the challenges organizations are experiencing, and the technologies and strategies they are using to manage and mitigate those challenges today.
Video
Current Issue
Getting Started With Emerging Technologies
Looking to help your enterprise IT team ease the stress of putting new/emerging technologies such as AI, machine learning and IoT to work for their organizations? There are a few ways to get off on the right foot. In this report we share some expert advice on how to approach some of these seemingly daunting tech challenges.
Slideshows
Flash Poll