Data Science — what is it, why do businesses need it, and what career prospects does it open up for you
Maryia Ivanina, former Senior Data Scientist at EPAM (now a Software Engineer at Google), answers in detail the questions “What is Data Science?” and “Is it true that the profession of Data Scientist is the future?” Spoileralert: yes, it’s true!
What is Data Science?
— Data Science is the science of data. More specifically, it is the study of data in its various manifestations to identify insights and patterns throughout the data science life cycle, and model processes. Machine learning algorithms and AI are used to extract useful information and create predictive models that are, in turn, applied to new data.
Today, each of our online actions (writing a message in a messenger, watching a video, ordering on the Internet, etc.) is data that can be used by specialists to understand user behavior and improve recommender systems. The purpose of collecting this data is to help users find what they need much faster. When you say “Hey Google” to your voice assistant on your phone, and ask it to check the weather forecast for today, this launches many models, including speech recognition and algorithms for understanding textual information, searching for an answer to your question, generating an answer, and several other subtasks. And yes, even the weather forecast that you receive is modeled using machine learning.
Why is Data Science important for business?
— With the help of Data Science, you can better understand your customers, build a development strategy, and improve your product faster. It used to be difficult for businesses to collect feedback from their customers. In years past, it was necessary to experiment almost blindly, based on customer surveys and comments. Now, by monitoring user interactions, businesses can quickly understand: which features customers use more often; which ones are used less frequently; whether customers like to interact with the interface or not; when to introduce promotions; how to calculate their performance and predict future growth; etc.
What industries is Data Science used in today?
Search engines, organization of workflows based on chat bots and voice assistants.
Calculation of the probability of an accident and assessment of the potential risk for each client.
Recommender systems for finding the right products, product purchase calculations, marketing campaigns, customer churn prediction.
Prediction of diseases and recommendations for maintaining health.
Transport and logistics
Optimization of delivery routes, calculation of the waiting time for the delivery of products, and even the introduction of unmanned vehicles.
Automated content placement and targeting.
Scoring a client to make a decision about issuing a loan, fraud detection and prevention, micro trading.
Searching for and identifying the most suitable properties for the buyer.
Health tracking for personalized training, selection of promising players, development of game strategies.
How are large companies using Data Science today?
Data Science is being implemented in large companies to harness the power of analytics to optimize business processes. They try to bring in Data Science specialists as early as possible in their process to create a strategy and determine what data to collect and how to organize the user action feedback received. In some cases, simple product and sales data can serve as the starting point for analysis, and more data can be accumulated over time so that the specialist has something to work with. The more relevant data collected, the better. More good data means that the analysis will describe circumstances more accurately, and more powerful algorithm models can be used.
Many large IT companies, such as Google, Amazon, Meta, and Microsoft, collect terabytes-petabytes of data. Doing so allows them to train and use state-of-the-art machine learning models. Since they release some of the models in open source, that means smaller companies can use the models for their own needs through transfer learning and can train models which would otherwise have less data.
Who are Data Scientists?
A Data Scientist is a developer-analyst with a good mathematical and algorithmic background, who understands what the business problem is and how data collection can be structured, and then analyzes the data. The difference between a Data Scientist and a data analyst is that a Data Scientist does not simply analyze, prepare reports, and describe what the data says. Data Scientists also use data for predictive modeling.
A software developer needs to implement some functionality based on customer requirements, it can be certain functions on a website or in a mobile application.
One main difference between Data Science and software development is the degree of uncertainty in the results of the work. For example, if the task is to write a website, this is an achievable task. There are, of course, many questions about the purpose of the site, the technologies to be used, the features it should have, etc., but the fact remains that it is quite possible to create the site. In Data Science, on the other hand, until you have examined the data and fully understood the goals and objectives of the project, you cannot know whether it is possible to solve the customer's problem or not. Perhaps the customer needs a certain accuracy for the model, and with the data and algorithms that exist, achieving the requested quality is simply unrealistic. That is why a project often starts with a PoC (proof of concept) stage, so that we roughly understand the task and the direction we can go in to solve it. There are many moments, however, when Data Science intersects with standard IT; when it is necessary to deliver the results of models by creating a web service, for example.
To be or not to be: prospects for the Data Scientist profession
Data Science is the fusion of several components: good mathematical knowledge, software development skills, and expertise in business processes or some domain area.
A strong mathematical base is needed, including: probability; statistics; knowledge of mathematical analysis and differentiation; optimization methods; and many more subjects that are part of a mathematical course of study in higher education. Many students question the need to study these areas, because the skills are not generally used in software development, but in Data Science this knowledge is useful and very important. This background will help you understand how to: solve a problem correctly; know why something does not work in an algorithm; test hypotheses; and draw the right conclusions. Fortunately, this knowledge is not exclusive to universities. There are now online courses on these topics on Coursera, Stepik, lectures from the Yandex School of Data Analysis, and other sources of the information that you need.
The main programming language for Data Science is Python, so a solid knowledge of the basics of the language is required. In-depth knowledge is a huge plus, because if you understand more of the features of the language, you can write solutions more efficiently and effectively. In addition to Python, there is a use for R and Julia as well, and for other programming languages with machine learning libraries that you can use.
Also useful are a knowledge of programming patterns, an understanding of how to write clean and optimized code, and a knowledge of CI/CD. The use of cloud technologies is still popular, so the ability to work with some of the cloud services (Microsoft Azure, AWS, GCP) will also be an asset.
Finally, the last component, but not the least important. Understanding business processes and tasks will help you communicate with the customer in their language and quickly understand: the problem they want to solve, why they need to solve it, how to measure business success, and how to convey the solution. This, unfortunately, is not taught at university for programmers and mathematicians, and most likely you will need to do some supplemental reading. One of the best books for those who want to start learning data science from scratch, and even for those with some experience in the field, is “Data Science for Business” by Tom Fawcett. The book provides examples of tasks and how to approach them, including what questions you can ask the customer and what some solutions might be. This is a good starting point, but ultimately the skills needed for the Data Science component of the job come with experience and working on a variety of projects.
In order to understand the problem the customer wants to solve, you will need to communicate with them effectively and, in most cases, the client will from a country other than your own. Therefore, a good command of English is necessary, because the more successfully you communicate and understand each other, the higher the client’s confidence will be, and the faster you will determine the scope of work.
Another soft skill that a Data Science specialist needs is basic presentation skills. A Data Scientist often needs to make presentations to: show proposed solutions to the customer; convince them of the need for certain changes, or share best practices with colleagues at a conference or meetup. Generally speaking, the development of any soft skill will be a big advantage.
Basic tools for working with data
When we discuss interaction with big data, it is our expectation that it is such big data that we will not be able to work with it on one computer: instead, we need a cluster of machines in order to make the necessary calculations more efficiently and quickly. The most common tools for working with Big Data are Apache Hadoop and Apache Spark. Hadoop consists of several parts:
- HDFS — distributed file system;
- MapReduce — a distributed computing model used for parallel computing when you need to convert tasks into jobs. One of the tasks is to represent the data in the form of a key — value (Map operation). The other is to perform aggregation actions, such as summation, taking a maximum, calculating indicators, or other more complex operations (Reduce operation);
- YARN — technology for managing clusters; and
- Various libraries for other modules to work with HDFS.
Apache Spark is based on Hadoop but is an improvement over the MapReduce concept. In this technology, distributed computing takes place in RAM, which increases the processing speed.
Data Scientists most often use Python to work with data, so a few libraries that are most often used in everyday tasks include:
- NumPy — used for manipulating arrays of data. Pandas, a library that helps you work with data in tabular form, was built on NumPy;
- SciPy — a library for mathematical, scientific, and engineering calculations, includes algorithms for integration, differentiation, and optimization methods;
- Matplotlib is the most popular visualization library. The Seaborn library was created based on it, and it provides more beautiful graphics with a simplified syntax;
- Plotly is a library for creating interactive and publish-ready plots;
- Scikit-learn has collected all the methods of classical machine learning, as well as convenient utilities for data preprocessing.
Neural networks are most often trained using TensorFlow or PyTorch. TensorFlow is perhaps a more difficult framework to get started with because there are a lot of concepts to understand. Its advantages include: a large developer community, a convenient tool for visualizing data training, and many examples for solving problems.
PyTorch is still quite young but developing. It is less popular than TensorFlow, but is catching up with the older program due to the academic environment, since you can quickly start experiments on PyTorch. For now, though, TensorFlow is more optimized for convenient output to production and training monitoring, which undoubtedly impacts its choice as the technologies for large projects.
Artificial intelligence is a broad concept that describes a system that can mimic human behavior to perform certain tasks and can gradually learn using information received. It includes machine learning and deep learning.
Data Mining can also be done using the Python libraries described above, but there are other products that allow you to load data, analyze it, and explore it from a convenient graphical user interface to create graphs and dashboards for interactive use. These options include systems such as SAS Data Mining, RapidMiner, Knime, Qlik, etc.
In simple terms, Data Science is working with data and using them to model problem solutions. With the growth in the amount of data, Data Scientists are increasingly in demand, so there are many vacancies. Some specialists from development, analytics, and even non-IT spheres are moving into this area. The threshold for entering this field can, of course, be high due to the skills and knowledge required, but you can certainly learn what you need to know, especially now, in the age of online learning.
You can learn more about Data Science in the IT Beard Shorts issue on the Anywhere Club YouTube channel.