Data Science — What Is It, Why Do Businesses Need It, and What Career Prospects Does It Open up for You?
Maryia Ivanina, former Senior Data Scientist at EPAM (now a Software Engineer at Google), answers in detail the questions “What is Data Science?” and “Is it true that the profession of Data Scientist is the future?” Spoileralert: yes, it’s true!
What is Data Science?
— Data Science is the science of data. More specifically, it is the study of data in its various manifestations to identify insights and patterns throughout the data science life cycle, and model processes. Machine learning algorithms and AI are used to extract useful information and create predictive models that are, in turn, applied to new data.
Today, each of our online actions (writing a message in a messenger, watching a video, ordering on the Internet, etc.) is data that can be used by specialists to understand user behavior and improve recommender systems. The purpose of collecting this data is to help users find what they need much faster. When you say “Hey Google” to your voice assistant on your phone, and ask it to check the weather forecast for today, this launches many models, including speech recognition and algorithms for understanding textual information, searching for an answer to your question, generating an answer, and several other subtasks. And yes, even the weather forecast that you receive is modeled using machine learning.
Why is Data Science important for business?
— With the help of Data Science, you can better understand your customers, build a development strategy, and improve your product faster. It used to be difficult for businesses to collect feedback from their customers. In years past, it was necessary to experiment almost blindly, based on customer surveys and comments. Now, by monitoring user interactions, businesses can quickly understand: which features customers use more often; which ones are used less frequently; whether customers like to interact with the interface or not; when to introduce promotions; how to calculate their performance and predict future growth; etc.
What industries is Data Science used in today?
Search engines, organization of workflows based on chat bots and voice assistants.
Calculation of the probability of an accident and assessment of the potential risk for each client.
Recommender systems for finding the right products, product purchase calculations, marketing campaigns, customer churn prediction.
Prediction of diseases and recommendations for maintaining health.
Transport and logistics
Optimization of delivery routes, calculation of the waiting time for the delivery of products, and even the introduction of unmanned vehicles.
Automated content placement and targeting.
Scoring a client to make a decision about issuing a loan, fraud detection and prevention, micro trading.
Searching for and identifying the most suitable properties for the buyer.
Health tracking for personalized training, selection of promising players, development of game strategies.
How are large companies using Data Science today?
Data Science is being implemented in large companies to harness the power of analytics to optimize business processes. They try to bring in Data Science specialists as early as possible in their process to create a strategy and determine what data to collect and how to organize the user action feedback received. In some cases, simple product and sales data can serve as the starting point for analysis, and more data can be accumulated over time so that the specialist has something to work with. The more relevant data collected, the better. More good data means that the analysis will describe circumstances more accurately, and more powerful algorithm models can be used.
Many large IT companies, such as Google, Amazon, Meta, and Microsoft, collect terabytes-petabytes of data. Doing so allows them to train and use state-of-the-art machine learning models. Since they release some of the models in open source, that means smaller companies can use the models for their own needs through transfer learning and can train models which would otherwise have less data.
Who are Data Scientists?
A Data Scientist is a developer-analyst with a good mathematical and algorithmic background, who understands what the business problem is and how data collection can be structured, and then analyzes the data. The difference between a Data Scientist and a data analyst is that a Data Scientist does not simply analyze, prepare reports, and describe what the data says. Data Scientists also use data for predictive modeling.
A software developer needs to implement some functionality based on customer requirements, it can be certain functions on a website or in a mobile application.
One main difference between Data Science and software development is the degree of uncertainty in the results of the work. For example, if the task is to write a website, this is an achievable task. There are, of course, many questions about the purpose of the site, the technologies to be used, the features it should have, etc., but the fact remains that it is quite possible to create the site. In Data Science, on the other hand, until you have examined the data and fully understood the goals and objectives of the project, you cannot know whether it is possible to solve the customer's problem or not. Perhaps the customer needs a certain accuracy for the model, and with the data and algorithms that exist, achieving the requested quality is simply unrealistic. That is why a project often starts with a PoC (proof of concept) stage, so that we roughly understand the task and the direction we can go in to solve it. There are many moments, however, when Data Science intersects with standard IT; when it is necessary to deliver the results of models by creating a web service, for example.
To be or not to be: prospects for the Data Scientist profession
Data Science is the fusion of several components: good mathematical knowledge, software development skills, and expertise in business processes or some domain area.
A strong mathematical base is needed, including: probability; statistics; knowledge of mathematical analysis and differentiation; optimization methods; and many more subjects that are part of a mathematical course of study in higher education. Many students question the need to study these areas, because the skills are not generally used in software development, but in Data Science this knowledge is useful and very important. This background will help you understand how to: solve a problem correctly; know why something does not work in an algorithm; test hypotheses; and draw the right conclusions. Fortunately, this knowledge is not exclusive to universities. There are now online courses on these topics on Coursera, Stepik, lectures from the Yandex School of Data Analysis, and other sources of the information that you need.
The main programming language for Data Science is Python, so a solid knowledge of the basics of the language is required. In-depth knowledge is a huge plus, because if you understand more of the features of the language, you can write solutions more efficiently and effectively. In addition to Python, there is a use for R and Julia as well, and for other programming languages with machine learning libraries that you can use.
Also useful are a knowledge of programming patterns, an understanding of how to write clean and optimized code, and a knowledge of CI/CD. The use of cloud technologies is still popular, so the ability to work with some of the cloud services (Microsoft Azure, AWS, GCP) will also be an asset.
Finally, the last component, but not the least important. Understanding business processes and tasks will help you communicate with the customer in their language and quickly understand: the problem they want to solve, why they need to solve it, how to measure business success, and how to convey the solution. This, unfortunately, is not taught at university for programmers and mathematicians, and most likely you will need to do some supplemental reading. One of the best books for those who want to start learning data science from scratch, and even for those with some experience in the field, is “Data Science for Business” by Tom Fawcett. The book provides examples of tasks and how to approach them, including what questions you can ask the customer and what some solutions might be. This is a good starting point, but ultimately the skills needed for the Data Science component of the job come with experience and working on a variety of projects.
In order to understand the problem the customer wants to solve, you will need to communicate with them effectively and, in most cases, the client will from a country other than your own. Therefore, a good command of English is necessary, because the more successfully you communicate and understand each other, the higher the client’s confidence will be, and the faster you will determine the scope of work.
Another soft skill that a Data Science specialist needs is basic presentation skills. A Data Scientist often needs to make presentations to: show proposed solutions to the customer; convince them of the need for certain changes, or share best practices with colleagues at a conference or meetup. Generally speaking, the development of any soft skill will be a big advantage.
Basic tools for working with data
When we discuss interaction with big data, it is our expectation that it is such big data that we will not be able to work with it on one computer: instead, we need a cluster of machines in order to make the necessary calculations more efficiently and quickly. The most common tools for working with Big Data are Apache Hadoop and Apache Spark. Hadoop consists of several parts:
- HDFS — distributed file system;
- MapReduce — a distributed computing model used for parallel computing when you need to convert tasks into jobs. One of the tasks is to represent the data in the form of a key — value (Map operation). The other is to perform aggregation actions, such as summation, taking a maximum, calculating indicators, or other more complex operations (Reduce operation);
- YARN — technology for managing clusters; and
- Various libraries for other modules to work with HDFS.
Apache Spark is based on Hadoop but is an improvement over the MapReduce concept. In this technology, distributed computing takes place in RAM, which increases the processing speed.
Data Scientists most often use Python to work with data, so a few libraries that are most often used in everyday tasks include:
- NumPy — used for manipulating arrays of data. Pandas, a library that helps you work with data in tabular form, was built on NumPy;
- SciPy — a library for mathematical, scientific, and engineering calculations, includes algorithms for integration, differentiation, and optimization methods;
- Matplotlib is the most popular visualization library. The Seaborn library was created based on it, and it provides more beautiful graphics with a simplified syntax;
- Plotly is a library for creating interactive and publish-ready plots;
- Scikit-learn has collected all the methods of classical machine learning, as well as convenient utilities for data preprocessing.
Neural networks are most often trained using TensorFlow or PyTorch. TensorFlow is perhaps a more difficult framework to get started with because there are a lot of concepts to understand. Its advantages include: a large developer community, a convenient tool for visualizing data training, and many examples for solving problems.
PyTorch is still quite young but developing. It is less popular than TensorFlow, but is catching up with the older program due to the academic environment, since you can quickly start experiments on PyTorch. For now, though, TensorFlow is more optimized for convenient output to production and training monitoring, which undoubtedly impacts its choice as the technologies for large projects.
Artificial intelligence is a broad concept that describes a system that can mimic human behavior to perform certain tasks and can gradually learn using information received. It includes machine learning and deep learning.
Data Mining can also be done using the Python libraries described above, but there are other products that allow you to load data, analyze it, and explore it from a convenient graphical user interface to create graphs and dashboards for interactive use. These options include systems such as SAS Data Mining, RapidMiner, Knime, Qlik, etc.
In simple terms, Data Science is working with data and using them to model problem solutions. With the growth in the amount of data, Data Scientists are increasingly in demand, so there are many vacancies. Some specialists from development, analytics, and even non-IT spheres are moving into this area. The threshold for entering this field can, of course, be high due to the skills and knowledge required, but you can certainly learn what you need to know, especially now, in the age of online learning.
You can learn more about Data Science in the IT Beard Shorts issue on the Anywhere Club YouTube channel.
- How to Ask for a Raise or PromotionWork09 Dec 2022
- How to Effectively Give Feedback?Work18 Nov 2022
- Effectively Launching as a Junior: an Expert's ViewWork23 Nov 2022
- How Do You Draw Positive Attention to Yourself?Work23 Dec 2022
- What to Do to Get into IT TodayWork12 Dec 2022
- How to Learn EnglishLearn04 Aug 2022
- Juniors Are Needed for CompaniesWork22 Dec 2022
- Who Can You Become in IT?Work13 Dec 2022
- What Is Mentoring in IT and How Does It Help You Master the Profession?Learn27 Dec 2022
- Teamwork: 5 Factors of Effective InteractionWork03 Jan 2023
- Business Analyst in IT: What You Need to Know and DoLearn14 Nov 2022
- 10 Things Beginner Startups Don't Need to DoLearn05 Jan 2023
- What You Need to Know to Relocate to SerbiaRelocate26 May 2022
- How to Learn Ruby on Your OwnLearn25 Nov 2022
- English Through Music: Learn and Have FunLearn04 Nov 2022
- What Does a Botanist Do in IT?Work04 Nov 2022
- Soft Skills Are 85% of a Person’s Success in a ProfessionWork29 Dec 2022
- Trends in IT Company EventsCommunity30 Nov 2022
- How to Compose a ResumeWork16 Aug 2022
- How to Manage Interview AnxietyWork12 Oct 2022
- "Where Else but Spain?" — the Experience of MovingRelocate06 Dec 2022
- IT Language of the TesterLearn09 Jan 2023
- The Brain-Up Project is Looking for VolunteersVolunteer11 Aug 2022
- Relocation to TurkeyRelocate31 Aug 2022
- The Easiest Programming Languages to LearnLearn07 Sep 2022
- How to Become a DevOpsLearn06 Oct 2022
- “Forget About Impostor Syndrome”Community18 Oct 2022
- 8 Reasons to Move to KrakowRelocate26 Oct 2022
- How to Become a Tester in three weeksWork19 May 2022
- The Reality and Myths of Life in TurkeyRelocate19 May 2022
- IT Beard Talks About Anywhere Club IT CommunityCommunity19 May 2022
- What Should a Beginner Automation QA Engineer LearnLearn29 Sep 2022
- Project Manager Responsibilities: Tasks and Skills vs RealityWork31 Oct 2022
- Home for Every IT Professional in the WorldCommunity19 May 2022
- Living in Serbia — Pros and ConsRelocate16 Jun 2022
- Relocation to GeorgiaRelocate20 Sep 2022
- 3 Ways to Find Your Dream JobWork24 Nov 2022
- Sent out 500 Resumes and Moved to UzbekistanWork28 Nov 2022
- 5 Common Questions about RelocationRelocate19 May 2022
- Relocation is Like the First Child's BirthRelocate23 May 2022
- To Learn How to Write Code, You Have to Write CodeWork24 May 2022
- Lithuania: Features and Life HacksRelocate10 Oct 2022
- 3 Facilitation Case Studies: Better and Worse ScenariosWork31 Oct 2022
- How Does a Junior Get Their First Job in Singapore?Work22 Nov 2022
- Android: What Should Beginners and Professionals DoLearn09 Jun 2022
- What Are Some of the Unusual Jobs in ITWork21 Jun 2022
- The Psychological Trauma of EmigrantsRelocate28 Jun 2022
- How to Avoid Burnout And Stay Productive in 2022Work06 Jul 2022
- Why Time Management Doesn't WorkLearn19 Sep 2022
- 5 Myths About IT: True or FalseWork19 May 2022
- What to Read and Watch for a Novice Java EngineerLearn19 May 2022
- How to Start an IT Career in TestingWork01 Jun 2022
- Tips for Relocating to SerbiaRelocate02 Jun 2022
- Best Countries to Relocate for IT SpecialistsRelocate06 Jun 2022
- How Do Expatriates Live in UzbekistanRelocate15 Jun 2022
- What Is a Vocation, and Does It Exist in 2022Work20 Jun 2022
- How to Support Someone Who Is StressedCommunity23 Jun 2022
- 6 Stereotypes About Life in HungaryRelocate30 Jun 2022
- What Impresses in MontenegroRelocate18 Aug 2022
- Where to Make Friends With BlockchainLearn25 Aug 2022
- Doctor in ITWork24 Aug 2022
- How to Become an HR ManagerWork05 Sep 2022
- Top iOS Developer ResourcesLearn12 Sep 2022
- 7 Soft Skills for DevelopersWork27 Sep 2022
- Features of Life in GermanyRelocate05 Oct 2022
- Facilitation: How to Conduct It EffectivelyWork17 Oct 2022
- Life in Italy: What You Need to PrepareRelocate19 Oct 2022
- 10 Ways to Stay on Top of IT TrendsLearn25 Oct 2022
- An 8-month Journey to Front-EndWork04 Jul 2022
- The Joys And Challenges of Relocation to HungaryRelocate11 Jul 2022
- Where is Python Used And Why Should You Study It in 2022Learn13 Jul 2022
- From Travel Agent to a Tester in ITWork20 Jul 2022
- The Real Story of a Radical Change in ProfessionWork18 Jul 2022
- Top 5 Startup Books to Read in 2022Learn27 Jul 2022
- Life Hacks And Impressions of Moving to PolandRelocate28 Jul 2022
- Family Traumas in EmigrationRelocate03 Aug 2022
- What Do You Need to Know About the South of SpainRelocate22 Aug 2022
- Pros and Cons of Life in MontenegroRelocate08 Sep 2022
- Why Do Job Applicants Lie in Interviews?Work12 Jan 2023
- The Incredible Success Story of a Switcher in ITWork17 Jan 2023
- Tips from a Career ConsultantWork20 Jan 2023
- IT Language of the Business AnalystLearn23 Jan 2023