Smart Action Application Window for Interdisciplinary Undergrad Programs Closes on 31st December. Apply Now Download Brochure
Apply Now E-Prospectus

Home > Blogs > Demystifying Data Science

Demystifying Data Science

Meet the Author

Prof. R. P. Suresh

Professor of Practice - Data Science
Vidyashilp University


Dr. Rajagopalan Padmanabha Suresh is an award-winning researcher in the area of Statistics and Reliability Modeling with over three decades of experience in both Academics and Industry. A Ph.D. in Statistics from the University of Pune, Dr. Suresh was bestowed the “Best Young Statistician Award” (1991) from Indian Society for Probability and Statistics. He is currently the Principal Director in Supply Chain Analytics in Accenture Applied Intelligence.

Dr. Suresh was a Professor at the Indian Institute of Management Kozhikode (IIMK) for over a decade. He also served as a Visiting Professor at Victoria Management School of Victoria University of Wellington, New Zealand. He moved from Academics to Industry as a Staff Researcher with General Motors Research & Development Centre India Science Lab, Bangalore, where he received the GM R&D Innovation Award (McCuen Special Achievement Award).

He has delivered invited lectures at leading institutions such as Indian Institute of Science, Indian Statistical Institute, Department of Mathematics and Statistics of Victoria University, New Zealand, Statistics New Zealand besides many universities in India. His training and consultancy engagements include organizations like Department of Telecommunications, Government of India, Hindustan Newsprints Ltd., WIPRO Ltd., Kirloskar Cummins, Tata Tea Ltd, and IPS and IES Officers Training Institutes.

Dr. Suresh has established himself as an innovative thought leader in the Analytics Industry over the past decade, and many of his ideas and methodologies have been implemented in the industry yielding greater benefits and savings.

As Professor of Practice – Data Science at Vidyashilp University, Dr. Suresh will propel the academic and research efforts as the University pursues expertise and excellence in the interdisciplinary domain of Data Science.

Introduction

Data Science has been one of the most widely used fields in industry in the recent years, and it also stands at the forefront of new scientific discoveries, while playing a pivotal role in our daily lives. Data Science is not very new in a sense, but is something we routinely use to make our decisions. For example, in simple activities such as determining the best time to travel to work/ school/ return home so as to avoid huge traffic, we inadvertently use the traffic data that we have experienced in the past to analyze the travel duration at different time points, and then arrive at the best time to travel. Similarly, before starting a new shop in a locality, the potential shop owner would naturally collect the data on the number of households in the locality, number of other similar shops in the locality etc., and based on the information extracted from these data, he/she will decide whether the venture would be a successful one or not. These are simple situations in which we take decisions in a scientific manner using the data driven insights. However, in more complex situations, one may need additional data so that one can be more confident about the predictions/ decisions. For example, traffic on Sundays and other Holidays and on special days where public events or sport events are taking place, one may have to do detailed analysis and use more relevant data such as social media data, real time alerts etc., before one arrives at a decision. Similarly, the potential shop owner may have to understand what kind of products are sold in the locality, who are the potential buyers of the product/ service, and how he/she can come up with differentiated products/ services, so that he/she can be more successful in the venture. All these situations involve uncertain environments and we need to take decisions under uncertainty. Use of data/information in a scientific manner is the central theme of Data Science. Let us try and understand what is modern Data Science, how it has evolved, and why is it important to study

What is Data Science?

Data Science is a combination of three major disciplines, viz., Mathematics, Statistics and Computer Science. Mathematics, which is the mother discipline of all Computational Sciences, has laid strong foundations for developing methodologies to analyze data leading to robust conclusions. Foundations of Data Science have been derived from mathematical concepts such as Algebra, Geometry, Analytical Geometry, Functions, Derivatives, Linear Algebra, Set Theory, Logic etc. Some of the sound mathematical methods developed by famous Mathematicians in the 16th and 17th centuries have provided the basis for developments in the 19th and 20th centuries leading to the foundations of Data Science (Statistics). For example, Blaise Pascal’s work carried out in 17th century on Games of Chance and basic Probability provided a way to quantifying uncertainty into a Probability Distribution. Newton-Raphson Method of finding roots of an equation based on Newton’s pioneering work in the 17th century evolved into algorithmic computational approaches to problem solving and optimization. The works on Geometry and Algebra by René Descartes laid the foundation for Calculus which is central to many methodologies developed in the 20th century. These contributions formed the basis for several methodologies in Data Science such as Machine Learning and Artificial Intelligence.

Evolution of Data Science

The discipline of Statistics is the earliest version of Data Science. Statistics started as a methodology to estimate an unknown parameter of the population such as the mortality rate, sex ratio etc., in the 17th and 18th centuries. Graphical representations of data were also introduced around the same time to bring a visual display of the data, primarily to understand variability in the data across different time points or across other variables or categories. The problem of estimation arose in scientific investigations, in the 18th century, in Earth Sciences. To estimate the Length of Seconds Pendulum of different cities, Laplace in 1799 and Adrain in 1818 (independently) fitted an equation to express the relationship between the length and the altitude based on observed values for known cities, and used this fitted equation to estimate the length of seconds pendulum for other cities. This methodology is known as the “Regression Modelling” approach which is now very commonly used in industry in solving business problems such as determining Price Elasticity of Demand, or determining effect of advertisement on sales, or predicting sale, of ice cream when the temperature is 40°C, etc. In biomedical or clinical research, the researcher often tries to understand the various risk factors that affect the chance of occurrence of a disease. Applications of statistics can be found in many other fields as well. In the early 19th century, Laplace estimated the population of France using the number of births in the previous year and census data for three communities in France. This is known as the “Sampling Technique” of estimating the population parameters using a random sample from an unknown population, which is a concept used today in “Random Forest”, a Machine Learning method.
Statistical methodologies were developed at a brisk pace in the early 20th century with major contributions from several pioneering Statisticians including Ronald A Fisher, Karl Pearson, Edwards Deming, Florence Nightingale, Gertrude Cox, Thomas Bayes, Walter Shewhart, William Gosset (Student), P.C. Mahalanobis, C R Rao and several others, and these developments provided a lot of confidence to apply statistical techniques to problem solving and decision making. Also, during the mid of the 20th century, several new statistical and optimization methodologies such as Linear Programming Simplex Method, Statistical Quality Control Techniques, Acceptance Sampling etc., were developed during the 2nd World War. All these paved the way for use of Statistics in almost all fields of Business and Science.

Given below are some of the fields / areas and some examples of problems in those fields, where Statistics is applied.

Agriculture
Statistics is used to improve yield by understanding the effect of fertilizers (treatment) and the effect of soil (Block) using “Design of Experiments”
Quality
Statistical Control Charts and Acceptance Sampling to improve product and process quality
Health Policy Formulation
Use of detailed analysis of health issues such as infant mortality, across groups/ communities to improve health of the population
Management
To determine the best investment option should one chose so as to maximize the Return on Investment
Social Studies
Collect relevant data and analyse them to understand the relation between individuals, communities etc.
Official Statistics
Use of detailed analysis of census data to help in policy formulation on education, employment, skill development, women empowerment etc.
Economics
To understand the relation between Price and Demand


Medical Science
Collecting relevant data and use Regression or Logit methods to understand risk factors for a disease (e.g., Lung Cancer and Heart Disease)
Insurance and Actuarial Studies
Detailed analysis of demographic and lifestyle data to understand the risk factors of several factors such as age, gender etc., with a view to provide affordable insurance products to customers
Engineering
Use appropriate Statistical Models to improve the reliability and warranty
Genetics
To estimate the effect of Genetic Selection.
Though the applications of Statistics were expanding to various fields, and different Statistical/Mathematical/Operational Research models were developed specific to those fields, the use of these models for day-to-day decision making was very limited. One of the reasons for the limited or restricted use was in collecting relevant data and compilation of the data, as most of the times, the data was to be collected manually through surveys etc. This led to data errors/quality issues, and also needed to go through number of iterations to make it analysis ready. Another important reason is the huge computations needed for calculating sophisticated statistical formulas.
Study at VU