What is Data Science?
Blog: Think Data Analytics Blog
What is Data Science?
Data science is an Multi-disciplinary field that uses scientific techniques, processing the data, in algorithms and systems to extract the data and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data.
Data science is a “concept to unify the statistics, data analysis and their related methods” as well as to “understand and analyze the actual phenomena” with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, domain knowledge and information science.
Why data science?
Data science is important role in each and every IT industry field. It is used to the develop and progress within the systems to handle emerging issues in every industry, business, and organization.
The system which resolve the complex problems should be advanced enough to provide simple solutions. The purpose of Data Science is to find the patterns within the data. It is used to find the various statistical techniques to analyze and draw insights from the data.
From data extraction, wrangling and pre-processing, a Data Scientist must scrutinize the data thoroughly. Then, he has the responsibility of making predictions from the data. The goal of a Data Scientist is to derive conclusions from the data. Through these conclusions, he is able to assist companies in making smarter business decisions.
Data science vs Data analytics
How many types of data in data science
Data is the foundation of data science; it is the material on which all the analyses are based. In the context of data science, there are two types of data: traditional, and big data.
Learn more about: What is Data Analytics
Traditional data is a data that is structured and stored in databases which analysts can manage from one computer; it is in table format, containing numeric or text values. It helps emphasize the distinction between big data and other types of data.
Big data, on the other hand, is… bigger than traditional data, and not in the trivial sense. From variety (numbers, text, but also images, audio, mobile data, etc.), to velocity (retrieved and computed in real time), to volume (measured in tera-, peta-, exa-bytes), big data is usually distributed across a network of computers.
Read more about: https://www.thinkdataanalytics.com/data-analytics-vs-data-science/
What do you do to data in data science?
Traditional data in Data Science
Traditional data is stored in relational database management systems.
That said, before being ready for processing, all data goes through pre-processing. This is a necessary group of operations that convert raw data into a format that is more understandable and hence, useful for further processing. Common processes are:
- Collect raw data and store it on a server
This is untouched data that scientists cannot analyze straight away. This data can come from surveys, or through the more popular automatic data collection paradigm, like cookies on a website.
- Class-label the observations
This consists of arranging data by category or labelling data points to the correct data type. For example, numerical, or categorical.
- Data cleansing / data scrubbing
Dealing with inconsistent data, like misspelled categories and missing values.
- Data balancing
If the data is unbalanced such that the categories contain an unequal number of observations and are thus not representative, applying data balancing methods, like extracting an equal number of observations for each category, and preparing that for processing, fixes the issue.
- Data shuffling
Re-arranging data points to eliminate unwanted patterns and improve predictive performance further on. This is applied when, for example, if the first 100 observations in the data are from the first 100 people who have used a website; the data isn’t randomized, and patterns due to sampling emerge.
Big Data in Data Science
When it comes to big data and data science, there is some overlap of the approaches used in traditional data handling, but there are also a lot of differences.
First of all, big data is stored on many servers and is infinitely more complex.
In order to do data science with big data, pre-processing is even more crucial, as the complexity of the data is a lot larger. You will notice that conceptually, some of the steps are similar to traditional data pre-processing, but that’s inherent to working with data.
- Collect the data
- Class-label the data
Keep in mind that big data is extremely varied, therefore instead of ‘numerical’ vs ‘categorical’, the labels are ‘text’, ‘digital image data’, ‘digital video data’, digital audio data’, and so on.
- Data cleansing
The methods here are massively varied, too; for example, you can verify that a digital image observation is ready for processing; or a digital video, or…
- Data masking
When collecting data on a mass scale, this aims to ensure that any confidential information in the data remains private, without hindering the analysis and extraction of insight. The process involves concealing the original data with random and false data, allowing the scientist to conduct their analyses without compromising private details. Naturally, the scientist can do this to traditional data too, and sometimes is, but with big data the information can be much more sensitive, which masking a lot more urgent.
Where does data come from?
Traditional data may come from basic customer records, or historical stock price information.
Big data, however, is all-around us. A consistently growing number of companies and industries use and generate big data. Consider online communities, for example, Facebook, Google, and LinkedIn; or financial trading data. Temperature measuring grids in various geographical locations also amount to big data, as well as machine data from sensors in industrial equipment. And, of course, wearable tech.
- Need of data science Traditionally, the data that we had was mostly structured and small in size, which could be analyzed by using simple BI tools. Unlike data in the traditional systems which was mostly structured, today most of the data is unstructured or semi-structured. Let’s have a look at the data trends in the image given below which shows that by 2020, more than 80 % of the data will be unstructured.
This data is generated from different sources like financial logs, text files, multimedia forms, sensors, and instruments. Simple BI tools are not capable of processing this huge volume and variety of data. This is why we need more complex and advanced analytical tools and algorithms for processing, analyzing and drawing meaningful insights out of it.
This is not the only reason why Data Science has become so popular. Let’s dig deeper and see how Data Science is being used in various domains.
- How about if you could understand the precise requirements of your customers from the existing data like the customer’s past browsing history, purchase history, age and income. No doubt you had all this data earlier too, but now with the vast amount and variety of data, you can train models more effectively and recommend the product to your customers with more precision. Wouldn’t it be amazing as it will bring more business to your organization?
- Let’s take a different scenario to understand the role of Data Science in decision making. How about if your car had the intelligence to drive you home? The self-driving cars collect live data from sensors, including radars, cameras, and lasers to create a map of its surroundings. Based on this data, it takes decisions like when to speed up, when to speed down, when to overtake, where to take a turn – making use of advanced machine learning algorithms.
- Let’s see how Data Science can be used in predictive analytics. Let’s take weather forecasting as an example. Data from ships, aircraft, radars, satellites can be collected and analyzed to build models. These models will not only forecast the weather but also help in predicting the occurrence of any natural calamities. It will help you to take appropriate measures beforehand and save many precious lives.
Business intelligence vs Data science
Data science is basically a field in which the data and knowledge are extracted from the data by using various scientific methods, algorithms, and processes.
It can be defined as a combined form of the various mathematical tools, algorithms, statistics, and machine learning techniques which are thus used to find the hidden patterns and insights from the data which helps in the decision making process.
Data science deals with both structured as well as unstructured data. It is related to both data mining and big data. Data science involves studying the historic trends and thus using its conclusions to redefine present trends and also predict future trends.
Business intelligence(BI) is basically a set of technologies, applications, and processes that are used by enterprises for business data analysis.
It is basically used for the conversion of raw data into meaningful information which is thus used for business decision making and profitable actions.
It deals with the analysis of structured and sometimes unstructured data which paves the way for new and profitable business opportunities.
It supports decision making based on facts rather than assumption-based decision making. Thus it has a direct impact on the business decisions of an enterprise. Business intelligence tools enhance the chances of an enterprise to enter a new market as well as help in studying the impact of marketing efforts.
Life cycle of a data science
- DATA COLLECTION
- DATA PREPARATION
- EXPLORATORY DATA ANALYSIS
- MODEL BUILDING
- MODEL ANLAYSIS
In this first step , You will need to query the databases, using databases software like MySQL to process the data.
You may also retrieve the data in file formats like Microsoft Excel. If you are using Python or R, they have specific packages that can read data from these data sources directly into your data science programs.
Some great web scraping tools are BeautifulSoup, Scrapy, etc.Another popular option to gather data is connecting to Web APIs.
Websites such as Facebook and Twitter allow users to connect to their web servers and access their data. All you need to do is to use their Web API to crawl their data.
Data are collected in the first step of a data science project is usually not in a usable format to run the required analysis and might contain missing entries, inconsistencies, and semantic errors.
Data preparation work is done by information technology (IT) and business intelligence (BI) teams as they are integrate the data sets to load into a data warehouse, NoSQL database, or Hadoop data lake repository.
One of the biggest benefits of instituting a formal data preparation process is that users can spend less time finding and structuring their data. Hence this work is generally done by data analysts and sometimes by data scientists.
EXPLORATORY DATA ANALYSIS
Data analysis is defined as the process of cleaning, transforming, and modeling data to discover useful information for the business decision-making.
The purpose of Data Analysis is to extract useful information from data and taking the decision based upon the data analysis.
Exploratory analysis is to often described as a philosophy, and there are no fixed rules for how you approach it.
There is no limitations for data exploration. Remember the quality of your inputs which it decides the output quality. Below are some of the standard practices involved to understand, clean, and prepare your data for building your predictive model:
-Variable Identification-Univariate Analysis-Bi-variate Analysis-Missing values treatment-Outlier treatment-Variable transformation-Variable creation
Modeling is the stage in the data science methodology where the data scientist has the chance to sample the sauce and determine if it’s banging on or in need of more seasoning.
Once again, before reaching this stage, bear in mind that the scrubbing and exploring stage is equally crucial to building useful models. Model Building is the core activity of a data science project.
It is carried out either Statistical Driven — Statistical Analytics or using Machine Learning Techniques.
In the machine learning world, modeling is divided into 3 distinct stages — training, validation, and testing.
These stages change if the mode of learning is unsupervised. In any case, once we have modeled the data then we can derive insights from it. It is the stage where we can finally start evaluating our complete data science system.
The end of modeling is characterized by model evaluation where you measure,
-Accuracy — How the model performs i.e. does it describe the data accurately.
-Relevance — Does it answer the original question that you set out to answer
Finally, all data science projects must be deployed in the real world. The deployment could be through an Android or an iOS app.
Machine learning models which might have to be recorded , before deployment because data scientists might favor in the Python programming language but the production environment supports the Java.
After this, the machine learning models are first deployed in a pre-production or test environment before actually deploying them into production.
Whatever the shape or form in which your data model is deployed it must be exposed to the real world.
Once real humans use it, you are bound to get feedback. Capturing this feedback translates directly to the life and death for any project. The more accurately you capture the feedback, the more effective will be the changes that you make to your model and more accurate will your final results be.
Who is a data scientist?
In simple defintion, a Data Scientist is one who practices the art of Data Science. The highly popular term, ‘Data Scientist’ which is coined by DJ Patil and Jeff Hammerbacher.
Data scientists are those who crack complex data problems with their strong expertise in certain scientific disciplines.
Data scientists are analytical experts who utilize their skills in both technology and social science to find trends and manage data.
From data extraction, wrangling and pre-processing, a Data Scientist must scrutinize the data thoroughly.
Then, he has the responsibility of making predictions from the data. The goal of a Data Scientist is to derive conclusions from the data.
What does a data science do?
Data science is defined as sort of the intersection between statistics, software engineering, and domain or business knowledge.
Data scientists work closely with business stakeholders to understand their goals and determine how data can be used to achieve those goals.
They design data modeling processes, create algorithms and predictive models to extract the data the business needs, and help analyze the data and share insights with peers.
Different types of Data Science Technique
In the following few paragraphs we would look into common data science techniques used in every other project.
Though sometimes the data science technique can be business problem-specific, and might not fall in the below categories, it is perfectly okay to term them as miscellaneous types.
At a high level, we divide the techniques into Supervised (we know target impact) and Unsupervised (We don’t know about the target variable we are trying to achieve). In the next level, the techniques can be divided in terms of
- The output we would get or what is the intent of the business problem
- Type of data used.
Let us first look at segregation based on intent.
1. Unsupervised Learning
In this type of technique, we identify any unexpected occurrence in the entire dataset. Since the behaviour differs from the actual happening of a data the underlying assumptions are:
- The occurrence of these instances is very small in number.
- The difference in behaviour is significant.
Anomaly algorithms are explained, such as the Isolation Forest, which provides a score for each record in a dataset.
This algorithm is a tree-based model. Using this type of detection technique and its popularity they are used in various business cases, for example, Web Page views, Churn Rate, Revenue per click, etc. In the below graph we can explain what anomaly looks like.
Here the ones in blue represent an anomaly in the dataset. They vary from the regular trend line and are less in occurrence.
Popular Course in this category
Through this analysis, the main task is to segregate the entire dataset into groups so that the trend or traits in one group data points are quite similar to each other. In data science terminology we call these as the cluster.
For example, in the retail business, there is a plan to scale the business and it becomes imperative to know how the new customers would behave in a new region based on the past data we have.
It becomes impossible to devise a strategy for each individual in a population, but it will be useful to bucket the population into clusters so that strategy will be effective in a group and is scalable.
Here the blue and orange colors are different clusters having unique traits within themselves.
This analysis helps us in building interesting relationships between items in a dataset. This analysis uncovers hidden relationships and helps in representing dataset items in the form of association rules or sets of frequent items. The association rule is broken down into 2 steps:
- Frequent Itemset Generation: In this, a set is generated where frequently occurring items are set up together.
- Rule Generation: The set built above is passed through different layers of rule formation to build a hidden relationship between themselves. For example, the set can fall into either conceptual or implementation issues or application issues. These are then branched down in respective trees to build the association rules.
For example, APRIORI is an association rule building algorithm.
2. Supervised Learning
In regression analysis, we define the dependent/target variable and the remaining variables as independent variables and eventually hypothesize how one/more independent variables influence the target variable.
The regression with one independent variable is called univariate and with more than one is known as multivariate. Let us understand using univariate and then scale for multivariate.
For example, y is the target variable and x1 is the independent variable. So, from the knowledge of the straight line,
we can write the equation as y = mx1 + c. Here “m” determines how strongly y is influenced by x1. If “m” is very close to zero, it means that with a change in x1, y is not affected strongly. With a number greater than 1, the impact gets stronger and small change in x1 leads into big variation in y. Similar to univariate, in multivariate can be written as y = m1x1 + m2x2 + m3x3………., here the impact of each independent variable is determined by its corresponding “m”.
Similar to clustering analysis, Classification algorithms are built having the target variable in the form of classes.
The difference between clustering and classification lies in the fact that in clustering we don’t know which group the data points fall in, whereas in classification we know which group it belongs to.
And it differs from regression from the perspective that the number of groups should be a fixed number unlike regression, it is continuous. There are a bunch of algorithms in classification analysis, for example, Support Vector Machines, Logistic Regression, Decision Trees, etc.
In conclusion, we understand that each type of analysis is vast in itself, but here we can provide a small flavor to different techniques. In the next few notes, we would take each of them separately and go into details on different sub-techniques employed in each parent techniques.