Blog Posts

Big Data: what it is, how to search, store and use

Blog: Think Data Analytics Blog

In this article, we will figure out what is considered Big Data and what is not, how to store, process and benefit from this information.

Definition of Big Data

These are Petabytes (and more) of complex and raw information that is constantly being updated . 

For example, data from IoT sensors from industrial equipment in factories, records of transactions of bank customers or searches from different devices. Sometimes processing methods and technologies are added to big data .

The concept of “big data” (big data) appeared in 2008, but even before the definition appeared with big data. For example, business analysts at VimpelCom worked with big data in 2005, according to Viktor Bulgakov, head of the management information department.

To more accurately understand whether the data belongs to big data or not, they look at the properties of information (properties were determined by the Meta Group in 2001):

Two more factors are often added to the listed factors:

Note . The definitions are conditional because no one knows exactly how to define big data. Some Western experts even believe that the term has been discredited and suggest that it be abandoned.

Visit here: Top Big Data Companies

How Big Data is collected

Sources can be:

Collection . Technology and data collection process itself is called data mining (data mining).

The services through which the collection is carried out are, for example, Vertica, Tableau, Power BI, Qlik. The collected data can be in different formats: text, Excel tables, SAS.

In the process of collecting, the system finds Petabytes of information, which will then be processed by the methods of intellectual analysis , which reveals patterns. These include neural networks, clustering algorithms, algorithms for detecting associative links between events, decision trees, and some machine learning methods.

Briefly, the process of collecting and processing information looks like this:

How Big Data is stored

Most often, “raw” data is stored in a data lake – a “data lake”. At the same time, they are stored in different formats and degrees of structuredness:

Different tools are used to store and process information in the data lake:

Data lake is not only storage. The “lake” can also include a software platform, for example, Hadoop, clusters of storage and processing servers, tools for integrating with sources and consumers of information and systems for data preparation, management and sometimes machine learning tools. Also, the “data lake” can be scaled up to thousands of servers without stopping the cluster.

From the lake, information flows into the “sandboxes” – areas of data exploration. At this stage, scenarios are developed to solve various business problems.

Data lake is more often located in the cloud than on its own servers. For example, 73% of companies use cloud services to work with big data, according to the report “ Overview of Trends and Issues of Big Data 2018 ”. Big data processing requires a lot of computing power, and cloud technologies can reduce the cost of work, so companies resort to these storages.

Cloud technologies can become an alternative to your own data service, because it is difficult to predict the exact load on the infrastructure. If you buy equipment “in reserve”, then it is idle and causes losses. And if the equipment is low-powered, it will not be enough for storage and processing.

How big data works

When the data is received and saved, it must be analyzed and presented in a form that is understandable for the client: graphs, tables, images or ready-made algorithms. Traditional methods are not suitable due to the volume and complexity of processing. With big data, you need to:

Therefore, separate technologies have been developed to work with big data.


Initially, these are tools for processing indefinitely structured data: NoSQL DBMS, MapReduce algorithms, Hadoop.

MapReduce is a framework for parallel computing of very large datasets (up to several Petabytes). Developed by Google (2004).

NoSQL (from English Not Only SQL, not only SQL). Helps to work with disparate data, solve scalability and availability problems by using data atomicity and consistency.

Hadoop is a project of the Apache Software Foundation. It is a set of utilities, libraries and frameworks that is used to develop and run distributed programs running on clusters of hundreds and thousands of nodes. We have already talked about it, but this is because almost no project related to big data can do without Hadoop.

Technologies also include the R and Python programming languages, Apache products.

Methods and tools for working with big data

These are data mining, machine learning, crowdsourcing, predictive analytics, visualization, simulation. Dozens of techniques:

For example, machine learning is an AI method that teaches a computer to “think” on its own, analyze information and make decisions after learning, rather than following a human-programmed command.

Learning algorithms need structured data from which the computer will learn. For example, if you play checkers with a machine and win, then the machine remembers only the correct moves, but does not analyze the game process. If you leave the computer to play with itself, then it will understand the course of the game, develop a strategy, and a living person will start losing to the machine. In this case, she does not just make moves, but “thinks”.

Deep learning is a separate type of machine learning, during which new programs are created that are capable of self-learning. And here artificial neural networks are used that mimic human neural networks. Computers process unstructured data, analyze, draw conclusions, sometimes make mistakes and learn – almost like humans.

The result of deep learning is used in image processing, speech recognition algorithms, computer translation and other technologies. The pictures drawn by Yandex neural networks and Alice’s witty answers to your questions are the result of deep learning.

Data Engineer

This is already the “human” part of working with big data. A Data Engineer or data engineer is a data processor. He prepares the infrastructure for work and data for the Data Scientist:

After the Data Engineer, Data Scientist steps in: creates and trains predictive (and not only) models using machine learning algorithms and neural networks, helping businesses find hidden patterns, predict the development of events and optimize business processes.

Where Big Data is used

The main principle of big data is to quickly give the user information about objects, phenomena or events. To do this, machines are able to build variable models of the future and track results, which is useful for commercial companies.


The banking industry uses big data technologies for fraud prevention, process optimization and risk management. For example, VTB, Sberbank or Tinkoff are already using big data to check the reliability of borrowers (scoring), manage staff and predict queues at branches.

Collecting big data helps to more accurately assess the client’s risk profile, which ultimately reduces the likelihood of loan defaults.

Tinkoff uses EMC Greenplum, SAS Visual Analytics and Hadoop to analyze risks, identify customer needs, and leverage big data in scoring, marketing and sales.

VTB uses big data to make decisions about opening new offices. The bank has created its own internal geo-analytical platform. Machine learning methods have made it possible to identify the demand for banking services in different areas of the city.


The choice of a business development strategy is based on the results of information analysis. Here, big data will help process huge amounts of data and identify the direction of development. Using the results of the analysis, you can identify which products are in demand in the market, and increase customer loyalty.

Hypermarket Hoff uses big data to create personalized offers for customers.

The CarPrice service reduces costs by optimizing traffic: thanks to big data, the speed of user decision-making has increased, and the quality of service has increased.

The Zarina brand increased its revenue by 28% by personalizing the delivery of recommendations to the customers of the online store.

Here one cannot fail to say about Netflix. Personalization is at its core. The service with a million audience offers content that in 80% of cases relies on the user experience of the viewer and information from Facebook and Twitter. To optimize the search results, the user’s search queries, browsing history, information about repeated views, pauses and rewinds are used. Netflix uses Hadoop, Teradata and proprietary solutions (Lipstick and Genie) to process data.

For example, when Netflix created House of Cards, based on the analysis, it ordered two seasons at once, and not just the pilot. And the series was an overwhelming success: data analysis showed that viewers were delighted with actor Kevin Spacey and producer David Fincher.


Big data provides a great toolbox for marketers. Data analysis helps to identify customer needs, test new ways to increase loyalty and find which products will be in demand.

For example, the RTB service helps you set up retargeting: cross-channel, search, and product retargeting. So companies can advertise products not to everyone, but only to the target audience.

Services Crossss, Alytics, 1C-Bitrix BigData allow conducting end-to-end analytics, increasing the average check, increasing ad conversion, and increasing the personalization of offers. And all this with the help of big data.

Problems and prospects of Big Data

The problems are the amount of information, processing speed and lack of structure.

Storing large amounts of data requires special conditions, while processing speed requires new methods of analysis. There is still no sufficient practice of accumulating big data in the world. At the same time, the data is scattered and sometimes unreliable, which interferes with effectively solving business problems.

The big data industry is just gaining momentum and there is not enough specialists, for example, Data Engineer, because this profession did not exist recently.

Perspectives . Big data is evolving: it helps to recognize fraud in banks, calculate the effectiveness of advertising campaigns, recommend a movie, and even diagnose a patient based on the collected anamnesis. Banks, process manufacturing and companies from the professional services industry invest the most in big data.

In 2016, the volume of the world market for software, equipment and services in the field of business intelligence and work with big data amounted to $ 130.1 billion, of which $ 17 billion is the banking sector. The share of investments from government bodies and commercial companies was measured at approximately 7.5%. In 2018, revenue from sales of programs and services in the global market in 2018 amounted to $ 42 billion, and the market is only growing.

Experts believe that the technology will soon be used in the transport sector, oil production, and energy. IDC predicts that revenues related to big data will exceed $ 260 billion by 2022 with an annual market growth of 11.9%. The largest market segments will be manufacturing, finance, healthcare, environmental protection and retail, according to Frost & Sullivan forecasts.

The development of big data will change our daily life. The systems will be able to analyze daily routes, frequent orders and recurring payments. Probably, in the future, technologies will make it possible to automatically pay for loans and utilities, call a car from work to home, where dinner from your favorite dishes will already be ready on the table.

The post Big Data: what it is, how to search, store and use appeared first on ThinkDataAnalytics.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples