Natural Language Processing: Opportunities and Challenges
Blog: NASSCOM Official Blog
Natural Language Processing (NLP) is the extension of AI and ML technologies, to understand linguistic analysis. In simple terms, it allows machines to understand the text. By analyzing text, computers can identify relations, entities, emotions and other useful information. This is a break-through, because now computers can understand beyond 0’s and 1’s or simply put machine language. NLP has gone beyond understanding the simple texts, to understand the entities, concepts, themes and sentiments. We have also seen its application in social media to understand the issues and opinions.
According to TDWI research report, AI and ML are not new technologies, as they have been around since 1950s, when AI and ML was first coined. However, because of the increased processing power of computers since then, these technologies have become mainstream.
In India’s context, however, NLP is in the nascent stage. This has particularly to do with the various languages and dialects in India. Even though English is our official language, the majority of Indians prefer speaking in regional languages. Given that NLP is based on converting text into machine language or data and then use that to understand the patterns, making NLP mainstream in India would require a deep understanding of regional languages.
Components of NLP
Broadly, there are 2 components of NLP – Natural Language Understanding and Natural Language Generation. TechSagar breaks these broad components further into various sub-components.
According to the experts, “NLP understanding, i.e. NLU” is more challenging than “NLP generation, i.e. NLG”. One of the most common reasons cited is the contextualization of the text. Let’s look at NLU in detail, from TechSagar lens:
Computational Linguistic: There can be only 2 sources to the NLP engine – speech and the text. Computational linguistic is the recognition of speech and text to develop better understanding of application in speech recognition, intelligent web surfing and machine translation.
Document Analysis: Includes analyzing the text from various input documents and finding the trends or common pattern
NLU Application: Some of these application areas include Spam filtering, Question & Answering, Intelligent personal assistant/virtual assistant, Machine Translation, Optical Character Recognition (OCR), Social Network Analytics, Search Engine, Speech Processing, Search and Information retrieval, and Information Extraction
Morphology, Phonology and Semantics: These are 3 distinct, yet related terms used in NLP. Morphology deals with the formation of words from morphemes. Morphemes are units of a word that can’t be broken down anymore.
To understand speech, NLP developers use Hidden Markov Models (HMM). HMMs are responsible for breaking the user speech into small units. Then, each unit is compared with a pre-recorded speech to identify the phoneme in each unit. A phoneme is a smallest unit of the speech. Then, the models detect series of phonemes and statistical analysis is used to decide the words or sentences spoken by the user.
Semantics is concerned with understanding the meaning of words and sentences. Some of the semantics tasks include named entity recognition, relationship extraction, topic modelling, recognizing textual entailment and sense disambiguation.
Pragmatic, Syntax and Transliteration: Pragmatic determines how sentences are being used in different situations and how the use changes the meaning of the sentence. Syntax deals with arrangement of and relationships among words, phrases and clauses that form the sentences. NLP processes such as Part-of-Speech tagging, named entity recognition and parsing are used to identify the relationships between the words in the sentence.
Transliteration is the process of transferring a word from the alphabet of one language to another. Transliteration only gives you an idea of how the word is pronounced, by putting it in a familiar alphabet. It changes the letters from the word’s original alphabet to similar-sounding letters in a different one.
NLP in Indian context
According to the recent report published by NASSCOM, AI and Data have potential to add more than USD 500bn to India’s GDP by 2025; about 45% of this value will come from 3 sectors – Consumer goods and retail, Banking and Agriculture.
Putting NLP in the context of India raises few challenges though. According to the research done by Towards Data Science, only 10% of Indians speak and understand English language. Remaining 90% speak and understand the regional languages. This poses language barrier for the technology such as NLP. This is a concern also because the resources for many of the languages are very limited.
There are 3 libraries in Python that cover Indian languages in detail:
- iNLTK: Hindi, Punjabi, Sanskrit, Gujarati, Kannada, Malayalam, Nepali, Odia, Marathi, Bengali, Tamil, Urdu
- Indic NLP Library: Assamese, Sindhi, Sinhala, Sanskrit, Konkani, Kannada, Telugu
- StanfordNLP: Many of the above languages
AnalyticsVidhya provides a detailed overview of these libraries, including text processing and building NLP applications for Indian languages. Based on the key use-cases of NLP, the following figure summarizes some of the names of start-ups in India. The following figure is not exhaustive by any means, but it helps to establish that NLP as a field is getting attention in India. Many of the start-ups shown below are featured on TechSagar platform.
The following sections provide an overview of the developments at the Global scale:
- NLP market
The global market for NLP is projected to reach USD 27.6Bn by 2026, from UDS9.9Bn currently in 2020,
according to Valuates report. This growth is attributed to the following factors:
- Increased use of smart devices
- Rapid adoption of cloud-based technologies
- NLP-based applications for customer support
According to the report, North America leads in NLP innovation and revenue generation. Asia-Pacific is expected to catch-up very soon because of technological innovation in the region.
2. Technology innovation
Very recent development has been GPT-3 language model. It is the most powerful language model ever built. GPT-3 was trained with 175 billion parameters, 10x more than any previous non-sparse language model. However, according to OpenAI learning without human-labeled data is a long standing challenge in machine learning.
OpenAI is planning to commercialize GPT-3 very soon. However, this raises few concerns in the expert community. Two of the leading concerns are listed as follows:
- Lack of contextualization and understanding of real World
- Another challenge raised is around anti-Semitism, racism, homophobia and misogyny, as the model picks its interpretation from the internet
NLP – TechSagar Context
TechSagar is supported by the Office of National Cyber Security Coordinator, and it is a platform to discover India’s cybertech capabilities through a portal that lists business and research capabilities of various entities from the IT industry, start-ups, academia and R&D institutes. TechSagar also lists individual researchers with the scope of their past and future research. The portal can be accessed at: www.techsagar.in.
TechSagar has identified 83 capability definitions within NLP. Some of the notable ones are – context aware site-search, Document analysis, NLP generation, Morphology, NLP applications and so on.
Use-cases and benefits of NLP
1. Financial services sector: NLP offers unique use cases for financial services, especially around sentiment analysis and content enrichment that can help the financial sector in making an informed decision about investments and risk management. NLP, when applied on the unstructured data, can help in pulling insights from underused data points.
2. Healthcare and Pharma sector: By the end of 2020, total healthcare data in the World will be around 2,314 Exabyte (1EB = 1Bn GB) and hence leveraging emerging technologies such as NLP can help Healthcare providers in the following ways:
- Improving patient care using NLP
- NLP can help in generating thematic map for various qualitative answers
- McKinsey research suggests that NLP can help reduce medical administrative costs through efficient billing and prior authorization approval, accurate prediction of post-surgery complications.
3. Retail sector Similarly, for the retail sector, NLP offers benefits such as text recognition and semantics to understand the user behavior, sentiment analysis can help retail companies to understand what does the client think about the product that he/she bought, Chatbots can help in customer engagement and support 24/7 and finally, NLP can also help the companies in advertising by targeting potential buyers.
Visit TechSagar website and download a full paper: https://www.techsagar.in/whitepapers