Among the various branches of artificial intelligence, Natural Language Processing (NLP) plays an increasingly important role, especially in recent years. This discipline focuses on the interaction between computers and human language, enabling machines to understand, interpret, and generate text in a way similar to how humans do.
Initially, before the advent of Machine Learning techniques, natural language processing relied mainly on symbolic approaches and manually defined rules created by linguists and programmers working together. However, these approaches had several limitations:
· Scalability: Creating rules manually was laborious and difficult to scale to complex languages and domains.
· Ambiguity: Handling linguistic ambiguity was very complicated with fixed rules.
· Adaptability: The systems were rigid and could not easily adapt to new data or contexts.
As anticipated, the turning point, or rather the revolution, came with the introduction of Machine Learning, which allowed NLP systems to learn from data rather than relying solely on manual rules. This simplified the way ambiguity and the variety of natural language are handled, leading to the creation of much more efficient and flexible systems.
Natural Language Processing and Machine Learning
A crucial phase of Machine Learning is pre-processing. This step occurs before applying the algorithms and prepares raw data to be effectively used by the models. Once a significant amount of data is collected, the first task is pre-processing, which involves a series of techniques to transform the available data so that it can be more easily and usefully analyzed by Machine Learning algorithms.
Some examples of these techniques are:
· Tokenization: Dividing the text into smaller parts, called tokens. Each token consists of a single word. Words are divided based on defined separators such as spaces, commas, periods, etc.
· Normalization: Cleaning individual tokens, such as removing numerical or special characters.
· Stop word removal: Eliminating insignificant words that are often used in sentences and do not provide substantial information for text classification. These typically include articles, prepositions, pronouns, etc.
· Lemmatization: Identifying the lemmas of a word, such as transforming a verb into its infinitive form.
· Stemming: Identifying and retrieving the roots of words to recover only their significant part.
· Generation of n-grams: Grouping the words of a sentence into partially overlapping sets of n elements. The best results are usually obtained with n = 2 or n = 3.
Once the pre-processing phase is completed, the data will be in a uniform and more understandable format for the learning algorithm and can be used to train the Machine Learning model. Depending on the nature of the problem and the type of learning used, these models can be grouped into different categories, the main ones being:
· Supervised: These require labeled historical data for the model training phase, such as in classification cases.
· Unsupervised: These do not require historical examples for the model training phase, such as in clustering, where the goal is to group similar data without predefined labels.
Once the model is created, the data to which it will be applied must be pre-processed in the same way as the training data to ensure consistency and accuracy in its predictions.
Natural language processing: a use case
By employing sophisticated algorithms and advanced text analysis techniques mentioned above, it is possible to develop software solutions that efficiently and accurately automate complex and time-consuming processes, such as the multi-level classification of numerous incoming emails in a large organization.
In the increasingly digital landscape of institutions, whether in the financial, insurance, legal, human resources sectors, etc., efficient communication management has become a fundamental priority. However, tedious and repetitive procedures like the one mentioned above continue to pose a significant challenge. This process not only requires considerable time and personnel but is also prone to human errors that can compromise the overall efficiency of the institution.
The Solution Aiom
In this context, Revelis has developed Aiom, a software solution capable of simultaneously analyzing multiple email inboxes, both certified (PEC) and traditional, providing an innovative response to this critical need. With its ability to learn from data and adapt to changes in communication through retraining learning models, this system offers a tailored approach for organizations seeking to optimize their operations, minimizing the time required to perform tasks and enabling more efficient email management. This frees up valuable human resources, allowing them to dedicate time to other activities.
Training can be done using various algorithms, and if probabilistic algorithms are chosen, Aiom can also provide a percentage evaluation of the reliability of a certain classification and can be configured to automatically respond to the sender if this reliability exceeds a configurable threshold.
Optionally, Aiom can also incorporate so-called Golden Rules, which are rules defined by the client, fully configurable at any time, allowing classification based on:
The sender’s address or addresses in To or Cc.
Regular expressions, defining the rule activation criteria through the search for relevant text or patterns in
· The email subject
· The email body
· Any attachments.
Any other client-specified requirement.
This is useful for several reasons:
Organizations sometimes need to classify according to specific criteria. In such cases, the Machine Learning model is only queried if a classification cannot be determined using this initial filter.
When there is no historical dataset with already classified data. With the described method, the software can still function, classifying incoming emails and simultaneously creating a dataset to later train a Machine Learning model. Such a dataset might be more precise than one generated based on human classifications, as it maintains uniformity in classification, which becomes increasingly difficult to maintain with more categories, especially when more individuals are involved.
Consider the case where multiple human operators are tasked with reading and classifying emails: the likelihood of obtaining different classifications for a message is not low and increases with the number of categories, resulting in a low-quality dataset for building a Machine Learning model.
The Solution Aiom: benefits
The above-described solution is currently active at a major international bank, managing the incoming flow of several email inboxes. For the client, the advantages of introducing this software can be summarized as:
· Performance optimization: Estimated reduction of processing time by 70%.
· Effort reduction: Fewer personnel allocated to reading, sorting, classifying, and responding to incoming emails.
· Precision increase: 94% of emails are handled correctly.
Author: Carmelo Pitrelli and Domenico Rodilosso