Article: Natural Language Processing with Java – Second Edition: Book Review and Interview
MMS • RSS
Article originally posted on InfoQ. Visit InfoQ
- Natural Language Processing (NLP) is a specialized form of machine learning that is tailored for text. Since humans work with text, often in a verbal form, it is a good problem domain for neural networks.
- NLP training model is essentially a neural network that has been trained to handle specific problems with a specific type of data.
- In the future, NLP will enable more automated responses and a better understanding of human conversations. Commands to our phones and computers will be handled with a higher degree of sophistication.
- Improvements to sentiment analysis will come from NLP neural network improvements and better data. Often overlooked is the quality of data.
- Selecting the right neural network for the right problem is a challenge. It’s important to use the correct type of neural network for the problem at hand.
Natural Language Processing with Java – Second Edition book, authored by Richard M Reese and Ashish Singh Bhatia, covers the Natural Language Processing (NLP) topic and various tools developers can use in their applications.
Technologies discussed in the book include Apache OpenNLP, Stanford NLP, LingPipe, GATE, UIMA, and Apache Lucene Core.
Authors discuss the NLP model which includes the following steps:
- Identifying the task
- Selecting a model
- Training the model
- Verifying and using the model
Other topics covered in the book include how to find parts of text, finding sentences, finding people and things, and detecting parts of speech.
Authors also discuss Deep Learning for Java as well as classifying documents and text which can be used for spam detection and sentiment analysis. OpenNLP, Stanford API and LingPipe frameworks were used to classify text.
Reese also authored video lessons on this topic.
InfoQ spoke with Reese about the book and video lessons and how NLP can be used in enterprise applications.
InfoQ: Can you describe how NLP works and how is it different from traditional machine learning (ML) techniques?
Richard M Reese: Simple natural language processing can be supported by the Java core SDK using numerous standard classes and methods. However, the more sophisticated NLP tasks require the use of specialized libraries. Popular libraries include OpenNLP, the Stanford NLP API, and LingPipe.
Many NLP techniques use neural networks to implement an NLP task. Models are trained against sample data and can then be used for specific problems. For common problems and natural languages, numerous models are available that can be readily used by a developer. In unique situations, models need to be trained using specialized data sets. Even for a language such as English, there are specialized domains, such as medical journals and textese, that require unique models. Given a trained model, similar data is submitted to the neural network which then performs the analysis.
NLP differs from traditional machine learning in several ways. NLP is a specialized form of machine learning that is tailored for text. Since humans work with text, often in a verbal form, it is a good problem domain for neural networks. Machine learning is concerned with other tasks such as analyzing visual images and audio input. It is also useful for supporting the manipulation of objects commonly used for robot type applications.
InfoQ: Can you discuss the NLP training models?
Reese: A training model is essentially a neural network that has been trained to handle specific problems with a specific type of data. For example, we can train a neural network to determine the sentiment of text by feeding it data representing the problem. Once trained, similar data can be supplied and the neural network will perform the analysis, hopefully with good results.
There are different types of neural networks which differ based on the number of layers used and the type of interconnections. An artificial neural network mimics the network of neurons found in the brain, though they are not nearly that complex. The various weights assigned to neurons change as the model is being trained.
InfoQ: What are some enterprise applications of NLP?
Reese: There numerous applications that use NLP. For example, customer service can be improved by automatically analyzing customer feedback and interactions. Chat bots are being used that will engage a customer and determine the specifics of their concerns. Sentiment analysis will determine how a customer feels about a product or service. The placement of ads can be influenced by analyzing comments a potential customer may make.
In the future we will witness improvements in the ability to derive meaning from communications. This will enable more automated responses and a better understanding of human conversations. Commands to our phones and computers will be handled with a higher degree of sophistication.
InfoQ: Can you discuss how NLP can help with sentiment analysis related use cases?
Reese: A traditional use of sentiment analysis has been to determine whether a review is positive or negative. Based on the analysis, tweaks can be made to a product or service, or a user can better determine whether a product is the right one for them. Services such as Netflix currently provide shows or movie recommendations. From my personal experience they are not very accurate. As improvements in sentiment analysis occur and as multiple sources of user input become available, such recommendations will improve.
The results of the analysis are determined by the quality of the model and the quality of the data. Improvements to sentiment analysis will come from NLP neural network improvements and better data. Often overlooked is the quality of data. Data must be cleaned and put into the proper format before it is used for training and analysis purposes.
InfoQ: How can we use NLP for classifying text and documents?
Reese: Sample sets of data are used to train models. Often these sets contain not only the text to be classified, but also the desired output. That is, if a specific text message is known to be positive, then a positive attribute is assigned to it. This is known as supervised learning. With a large enough set of data, the model can be trained to recognize similar reviews, either positive or negative. The larger the set, and the more reflective its contents are to the problem at hand, goes a long way toward the quality of the results.
When the data set does not contain an attribute specifying the output, then this is called unsupervised learning. The training process will organize what it considers to be similar types of documents and assign tags to them. It is a time-consuming process to create data sets that have been assigned outcomes. Avoiding this process is the chief advantage of unsupervised models, though the classification is a more difficult process.
InfoQ: What are the challenges of NLP?
Reese: Challenges can be found at multiple levels. The neural network architecture, the number of levels and how these layers are interconnected, continues to evolve. One challenge is to design better neural network frameworks.
Selecting the right neural network for the right problem is another challenge. The old saying about using the right hammer for the right job fits well here. We don’t want to use a sledge hammer to hang a picture on the wall. Likewise, it is important to use the correct type of neural network for the problem at hand.
The model trained is only as good as the data. The data needs to be comprehensive, correct, and relatively free of bad data points. Preparing the data is often the most time-consuming and important part of the process.
Another important factor is correct interpretation of the results. Sometimes the analysis output is represented by a set of numbers measuring different aspects of the results. If these are interpreted incorrectly, then the overall effort may be of less value than it might otherwise be.
InfoQ: What are Natural Language Understanding (NLU) and Natural-Language Generation (NLG)? How are they different from NLP?
Reese: NLU is concerned with deriving meaning from text and to produce data that reflects this meaning. NLG involves the creation of text that sounds and flows naturally. NLU attempts to understand what a human may mean from a statement such as, “Send the message to Sue.” The command is referring to which message? How should it be sent? If, with multiple Sues, then which one? Answering these questions is not always easy for a computer. Advances in NLU improve the ability for computers to derive meaning from text.
When a computer needs to communicate with a user, then the text generated should be clear and natural. The old mad libs type of text, where a template is filled with often randomly chosen words, typifies an approach that does not generate the type of text most people would like to hear. Instead, NLG works to generate text that is more pleasing to the human ear. NLU and NLG are subfields of NLP.
InfoQ: What are the emerging trends happening now in NLP space?
Reese: It is a continually evolving game. We will see improvements in NLU/NLG which will give rise to new capabilities and applications. Personal assistants similar to Alexi and Ok Google will assist humans in all sorts of endeavors. More companies will be rolling out NLP applications that will often be “home grown”, that is, they will not be Amazon or Google based. Instead they may well rely upon technology produced by other NLP vendors such as IBM.
Many NLP applications will incorporate hybrid approaches where analysis techniques are paired with human intervention to provide a more meaningful and satisfying response. When the NLP techniques reach their limits, a human will intervene. Hand crafted responses are currently being used for specific, limited problem domains. For example, personal assistants can only answer certain types of queries. The seemingly more capable ones are structured to handle a tightly defined set of interactions.
NLP processing will become more distributed. Both the training and data sets may be distributed across a variety of platforms. Smart phones and similar devices will have ML functionality built-in into them in the form of specialized processors. This again will usher in new uses for NLP technology. Data will come from a more diverse set of sources as sensors and actuators become more prevalent in society.
About the Book Author
Richard M. Reese has worked in both industry and academia. For 17 years, he worked in the telephone and aerospace industries, serving in several capacities, including research and development, software development, supervision, and training. He currently teaches at Tarleton State University. Richard has written several Java books and a C Pointer book. He uses a concise and easy-to-follow approach to teaching about topics. His Java books have addressed EJB 3.1, updates to Java 7 and 8, certification, functional programming, jMonkeyEngine, and natural language processing.