[ 1 ]
Natural Language Processing with TensorFlow
Teach language to machines using Python's deep learning library
Thushan Ganegedara
BIRMINGHAM - MUMBAI
Natural Language Processing with TensorFlow Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this t his book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Acquisition Editor: Frank Pohlmann Project Editor: Radhika Atitkar Content Development Editor: Chris Nelson Technical Editor: Bhagyashree Rai Copy Editor: Tom Jacob Proofreader: Sas Editing Indexer: Rekha Nair Graphics: Tom Scaria Production Coordinator: Nilesh Mohite
First published: May 2018 Production reference: 2310518 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78847-831-1 www.packtpub.com
mapt.io
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Why subscribe? •
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
•
Learn better with Skill Plans built especially for you
•
Get a free eBook or video every month
•
Mapt is fully searchable
•
Copy and paste, print, and bookmark content
PacktPub.com Did you know that Packt offers eBook versions of every book published, with PDF and ePub les available? You can upgrade to the eBook version at www.PacktPub. com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
[email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Contributors About the author Thushan Ganegedara is currently a third year Ph.D. student at the University of Sydney, Australia. He is specializing in machine learning and has a liking for deep learning. He lives dangerously and runs algorithms on untested data. He also works as the chief data scientist for AssessThreat, an Australian start-up. He got his BSc. (Hons) from the University of Moratuwa, Sri Lanka. He frequently writes technical articles and tutorials about machine learning. Additionally, he also strives for a healthy lifestyle by including swimming in his daily schedule. I would like to thank my parents, my siblings, and my wife for the faith they had in me and the support they have given, also all my teachers and my Ph.D advisor for the guidance he provided me with.
About the reviewers Motaz Saad holds a Ph.D. in computer science from the University of Lorraine. He loves data and he likes to play with it. He has over 10 years, professional experience in NLP, computational linguistics, data science, and machine learning. He currently works as an assistant professor at the faculty of information technology, IUG.
Dr Joseph O'Connor is a data scientist with a deep passion for deep learning. His company, Deep Learn Analytics, a UK-based data science consultancy, works with businesses to develop machine learning applications and infrastructure from concept to deployment. He was awarded a Ph.D. from University College London for his work analyzing data on the MINOS high-energy physics experiment. Since then, he has developed ML products for a number of companies in the private sector, specializing in NLP and time series forecasting. You can nd him at http:// deeplearnanalytics.com/.
Packt is searching for authors like you If you're interested in becoming an author for Packt, please visit authors.packtpub. com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specic hot topic that we are recruiting an author for, or submit your own idea.
Table of Contents Preface
xi
Chapter 1: Introduction to Natural Language Processing
1
What is Natural Language Processing? Tasks of Natural Language Processing The traditional approach to Natural Language Processing Understanding the traditional approach Example – generating football game summaries
Drawbacks of the traditional approach The deep learning approach to Natural Language Processing History of deep learning The current state of deep learning and NLP Understanding a simple deep model – a Fully-Connected Neural Network The roadmap – beyond this chapter Introduction Introductio n to the technical tools Description of the tools Installing Python and scikit-learn scikit-l earn Installing Jupyter Notebook Installing TensorFlow Summary
Chapter 2: Understanding TensorFlow What is TensorFlow? Getting started with TensorFlow TensorFlow client in detail TensorFlow architecture – what happens when you execute the client? Cafe Le TensorFlow – understanding TensorFlow with an analogy Inputs, variables, variables, outputs, and operations Dening inputs in TensorFlow [ i ]
1 2 5 5 6
10 10 11 13 14 16 21 21 22 22 23 24
27 28 28 31 32 35 36 37
Table of Contents Feeding data with Python code Preloading and storing data as tensors Building an input pipeline
37 38 39
Dening variables in TensorFlow Dening TensorFlow outputs Dening TensorFlow operations
43 45 45
Comparison operations Mathematical operations Scatter and gather operations Neural network-related operations
45 46 47 48
Reusing variables with scoping Implementing our rst neural network Preparing the data Dening the TensorFlow graph Running the neural network Summary
57 59 60 61 63 65
Chapter 3: Word2vec – Learning Word Embeddings
67
What is a word representation representatio n or meaning? Classical approaches to learning word representation representatio n WordNet – using an external lexical knowledge base for learning word representations
69 69
Tour of WordNet Problems with WordNet
70 70 73
One-hot encoded representation The TF-IDF method Co-occurrence Co-occurrence matrix Word2vec – a neural network-based approach to learning word representation Exercise: is queen = king – he + she? Designing a loss function for learning word embeddings The skip-gram algorithm From raw text to structured data Learning the word embeddings with a neural network Formulating a practical loss function Efciently approximating the loss function
Implementing skip-gram with TensorFlow The Continuous Bag-of-Words algorithm Implementing CBOW in TensorFlow Summary
Chapter 4: Advanced Word2vec
74 75 76 77 78 82 83 83 84 87 90
95 98 99 100
103
The original skip-gram algorithm Implementing Implementing the original skip-gram algorithm [ ii ]
104 105
Table of Contents
Comparing the original skip-gram with the improved skip-gram Comparing skip-gram with CBOW Performance comparison Which is the winner, skip-gram or CBOW? Extensions to the word embeddings algorithms Using the unigram distribution for negative sampling Implementing unigram-based negative sampling Subsampling – probabilistically ignoring the common words Implementing subsampling Comparing the CBOW and its extensions More recent algorithms extending skip-gram and CBOW A limitation of the skip-gram algorithm The structured skip-gram algorithm The loss function The continuous window model GloVe – Global Vectors representation Understanding GloVe Implementing GloVe Document classication with Word2vec Dataset Classifying documents with word embeddings Implementation – learning word embeddings Implementation – word embeddings to document embeddings Document clustering and t-SNE visualization of embedded documents Inspecting several outliers Implementation – clustering/classication of documents with K-means Summary
Chapter 5: Sentence Classication with Convolutional Neural Networks Introducing Convolution Neural Networks CNN fundamentals The power of Convolution Neural Networks Understanding Convolution Neural Networks Convolution operation Standard convolution operation Convolving with stride Convolving with padding Transposed convolution
107 107 108 112 114 114 115 117 118 118 119 119 120 120 122 123 123 125 126 127 127 128 129 130 131 132 134
135 136 136 139 139 140 140 141 142 143
Pooling operation
144
Max pooling Max pooling with stride Average pooling
145 145 146 [ iii ]
Table of Contents
Fully connected layers Putting everything together Exercise – image classication on MNIST with CNN About the data Implementing the CNN Analyzing the predictions produced with a CNN Using CNNs for sentence classication CNN structure Data transformation The convolution operation
147 147 148 149 149 152 153 153 153 154
Pooling over time Implementation – sentence classication with CNNs Summary
Chapter 6: Recurrent Neural Networks Understanding Recurrent Neural Networks The problem with feed-forward neural networks Modeling with Recurrent Neural Networks Technical description of a Recurrent Neural Network Backpropagation Through Time How backpropagation works Why we cannot use BP directly for RNNs Backpropagation Through Time – training RNNs Truncated BPTT – training RNNs efciently Limitations of BPTT – vanishing and exploding gradients Applications of RNNs One-to-one RNNs One-to-many RNNs Many-to-one RNNs Many-to-many RNNs Generating text with RNNs Dening hyperparameters Unrolling the inputs over time for Truncated BPTT Dening the validation dataset Dening weights and biases Dening state persisting variables Calculating the hidden states and outputs with unrolled inputs Calculating the loss Resetting state at the beginning of a new segment of text Calculating validation output Calculating gradients and optimizing Outputting a freshly generated chunk of text [ iv ]
157 159 162
163 164 165 166 168 170 170 171 172 173 173 175 176 176 177 178 179 179 180 181 181 181 182 183 183 184 184 184
Table of Contents
Evaluating text results output from the RNN Perplexity – measuring the quality of the text result Recurrent Neural Networks with Context Features – RNNs with longer memory Technical description of the RNN-CF Implementing the RNN-CF Dening the RNN-CF hyperparameters Dening input and output placeholders Dening weights of the RNN-CF Variables and operations for maintaining hidden and context states Calculating output Calculating the loss Calculating validation output Computing test output Computing the gradients and optimizing
Text generated with the RNN-CF Summary
185 187 188 188 190 190 191 191 192 194 195 195 196 196
196 199
Chapter 7: Long Short-Term Memory Networks Understanding Long Short-Term Memory Networks What is an LSTM? LSTMs in more detail How LSTMs differ from standard RNNs How LSTMs solve the vanishing gradient problem Improving LSTMs Greedy sampling Beam search Using word vectors Bidirectional LSTMs (BiLSTM) Other variants of LSTMs Peephole connections Gated Recurrent Units Summary
Chapter 8: Applications of LSTM – Generating Text Our data About the dataset Preprocessing data Implementing an LSTM Dening hyperparameters Dening parameters Dening an LSTM cell and its operations Dening inputs and labels Dening sequential calculations required to process sequential data [ v ]
201 202 203 204 212 213 216 217 218 219 220 222 223 224 226
229 230 230 232 232 232 233 235 236 237
Table of Contents
Dening the optimizer Decaying learning rate over time Making predictions Calculating perplexity (loss) Resetting states Greedy sampling to break unimodality Generating new text Example generated text Comparing LSTMs to LSTMs with peephole connections and GRUs Standard LSTM Review Example generated text
238 238 240 240 240 241 241 242 243 243 243 244
Gated Recurrent Units (GRUs)
245
Review The code Example generated text
245 246 247
LSTMs with peepholes
248
Review The code Example generated text
248 248 249
Training and validation perplexities over time Improving LSTMs – beam search Implementing beam search Examples generated with beam search Improving LSTMs – generating text with words instead of n-grams The curse of dimensionality Word2vec to the rescue Generating text with Word2vec Examples generated with LSTM-Word2vec and beam search Perplexity over time Using the TensorFlow RNN API Summary
250 251 252 254 255 255 255 256 258 259 260 264
Chapter 9: Applications of LSTM – Image Caption Generation
265
Getting to know the data ILSVRC ImageNet dataset The MS-COCO dataset The machine learning pipeline for image caption generation Extracting image features with CNNs Implementation – loading weights and inferencing with VGG-16 Building and updating variables Preprocessing inputs Inferring VGG-16
266 267 268 269 273 274 274 275 277
[ vi ]
Table of Contents
Extracting vectorized representations of images Predicting class probabilities with VGG-16 Learning word embeddings Preparing captions for feeding into LSTMs Generating data for LSTMs Dening the LSTM Evaluating the results quantitatively BLEU ROUGE METEOR CIDEr BLEU-4 over time for our model Captions generated for test images Using TensorFlow RNN API with pretrained GloVe word vectors Loading GloVe word vectors Cleaning data Using pretrained embeddings with TensorFlow RNN API Dening the pretrained embedding layer and the adaptation layer Dening the LSTM cell and softmax layer Dening inputs and outputs Processing images and text differently Dening the LSTM output calculation Dening the logits and predictions Dening the sequence loss Dening the optimizer
Summary
278 278 280 281 282 284 287 287 288 289 291 292 293 297 298 299 302 303 303 304 305 306 307 307 307
308
Chapter 10: Sequence-to-Sequence Learning – Neural Machine Translation Machine translation A brief historical tour of machine translation Rule-based translation Statistical Machine Translation (SMT) Neural Machine Translation (NMT) Understanding Neural Machine Translation Intuition behind NMT NMT architecture The embedding layer The encoder The context vector The decoder
311 312 313 313 315 317 320 320 321 322 322 323 324
Preparing data for the NMT system At training time Reversing the source sentence [ vii ]
325 325 326
Table of Contents
At testing time Training the NMT Inference with NMT The BLEU score – evaluating the machine translation systems Modied precision Brevity penalty The nal BLEU score Implementing an NMT from scratch – a German to English translator Introduction to data Preprocessing data Learning word embeddings Dening the encoder and the decoder Dening the end-to-end output calculation Some translation results Training an NMT jointly with word embeddings Maximizing matchings between the dataset vocabulary and the pretrained embeddings Dening the embeddings layer as a TensorFlow variable Improving NMTs Teacher forcing Deep LSTMs Attention Breaking the context vector bottleneck The attention mechanism in detail Implementing the attention mechanism Dening weights Computing attention
327 328 329 330 331 331 332 332 333 333 335 335 338 340 342 343 345 348 348 350 351 351 352 356 356 357
Some translation results – NMT with attention Visualizing attention for source and target sentences Other applications of Seq2Seq models – chatbots Training a chatbot Evaluating chatbots – Turing test Summary
Chapter 11: Current Trends and the Future of Natural Language Processing Current trends in NLP Word embeddings
359 361 363 364 365 366
369 370 370
Region embedding Probabilistic word embedding Ensemble embedding Topic embedding
370 374 375 375
Neural Machine Translation (NMT)
376 [ viii ]