Machine Learning for Text Prediction

Dave Page November 5, 2020

Machine Learning for Text Prediction

In a previous blog post, I talked about using Machine Learning for Capacity Management as I began a journey exploring how machine learning techniques can be used with and as part of PostgreSQL. Machine Learning has numerous applications of course, and the idea of text prediction piqued my interest.

Those who know me may be aware that not only do I lead the pgAdmin project (part of which involves developing and running the website), but I'm also part of the PostgreSQL web and sysadmin teams who look after the PostgreSQL website. As I was researching natural language processing, I started thinking that maybe machine learning techniques could be used to generate suggestions for search criteria based on the contents of our product documentation, the idea being that by offering intelligent auto-completion based on the actual site content as the user types, we could offer a far better user experience. An example of this would be the way that when you start typing into the Google search box; it offers you suggestions to complete your search query.
 

Data Preparation

Preparing the data is perhaps the most important (and possibly complex) step when training a model to perform test prediction or other natural language processing functions. There are a couple of phases to this:

First, we need to extract the data and clean it up in order to create the corpus (a structured textual dataset) that the network will be trained with. In my experimental script this involved a number of steps:

  • Iterate through all the HTML files in the source directory, and for each:
    • Extract the data from the <p> tags and convert it to lower case.
    • Break up each paragraph into a set of sentences.
    • Remove any punctuation.
    • Split the paragraphs into the individual sentences, and append each to a list if there's more than one word.

At the end of this process we're left with a list of sentences, all in lowercase, consisting of at least two words and no punctuation.

Now that we have a corpus of text to work with, we need to get it into a format that we can process with a Tensorflow model. There are a number of steps to this process as well:

  • Tokenize the text, and create a dictionary of numeric word IDs and the corresponding words.
  • Create a list of sequences of word IDs that represents each of the sentences in the corpus, for each sub-sentence up to the full sentence.
  • Pre-pad the sequences with as many zeros as is required to ensure that all the sequences have the same length, i.e. the length of the longest sentence.
  • Break off the last word ID from each sequence, so we're left with the list of all the preceding sequences (the inputs) and a separate list of the final words (the result or label) from each sequence.
  • One-hot encode the list of labels.

Once we've done this, we have the data we need to train a model; a set of numeric input values that represent the input strings, and a corresponding set of numeric result values that represent the expected final word for each sentence.
 

Model

In my previous experiment with time series prediction, I used a model that implemented the WaveNet architecture; multiple one dimensional convolutional layers with increasing dilation that allowed it to detect and learn seasonality in the data. In this case, I'm using a Recurrent Neural Network (RNN) consisting of multiple bi-directional layers of Long Short Term Memory (LSTM) units. The idea of an RNN is that it can handle "long term dependencies" by using the past information to help provide context to the present. This works well with relatively small gaps between the past and present, but not so well when the gaps become longer. This is where LSTM units help, as they are able to remember (and forget) data from much earlier in the sequence, enabling the network to better connect the past data with the present.

Christopher Olah has an excellent blog post describing RNNs and LSTMs that can provide more in-depth information.

In this case, I'm creating a model that consists of a the following layers:

  • An embedding layer.
  • Two bidirectional LSTM layers.
  • A dropout layer.
  • A dense layer.

The embedding layer converts the input data to fixed size dense vectors that match the size of the following layer.

The LSTM layers are responsible for the actual work.

The dropout layer will randomly set some input units to zero during training. This helps prevent overfitting where the model learns the actual training data rather than the characteristics of the data.

The dense layer provides us with a final vector of probabilities for the next word based on the word index.

The model looks like this:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 55, 256)           768000    
_________________________________________________________________
bidirectional (Bidirectional (None, 55, 512)           1050624   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 512)               1574912   
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense (Dense)                (None, 3000)              1539000   
=================================================================
Total params: 4,932,536
Trainable params: 4,932,536
Non-trainable params: 0
_________________________________________________________________

 

Training

I split the code for this experiment into two parts; html-train.py which is responsible for creating the model, training it, and saving both the model and the tokenizer data that contains the word index etc, and test-model.py which will load a previously saved model and tokenizer data and allow it to be tested by hand. The code can be found on Github.

I immediately ran into problems when training the model; despite EDB providing me with a very high spec MacBook Pro, it was going to take an extremely long time to run the training. This is largely because Tensorflow is not GPU optimised on macOS. Instead I spun up a GPU optimised Linux instance on Amazon AWS, which proved able to run the training at something like ten times the speed of my laptop. I was able to use the smallest machine type available (g4dn.xlarge); the code doesn't require huge amounts of RAM or CPU, and using multiple GPUs would require the code to be changed to support parallelism which would significantly complicate it, more so than seems worthwhile for this experiment.

Once I had run the training using a copy of the pgAdmin documentation in HTML format, I was left with a Tensorflow model file and the JSON file representing the tokenizer, both of which I copied to my laptop for testing.
 

Testing

As you'll recall, the aim of the experiment was to see if it is viable to offer users of the pgAdmin or PostgreSQL websites auto-complete options for their searches in the documentation. The test program loads the model and tokenizer and then prompts the user for an input word (or words), and offers a user-specified number of follow-on words. If the user requests more than one word, it doesn't predict them all at once, instead, it predicts one word, adds it to the word(s) provided by the user, and then predicts the next word and so on, thus predicting each word based on the entirety of the sentence as it's constructed.

Here's an example of a test session:

python3 test-model.py -d pgadmin-docs.json -m pgadmin-docs.h5

Enter text (blank to quit): trigger

Number of words to generate (default: 1): 

Results: trigger date

Enter text (blank to quit): table

Number of words to generate (default: 1): 

Results: table you

Enter text (blank to quit): trigger function

Number of words to generate (default: 1): 

Results: trigger function is

Enter text (blank to quit): trigger function

Number of words to generate (default: 1): 5

Results: trigger function is not the default of

Enter text (blank to quit): creating a table

Number of words to generate (default: 1): 10

Results: creating a table is no restrictions to the server and confirm deletion of

Enter text (blank to quit):

It's clear that those results are quite disappointing—and similar results were seen with various other tests with the pgAdmin documentation and also with the PostgreSQL documentation:

python3 test-model.py -d postgresql-docs.json -m postgresql-docs.h5

Enter text (blank to quit): trigger

Number of words to generate (default: 1): 

Results: trigger exclusion

Enter text (blank to quit): constraint

Number of words to generate (default: 1): 

Results: constraint name

Enter text (blank to quit): create table

Number of words to generate (default: 1): 5

Results: create table is also not to the

Enter text (blank to quit): max aggregate

Number of words to generate (default: 1): 5

Results: max aggregate page size of the data

Enter text (blank to quit):

It's safe to say that offering auto-complete suggestions such as those generated would almost certainly not improve the user experience for most users.
 

Conclusion

The results of this experiment were quite disappointing—though I have to say that wasn't entirely unexpected. Searching can be something of an art form. Whilst many search engines try to make the experience as natural as possible, users will almost certainly get the best results by using specific search terms and operators supported by each engine, rather than natural phrasing. Even so, it's clear in this experiment that the model really wasn't generating naturally phrased strings, which isn't overly surprising considering how complex and nuanced the English language is and the fact that this is relatively simple code that was put together over and experimented with over a couple of weeks. It's certainly possible that better results could be achieved by fine tuning the model or by improving the quality of the data preparation (and maybe including the content of the <title> tags for example), however, is it worth it?

As a software developer given the task of providing auto-completion suggestions for search users I would almost certainly approach the problem in a different way; log the queries that users execute, and when a user is typing in their criteria, perform a string prefix search of the logged queries using the text they've typed so far, thus offering suggestions based on what real people have searched with in the past. In a framework such as Django, the backend code to do that could be written in minutes.

Another option might be to employ Markov chains to predict the text based on the training corpus. A nice example of that is shown in Ashwin M J's blog.

It was a fun experiment, but using non-machine learning techniques for this particular task would be far easier to implement and would almost certainly yield much better results for individual websites.

Thoughts? Share your comments with us on Twitter @EDBPostgres! 

 

Dave PageVice President & Chief Architect, Database Infrastructure

Dave Page is Vice President and Chief Architect, Database Infrastructure, currently working in the CTO team on research and development, best practices with Postgres, and providing high-level guidance and support for key customers. Dave has been working with PostgreSQL since 1998 and is one of five members of the open source project's Core Team, as well as serving as Secretary of the Board of PostgreSQL Europe and Chairman of the PostgreSQL Community Association of Canada. He joined EDB in 2007 and has been influential in the company’s direction and development of critical database management tools and product packaging and deployment. Prior to EDB, Dave spent more than a decade with The Vale Housing Association as Head of IT. He joined the organization after spending four years as an electronics technician with the Department of Particle and Nuclear Physics at the University of Oxford. Dave holds a Higher National Certificate in electronic engineering from Oxford Brookes University and a Master’s degree in information technology from the University of Liverpool.