August 2019 – AI Voyager

So you are a trying to learn both the theory and the implementation of machine learning (ML) algorithms in general and recurrent neural networks (RNNs) and Long Short Term Memory methods (LSTMs) in particular. You have located both the various blogs (examples here and here) and research papers on Long Short Term Memory (LSTM) (examples here, here and here) as well as the (excellent) Keras-based practical guides by Jason Brownlee of Machine Learning Mastery (MLM). You have even studied (or at least glanced at) Andrew Ng’s “Coursera” lectures (link to the RNN one here) that go through sequential ML models step-by-step. But you still don’t quite get it.

In particular, what you don’t get is located somewhere in the gap between theory as in Olah’s blog and practice as in Brownlee’s exercises. Specifically, what are the input parameters for Keras LSTMs? What other parameters are needed in MLM exercises and what do they mean? Where can you get some code that runs right off the shelf (as the MLM stuff generally does) and has parameters, whose meanings you understand in terms of the actual equations governing LSTMs so that you can apply these codes to your own sequence problems with varying numbers of input features, timesteps, and memory size (i.e. number of neurons)?

You have begun to despair. Olah is informative. Brownlee is great. But hey, neither of them is God. You know what I’m saying?

Well you’ve come to the right place!

The purpose of this blog entry and the associated github page
https://github.com/mpstopa/LSTM-EZ is to provide a fully-explained, simple python, Keras (tensorflow) LSTM code for predicting next member(s) of sequences. Much of the code is taken from MLM examples and so inherits some of its idiosyncrasies. I will not reproduce the information in the blogs, papers or training examples but rather try to bridge the gap between what the code does and what the equations/diagrams show.

By the end of this blog and github code implementation you should be able to

Take your sequential csv-file data with a single column to be predicted and an arbitrary number of additional feature columns and reformat it to be read by LSTM-EZ.py.
Understand and be able to experiment with the “timesteps” argument.
Understand the “neurons” parameter (which, in the utterly opaque Keras documentation is called “units” and not otherwise explained in any way) and modify and test it on your data.
Understand MLM’s (i.e. Brownlee’s) argument “n_lag” (also called “predict’). In the future we will address changing this for multiple timestep prediction. Herein predict=1.
Run the LSTM-EZ code for these and other cases.

Software prerequisites (installed):

Python 3.6 or later. I use PyCharm.
Keras 2.2.4 or later
pandas 0.24.1
numpy 1.15.4
sklearn
matplotlib 3.0.2

Learning prerequisites: you should have read Olah’s blog and pretty much understood it. You should have tried out some of the LSTM examples in Machine Learning Mastery. Most of these work right out of the box, even if they are difficult to interpret or modify (because the parameters are not well-explained).

One final note, so far I am using n_batch=1 for simplicity. Submitting batches is only necessary for purposes of efficiency and speed. Thus, this code is slow and for production you’ll need something else most likely. I will try to expand to n_batch cases in the future.

Basic RNN and LSTM

First the RNN

In order to make the implementation of Keras-based recurrent neural network algorithms clear, I will begin with a description of the basic recurrent neural network (RNN) and Long Short Term Memory network (LSTM). For the purposes of this blogpost it is important to display the dimensions of the various vectors and matrices – which are often suppressed in other blogs (here and here).

LSTM is significantly more complicated than RNN, both conceptually and mathematically. I will nevertheless provide the detailed math for both RNN and LSTM. I will not, in this blogpost, however, provide the motivation for LSTM versus RNN. It is fairly well known that this motivation involves the vanishing of gradients in the back propagation (through time) algorithm leading to a loss of memory in the RNN over a (too) small number of time steps. However, precisely how the additive replacement to memory (LSTM) as opposed to the repetitive matrix multiplication (RNN) ameliorates that is the subject for a future post. (Also, see related post here).

Also, even though LSTM is much more complex than RNN, the essential requirement for understanding how to use the algorithms (at least for our purposes here) is understanding the inputs and the outputs to the elementary cell (for one timestep). The only difference in these is that for LSTM there are two memory channels (often called the working memory denoted h and long term memory denoted C). These two memory vectors, however, have the same dimension and hence the input to the Keras routines – specifically the parameter which the Keras documentation calls “units” – is the same for (specifically) keras.layers.simpleRNN, keras.layers.GRU and keras.layers.LSTM.

Since the usage in the code is the same for RNN and LSTM, I will describe RNN first in some detail and then later give the description of LSTM (which will require, as I say, no further understanding of code parameters than RNN, but will differ in the internal structure).

When it is “unrolled” in time, the RNN cell at time t looks like this:

Figure 1: Basic RNN. Note that dimensions of x (M) and h (R) and the matrices W and U are included explicitly. Dimension of output y and the first dimension of V depend on case (see text).

The input at time t is the vector x<t> with length M, the incoming memory from t-1 is the vector h<t-1>, with length R and the output vector y-hat, formed from the softmax operation on the matrix product of V with h<t>, (the first index of V, or alpha) is typically a scalar – i.e. the value of the variable that we are trying to predict at time t (e.g. the closing price of a stock).

For the simplest sequences M=1 in which case a sequence consists of one variable changing in time (in the simple MLM example this is “shampoo sales” per month). Naturally we might want to input more information than only the time varying variable we are trying to predict. For example, if we are trying to predict the price of a stock we might want not only the historical time evolution of the stock price but we might want the corresponding trade volume as well (and perhaps the evolving price of other related stocks). Thus often M>1.

[Note: when we get to the code, the input data, in the form of a csv file, will consist of M columns and it will be assumed that the first column is the one whose value we wish to predict and the remaining M-1 columns are these “auxiliary” data].

The value of R is referred to as “units” in the Keras documentation. Brownlee refers to it as “neurons.” It is incomprehensible why neither of these sources make it clear that what they are talking about is the size of the vector that caries the memory, h. But that is all it is.

R=units=neurons is very much a tunable parameter. Brownlee is cavalier saying at one point:

“The final import parameter in defining the LSTM layer is the number of neurons, also called the number of memory units or blocks. This is a reasonably simple problem and a number between 1 and 5 should be sufficient,”

without saying just what it (neurons) is. The point being that the whole operation of the RNN depends on how many internal parameters you want to fit with your training – and that includes the weight matrices as well as the number of memory units. Too many parameters and you will overfit. Too few and you will get diminishing accuracy. It’s an art. So the lesson is that you have to experiment with R – it is not fixed by the nature of your problem (as is M, for instance). Final note: as seen from the equations in Figure 1, the size of all of the weight matrices also depend on R in one or both dimensions.

The LSTM

The corresponding diagram for an LSTM cell is here:

Figure 2: Basic Long Short Term Memory cell, unrolled in time.

The symbols in the LSTM diagram are defined as follows:

Figure 3: Legend for figure 2. (A) memory h<t-1> and input x<t> are multiplied by weight matrices W and U, the results added and then run through an element-wise sigma function. Note that there are different W and U matrices for the f_t, i_t, \twiddle(C)_t and o_t data lines. (B) the input gate (combining i_t and \twiddle(C)_t, used the tanh function rather than the sigma function but is otherwise identical. (C) The dot function is an element-wise dot product. (D) the plus function is an element-wise addition.

Clearly LSTM is far more complicated than RNN. For the purposes of this post, however, the difference in the input and output of a single cell is simply that LSTM carries two memory lines instead of one. However, the dimensions of these vectors, h and C, are the same, i.e. R (the aforementioned “units” or “neurons”), so the calls to the Keras routines (for this parameter) are the same.

Figure 4: LSTM cell from Christopher Olah’s blog:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Note that my diagram and notation differ from that of Olah’s bolg. I split the input x and the “working” memory h each into four independent lines. Olah runs those lines together before shunting them into sigma or tanh functions. This obscures the fact that each time a memory line meets an input line they are combined by first multiplying with weight matrices and then added and, here is the point, the weight matrices *differ* for the so called “forget,” two “input,” and the “output” lines.

The github code LSTM-EZ

At this stage you should, if you have not already done so, look at my github repository:
https://github.com/mpstopa/LSTM-EZ.

The LSTM-EZ code is borrowed in large part from three examples given in Brownlee’s MLM:
https://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/
https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term-memory-networks-python/
https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/

If your environment is set up properly the code should simply run out-of-the-box using the sample dataset (closing stock prices of Nvidia per day for one year along with high, low, open, and two volume measures…six columns altogether). The point of this blog/github repo is to teach you to run this code on your own dataset and adjust the parameters (in the code) timesteps, neurons and features.

Study the following diagram carefully:

Figure 5: full network for time series problem LSTM-EZ. Generally h^<0> is initialized randomly (and automatically) by LSTM.

First, note that the component cells are RNN cells (only one memory line) rather than LSTM cells. But ignore that fact! The difference is not relevant for this example.

This diagram shows a recurrent network unrolled in time. There are T=timesteps values of x that are input. Note that “timesteps” is NOT the number of lines in your data file for which you have entries. For example, in the provided data NVDA.csv there are 253 lines (plus one for column headings). What you use for timesteps is usually much smaller, e.g. timesteps=3. Thus the network is trained to look at three timesteps of x and to predict the fourth, x<T+1>, which we define as y_hat<T>, using as starting point x<1>, x<2>,…,x<250>. (The final starting x has to be number 250 = 253-timesteps).

Assume for the moment that the input data is scalar: i.e. there is only a single column of time-varying data and no auxiliary features (i.e. the variable features=1. Recall also that in the diagrams above features=M). Then, in the code the function series_to_supervised takes the sequence of input data values and create timesteps+1 columns starting from the original column. Additional columns, from 1,…,timesteps, are shifted up by one row and appended. Note that this shifts undefined values into the bottom rows (since there are no values for x<254>, x<255> etc.) The routine series_to_supervised puts NaN into these slots and then totally removes those rows that have any NaN values. No muss, no fuss.

[Note that this code allows the user to change timesteps to vary the number of inputs to the network. However the number of steps into the future which are predicted is only one. The variable “predict” in the code is set to one and the code will fail if it is set to anything else. To predict multiple timesteps into the future – which will be the subject of a future post – it is necessary to modify the model to output more than one variable. This is generally achieved by putting in a Dense layer with parameter “predict,” i.e. model.add(Dense(predict)). Changes to the input training data (specifically test_y) are also necessary. See MLM post here.]

Suppose that features>1, what happens then? Well, you still need timesteps values of your M features. Each value of x<t> is an M-vector. But what about x<T+1>, the expected output of the model based on timesteps inputs? There is no need to keep the M features for x<T+1>=y_hat<T>. Thus, LSTM-EZ will eliminate the additional features (i.e. columns) from the T+1st timestep.

A couple pictures will make this clearer. First, (Fig. 6) the initial data with two features, indexed by date. (This data is also taken from the NVDA.csv file using read_csv but in this case I have thrown away all but the two columns “Open” and “High” and the “Open” data, being the first column, is the thing we are trying to predict).

Figure 6: A two feature dataset imported using pandas read_csv. Date is the index column and so when this data is converted to a numpy array using pandas conversion “data.values” the date column disappears.

Next, (Fig. 7) the output from series_to_supervised using timesteps=2 and predict=1 gives six columns. Note however that (actually, in LSTM-EZ, after returning from series_to_supervised) the final column, which would be var2(t), has been thrown away. var1(t) is our output ground truth. We don’t need and are not trying to predict var2(t).

Figure 7: After calling series_to_supervised with timesteps=2 and predict=1 and then discarding the last column the above dataset is obtained. *Note that the column headings are now replaced with generic names that indicate the timestep: t-2, t-1 and t.* The two rows, 250 and 251, (which contained NaN entries) has been automatically deleted. The second pair of columns are offset vertically by one time step from the corresponding first pair of columns. The last column is offset two timesteps from the first. Note also that the last “High” column (var2(t)), which we are not trying to predict, has been eliminated.

Figure 8: The final picture looks like this. For the actual code LSTM-EZ there are actually six features (rather than the two shown here) so six columns are fed, line-by-line, into x<1>, six into x<2> etc. for however many timesteps. The final multiplication of h<2> is actually done by a Dense layer in the code created by model.add(Dense(1)).

The final picture for the line-by-line training of the LSTM model (again, we picture an RNN here but the difference for this purpose is unimportant) is given in figure 8. Back propagation through time, which I will not discuss, is used for each instance to adjust the various weight matrices. The fitting is done with the input data split into a train and a test set to reveal, among other things, if the model is being overfit. See the code for details.

Once the model is trained (the Keras call is model.fit) the model can be called with arbitrary input lines of the same form as the training lines to predict outputs using model.predict. In LSTM-EZ we simply predict the test_X dataset again to illustrate the format of the call to model.predict.

Note: the LSTM-EZ code as provided (using the file NVDA.csv) is only slightly more complicated. It has T=timesteps=3 and M=features=6. Therefore “series_to_supervised” creates 4×6=24 columns of data, each 251 rows in length. Each set of six is offset vertically from the preceding set of six by one timestep. After returning from series_to_supervised LSTM-EZ eliminates columns 20,21,22,23 and 24, leaving a total of 19 columns. Furthermore, all rows with NaN entries are deleted by series_to_supervised. Thus, for each row there are six <t-3> columns (var1,var2,…,var6), six <t-2> columns, six <t-1> columns and one <t> (var1 only) column.

To summarize:

create a sequential csv dataset with the first column being the variable that you wish to predict based on the preceding timesteps values. You may include arbitrary number of additional columns of data that you think might be relevant to the prediction.
download and test the LSTM-EZ code on the NVDA.csv dataset provided in the repo.
modify the read_csv command to point to your csv file and change the value of the variable “features” to the total number of columns in your dataset. If you have an index column (such as a sequence of dates) then set index_col=0 in the read_csv command. Otherwise leave it out.
Run the code. Timesteps is set by default to 3, neurons is set by default to 50, epochs is set by default to 50. You should be able to vary all of these (within reason) to see how your results change. The code produces two plots: one that shows the loss as a function of epoch in train and test data, the other of which shows the predicted variable versus the actual variable throughout the dataset.

Please leave comments on your satisfaction or lack thereof with the code and these notes. I will attempt to update this to accommodate any flaws. Thanks for reading. WhenI have sponsors please visit them!

Month: August 2019

LSTM EZ