This is a blog about artificial intelligence. Some of the content is designed for students of machine learning and is fairly technical in nature, however most of the articles are for non-specialists. Non-specialists include readers with business or public policy interests or others who simply have a general intellectual curiosity about what is AI and how is it affecting the world around us.
I have included some posts – in addition to strictly AI oriented posts – that examine other issues related to data and statistics. In particular I have written some pieces on COVID-19 and the analysis of the epidemic.
Please feel free to make suggestions about topics you would like to see covered and please share a link to AIVoyager with your colleagues and friends.
In an upcoming American Greatness article I will be discussing the contentious issue of which political party is responsible for the spread of the COVID-19 disease and which party is suffering more deaths from the disease. That article is based on a calculation which is presented in my github page here.
The purpose of this post is to provide somewhat more details on the technical calculation short of simply providing the code and data (which is the purpose of the github page). In the end, the github page is the final source on what is in these calculations. I am nevertheless happy to answer any technical questions posed to this post.
The first question which I ask is whether more Republicans or more Democrats are dying/have died from the disease.*
As a proxy (and perhaps a more interesting sample) for Republican versus Democrat I used instead the number of Trump voters (by county) in 2016 versus the number of Clinton voters (and accounting as well for the number of non-voters).
To calculate this I took the data on the daily U.S. deaths due to COVID-19 as compiled in the Johns Hopkins University Center for Systems Science and Engineering database (henceforth called the JHU data). This data is compiled by day, since January 22, 2020, and by county (3145 in the United States).
The election data by county is available in many places. I used the dataset found here.
Additionally, in a second half of the analysis described below I also included data on the party which holds the governorship in each state. That dataset is included in my github page for convenience but of course it can be found anywhere.
Finally, I needed a dataset that gave the population in 2016 of each county. This was needed in order to determine the number of children in the county and thus the number of eligible voters who did not vote in 2016. That dataset is also included in the github page. The original source is the U.S. Census data which is here.
Here I present three figures based on these calculations.
Figure 1
The number of deaths on a given day in a given county is multiplied by the fraction of Trump voters, the fraction of Clinton voters, and the fraction of voting agenon-voters in the given county to obtain the number of Trump voter/Clinton voter/non-voter deaths on that day in that county. These fractions are computed as percentages among all eligible voters (not all citizens). Thus I specifically exclude citizens under the age of 18 (who number approximately 100 million in the United States).
To explain this: if children are included in the analysis the result is that the number of Trump voters and Clinton voters (indeed, the number of voters) become a significantly smaller portion of the total population. In 2016 roughly 55% of the eligible voters cast a ballot. But only 38% of the total citizens cast a ballot. Since in fact the number of deaths among children from COVID-19 is thankfully very small it is sensible to assume for the precision of this argument that is it zero and that all deaths occur within the voting public.
Once the number of deaths per county per day is computed they are simply summed over counties on each day, giving a number of deaths of Trump voters, Clinton voters and non-voters nationwide. The result is figure 1.
Figure 2
If we assume that the vote proportion in a given county is indicative of the overall support for the candidates – even among citizens who did not cast a ballot – then we can calculate the number of deaths of Clinton supporters as compared to Trump supporters. This is shown in figure 2 where all the deaths in a given county are attributed to one or the other candidate’s supporters. The absolute numbers are thus higher for each group, but the ratios of the “D” to “R” numbers does not change from figure 1.
Figure 3
As described in my forthcoming article, there is considerable political interest in assigning blame to one party or another for the 185,000+ COVID-19 deaths. The analysis presented here gives some evidence for whom is to blame, but it is only very sparse evidence and cannot be relied on to produce an absolute attribution of blame.
In other words, even though figures 1 & 2 show that more Clinton voters and probably more Clinton supporters have died from COVID-19 than voters and supporters of Trump, that does not prove that Democrats are at fault. It is obvious that under any political policies many people would have died. It is also obvious that the disease entered the United States through coastal areas mainly and that spreading occurred before preventative measures were in place to stop it.
So, do figures 1&2 provide any information on culpability? Of course they do. But they are only one small piece of the puzzle. A piece that could weigh very little in comparison to other factors.
That said, I attempted to extend the calculation above to more specifically target the effect of political control on the mortality rate. ** In this final calculation of this post I defined a political control variable for each county by simply whether the county had been won by Trump or Clinton. Further, i defined state political control simply by which party controlled the governorship of the state. These are obviously both very crude approximations requiring refinement (e.g. who controls the state legislatures?), so of course the results need to be taken with a grain of salt.
That said, if one party had both county and state political control, then all of the deaths in that county were attributed to that party. If control was split between county and state then half of the deaths were attributed to Republicans and half to Democrats.
Note that this is at least somewhat more informative than simply counting the number of deaths in terms of the governor’s party alone. For example, Massachusetts has a Republican governor (my friend Charlie Baker) but it is a highly liberal/Democratic state and they also had a substantial number of COVID-19 deaths.
With all of those caveats and explanations, the result of this calculation is shown in figure 3.
Here the data indicate that an even larger percentage of deaths in the first wave were in democratic controlled jurisdictions as compared to figure 2 which splits the data by Trump supporter versus Clinton supporter. However, the second wave in early August is more evenly distributed and actually has a slight bias toward Republican-controlled states and counties.
* the motivation for this investigation is an interview (March 2020, MSNBC) with and subsequent articles by Washington Post columnist Jennifer Rubin who claimed that due to denial of the seriousness of COVID-19 and concomitant dangerous rapid-opening policies many more Republicans would die from the disease than Democrats.
** Note that while statistics are available for mortality, number of cases, hospitalizations and various other disease metrics I have limited my study simply to the number of deaths. This is in part because it is less ambiguous than number of cases (how many cases depends, for example, on how many tests are done). It is not completely unambiguous, of course, since dying of COVID-19 and dying with COVID-19 have notoriously been conflated in places at times.
The outbreak of the coronavirus has been accompanied by a dizzying volume of scientific research and a concomitant volume of explanatory articles to communicate, either to a lay audience or to an audience of scientists whose specialties are not in those research areas, the assumptions, limitations and conclusions of that research. The research – or a big portion of it at least – can be divided into two main categories: biomedical work aimed at producing vaccines, cures and therapies, on the one hand, and statistical, epidemiological research aimed at understanding and predicting the progress of the disease throughout society on the other. The repository for many of the research papers has to a large degree defaulted to the Medical Preprint archive medRxiv.org.
In this post I wish to begin examining the epidemiological models with the hope of clarifying the predictions of the most noteworthy (in particular the IHME model from Christopher Murray at the University of Washington). I will attempt to run a middle course between highly technical and layman’s descriptions with the hopes of not hitting that sweetspot which the Japanese refer to as “chuu to hanpa” (approximately, the worst of both worlds). Therefore, while I will include equations I will try to make the discussion sufficiently lucid that even if you can’t parse differential calculus formulas you will nevertheless get the relevant ideas.
One of the most prominent COVID-19 models (discussed in this Nature News Feature) was developed at Imperial College in the group of Dr. Neil Ferguson. It famously predicted in mid March an ultimate potential death count in Britain of 500,000 as well as 2.2 million fatalities in the United States. This one work arguably led Prime Minister Boris Johnson to opt for a lockdown strategy as opposed to letting the virus infect until it died out through so-called “herd-immunity.” The model was further shared with the White House and led promptly to new guidance on social distancing. In other words, it was quite influential.
The Imperial College model begins from the most traditional of epidemic models known as the SIR (Susceptible-Infected-Recovered) model, which I describe below. The Imperial College researchers substantially elaborated on that model using a type of simulation of “agent” motion rather than trying to write and solve all of the pertinent differential equations (which is only feasible in the simple form of the SIR model).
A second highly cited model – frequently referred to by White House corona virus advisors Deborah Birx and Anthony Fauci – was developed by Dr. Christopher Johnson and co-workers at the University of Washington’s Institute for Health Metrics and Evaluation (IHME). As we will show below the premises and theory of the IHME model differ significantly from the more standard approach of the Imperial college research. this second approach is based entirely on the statistics of deaths and attempts to fit the data to specific functional forms. This implies that there is some relevant functional form whose features, up to this time, are relevant for determining its features in the future.
I am particularly dubious of the IHME model simply because one can at least imagine situations where it would fail spectacularly. For example, if a cure is found the death rate would obviously collapse. But no parameter in the original functional estimate will have any way of predicting that. Of course one could argue that a cure would thwart any predictive model, so IHME is just as good as any. But there could potentially be existing features in the data that could indicate some non-linear change in the future (other than the discovery of a cure) which the parameter-based IHME model would be unable to capture. More on that below.
The Basic SIR model
The SIR or Susceptible-Infected-Recovered model is a simple set of rate equations with parameters describing the probability of transmission and the rate at which infected people either die or recover. Slightly more sophisticated models can incorporate the latency period from infection to transmisibility.
The model places all the members of the group (city, country, globe) into one of the three categories: S, I or R. The number of those who are infected changes in time as either susceptible members become infected, or as infected members either die or recover – in either case leaving the infected group. The simplest assumption then is that those recovered can never be infected again (they are not susceptible).
The parameter beta controls the fraction of times that – upon interacting, hence the product of I and S – a susceptible becomes infected. The parameter gamma controls that recovery plus death rate.
Graphical representation of SIR model. Susceptible can be infected, infected can recover (which, sadly, includes those who die).
Typical numerical solution of SIR equations showing S (yellow), I (red) and R (blue) as a function of time.
The Imperial College model – which based their initial analysis on data from China – obviously went far beyond this simple model. First, the SIR model puts everyone into the same three categories. But one might think to subdivide the population into different towns, or different age groups for example. Further subdivision into social groups makes the model still more complicated. And then each parameter in the model (the transmission rate and the death or recovery rate) could then depend on those “cohorts.”
Further, the SIR model is deterministic. It could be recast as an equation for a probability distribution but the Imperial College study instead used a stochastic simulation based on agents with particular behaviors circulating through society. (They also used the complicated set of rate equations directly).
data sources: JHU, NY Times, RealClearPolitics, NextStrain, Social distancing data: Descartes Labs, SafeGraph, Google COVID-19 Community Mobility Reports, unacast.com (BM)
So you are a trying to learn both the theory and the implementation of machine learning (ML) algorithms in general and recurrent neural networks (RNNs) and Long Short Term Memory methods (LSTMs) in particular. You have located both the various blogs (examples here and here) and research papers on Long Short Term Memory (LSTM) (examples here, here and here) as well as the (excellent) Keras-based practical guides by Jason Brownlee of Machine Learning Mastery (MLM). You have even studied (or at least glanced at) Andrew Ng’s “Coursera” lectures (link to the RNN one here) that go through sequential ML models step-by-step. But you still don’t quite get it.
In particular, what you don’t get is located somewhere in the gap between theory as in Olah’s blog and practice as in Brownlee’s exercises. Specifically, what are the input parameters for Keras LSTMs? What other parameters are needed in MLM exercises and what do they mean? Where can you get some code that runs right off the shelf (as the MLM stuff generally does) and has parameters, whose meanings you understand in terms of the actual equations governing LSTMs so that you can apply these codes to your own sequence problems with varying numbers of input features, timesteps, and memory size (i.e. number of neurons)?
You have begun to despair. Olah is informative. Brownlee is great. But hey, neither of them is God. You know what I’m saying?
Well you’ve come to the right place!
The purpose of this blog entry and the associated github page https://github.com/mpstopa/LSTM-EZ is to provide a fully-explained, simple python, Keras (tensorflow) LSTM code for predicting next member(s) of sequences. Much of the code is taken from MLM examples and so inherits some of its idiosyncrasies. I will not reproduce the information in the blogs, papers or training examples but rather try to bridge the gap between what the code does and what the equations/diagrams show.
By the end of this blog and github code implementation you should be able to
Take your sequential csv-file data with a single column to be predicted and an arbitrary number of additional feature columns and reformat it to be read by LSTM-EZ.py.
Understand and be able to experiment with the “timesteps” argument.
Understand the “neurons” parameter (which, in the utterly opaque Keras documentation is called “units” and not otherwise explained in any way) and modify and test it on your data.
Understand MLM’s (i.e. Brownlee’s) argument “n_lag” (also called “predict’). In the future we will address changing this for multiple timestep prediction. Herein predict=1.
Run the LSTM-EZ code for these and other cases.
Software prerequisites (installed):
Python 3.6 or later. I use PyCharm.
Keras 2.2.4 or later
pandas 0.24.1
numpy 1.15.4
sklearn
matplotlib 3.0.2
Learning prerequisites: you should have read Olah’s blog and pretty much understood it. You should have tried out some of the LSTM examples in Machine Learning Mastery. Most of these work right out of the box, even if they are difficult to interpret or modify (because the parameters are not well-explained).
One final note, so far I am using n_batch=1 for simplicity. Submitting batches is only necessary for purposes of efficiency and speed. Thus, this code is slow and for production you’ll need something else most likely. I will try to expand to n_batch cases in the future.
Basic RNN and LSTM
First the RNN
In order to make the implementation of Keras-based recurrent neural network algorithms clear, I will begin with a description of the basic recurrent neural network (RNN) and Long Short Term Memory network (LSTM). For the purposes of this blogpost it is important to display the dimensions of the various vectors and matrices – which are often suppressed in other blogs (here and here).
LSTM is significantly more complicated than RNN, both conceptually and mathematically. I will nevertheless provide the detailed math for both RNN and LSTM. I will not, in this blogpost, however, provide the motivation for LSTM versus RNN. It is fairly well known that this motivation involves the vanishing of gradients in the back propagation (through time) algorithm leading to a loss of memory in the RNN over a (too) small number of time steps. However, precisely how the additive replacement to memory (LSTM) as opposed to the repetitive matrix multiplication (RNN) ameliorates that is the subject for a future post. (Also, see related post here).
Also, even though LSTM is much more complex than RNN, the essential requirement for understanding how to use the algorithms (at least for our purposes here) is understanding the inputs and the outputs to the elementary cell (for one timestep). The only difference in these is that for LSTM there are two memory channels (often called the working memory denoted h and long term memory denoted C). These two memory vectors, however, have the same dimension and hence the input to the Keras routines – specifically the parameter which the Keras documentation calls “units” – is the same for (specifically) keras.layers.simpleRNN, keras.layers.GRU and keras.layers.LSTM.
Since the usage in the code is the same for RNN and LSTM, I will describe RNN first in some detail and then later give the description of LSTM (which will require, as I say, no further understanding of code parameters than RNN, but will differ in the internal structure).
When it is “unrolled” in time, the RNN cell at time t looks like this:
Figure 1: Basic RNN. Note that dimensions of x (M) and h (R) and the matrices W and U are included explicitly. Dimension of output y and the first dimension of V depend on case (see text).
The input at time t is the vector x<t> with length M, the incoming memory from t-1 is the vector h<t-1>, with length R and the output vector y-hat, formed from the softmax operation on the matrix product of V with h<t>, (the first index of V, or alpha) is typically a scalar – i.e. the value of the variable that we are trying to predict at time t (e.g. the closing price of a stock).
For the simplest sequences M=1 in which case a sequence consists of one variable changing in time (in the simple MLM example this is “shampoo sales” per month). Naturally we might want to input more information than only the time varying variable we are trying to predict. For example, if we are trying to predict the price of a stock we might want not only the historical time evolution of the stock price but we might want the corresponding trade volume as well (and perhaps the evolving price of other related stocks). Thus often M>1.
[Note: when we get to the code, the input data, in the form of a csv file, will consist of M columns and it will be assumed that the first column is the one whose value we wish to predict and the remaining M-1 columns are these “auxiliary” data].
The value of R is referred to as “units” in the Keras documentation. Brownlee refers to it as “neurons.” It is incomprehensible why neither of these sources make it clear that what they are talking about is the size of the vector that caries the memory, h. But that is all it is.
R=units=neurons is very much a tunable parameter. Brownlee is cavalier saying at one point:
“The final import parameter in defining the LSTM layer is the number of neurons, also called the number of memory units or blocks. This is a reasonably simple problem and a number between 1 and 5 should be sufficient,”
without saying just what it (neurons) is. The point being that the whole operation of the RNN depends on how many internal parameters you want to fit with your training – and that includes the weight matrices as well as the number of memory units. Too many parameters and you will overfit. Too few and you will get diminishing accuracy. It’s an art. So the lesson is that you have to experiment with R – it is not fixed by the nature of your problem (as is M, for instance). Final note: as seen from the equations in Figure 1, the size of all of the weight matrices also depend on R in one or both dimensions.
The LSTM
The corresponding diagram for an LSTM cell is here:
Figure 2: Basic Long Short Term Memory cell, unrolled in time.
The symbols in the LSTM diagram are defined as follows:
Figure 3: Legend for figure 2. (A) memory h<t-1> and input x<t> are multiplied by weight matrices W and U, the results added and then run through an element-wise sigma function. Note that there are different W and U matrices for the f_t, i_t, \twiddle(C)_t and o_t data lines. (B) the input gate (combining i_t and \twiddle(C)_t, used the tanh function rather than the sigma function but is otherwise identical. (C) The dot function is an element-wise dot product. (D) the plus function is an element-wise addition.
Clearly LSTM is far more complicated than RNN. For the purposes of this post, however, the difference in the input and output of a single cell is simply that LSTM carries two memory lines instead of one. However, the dimensions of these vectors, h and C, are the same, i.e. R (the aforementioned “units” or “neurons”), so the calls to the Keras routines (for this parameter) are the same.
Note that my diagram and notation differ from that of Olah’s bolg. I split the input x and the “working” memory h each into four independent lines. Olah runs those lines together before shunting them into sigma or tanh functions. This obscures the fact that each time a memory line meets an input line they are combined by first multiplying with weight matrices and then added and, here is the point, the weight matrices *differ* for the so called “forget,” two “input,” and the “output” lines.
If your environment is set up properly the code should simply run out-of-the-box using the sample dataset (closing stock prices of Nvidia per day for one year along with high, low, open, and two volume measures…six columns altogether). The point of this blog/github repo is to teach you to run this code on your own dataset and adjust the parameters (in the code) timesteps, neurons and features.
Study the following diagram carefully:
Figure 5: full network for time series problem LSTM-EZ. Generally h^<0> is initialized randomly (and automatically) by LSTM.
First, note that the component cells are RNN cells (only one memory line) rather than LSTM cells. But ignore that fact! The difference is not relevant for this example.
This diagram shows a recurrent network unrolled in time. There are T=timesteps values of x that are input. Note that “timesteps” is NOT the number of lines in your data file for which you have entries. For example, in the provided data NVDA.csv there are 253 lines (plus one for column headings). What you use for timesteps is usually much smaller, e.g. timesteps=3. Thus the network is trained to look at three timesteps of x and to predict the fourth, x<T+1>, which we define as y_hat<T>, using as starting point x<1>, x<2>,…,x<250>. (The final starting x has to be number 250 = 253-timesteps).
Assume for the moment that the input data is scalar: i.e. there is only a single column of time-varying data and no auxiliary features (i.e. the variable features=1. Recall also that in the diagrams above features=M). Then, in the code the function series_to_supervised takes the sequence of input data values and create timesteps+1 columns starting from the original column. Additional columns, from 1,…,timesteps, are shifted up by one row and appended. Note that this shifts undefined values into the bottom rows (since there are no values for x<254>, x<255> etc.) The routine series_to_supervised puts NaN into these slots and then totally removes those rows that have any NaN values. No muss, no fuss.
[Note that this code allows the user to change timesteps to vary the number of inputs to the network. However the number of steps into the future which are predicted is only one. The variable “predict” in the code is set to one and the code will fail if it is set to anything else. To predict multiple timesteps into the future – which will be the subject of a future post – it is necessary to modify the model to output more than one variable. This is generally achieved by putting in a Dense layer with parameter “predict,” i.e. model.add(Dense(predict)). Changes to the input training data (specifically test_y) are also necessary. See MLM post here.]
Suppose that features>1, what happens then? Well, you still need timesteps values of your M features. Each value of x<t> is an M-vector. But what about x<T+1>, the expected output of the model based on timesteps inputs? There is no need to keep the M features for x<T+1>=y_hat<T>. Thus, LSTM-EZ will eliminate the additional features (i.e. columns) from the T+1st timestep.
A couple pictures will make this clearer. First, (Fig. 6) the initial data with two features, indexed by date. (This data is also taken from the NVDA.csv file using read_csv but in this case I have thrown away all but the two columns “Open” and “High” and the “Open” data, being the first column, is the thing we are trying to predict).
Figure 6: A two feature dataset imported using pandas read_csv. Date is the index column and so when this data is converted to a numpy array using pandas conversion “data.values” the date column disappears.
Next, (Fig. 7) the output from series_to_supervised using timesteps=2 and predict=1 gives six columns. Note however that (actually, in LSTM-EZ, after returning from series_to_supervised) the final column, which would be var2(t), has been thrown away. var1(t) is our output ground truth. We don’t need and are not trying to predict var2(t).
Figure 7: After calling series_to_supervised with timesteps=2 and predict=1 and then discarding the last column the above dataset is obtained. Note that the column headings are now replaced with generic names that indicate the timestep: t-2, t-1 and t. The two rows, 250 and 251, (which contained NaN entries) has been automatically deleted. The second pair of columns are offset vertically by one time step from the corresponding first pair of columns. The last column is offset two timesteps from the first. Note also that the last “High” column (var2(t)), which we are not trying to predict, has been eliminated.Figure 8: The final picture looks like this. For the actual code LSTM-EZ there are actually six features (rather than the two shown here) so six columns are fed, line-by-line, into x<1>, six into x<2> etc. for however many timesteps. The final multiplication of h<2> is actually done by a Dense layer in the code created by model.add(Dense(1)).
The final picture for the line-by-line training of the LSTM model (again, we picture an RNN here but the difference for this purpose is unimportant) is given in figure 8. Back propagation through time, which I will not discuss, is used for each instance to adjust the various weight matrices. The fitting is done with the input data split into a train and a test set to reveal, among other things, if the model is being overfit. See the code for details.
Once the model is trained (the Keras call is model.fit) the model can be called with arbitrary input lines of the same form as the training lines to predict outputs using model.predict. In LSTM-EZ we simply predict the test_X dataset again to illustrate the format of the call to model.predict.
Note: the LSTM-EZ code as provided (using the file NVDA.csv) is only slightly more complicated. It has T=timesteps=3 and M=features=6. Therefore “series_to_supervised” creates 4×6=24 columns of data, each 251 rows in length. Each set of six is offset vertically from the preceding set of six by one timestep. After returning from series_to_supervised LSTM-EZ eliminates columns 20,21,22,23 and 24, leaving a total of 19 columns. Furthermore, all rows with NaN entries are deleted by series_to_supervised. Thus, for each row there are six <t-3> columns (var1,var2,…,var6), six <t-2> columns, six <t-1> columns and one <t> (var1 only) column.
To summarize:
create a sequential csv dataset with the first column being the variable that you wish to predict based on the preceding timesteps values. You may include arbitrary number of additional columns of data that you think might be relevant to the prediction.
download and test the LSTM-EZ code on the NVDA.csv dataset provided in the repo.
modify the read_csv command to point to your csv file and change the value of the variable “features” to the total number of columns in your dataset. If you have an index column (such as a sequence of dates) then set index_col=0 in the read_csv command. Otherwise leave it out.
Run the code. Timesteps is set by default to 3, neurons is set by default to 50, epochs is set by default to 50. You should be able to vary all of these (within reason) to see how your results change. The code produces two plots: one that shows the loss as a function of epoch in train and test data, the other of which shows the predicted variable versus the actual variable throughout the dataset.
Please leave comments on your satisfaction or lack thereof with the code and these notes. I will attempt to update this to accommodate any flaws. Thanks for reading. WhenI have sponsors please visit them!