Danielle Navarro
library(tidyverse)
Last week’s exercise covered the Rescorla-Wagner model. In this week’s exerise we’ll expand this to cover the backpropagation rule for learning in connectionist networks. For this exercise, however, we’ll only cover the simplest kind of backpropagation network, with just two layers.
At the moment the scripts don’t do anything other than learn a classification rule. The goal for the full exercise will (eventually) be to examine what the model is learning across the series of “epochs”, and consider the relationship between this connectionist network and a probabilistic logistic regression model, but for now it’s a bit simpler than that!
In this tutorial we’ll only cover a very simple version of a backpropagation network, the two-layer “perceptron” model. There are two versions of the code posted above. The code in the iris_twolayer.R script is probably the more intuitive version, as it updates the association weights one at a time, but R code runs much faster when you express the learning rule using matrix operations, which is hwat the iris_twolayer2.R version does. Let’s start with a walk through of the more intuitive version…
First, let’s take a look at the training data. I’m going to use the classic “iris” data set that comes bundled with R, but I’ve reorganised the data in a form that is a little bit more useful for thinking about the learning problem involved, and expressed it as a numeric matrix.
irises <- read_csv("http://djnavarro.github.io/minds-machines-4103/iris_recode.csv") %>% as.matrix()
## Parsed with column specification:
## cols(
## sepal_length = col_double(),
## sepal_width = col_double(),
## petal_length = col_double(),
## petal_width = col_double(),
## context = col_double(),
## species_setosa = col_double(),
## species_versicolor = col_double(),
## species_virginica = col_double()
## )
This data set has columns containing many features. First there are the input features, which consist of two features relating to the petals, two features relating to the sepal, and a context feature that is 1 for every flower. Additionally, there are three binary valued output features corresponding to the species of each flower, dummy coded so that only the correct species has value 1 and the incorrect species have value 0. Here are the names:
input_names <- c("sepal_length", "sepal_width", "petal_length", "petal_width", "context")
output_names <- c("species_setosa", "species_versicolor", "species_virginica")
So for the first flower, the network would be given this pattern as input:
input <- irises[1, input_names]
input
## sepal_length sepal_width petal_length petal_width context
## 5.1 3.5 1.4 0.2 1.0
and we need to train it to produce this target pattern as the output:
target <- irises[1, output_names]
target
## species_setosa species_versicolor species_virginica
## 1 0 0
In its simplest form we can describe the knowledge possessed by our network as a set of associative strengths between every input feature and every output feature. In that sense we can think of it as a generalisation of how the Rescorla-Wagner model represents knowledge:
n_input <- length(input_names)
n_output <- length(output_names)
n_weights <- n_input * n_output
So what we’ll do is create a weight matrix that sets the initial associative strength to zero, with a tiny bit of random noise added to each of these associative weights:
weight <- matrix(
data = rnorm(n_weights) *.01,
nrow = n_input,
ncol = n_output,
dimnames = list(input_names, output_names)
)
weight
## species_setosa species_versicolor species_virginica
## sepal_length -0.002504672 -0.004840322 0.0008046437
## sepal_width 0.003742853 0.008551854 0.0110337623
## petal_length -0.001587042 0.001736071 -0.0029740646
## petal_width 0.005471515 -0.009401521 -0.0110717540
## context 0.023489872 0.009406889 0.0157307615
Here’s the network we want to code:
While we’re at it, store a copy for later:
old_weight <- weight
In the Rescorla-Wagner model, when the learner is shown a compound stimulus with elements A and B with individual associative strengths \(v_A\) and \(v_B\), the association strength for the compound AB is assumed to be additive \(v_{AB} = v_{A} + v_{B}\). We could do this for our backpropagation network too, but it is much more common to assume a logistic activation function. So we’ll need to define this activation function:
logit <- function(x){
y <- 1/(1 + exp(-x))
return(y)
}
So what we do is first take the sum
of the inputs and then pass them through our new logit
function. So let’s say we want to compute the strength associated with the first species:
output_1 <- sum(input * weight[,1]) %>% logit()
output_1
## [1] 0.5056719
More generally though we can loop over the three species:
# initialise the output nodes at zero
output <- rep(0, n_output)
names(output) <- output_names
# feed forward to every output node by taking a weighted sum of
# the inputs and passing it through a logit function
for(o in 1:n_output) {
output[o] <- sum(input * weight[,o]) %>% logit()
}
# print the result
output
## species_setosa species_versicolor species_virginica
## 0.5056719 0.5038007 0.5130157
As you can see, initially the model has no knowledge at all! It’s predicting a value of about 0.5 for every category!
The prediction error is very familiar:
prediction_error <- target - output
prediction_error
## species_setosa species_versicolor species_virginica
## 0.4943281 -0.5038007 -0.5130157
Here is the code implementing the learning rule. What we’re doing is looping over every weight in the network, and then adjusting the strength proportional to the prediction error:
learning_rate <- .1
# for each of the weights connecting to an output node...
for(o in 1:n_output) {
for(i in 1:n_input) {
# associative learning for this weight scales in a manner that depends on
# both the input value and output value. this is similar to the way that
# Rescorla-Wagner has CS scaling (alpha) and US scaling (beta) parameters
# but the specifics are slightly different (Equation 5 & 6 in the paper)
scale_io <- input[i] * output[o] * (1-output[o])
# adjust the weights proportional to the error and the scaling (Equation 8)
weight[i,o] <- weight[i,o] + (prediction_error[o] * scale_io * learning_rate)
}
}
(Let’s not worry too much about the scale_io
factor for now). So let’s look at the input, output, target, and prediction error:
input
## sepal_length sepal_width petal_length petal_width context
## 5.1 3.5 1.4 0.2 1.0
output
## species_setosa species_versicolor species_virginica
## 0.5056719 0.5038007 0.5130157
target
## species_setosa species_versicolor species_virginica
## 1 0 0
prediction_error
## species_setosa species_versicolor species_virginica
## 0.4943281 -0.5038007 -0.5130157
Now let’s look at how the weights changed:
weight - old_weight
## species_setosa species_versicolor species_virginica
## sepal_length 0.063018726 -0.064230873 -0.06536518
## sepal_width 0.043248145 -0.044080011 -0.04485846
## petal_length 0.017299258 -0.017632004 -0.01794338
## petal_width 0.002471323 -0.002518858 -0.00256334
## context 0.012356613 -0.012594289 -0.01281670
Not surprisingly everything to setosa has gone up and the others down. But notice the scale!
For the actual simulation I’ll set the learning rate to .01, run it for 5000 epochs and average across 100 independent runs just to smooth out any artifacts of randomisation1 How the weights change over epochs: