MULTI-LAYER PERCEPTRON FOR REGRESSION IN JULIA: USING THE MOCHA FRAMEWORK: With the raise of machine learning techniques to analyze data, a bunch of frameworks to build those models have arised. Today, most machine learning techniques are based on deep learning models which are based on artificial neural networks (ANN). In this post, I’m going to introduce a deep learning framework for the Julia programming language by constructing a very simple model called the multi layer perceptron which is useful for many different tasks like estimating a mathematical function (regression). This post is going to be the foundation for future posts where we’re going to build real deep learning models for real life problems ;).
THE MULTI-LAYER PERCEPTRON (MLP)
According to Wikipedia, the multi-layer perceptron is a feedforward neural network that maps a set of input data onto a set of appropiate outputs. Mathematically, we can see an artificial neural networks as the composition of other mathematical functions. We can visually represent a neural network using a graph, like the next example:
From the previous example, we can note tha an ANN is composed from layers wich are composed by neurons. The first layer is the input layer where we input data into the net. The next layer is a fully connected layer (all nodes from the previous layer are connected with all the nodes from the actual layer) and we called it a hidden layer. The final layer is the output layer and in the example it only has one output. Neurons are functions that receive some input and returns an output. Each connection between neurons on different layers have weights, so a neuron receive a weighted input from the neurons on the preivous layer. For example, take the neuron on a1, it will receive a weighted input from every neuron on the input layer.
With the previous statements, we can define the multi-layer perceptron as a feedforward (only connections between layers going forward) fully connected (every neuron on one layer is connected with every neuron on the next layer) neural network. This kind of models are pretty useful on a bunch of different tasks related with supervised learning. Here, we are going to focus on estimating a mathematical function.
For a deeper explanation on neural networks you can refer to http://deeplearning.net/tutorial/.
JULIA AND THE MOCHA FRAMEWORK
Julia is a high level dynamic programming language which is made with science computation and data analysis in mind. It has overall good performance but it really shines in distributed and concurrent computation tasks. Being relatively new (first release on 2012), it has a vibrant community and is updated regularly. Because is focused on data analysis, there are a lot of different libraries to perform data related tasks. In this post, we are going to work with a deep learning framework called Mocha.
Mocha has a pretty simple structure to build complex networks and allows in a simple way to use GPU capabilities (if you have one) so to build our multi-layer perceptron we’re going to use the Mocha library.
OUR MULTI-LAYER PERCEPTRON FOR REGRESSION
First, we need to define which mathematical function we’re going to estimate. In particular, the next function is our election:
and in Julia:
1 |
f1(x1, x2) = sin(x1).*sin(x2)./(x1.*x2) |
his is a two-input function that maps a pair (x,y) to a real value. We can plot our function in a simple way using the PyPlot package of Julia. Using 10.000 (x,y) points, our function looks like:
So, our multi-layer perceptron needs two neurons on the input layer (because the function that we’re trying to estimate has two inputs). Then, is going to have a fully connected hidden layer and to finalize an output layer with a single neuron (the final network estimation for our function). In Mocha, this network definition looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
using Mocha using Distributions # by fixing the random seed we can replicate our results srand(500) # we're going to use 5000 points to estimate our function tam = 5000 # generate inputs, those are the x and y points on our function # we generate them using the normal distribution generate_dataset(media,var,tam) = rand(MvNormal(media, var), tam) # generate outputs f1(x1, x2) = sin(x1).*sin(x2)./(x1.*x2) datasetinput = generate_dataset([0.0;0.0], [1.0 0.0;0.0 1.0], tam) datasetoutput = f1(datasetinput[1,:], datasetinput[2,:]) # i don't have a gpu :( backend = CPUBackend() init(backend) # first layer, is a data layer and receives as input our dataset input (the 5000 points) # and also we pass the outputs from our function, those are needed to train our network data_layer = MemoryDataLayer(name="data", data=Array[datasetinput, datasetoutput], batch_size=100) # then we have our fully connected hidden layer, here I use 35 hidden neurons ip_layer = InnerProductLayer(name="ip", output_dim=35, bottoms=[:data], tops=[:ip], neuron=Neurons.Tanh()) # the final layer is also a fully connected layer but with only one neuron, the output one aggregator = InnerProductLayer(name="aggregator", output_dim=1, tops=[:aggregator], bottoms=[:ip] ) |
The code has a lot comments, so I’m going to explain the basics. The idea here is to generate a set of inputs (using the normal distribution but as we need two inputs we need to use the multivariate normal) and their corresponding outputs using our function. The outputs are what we call the labels of the input set. With the input and the label sets, we pass this information to our net. The network then process every input and return an output according to the weights of the connections between every neuron and the specific function that every neuron computes. Later, the network compares its output with the corresponding label and computes an error using a loss function (such as the mean squared error). With this errors, the network is able to (via backpropagation) adjust their weights to correct their future outputs. This process is what we call supervised learning.
So, in our mocha neural network structure we have our input layer which receives the input set and the labels. Then we have our fully connected hidden layer where each neuron use the function Tanh. For a full list of supported function for neurons on the Mocha framework you can check http://mochajl.readthedocs.io/en/latest/user-guide/neuron.html. Now, we need to define our loss function. To do this, in Mocha we add a Loss Layer to our structure:
1 |
layer_loss = SquareLossLayer(name="loss", bottoms=[:aggregator, :label]) |
This is going to be the last layer in our network structure. In my opinion, the loss function should not be called a layer because is actually not a layer, but the Mocha framework works with the concept of “loss layers”. By now, we have a full neural network with its loss layer along with their inputs and labels. Now we need to actually train it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# first with the layers we construct our final neural network common_layers = [ip_layer, aggregator] net = Net("MLP", backend, [data_layer, common_layers, layer_loss]) # when we train our network, also we perform validation of the training # for this, we define a twin neural network where the only difference is # the input layer (because we pass validation inputs and labels) input_test = generate_dataset(mean, var, 5000) output_test = f1(input_test[1,:], input_test[2,:]) data_test = MemoryDataLayer(data = Array[input_test, output_test], batch_size = 100) accuracy = SquareLossLayer(name="acc", bottoms=[:aggregator, :label]) net_test = Net("test", backend, [data_test, common_layers, accuracy]) # we tell Mocha that this "twin" network is for validation purposes test_performance = ValidationPerformance(net_test) # to train the network we use stochastic gradient descent method = SGD() # the max. number of iterations of SGD is 1000 params = make_solver_parameters(method, max_iter=1000) solver = Solver(method, params) |
In the last code, there’re two important things:
- Validation: we define a new neural network similar to the previous one but with the difference on the MemoryDataLayer where we pass a new input and label set. This new network is going to be used by Mocha to validate the correct training of our original network. To accomplish this, we use the ValidationPerformance class of mocha.
- Solver: to actually train our network we need to define a solver algorithm used for training. In Mocha we have several choices, but for the MLP the algorithm stochastic gradient descent (SGD) works well.
Mocha has the nice ability to save a trained network so when we need to actually use it we only load a “snapshot” of our trained network so we save time by not needing to train the network again. To do this we uuse what Mocha calls “coffee breaks”:
1 2 3 |
add_coffee_break(solver, TrainingSummary(), every_n_iter=10) add_coffee_break(solver, Snapshot("snapshots"), every_n_iter=1000) add_coffee_break(solver, test_performance, every_n_iter=10) |
Here, we defined three coffee breaks, one for print every 10 iterations a summary of the status of the actual training of the network. The second coffee breaks, save the state of the network every 1000 iterations. As we use only 1000 iterations on the SGD algorithm, the snapshot saved by the coffee break is going to be the trained network. The last coffee break is used by the validation network every 10 iterations.
The last step in our training is to actually tell Mocha to train the network:
1 2 3 4 5 6 7 8 9 10 |
# train the network solve(solver, net) # dump some useful statistics Mocha.dump_statistics(solver.coffee_lounge, get_layer_state(net, "loss"), true) # free resources and shutdown the backend destroy(net) destroy(net_test) shutdown(backend) |
Save this file as neural_network.jl and execute it from the command line. You’re going to see something like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
Configuring Mocha... * CUDA disabled by default * Native Ext disabled by default Mocha configured, continue loading module... DefaultBackend = Mocha.CPUBackend 04-Jul 12:18:10:INFO:root:Constructing net MLP on Mocha.CPUBackend... 04-Jul 12:18:10:INFO:root:Topological sorting 4 layers... 04-Jul 12:18:10:INFO:root:Setup layers... 04-Jul 12:18:10:INFO:root:Network constructed! 04-Jul 12:18:10:INFO:root:Constructing net test on Mocha.CPUBackend... 04-Jul 12:18:10:INFO:root:Topological sorting 4 layers... 04-Jul 12:18:10:INFO:root:Setup layers... 04-Jul 12:18:10:DEBUG:root:InnerProductLayer(ip): sharing weights and bias 04-Jul 12:18:10:DEBUG:root:InnerProductLayer(aggregator): sharing weights and bias 04-Jul 12:18:10:INFO:root:Network constructed! 04-Jul 12:18:13:DEBUG:root:#DEBUG Checking network topology for back-propagation 04-Jul 12:18:13:DEBUG:root:Init network MLP 04-Jul 12:18:13:DEBUG:root:Init parameter weight for layer ip 04-Jul 12:18:13:DEBUG:root:Init parameter bias for layer ip 04-Jul 12:18:13:DEBUG:root:Init parameter weight for layer aggregator 04-Jul 12:18:13:DEBUG:root:Init parameter bias for layer aggregator 04-Jul 12:18:14:DEBUG:root:#DEBUG Initializing coffee breaks 04-Jul 12:18:14:INFO:root:Snapshot directory snapshots already exists 04-Jul 12:18:14:DEBUG:root:Init network test 04-Jul 12:18:14:INFO:root: TRAIN iter=000000 obj_val=0.34167932 04-Jul 12:18:14:INFO:root:Saving snapshot to snapshot-000000.jld... 04-Jul 12:18:14:WARNING:root:Overwriting snapshots/snapshot-000000.jld... 04-Jul 12:18:14:DEBUG:root:Saving parameters for layer ip 04-Jul 12:18:15:DEBUG:root:Saving parameters for layer aggregator 04-Jul 12:18:16:INFO:root: 04-Jul 12:18:16:INFO:root:## Performance on Validation Set after 0 iterations 04-Jul 12:18:16:INFO:root:--------------------------------------------------------- 04-Jul 12:18:17:INFO:root: Square-loss (avg over 5000) = 0.3189 04-Jul 12:18:17:INFO:root:--------------------------------------------------------- 04-Jul 12:18:17:INFO:root: 04-Jul 12:18:17:DEBUG:root:#DEBUG Entering solver loop 04-Jul 12:18:17:INFO:root: TRAIN iter=000010 obj_val=0.16180613 04-Jul 12:18:17:INFO:root: 04-Jul 12:18:17:INFO:root:## Performance on Validation Set after 10 iterations 04-Jul 12:18:17:INFO:root:--------------------------------------------------------- 04-Jul 12:18:17:INFO:root: Square-loss (avg over 5000) = 0.1800 04-Jul 12:18:17:INFO:root:--------------------------------------------------------- ... ... ... 04-Jul 12:18:18:INFO:root:## Performance on Validation Set after 1000 iterations 04-Jul 12:18:18:INFO:root:--------------------------------------------------------- 04-Jul 12:18:18:INFO:root: Square-loss (avg over 5000) = 0.0062 04-Jul 12:18:18:INFO:root:--------------------------------------------------------- 04-Jul 12:18:18:INFO:root: 04-Jul 12:18:18:INFO:root: Square-loss (avg over 100100) = 0.0150 04-Jul 12:18:18:DEBUG:root:Destroying network MLP 04-Jul 12:18:18:DEBUG:root:Destroying network test |
If you look into the snapshot folder you’re going to see a file called snapshot-001000.jld. This is our saved trained multi-layer perceptron.
So, we have trained our network. Now we are going to actually use it! First, create a new file called test_network.jl. To see if our network is correctly predicting our function, we’re going to plot the predictions of our network compared to the correct outputs (the labels) of the original function. To plot the predictions, I used the matplotlib library directly from python using PyCall. In Julia you can use a wrapper called PyPlot but it didn’t work for me 🙁 (something related with this issue on github: https://github.com/stevengj/PyPlot.jl/issues/103). So, the code to test our net:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
using Mocha using Distributions using PyCall @pyimport mpl_toolkits.mplot3d as mplot3d @pyimport matplotlib.pyplot as plt srand(500) # generate inputs generate_dataset(media,var,tam) = rand(MvNormal(media, var), tam) # generate outputs f1(x1, x2) = sin(x1).*sin(x2)./(x1.*x2) #Number of examples (tam) tam=10000 datasetinput = generate_dataset([0.0;0.0], [1.0 0.0;0.0 1.0], tam) datasetoutput = f1(datasetinput[1,:], datasetinput[2,:]) backend = CPUBackend() init(backend) # network definition data_layer = MemoryDataLayer(name="data", data=Array[datasetinput, datasetoutput], batch_size=10000, tops=[:data,:label]) ip_layer = InnerProductLayer(name="ip", output_dim=35, bottoms=[:data], tops=[:ip], neuron=Neurons.Tanh()) aggregator = InnerProductLayer(name="aggregator", output_dim=1, tops=[:aggregator], bottoms=[:ip] ) common_layers = [ip_layer, aggregator] net = Net("MLP", backend, [data_layer, common_layers]) # we load the trained network load_snapshot(net, "snapshots/snapshot-001000.jld") # forward function does the prediction in the network forward(net) destroy(net) # plot the correct outputs in red and the predicted in blue fig = plt.figure() ax = mplot3d.Axes3D(fig) ax[:set_xlabel]("X Label") ax[:set_ylabel]("Y Label") ax[:set_zlabel]("Z Label") # plot original data ax[:scatter](datasetinput[1,:], datasetinput[2,:], datasetoutput, c="r", marker="o") # plot predictions ax[:scatter](datasetinput[1,:], datasetinput[2,:], net.output_blobs[:aggregator].data, c="b", marker="^") plt.show() shutdown(backend) |
Note that in this case we don’t need a SquaredLossLayer because we aren’t training our neural network. If you execute this file you are going to see a plot like the next one:
The blue marks are the predictions maded by our neural network and the red points are the actual values from the estimated function. Here we can see that our neural networks makes a relatively good job estimating our function. There are errors mostly on the edges of the function but like I said is a decent estimation. We can improve our predictions by adjusting some parameters of the net, for example the number of the hidden neurons (we used 35), the parameters of the solver, etc.
CONCLUSION
In this post, I introduced the Mocha framework of the Julia programming language to build, traing and use deep neural networks. As an introductory example, we built a simple multi-layer perceptron (not a deep learning model) used to estimate a simple mathematical function. In future posts, I’m going to show more complex examples using the framework mainly using deep learning models, so take this post as the foundation for upcoming posts.
The complete source code of this article is available on github: https://github.com/diegoacuna/mlp-regression-mocha