MULTI-LAYER PERCEPTRON FOR REGRESSION IN JULIA: USING THE MOCHA FRAMEWORK: With the raise of machine learning techniques to analyze data, a bunch of frameworks to build those models have arised. Today, most machine learning techniques are based on deep learning models which are based on artificial neural networks (ANN). In this post, I’m going to introduce a deep learning framework for the Julia programming language by constructing a very simple model called the multi layer perceptron which is useful for many different tasks like estimating a mathematical function (regression). This post is going to be the foundation for future posts where we’re going to build real deep learning models for real life problems ;).


According to Wikipedia, the multi-layer perceptron is a feedforward neural network that maps a set of input data onto a set of appropiate outputs. Mathematically, we can see an artificial neural networks as the composition of other mathematical functions. We can visually represent a neural network using a graph, like the next example:

From the previous example, we can note tha an ANN is composed from layers wich are composed by neurons. The first layer is the input layer where we input data into the net. The next layer is a fully connected layer (all nodes from the previous layer are connected with all the nodes from the actual layer) and we called it a hidden layer. The final layer is the output layer and in the example it only has one output. Neurons are functions that receive some input and returns an output. Each connection between neurons on different layers have weights, so a neuron receive a weighted input from the neurons on the preivous layer. For example, take the neuron on a1, it will receive a weighted input from every neuron on the input layer.

With the previous statements, we can define the multi-layer perceptron as a feedforward (only connections between layers going forward) fully connected (every neuron on one layer is connected with every neuron on the next layer) neural network. This kind of models are pretty useful on a bunch of different tasks related with supervised learning. Here, we are going to focus on estimating a mathematical function.

For a deeper explanation on neural networks you can refer to


Julia is a high level dynamic programming language which is made with science computation and data analysis in mind. It has overall good performance but it really shines in distributed and concurrent computation tasks. Being relatively new (first release on 2012), it has a vibrant community and is updated regularly. Because is focused on data analysis, there are a lot of different libraries to perform data related tasks. In this post, we are going to work with a deep learning framework called Mocha.

Mocha has a pretty simple structure to build complex networks and allows in a simple way to use GPU capabilities (if you have one) so to build our multi-layer perceptron we’re going to use the Mocha library.


First, we need to define which mathematical function we’re going to estimate. In particular, the next function is our election:

f(x,y) = \frac{sin(x)\cdot sin(y)}{x\cdot y}

and in Julia:

his is a two-input function that maps a pair (x,y) to a real value. We can plot our function in a simple way using the PyPlot package of Julia. Using 10.000 (x,y) points, our function looks like:

Function to estimate

So, our multi-layer perceptron needs two neurons on the input layer (because the function that we’re trying to estimate has two inputs). Then, is going to have a fully connected hidden layer and to finalize an output layer with a single neuron (the final network estimation for our function).  In Mocha, this network definition looks like:

The code has a lot comments, so I’m going to explain the basics. The idea here is to generate a set of inputs (using the normal distribution but as we need two inputs we need to use the multivariate normal) and their corresponding outputs using our function. The outputs are what we call the labels of the input set. With the input and the label sets, we pass this information to our net. The network then process every input and return an output according to the weights of the connections between every neuron and the specific function that every neuron computes. Later, the network compares its output with the corresponding label and computes an error using a loss function (such as the mean squared error). With this errors, the network is able to (via backpropagation) adjust their weights to correct their future outputs. This process is what we call supervised learning.

So, in our mocha neural network structure we have our input layer which receives the input set and the labels. Then we have our fully connected hidden layer where each neuron use the function Tanh. For a full list of supported function for neurons on the Mocha framework you can check Now, we need to define our loss function. To do this, in Mocha we add a Loss Layer to our structure:

This is going to be the last layer in our network structure. In my opinion, the loss function should not be called a layer because is actually not a layer, but the Mocha framework works with the concept of “loss layers”. By now, we have a full neural network with its loss layer along with their inputs and labels. Now we need to actually train it:

In the last code, there’re two important things:

  • Validation: we define a new neural network similar to the previous one but with the difference on the MemoryDataLayer where we pass a new input and label set. This new network is going to be used by Mocha to validate the correct training of our original network. To accomplish this, we use the ValidationPerformance class of mocha.
  • Solver: to actually train our network we need to define a solver algorithm used for training. In Mocha we have several choices, but for the MLP the algorithm stochastic gradient descent (SGD) works well.

Mocha has the nice ability to save a trained network so when we need to actually use it we only load a “snapshot” of our trained network so we save time by not needing to train the network again. To do this we uuse what Mocha calls “coffee breaks”:

Here, we defined three coffee breaks, one for print every 10 iterations a summary of the status of the actual training of the network. The second coffee breaks, save the state of the network every 1000 iterations. As we use only 1000 iterations on the SGD algorithm, the snapshot saved by the coffee break is going to be the trained network. The last coffee break is used by the validation network every 10 iterations.

The last step in our training is to actually tell Mocha to train the network:

Save this file as neural_network.jl and execute it from the command line. You’re going to see something like:

If you look into the snapshot folder you’re going to see a file called snapshot-001000.jld. This is our saved trained multi-layer perceptron.

So, we  have trained our network. Now we are going to actually use it! First, create a new file called test_network.jl. To see if our network is correctly predicting our function, we’re going to plot the predictions of our network compared to the correct outputs (the labels) of the original function. To plot the predictions, I used the matplotlib library directly from python using PyCall. In Julia you can use a wrapper called PyPlot but it didn’t work for me 🙁 (something related with this issue on github: So, the code to test our net:

Note that in this case we don’t need a SquaredLossLayer because we aren’t training our neural network. If you execute this file you are going to see a plot like the next one:

Network predictions

The blue marks are the predictions maded by our neural network and the red points are the actual values from the estimated function. Here we can see that our neural networks makes a relatively good job estimating our function. There are errors mostly on the edges of the function but like I said is a decent estimation. We can improve our predictions by adjusting some parameters of the net, for example the number of the hidden neurons (we used 35), the parameters of the solver, etc.


In this post, I introduced the Mocha framework of the Julia programming language to build, traing and use deep neural networks. As an introductory example, we built a simple multi-layer perceptron (not a deep learning model) used to estimate a simple mathematical function. In future posts, I’m going to show more complex examples using the framework mainly using deep learning models, so take this post as the foundation for upcoming posts.

The complete source code of this article is available on github: