Normalization

Edit on Github


Although Neataptic networks accepts non-normalized values as input, normalizing your input makes your network converge faster. I see a lot of questions where people ask how to normalize their data correctly, so I decided to make a guide.

Example data

You have gathered this information, now you want to use it to train/activate a neural network:

{ stock: 933, sold: 352, price: 0.95,   category: 'drinks',      id: 40 }
{ stock: 154, sold: 103, price: 5.20,   category: 'foods',       id: 67 }
{ stock: 23,  sold: 5,   price: 121.30, category: 'electronics', id: 150 }

So some information on the above data: stock: the amount of this item in stock sold: the amount of this item sold ( in the last month ) price: the price of this item category: the type of product * id: the id of the product

Normalize

So we want to represent each of these inputs as a number between 0 and 1, however, we can not change the relativity between the values. So we need to treat each different input the same (stock gets treated the same for every item for example).

We have two types of values in our input data: numerical values and categorical values. These should always be treated differently.

Numerical values

Numerical values are values where the distance between two values matters. For example, price: 0.95 is twice as small as price: 1.90. But not all integers/decimals are numerical values. Id's are often represented with numbers, but there is no relation between id: 4 and id: 8 . So these should be treated as categorical values.

Normalizing numerical values is quite easy, we just need to determine a maximum value we divide a certain input with. For example, we have the following data:

stock: 933
stock: 154
stock: 23

We need to choose a value which is >= 933 with which we divide all the stock values. We could choose 933, but what if we get new data, where the stock value is higher than 933? Then we have to renormalize all the data and retrain the network.

So we need to choose a value that is >=933, but also >= future values and it shouldn't be a too big number. We could make the assumption that the stock will never get larger than 2000, so we choose 2000 as our maximum value. We now normalize our data with this maximum value:

// Normalize the data with a maximum value (=2000)
stock: 933 -> 933/2000 -> 0.4665
stock: 154 -> 154/2000 -> 0.077
stock: 23  ->  23/2000 -> 0.0115

Categorical data

Categorical data shows no relation between different categories. So each category should be treated as a seperate input, this is called one-hot encoding. Basically, you create a seperate input for each category. You set all the inputs to 0, except for the input which matches the sample category. This is one-hot encoding for our above training data:

Sample Drinks Foods Electronics
1 1 0 0
2 0 1 0
3 0 0 1

But this also allows the addition of new categories over time: you just a need input. It has no effect on the performances of the network on the past training data as when the new category is set to 0, it has no effect (weight * 0 = 0).

Normalized data

So applying what I have explained above creates our normalized data, note that the relativity between inputs has not been changed. Also note that some values may be rounded in the table below.

{ stock: 0.4665, sold: 0.352, price: 0.00317, drinks: 1, foods: 0, electronics: 0, id40: 1, id67: 0, id150: 0 }
{ stock: 0.077,  sold: 0.103, price: 0.01733, drinks: 0, foods: 1, electronics: 0, id40: 0, id67: 1, id150: 0 }
{ stock: 0.0115, sold: 0.005, price: 0.40433, drinks: 0, foods: 0, electronics: 1, id40: 0, id67: 0, id150: 1 }

Max values:

  • stock: 2000
  • sold: 1000
  • price : 300

Please note, that these inputs should be provided in arrays for neural networks in Neataptic:

[ 0.4665, 0.352, 0.00317, 1, 0, 0, 1, 0, 0 ]
[ 0.77,   0.103, 0.01733, 0, 1, 0, 0, 1, 0 ]
[ 0.0115, 0.005, 0.40433, 0, 0, 1, 0, 0, 1 ]