Although Neataptic networks accepts non-normalized values as input, normalizing your input makes your network converge faster. I see a lot of questions where people ask how to normalize their data correctly, so I decided to make a guide.
You have gathered this information, now you want to use it to train/activate a neural network:
{ stock: 933, sold: 352, price: 0.95, category: 'drinks', id: 40 }
{ stock: 154, sold: 103, price: 5.20, category: 'foods', id: 67 }
{ stock: 23, sold: 5, price: 121.30, category: 'electronics', id: 150 }
So some information on the above data:
stock
: the amount of this item in stock
sold
: the amount of this item sold ( in the last month )
price
: the price of this item
category
: the type of product
* id
: the id of the product
So we want to represent each of these inputs as a number between 0
and 1
, however, we can not change the relativity between the values. So we need to treat each different input the same (stock
gets treated the same for every item for example).
We have two types of values in our input data: numerical values and categorical values. These should always be treated differently.
Numerical values are values where the distance between two values matters. For example, price: 0.95
is twice as small as price: 1.90
. But not all integers/decimals are numerical values. Id's are often represented with numbers, but there is no relation between id: 4
and id: 8
. So these should be treated as categorical values.
Normalizing numerical values is quite easy, we just need to determine a maximum value we divide a certain input with. For example, we have the following data:
stock: 933
stock: 154
stock: 23
We need to choose a value which is >= 933
with which we divide all the stock
values. We could choose 933
, but what if we get new data, where the stock
value is higher than 933
? Then we have to renormalize all the data and retrain the network.
So we need to choose a value that is >=933
, but also >= future values
and it shouldn't be a too big number. We could make the assumption that the stock
will never get larger than 2000
, so we choose 2000
as our maximum value. We now normalize our data with this maximum value:
// Normalize the data with a maximum value (=2000)
stock: 933 -> 933/2000 -> 0.4665
stock: 154 -> 154/2000 -> 0.077
stock: 23 -> 23/2000 -> 0.0115
Categorical data shows no relation between different categories. So each category should be treated as a seperate input, this is called one-hot encoding. Basically, you create a seperate input for each category. You set all the inputs to 0
, except for the input which matches the sample category. This is one-hot encoding for our above training data:
Sample | Drinks | Foods | Electronics |
---|---|---|---|
1 | 1 | 0 | 0 |
2 | 0 | 1 | 0 |
3 | 0 | 0 | 1 |
But this also allows the addition of new categories over time: you just a need input. It has no effect on the performances of the network on the past training data as when the new category is set to 0
, it has no effect (weight * 0 = 0
).
So applying what I have explained above creates our normalized data, note that the relativity between inputs has not been changed. Also note that some values may be rounded in the table below.
{ stock: 0.4665, sold: 0.352, price: 0.00317, drinks: 1, foods: 0, electronics: 0, id40: 1, id67: 0, id150: 0 }
{ stock: 0.077, sold: 0.103, price: 0.01733, drinks: 0, foods: 1, electronics: 0, id40: 0, id67: 1, id150: 0 }
{ stock: 0.0115, sold: 0.005, price: 0.40433, drinks: 0, foods: 0, electronics: 1, id40: 0, id67: 0, id150: 1 }
Max values:
stock
: 2000sold
: 1000price
: 300Please note, that these inputs should be provided in arrays for neural networks in Neataptic:
[ 0.4665, 0.352, 0.00317, 1, 0, 0, 1, 0, 0 ]
[ 0.77, 0.103, 0.01733, 0, 1, 0, 0, 1, 0 ]
[ 0.0115, 0.005, 0.40433, 0, 0, 1, 0, 0, 1 ]