Use of Data Distribution in machine learning
Data Distribution in Machine learning
To collect data from the real world is a difficult task because at an early stage of a project the data is not in a limited form it has a broad format.
In some previous articles, we have gone through a limited amount of data to explain the concept in different ways.
How to Get Big Data Sets?
To construct big data sets for testing, and examination we use the Python module NumPy.
It comes with a number of methods and techniques to create random data sets, of any size.
Example Create an array containing 150 random floats between 0 and 8:
import numpy
x = numpy.random.uniform(0.0, 8.0, 150)
print(x)
What is Histogram
To create the data set we can draw a histogram with the data we collected.
By using the Python module Matplotlib we can draw a histogram
Example to draw a histogram:
import numpy
import matplotlib.pyplot as plt
x = numpy.random.uniform(0.0, 8.0, 150)
plt.hist(x, 8)
plt.show()
Note: The array values are random numbers and will not show the accurate value as result on your computer.
Explanation of Histogram
In the above example, we use the array to draw a histogram with 8 bars.
The first bar symbolizes how many values in the array are between 0 and 1.
The second bar symbolizes how many values are between 1 and 2.
Etc.
Which gives us this result:
50 values are between 0 and 1
45 values are between 1 and 2
49 values are between 2 and 3
40 values are between 3 and 4
50 values are between 4 and 5
Big Data Distributions
An array that encloses to 250 values is not treated as very big, but now you know in what way or manner to create a random set of values, and dynamically change the parameters, you can build the data set as big as you required.
Example
construct an array with 10000 random numbers, and present them using a histogram with 100 bars:
import numpy
import matplotlib.pyplot as plt
y = numpy.random.uniform(0.0, 5.0, 10000)
plt.histo(y, 100)
plt.show()