So we will get 100 percentile values for the orginal 500 samples I can say that below the value of x10 or below the value of 10th rank or 10th percentile, 10% of the values lies and above x10 or above 10th percentile, 90% of the value lies. In this set, lets say i am ranking each value from 1 to 100 so first value will get rank 1 and the last value will get rank 100. (if you dont know how to find percentile or what is exactly percentile, lets assume i have 100 values and i sort them into ascending order. Sort xi’s in ascending order and find percentile HERE WE DO NOT KNOW THE DISTRIBUTION OF X, AT THE END OF QQ PLOT WE SHOULD KNOW IS IT NORMAL DISTRIBUTED OR NOT. So lets assume we have a random variable X and we take 500 observations out of them, lets say x1, x2….x500. So we got the intution Why we need Q-Q plot. There are two methods to determine that:. So if we somehow got to know that our distribution is Gaussian, We can build a great machine learning model.īUT HOW TO DETERMINE WHETHER A DISTRIBUTION IS GAUSSIAN OR NOT? This distribution has been widely studied by scientist and we have enough information about this distribution to shape our model into a good machine learning model.Mean is “0" and standard deviation is “1".Mean, Median and mode is same in this distribution.Now What is Q-Q plot then and why do we need it?īefore going further i am assuming that you know what is a Gaussian Distribution or Normal Distribution, if not just know some simple facts about Gaussian Distribution. High Dimensiom data(100 features or more) - USE DIMENSION REDUCTION ![]() Low dimension data(upto 10 features) - Use PAIR PLOTS You will be easily able to connect the dots after reading that. ![]() I can’t cover Dimensional Reduction here because it is very big topic itself but i already did write a article on Geometric Intution of PCA AND T-SNE. ![]() So to solve this problem We have something called Dimension Reduction and in that we have techniques called PCA and T-SNE to visualize data when there is a high dimensional data set. It would be really difficult to go through every plot and make sense out of it. Let me give you a situation, lets imagine you have 100 features and you want its pair plot, now 100*100 will be lots of plots Seaborn has a very simple one line code for Pairplots That means if there are n dimensions or features, Pair Plot simply gives us the matrix of n*n size. Personally i remember it as a scatter plot of every pair of features. Exactly!!!! Pair plot is a matrix type distribution showing scattering of data points of every possible pair of features.
0 Comments
Leave a Reply. |