March 4, 2024

Correlation and Regression by K Bhanumoorthy

 

A quote from Dr. Napoleon Hill.

google.com, pub-5194588720185623, DIRECT, f08c47fec0942fa0

“Whatever the mind can conceive and believe, the mind can achieve.”

 Correlation is a measure of relationship or association between two variables say x, y.

There are two types, one is positive and the other is negative. If x increases y also increases, then it is positive or if x decreases, y also decreases this is also positive (+ve).  There may be situation when x increases y decreases or when x decreases y increases, in that case correlation is negative (-ve). There are some instances there are no correlation.

Examples: Positive

 1. One liter petrol costs Rs.102, then the cost of 4 liters.

  2.Cost of bus/ train ticket; If the travel distance is more, cost also goes up.

   3.Consumption of electricity, if one uses longer time, the consumption shall be more, subsequently bill amount will be more.

Eg; Negative

       1.If you increase the speed while driving car (could be even scooter or any vehicle for the matter) the travel time will be less.

      2.In a circuit if the resistance R is more, the flow of current shall be less.  

Francis Galton, during the year 1888, has found out/ introduced this topic in the field of mathematics. This is being used in the field of psychology and education, correlation is used, in business financial analysis and decision making, in Statistics analysis of variates, in research, scientists deal with data analysis, in order to reduce the mistakes.


There 4 types of correlation.

1. Pearson correlation:- It is a correlation coefficient  that measures linear correlation between two sets of data.



2. Kendall rank correlation : It is used to measure the ordinal association between two measured quantities.


3. Spearman Correlation :- It is a nonparametric measure of rank correlation that assesses how well the relationship between two variables can be described using a monotonic function.  It is denoted by the symbol “rho” (ρ) and can take values between -1 to +1. A positive value of rho indicates that there exists a positive relationship between the two variables, while a negative value of rho indicates a negative relationship. A rho value of 0 indicates no association between the two variables.

Spearman’s Correlation formula

\rho = 1 - \frac{6\sum d_{i}^{2}}{n(n^2-1)}

4. Point- Biserial Correlation: The point biserial correlation coefficient (rpb) is a correlation coefficient used when one variable (e.g. Y) is dichotomous. To calculate rpb, assume that the dichotomous variable Y has the two values 0 and 1. If we divide the data set into two groups, group 1 which received the value "1" on Y and group 2 which received the value "0" on Y, then the point-biserial correlation coefficient is calculated as follows:

where sn is the standard deviation used when data are available for every member of the population:

 Let me recall what had been learnt in earlier classes. In chapter measures of central tendency, we studied mean, median and mode. These averages give us only a rough idea where the observations centered. We will not get clear idea to what extent observations are scattered or arranged. So, we go further to study about the data, call them measures of dispersion .ie 1. range,2. mean deviation about mean, median,.3, Harmonic Mean, 4. Geometric mean,5. Standard deviation (SD), 6. variance. These measures of dispersion will be able to measure the degree of how the observations are scattered. We use the Standard deviation for our analysis.

Variance: Given a set of numbers. Variance is a measure of, how far each number is from the mean. It is calculated by taking the difference between each number from the mean. The differences are squared.  It is further dividing the sum of squares by the number of values in the given set.

Variance is used in statistical inferences, hypothesis testing. It is also used in investment portfolios, to know, improve investment.


For all types formulae are given, from the given data one has to form the tables, then substitute, in the formulae. It’s a matter of 4 fundamental operations, squares and square roots. One has to remember the formula. let us work out an example.

Before we move to the problems there is another word connected with this topic is Covariance.

Covariance is a measure of directional relationship between the two variables to what extent the variables change together. Correlation coefficient is a mere number lying between -1 and 1 whereas Covariance is measured in units. Variance is a measure of magnitude; it is a number. Variance could be positive (+) or negative (-)

Formula for covariance:     Cov (X, Y) = ∑ { (Xi - `X) ( Yj -`Y)}/ n, where Xi denotes the values of variable X, Yj denotes the values of variable Y, `X the mean of variable X,`Y   mean of variable Y, n, the number of data entries /units.

Variance is square of` Standard Deviation (SD), denoted by s^2 (sigma square). One has to be familiar with the symbols what do we use in

mathematics universally accepted ones.

Correlation coefficient r(X, Y) = Cov (X, Y) / sx sy where r(X, Y) correlation between X and Y, Cov (X, Y) covariance between X and Y, sx Standard deviation of X and  sy standard deviation of Y.

 

Note: The correlation coefficient always lies  between -1 and 1.  If it is 0 (zero) we can clearly say that there is no correlation between `the two given variables,

Covariance: Examples.

1.Find the Cov (X, Y) between the two variables X and Y:

 Given if   X:  3   4   5   6   7; Y: 8    7   6    5     4.  From the given data XY =140,

  X =25, ∑Y =30. n=5

(∑XY =24+ 28+ 30+ 30+28). Now the Solution is Cov({X, Y) ={ n∑ XY -  (∑X)( ∑Y) }/ `n^2. Substituting

the values, we get   5×140 - (25×30) / 25 = 700 – 750 / 25 = - 50/ 25 = -2 we can conclude the variables are negatively correlated.

2.Find the correlation coefficient for the data given. Cov (X, Y) = -16.5, Var (X) =2.89, Var(Y) =100,

r(X, Y) = Cov(X, Y) /  Övar(x).Var(Y).=   -16.5 /  Sq root 0f (2.89 ×100)  = -16.5 / (1,7 ×10)  = -16.5/ 17 ,

Correlation coefficient

 is calculated to be   - 0.97, (negative), is the answer.

3.  ∑ X = 15,  .   ∑ Y = 36,.   ∑ XY=110, n =5, find Cov (X, Y)

    Ans: Cov (X, Y) = (1/ n). ∑ XiYi -

     (1/ 5) X 110 ---3x 7.2   = 22 ---- 21.6 = 0.4 Answer.

 

4. Cov ( x,y)  = -- 13.5 , Var (X) = 2.25 , Var (Y ) = 100 ,find correlation coefficient . r ( x,y) ?

Ans: r( x,y) = Cov (X, Y )  / Ö Var X . Ö Var Y.

                      =    --- 13.5 / 1.5 x 10   = -- 13,5 / 15 = - 0.9 Answer (negative).     

 

 

                                                                     --------- be continued

 

 

 

 

 

 


No comments: