next home previous table of contents chap 3 toc

3.5 Correlation Coefficient and Coefficient of Determination

We have already showed how to use the sum of the squares of the individual errors to measure how well a function f predicts y from x. Suppose that we have a set of data points of the form (xi,yi), for i = 1, 2, …, n.. First consider trying to fit the data with a constant function (a function of the form y = c). The best function of this form is the function for which c is the average of the yi. As an example, consider fitting the data given in Table 3.2 with a constant function. We would like to find the function of the form y = c that fits the data best in the least squares sense. In this case,

A & b

In order to determine the value of c, we solve the normal equations:

normal equations

We can see from this example that AT A is the number of data points and AT b is the sum of the y values of the data points.

Now let's return to a discussion of a general data set. Let fc represent the constant function that best fits the data in the least squares sense, and let Sc represent the sum of the squares of the errors associated with fitting the data set with fc. Since we may not be satisfied with the error obtained by fitting the data with a constant function, we will also consider fitting the function with a linear function of the form

y = mx + b. (3.12)

Let fl represent the function of the form (3.12) that best fits the data in the least squares sense, and let Sl represent the sum of the squares of the errors associated with fitting the data set with fl. Every constant function is also a function of the form (3.12). We can see this by letting m = 1 and b = c in (3.12). That means that the function fl must fit the data at least as well as fc does. So the errors in the two fits must satisfy S_l <= S_c.

We can use the ratio of these two error measures to see how well linear changes in the x-values predict changes in y-values. The ratio S_l/S_c represents the proportion of the variation in y that is not explained by linear variation in x. We will define $R^2 = 1 - S_l/S_c. R2 measures the proportion of the variation that is explained by simple linear regression. If the linear function that best fits the data happens to be a constant function, then Sc = Sl and R2 = 0. On the other hand, if the error obtained by fitting the data using the best linear function is much smaller than the error obtained using the best constant function, then the ratio S_l/S_c will be small and the value of R2 will be close to 1. The value R is called the coefficient of determination.

We can define the coefficient of determination for fits using other types of curves. So far, we considered only curves that could be written in the form (3.12). We call the set of curves satisfying this relationship a family of curves. If the data seems to follow a different shape, then we might want to consider fitting the data with curves from a different family; for example, we might try a the family of cubic functions or even a family involving exponential functions. Let f_alpha represent the function from our chosen family that best fits the data, and let S_alpha represent the sum of squared errors obtained by fitting the data using f_alpha. Then we will use the definition $R^2 = 1 - S_l/S_c. R2 is a relative measure of fit: it compares the best fit using a function from our chosen family with the best fit using a constant function.

If the family of curves that we are using contains constant functions (and most will), then we will have the relationship S_alpha leq S_c, and then 0 leq R^2 leq 1.

If we are fitting a set of data in two variables and we select our function from those of the form (3.12), then we are performing what is called simple linear regression. In this case, it is customary to use the notation r instead of R and to refer to r as the correlation coefficient.

next home previous go to the top table of contents chap 3toc

Send comments on material to Cynthia Lanius

These pages are maintained by Hilena Vargas (hvargas@rice.edu)
Updated: March 1, 2001

 Copyright © 2001 Richard Tapia and Cynthia Lanius