AnsweredAssumed Answered

vrf Linear regression & goodness of fit

Question asked by VRFuser on Jun 12, 2006
Andy,

There are several things going on here. VEE does indeed appear to be computing r-squared as claimed. The value r is the correlation coefficient defined by :

r = (mean(x*y) - mean(x)*mean(y))/(sdev(x)*sdev(y))

and is a standard measure of how strongly x and y are correlated. Your example is problematic in that horizontal and vertical lines actually have r=0. (This makes sense for the horizontal line because y is constant, i.e.
does not depend on x; and for the vertical line because y can take on any value for the single value of x.) Complicating this is that your example used perfect horizontal lines for which sdev(x)=0. This would normally cause a divide by 0 error, except that VEE apparently calculates the standard deviation in this particular case using a form that is highly susceptible to roundoff error. Recall that the standard deviation is:

sdev(x) = sqrt(mean((x-mean(x))^2))

which can be rewritten as

sdev(x) = sqrt(mean(x^2) - mean(x)^2)

This second form is more susceptible to roundoff error, and it is apparent from fiddling around that this is what VEE uses for the goodness of fit calculation. The VEE functions sdev() and vari() (variance) do not seem to be susceptible to this problem.

All of this is a long winded way of getting to some practical suggestions:

If your expected linear relationships have non-zero, finite slopes, you probably will be OK. Just to be on the safe side you could subtract the means from the x and y data before feeding into the regression object and then adjust the results accordingly. Failing that, you could write your own regression function in VEE. The formulas should be in a statistics or linear algebra book. Be aware that the VEE sdev() and vari() functions divide by N-1 rather than by N so as to get an unbiased estimate for sampled data sets, whereas the definition for r uses the divide-by-N form.
Alternatively you could use MATLAB, it has some pretty concise notation for regression. Check out the back-slash () notation.

I hope this helps.
--
Bill Ossmann
Philips Ultrasound

"Street, Andy M" <andy.street@tycoelectronics.com> wrote on 06/09/2006
11:43:05 AM:

> Hello VRF,
>
> Please can any of the curve-fitters and regressors out there please
> comment on the attached.
>
> I have some test data that should follow a linear relationship.  I
> curve fit using the linear regression object native to VEE.  I would
> like to use the 'goodness of fit' as a metric of how well my data is
> modeled by the linear regression.
>
> Question: Does anyone know how VEE calculates the GoF metric - the
> help is a little vague (it merely says it's the R-squared value which
> can range -1 to 1).
>
> Question: does anyone know why the GoF is so sensitive to the
> numerical value when the gradient is zero?  See attached.  The
> examples show that whilst the object fits a line to the data, the GoF
> tends to 1 or 0, depending upon the value!  Interesting that when the
> number can be correctly represented in binary (integer, 1/2 etc and
> not 1/3), the GoF is 1, but when it is something like 1/3 the GoF plummets to zero....
>
> Thanks
>
> Andy


---
To subscribe please send an email to: "vrf-request@lists.it.agilent.com" with the word subscribe in the message body.
To unsubscribe send a blank email to "leave-vrf@it.lists.it.agilent.com".
To send messages to this mailing list,  email "vrf@agilent.com". 
If you need help with the mailing list send a message to "owner-vrf@it.lists.it.agilent.com".
Search the "unofficial vrf archive" at "www.oswegosw.com/vrf_archive/".

Outcomes