April 11, 2017 --- Class 25 --- Fitting Data, Chi-square Distribution,
Confidence Levels, Polyfit, Start of Verification of Error Estimates
Activities:
Fitting Data
------------
In our last class, we looked at what one should do to fit data
to a model function.
If the data has a Gaussian distribution about its true value, then
we maximize the probability for the set of observations by
adjusting the parameters of the model to minimize \chi^2.
\chi^2= \Sum_{i=0}^N} (y_i- y_model(x_i)^2/ \sigma_i^2
Linear Model Function
---------------------
The case of a linear model was considered in great detail. We
obtained specific formulae for the slope and intercept parameters of
the fit. However, even with a formula for the slope and intercept,
there are two important questions:
1) How large is the error in the fitted parameters?
2) How good is the fit?
In this class, using a standard propagation of errors approach,
we found the error on the slope. In addition, I displayed the formulae
for the intercept, and the covariance of slope and intercept. Next, we
turn to the question of how good is the fit.
Chi-squared Distribution
------------------------
In consideration of our second question we note that \chi^2 can be used to
determine the goodness of fit. Since \chi^2 should be the sum of
the squares of N normalized Gaussian variables, it has a known
distribution. If we find that our fit gives a small value of
\chi^2, that is good. If we get a large value, maybe our model is
not the correct description of the data. If the model has p
parameters that we adjust, the fit is said to have N-p degrees of
freedom. Soon we will look at a code to compute
the probabily that chisq is greater than the observed value.
This is called the confidence level of the fit. If your
confidence level gets below 5-10%, you begin to wonder if you
really have the right model for describing your data.
Analytic Approach to Chi-square Distribution
---------------------------------------------
We showed how the chi square distribution can be found by
integrating random variables in spherical coordinates. If we integrate
all the variables but the radial one, we find that the radial variable
is equivalent to the random variable chi-squared. In fact, we don't
even have to do the angular integral because it contributes to the
normalization and we can just normalize our probability distribution
function so that when we integrate it from 0 to infinity, we get one.
However, we did the radial integration, getting Euler's gamma
function and that allows us to calculate the "solid angle" of
an n-dimensional hypershere, i.e., the area of a unit sphere.
Here is a Mathematica expression for the chi-squared distribution:
chisq[x_, n_] := (x^((n - 2)/2)*Exp[-(x/2)])/(Gamma[n/2]*2^(n/2))
When making histograms of chi-square or graphs in Mathematica, we find
that the peak in the function occurs where x is about n, the number of
degrees of freedom.
Mathematica has the chi-square built in as one of it many statistical
distributions. We explored this and there is a Mathematica notebook
~sg/Documents/chisquaredistribution.nb for you to review.
Confidence Levels
-----------------
We considered the fact that the quantity \chi^2 can be used to
determine the goodness of fit. Since \chi^2 should be the sum of
the squares of N normalized Gaussian variables, it has a known
distribution. If we find that our fit gives a small value of
\chi^2, that is good. If we get a large value, maybe our model is
not the correct description of the data. If the model has p
parameters that we adjust, the fit is said to have N-p degrees of
freedom. A program confidence2, whose source code is in
~sg/src/misc/confidence2_2.c can be used to determine the confidence
level of a fit. The confidence level depends on \chi^2 and the
number of degrees of freedom. The confidence level is the
probability that \chi^2 would exceed the value for which we are
computing the confidence level. As an example:
% confidence2
6.3 5
6.300000e+00 5.000000e+00 2.781123e-01
This represents the screen display for considering \chi^2=6.3 for
5 degrees of freedom. The confidence level is 0.278 or nearly 28%.
Thus, for a fit with this confidence level, we would realize that
even for a correct model the computed value of \chi^2 would be
larger than this 28% of the time. Usually, statisticians like to
rule things out at the 90% or 95% confidence level. So, when your
confidence level gets below 5-10%, you begin to wonder if you
really have the right model for describing your data. (It is also
possible that you have underestimated the errors.)
BTW, confidence2 takes its input from the standard input, so hit
Control-D or Control-C to terminate its running.
Confidence Level From Mathmatica
--------------------------------
Since we have already considered how to get Mathmematica to calculate
the normalized chisq distribution, we can just use numerical
integration to compute the confidence level:
chisq[x_, n_] := (x^((n - 2)/2)*Exp[-(x/2)])/(Gamma[n/2]*2^(n/2))
confidence[chisquared_, n_] := NIntegrate[ chisq[y,n], {y,chisquared,Infinity}]
You may also use Mathematica's CDF (cumulative distribution function) or
Quantile function to calculate the confidence level.
Polyfit
-------
Polyfit is a routine that fits data with an nth order polynomial. You
are welcome to use or copy it. You will find the code and a makefile
in ~sg/src/misc. Before class, I made up a code to produce data
close to a quadratic function. It is called almostquad and is
in my bin directory. The source code is in ~sg/src/misc/almostquad.c.
You will also need gausrand.c and newfsr.c to compile almostquad.
almostquad set to produce a curve:
2.0 + x + 0.1 x*x + noise
The noise is gaussian with a standard deviation sigma = 0.8.
I ran almostquad and put the output in a file. I then demonstated
how to use polyfit to fit the data.
We can learn quite a bit by doing this once, but we want to explore
the statistics of fitting many data sets that only differ by the
random noise. Students started writing a script to go through
the process of creating data and fitting it many times so that
we can look at the statistical fluctuations on our next class.
We would like to see how the variation of our fitted parameters
compares with the error estimate printed out by polyfit.