December 2, 2014 --- Class 27 --- Chi-square Distribution, Confidence Levels,
Verification of Error Estimates
Activities:
Analytic Approach to Chi-square Distribution
---------------------------------------------
We completed the derivation of the chi squared distribution by doing
the radial integral and also found the surface area of a sphere in
n dimensions.
Here is a Mathematica expression for the chi-squared distribution:
chisq[x_, n_] := (x^((n - 2)/2)*Exp[-(x/2)])/(Gamma[n/2]*2^(n/2))
When making histograms of chi-square or graphs in Mathematica, we find
that the peak in the function occurs where x is about n, the number of
degrees of freedom.
Mathematica has the chi-square built in as one of it many statistical
distributions. We explored this and there is a Mathematica notebook
~sg/Documents/chisquaredistribution.nb for you to review.
Confidence Levels
-----------------
We considered the fact that the quantity \chi^2 can be used to
determine the goodness of fit. Since \chi^2 should be the sum of
the squares of N normalized Gaussian variables, it has a known
distribution. If we find that our fit gives a small value of
\chi^2, that is good. If we get a large value, maybe our model is
not the correct description of the data. If the model has p
parameters that we adjust, the fit is said to have N-p degrees of
freedom. A program confidence2, whose source code is in
~sg/src/misc/confidence2_2.c can be used to determine the confidence
level of a fit. The confidence level depends on \chi^2 and the
number of degrees of freedom. The confidence level is the
probability that \chi^2 would exceed the value for which we are
computing the confidence level. As an example:
confidence2
6.3 5
6.300000e+00 5.000000e+00 2.781123e-01
This represents the screen display for considering \chi^2=6.3 for
5 degrees of freedom. The confidence level is 0.278 or nearly 28%.
Thus, for a fit with this confidence level, we would realize that
even for a correct model the computed value of \chi^2 would be
larger than this 28% of the time. Usually, statistician like to
rule things out at the 90% or 95% confidence level. So, when your
confidence level gets below 5-10%, you begin to wonder if you
really have the right model for describing your data. (It is also
possible that you have underestimated the errors.)
BTW, confidence2 takes its input from the standard input, so hit
Control-D or Control-C to terminate its running.
Confidence Level From Mathmatica
--------------------------------
Since we have already considered how to get Mathmematica to calculate
the normalized chisq distribution, we can just use numerical
integration to compute the confidence level:
chisq[x_, n_] := (x^((n - 2)/2)*Exp[-(x/2)])/(Gamma[n/2]*2^(n/2))
confidence[chisquared_, n_] := NIntegrate[ chisq[y,n], {y,chisquared,Infinity}]
You may also use Mathematica's CDF (cumulative distribution function) or
Quantile function to calculate the confidence level.
Polyfit
-------
Polyfit is a routine that fits data with an nth order polynomial.
You are welcome to use it. You will find the code and a makefile
in ~sg/src/misc. Before class, I made up a code to produce data
close to a quadratic function. It is called almostquad.c and is
set to produce a curve
2.0 + x + 0.1 x*x + noise
The noise is gaussian with a standard deviation sigma = 0.8.
You may run almostquad (it is in ~sg/bin) and put the output
in a file. You may then use polyfit to fit the data.
You can learn quite a bit by doing this once, but I wanted to explore
the statistics of fitting many data sets that only differ by the
random noise. We can ask questions such as:
1) what is the distribution of chi-squared produced by the fit?
2) what is the distribution of confidence level?
3) what is the distribution of each of the fitted parameters and
how does it compare with the error estimated by doing the fit?
Verification of Error Estimates
-------------------------------
~sg/src/misc/almostquad.csh is set up to do this experiment 10000 times
Here is what the script looks like:
#!/bin/csh
# a script to run my almostquad code many times and use polyfit
unset noclobber
set run = 0
while ($run <10000)
almostquad > almostquad.dat
polyfit <$i.out
foreach? end
We made histograms of the three parameters in the model
and used my variance program to find the standard deviation for each
parameters. These values all agree very well with the error on the
parameters reported by polyfit.
You might like to extend this by looking at the covariance of different
parameters and comparing that with the correlations reported by
polyfit.
We also considered what happens when we fit the data with either a
linear or cubic polynomial. With a linear polynomial the chisquared
is huge and the confidence levels are miniscule. With a cubic fit,
the confidence levels are fine, but we find that the coefficient of
the cubic term is not significantly different from zero. We also found
the because of the extra parameter in the model the error estimate of the
other parameters increases.