April 8, 2013 --- Class 25 --- Fitting Data, Chi-squared Distribution
Activities:
Fitting Data
------------
We looked at what one should do to fit data to a model function.
If the data has a Gaussian distribution about its true value, then
we maximize the probability for the set of observations by
adjusting the parameters of the model to minimize \chi^2.
\chi^2= \Sum_{i=0}^N} (y_i- y_model(x_i)^2/ \sigma_i^2
Linear Model Function
---------------------
The case of a linear model was considered in great detail. We
obtained specific formulae for the slope and intercept parameters of
the fit. However, even with a formula for the slope and intercept,
there are two important questions:
1) How large is the error in the fitted parameters?
2) How good is the fit?
Using a standard propagation of errors approach, we found the error
on the slope and intercept, and the covariance of those quantities.
We also considered the fact that the quantity \chi^2 can be used to
determine the goodness of fit. Since \chi^2 should be the sum of
the squares of N normalized Gaussian variables, it has a known
distribution. If we find that our fit gives a small value of
\chi^2, that is good. If we get a large value, maybe our model is
not the correct description of the data. If the model has p
parameters that we adjust, the fit is said to have N-p degrees of
freedom. Soon we will look at a code to compute
the probabily that chisq is greater than the observed value.
This is called the confidence level of the fit. If your
confidence level gets below 5-10%, you begin to wonder if you
really have the right model for describing your data.
To get a feeling for what the chisq distribution looks like we took
a code that prints out gaussian random numbers and created an awk
script to sum squares of N such numbers. The code for creating
gaussian random numbers is in ~sg/src/misc/gausrand.c. gausrand.c uses
an accept-reject algorithm to generate numbers uniformly in the
unit circle and then rescales the radius of the points to create
a gaussian distribution. Like the Box-Muller method, it produces
two numbers at a time. It comes from Sec. 7.2 of Numerical Recipes.
testgaus.c is a simple driver programs that calls gausrand 250000
times and prints the result to standard output.
The awk script chisq.a looks like this:
{sum += $1*$1; if (NR%n == 0) {print sum; sum=0.} }
testgaus |awk -f chisq.a n=6
will produce a stream of numbers according to the chisq distribution
with N=6. Of course, you can pipe this to hist or autocorr if you
want to see a histogram or verify that the autocorrelation is small.
Members of the class were assigned various values of the number of
degrees of freedom and created a histogram of the chi-sq distribution.
Interactively, I wrote a foreach loop to look at values of n from 2 to
50 and display a graph on each case.
Gaussian Integral
-----------------
At the end of class we did a simple Gaussian integral as preparation
for analytic discussion of chi-squared distribution.