December 4, 2007 --- Class 27 --- Fitting Data, Chi-squared Distribution,
Nonlinear Fitting, Polyfit
Activities:
Fitting Data
We looked at what one should do to fit data to a model function.
If the data has a Gaussian distribution about its true value, then
we maximize the probability for the set of observations by
adjusting the parameters of the model to minimize \chi^2.
\chi^2= \Sum_{i=0}^N} (y_i- y_model(x_i)^2/ \sigma_i^2
The case of a linear model was considered in great detail. We
obtained specific formula for the slope and intercept parameters of
the fit. However, even with a formula for the slope and intercept,
there are two important questions:
1) How large is the error in the fitted parameters?
2) How good is the fit?
Using a standard propagation of errors approach, we found the error
on the slope and intercept, and the covariance of those quantities.
We also considered the fact that the quantity \chi^2 can be used to
determine the goodness of fit. Since \chi^2 should be the sum of
the squares of N normalized Gaussian variables, it has a known
distribution. If we find that our fit gives a small value of
\chi^2, that is good. If we get a large value, maybe our model is
not the correct description of the data. If the model has p
parameters that we adjust, the fit is said to have N-p degrees of
freedom. Soon we will look at a code to compute
the probabily that chisq is greater than the observed value.
This is called the confidence level of the fit. If your
confidence level gets below 5-10%, you begin to wonder if you
really have the right model for describing your data.
To get a feeling for what the chisq distribution looks like we took
a code that prints out gaussian random numbers and created an awk
script to sum squares of N such numbers. The code for creating
gaussian random numbers is in ~sg/src/misc/gausrand.c. gausrand.c uses
an accept-reject algorithm to generate numbers uniformly in the
unit circle and then rescales the radius of the points to create
a gaussian distribution. Like the Box-Muller method, it produces
two numbers at a time. It comes from Sec. 7.2 of Numerical Recipes.
testgaus.c is a simple driver programs that calls gausrand 250000
times and prints the result to standard output.
The awk script chisq.a looks like this:
{sum += $1*$1; if (NR%n == 0) {print sum; sum=0.} }
testgaus |awk -f chisq.a n=6
will produce a stream of numbers according to the chisq distribution
with N=6. Of course, you can pipe this to hist or autocorr if you
want to see a histogram or verify that the autocorrelation is small.
Members of the class were assigned various values of the number of
degrees of freedom and created a histogram of the chi-sq distribution.
It is easy to do several cases with a foreach loop at the command line:
foreach i ( 2 4 6 8 10 12 20)
testgaus |awk -f chisq.a n=$i|hist -n 50 -g |axis lt $i|xplot
end
Notice the label in the axis command so we can tell which graph
is which.
Chi-Squared, only faster
If you want to know what a chi-squared distribution looks like you
may use my chisq program in directory ~sg/src/misc. (This file
was copied to the Apple cluster after class.)
The program chisq.c calls gausrand to produce numbers with a
chi-sq distribution. This runs more quickly than using the awk
script. Of course, writing the C program
takes more time than typing the awk script. So we have yet another
example of trying to compare development time and run time.
If you just type chisq, you will get this message:
Usage: chisq dof num
where dof is the number of degrees of freedom
and num is the number of random variables generated
Now you know how to run the code. For example, try:
chisq 10 10000 |~sg/bin/hist -n 40 -g |axis |xplot
will produce a nice graph of the distribution for 10 degrees
of freedom.
Nonlinear fitting
We must next generalize the discussion to include two
types of non-linearities in the model. For the case of a model that
is a non-linear function of x, but still linear in its parameters,
it is still easy to find the minimum of \chi^2 because \chi^2 is
only a quadratic function of the parameters. Numerical Recipes has
a function lfit to deal with this case. You have to supply a
routine that computes the x dependence of the model function.
y_model (x) = \Sum_{k=1}^N =lambda_k f_k(x)
Your routine calculates the f_k and the NR routine takes care of
varying lambda_k. Of course, you also have to supply the data with
(optionally) error bars.
We will deal with the case of the model function being non-linear
in its parameters later this week.
Polyfit
Polyfit is a routine that fits data with an nth order polynomial.
You are welcome to use it. You will find the code and a makefile
in ~sg/src/misc. I made up a data file with data close to a
quadratic and fit it. We briefly looked at the output.