December 2, 2004 --- Class 27 --- Fitting Data, Chi-squared Distribution Activities: Fitting Data We looked at what one should do to fit data to a model function. If the data has a Gaussian distribution about its true value, then we maximize the probability for the set of observations by adjusting the parameters of the model to minimize \chi^2. \chi^2= \Sum_{i=0}^N} (y_i- y_model(x_i)^2/ \sigma_i^2 The case of a linear model was considered in great detail. We obtained specific formula for the slope and intercept parameters of the fit. We also found the error on the slope and intercept, and the covariance of those quantities. We also considered the fact that the quantity \chi^2 can be used to determine the goodness of fit. Since \chi^2 should be the sum of the squares of N normalized Gaussian variables, it has a known distribution. If we find that our fit gives a small value of \chi^2, that is good. If we get a large value, maybe our model is not the correct description of the data. If the model has p parameters that we adjust, the fit is said to have N-p degrees of freedom. Soon we will look at a code to compute the probabily that chisq is greater than the observed value. This is called the confidence level of the fit. If your confidence level gets below 5-10%, you begin to wonder if you really have the right model for describing your data. We must next generalize the discussion to include two types of non-linearities in the model. For the case of a model that is a non-linear function of x, but still linear in its parameters, it is still easy to find the minimum of \chi^2 because \chi^2 is only a quadratic function of the parameters. Numerical Recipes has a function lfit to deal with this case. You have to supply a routine that computes the x dependence of the model function. y_model (x) = \Sum_{k=1}^N =lambda_k f_k(x) Your routine calculates the f_k and the NR routine takes care of varying lambda_k. Of course, you also have to supply the data with (optionally) error bars. We will deal with the case of the model function being non-linear in its parameters next week. To get a feeling for what the chisq distribution looks like we took a code that prints out gaussian random numbers and created an awk script to sum squares of N such numbers. The code for creating gaussian random numbers is in ~sg/src/gausrand.c. gausrand.c uses an accept-reject algorithm to generate numbers uniformly in the unit circle and then rescales the radius of the points to create a gaussian distribution. Like the Box-Muller method, it produces two numbers at a time. It comes from Sec. 7.2 of Numerical Recipes. testgaus.c is a simple driver programs that calls gausrand 250000 times and prints the result to standard output. The awk script chisq.a looks like this: {sum += $1*$1; if (NR%n == 0) {print sum; sum=0.} } testgaus |awk -f chisq.a n=6 will produce a stream of numbers according to the chisq distribution with N=6. Of course, you can pipe this to hist or autocorr if you want to see a histogram or verify that the autocorrelation is small. Chi-Squared If you want to know what a chi-squared distribution looks like you may use my chisq program in directory ~sg/gausrand. The program chisq.c calls gausrand to produce numbers with a chi-sq distribution. This runs more quickly than using the awk script. Of course, writing the C program takes more time than typing the awk script. So we have yet another example of trying to compare development time and run time. If you just type chisq, you will get this message: Usage: chisq dof num where dof is the number of degrees of freedom and num is the number of random variables generated Now you know how to run the code. For example, try: chisq 10 10000 |~sg/bin/hist -n 40 -g |axis |xplot will produce a nice graph of the distribution for 10 degrees of freedom.