Chatroom
 

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Bad Astronomy and Universe Today Forum > Space and Astronomy > General Science
Register FAQ Members List Calendar Mark Forums Read

   

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 02-April-2008, 01:01 PM
Lukas's Avatar
Lukas Lukas is online now
Member
 
Join Date: Oct 2006
Posts: 66
Default Statistics question: goodness of fit vs R2

Hi guys,

I am currently working with some exponential decay data.
I am fitting a linear regression through the log of the datapoints.
To tell the goodness of fit I was suggested to use R-square.
Some of my profiles are steep others are rather flat. The R-square obviously corresponds to the slope (if I have a flat profile the R-square is lower).

All I want to know though is how well the line represents the data. I.e. if I have no decay at all but all the points sit on the line I would expect a high "goodness of fit" value. Rsquare doesn't seem to do that.

How should I go from here?
Am I missing something?

thanks!
__________________
Life is unfair. But that's ok.. as long as you make sure it's unfair in your favour. -Me

You don't plan sincerity. You have to make it up on the spot. -Denny Crane

I never make predictions, and I never will. -some footballer
Reply With Quote
  #2 (permalink)  
Old 02-April-2008, 02:06 PM
Ivan Viehoff Ivan Viehoff is offline
Senior Member
 
Join Date: Apr 2004
Location: Chalfont St. Giles, England
Posts: 490
Default

What are you trying to do?

Are you (1) (a) without question that your data is generated by an exponential decay process, and merely concerned with the level of confidence you have in the parameters the regression gives you? And, if so, are you (b) sure after taking logs that the errors are iid (independently identically distributed)? ...OR...

(2) are you actually uncertain whether the data is well modelled by the exponential decay process, or do you want to test the model specification?

If 1 (a) and (b) yes, then a text book will provide the methods for giving you confidence intervals on the parameters, remembering that they will apply to the model you estimated, ie, in log form. In fact you can probably read off what you need from a basic stats package, eg as provided with excel, if you know what you are looking for.

If 1 (a) yes but (b) no, then to use OLS (ordinary least squares, ie, bog-standard regression) you will have to estimate a different model respecified so that your errors are iid. Or find a different estimation method that matches your model. Text book may give you ideas what to do. I'm not saying this will be easy.

If 1(a) no, then 2. You are a bit more on your own now. Suggest you look at a textbook for tests of model specification. Myself, I would probably start with as a first cut with a test for heteroscedasticity (can also be spelled with k). Heteroscedasticity is when the errors do not have constant variance. This can be the result either of the errors not being iid (ie, 1(b)= no), or else of model mis-specification so that the measured errors are not the true errors.

btw, That's about as much as I am willing to say. I'm not going to do this for you, nor am I going to provide lessons in understanding the above.
Reply With Quote
  #3 (permalink)  
Old 02-April-2008, 02:10 PM
Lukas's Avatar
Lukas Lukas is online now
Member
 
Join Date: Oct 2006
Posts: 66
Default

I have a better description of my problem and an idea in which direction I want to go.
Below I generated a fictional dataset:
There are two datasets (data1 and data2).
Both had a line fitted to them (slope and intercept shown below the table).
The deviation from the fitted line is shown as resid1 and resid2 (residuals). I generated this data so the residuals would be the same in both cases, but the slope is very different.
Code:
	data1	resid1	data2	resid2
0.5	3.03	-0.23	2.90	-0.23
1.5	3.51	0.23	3.10	0.23
2.5	3.11	-0.18	2.44	-0.18
3.5	3.43	0.12	2.49	0.12
4.5	3.60	0.28	2.38	0.28
5.5	3.21	-0.13	1.72	-0.13
6.5	3.46	0.11	1.70	0.11
7.5	3.16	-0.20	1.14	-0.20
				
Slope	0.01		-0.26	
R2	0.0233		0.9031	
Stdev		0.206		0.206
As you can see the R-square (R2) is very different between those two. But it isn't telling me that actually both of these data are equally well fit by their corresponding line.
I used the standard deviation of the residuals to describe this. But I am still kind of longing for some "official" parameter that will "talk" to me in the same way as R-square. I.e. "0.9999 veeery good fit, 1 perfect fit, 0.02 very poor fit".

Thanks
__________________
Life is unfair. But that's ok.. as long as you make sure it's unfair in your favour. -Me

You don't plan sincerity. You have to make it up on the spot. -Denny Crane

I never make predictions, and I never will. -some footballer
Reply With Quote
  #4 (permalink)  
Old 02-April-2008, 02:55 PM
Ivan Viehoff Ivan Viehoff is offline
Senior Member
 
Join Date: Apr 2004
Location: Chalfont St. Giles, England
Posts: 490
Default

You keep talking about "goodness of fit". But to investigate it, you have to be more precise. Goodness of fit is not well-defined. There are many different kinds of badness. R2 can be high but the model can be useless, it is easy to exhibit examples of this. In specific circumstances, R2 is a useful measure of something. Also we need to be sure you are using appropriate statistical methods for your model. Hence my Qs 1 and 2.

In the particular case of your data, assuming the answers to Q 1(a) and (b) are yes and yes, then what has happened is that the confidence interval around 0.01 is much larger (in absolute terms) than 0.01. The errors (although the same size as the second example) are large in comparision to this weak relationship - they swamp it. So on this data we cannot even be confident that the slope is positive rather than negative; indeed we are not even sure that there is a relation. That is what the low r2 is (in this case, assuming my assumptions are OK) telling you.

In the second data set, although the errors are the same size, the much stronger relationship (-0.26) means that the confidence interval around -0.26 is fairly small in comparison to 0.26 (absolute value). The relationship is strong enough we can pick it up, even with errors of this size in the data. So we are pretty sure that the slope is negative, there is a relationship, etc.

Loadsa different stat parameters exist that answer different questions in different circumstances. But you have to frame your question accurately and describe your circumstances in order to choose one. I'm not going to do that choosing for you. I have already said more than I said I was going to. I suspect you are in fact looking for the confidence interval. Head for textbook, wikipedia, mathworld, etc.
Reply With Quote
  #5 (permalink)  
Old 02-April-2008, 03:03 PM
geonuc's Avatar
geonuc geonuc is offline
Senior Member
 
Join Date: Dec 2007
Location: Atlanta
Posts: 1,774
Default

You got an R-squared value of 0.0233 from the first set of data? That doesn't seem right. I put the data in Excel and got 0.993.
Reply With Quote
  #6 (permalink)  
Old 02-April-2008, 03:04 PM
aurora's Avatar
aurora aurora is offline
Senior Member
 
Join Date: Sep 2003
Posts: 2,693
Default

Quote:
Originally Posted by Lukas View Post
As you can see the R-square (R2) is very different between those two. But it isn't telling me that actually both of these data are equally well fit by their corresponding line.
I used the standard deviation of the residuals to describe this. But I am still kind of longing for some "official" parameter that will "talk" to me in the same way as R-square. I.e. "0.9999 veeery good fit, 1 perfect fit, 0.02 very poor fit".

Thanks
Just look at the residuals, especially a graph of the residuals. that's the first step in seeing which model best fits the data. Check to see if the residuals are randomly distributed above and below zero, or if there is some sort of pattern that would indicate that there is systematic variation that is not being completely explained by the model.
__________________
"I'm as accurate as any psychic. And I'm a cartoon!" -- Squidward

"Arrrgh, the laws of physics be a harsh mistress!" -- Bender
Reply With Quote
  #7 (permalink)  
Old 02-April-2008, 03:11 PM
geonuc's Avatar
geonuc geonuc is offline
Senior Member
 
Join Date: Dec 2007
Location: Atlanta
Posts: 1,774
Default

Quote:
Originally Posted by geonuc View Post
You got an R-squared value of 0.0233 from the first set of data? That doesn't seem right. I put the data in Excel and got 0.993.
Oops. Never mind. I had the sign of the residuals reversed.
Reply With Quote
  #8 (permalink)  
Old 02-April-2008, 03:39 PM
Lukas's Avatar
Lukas Lukas is online now
Member
 
Join Date: Oct 2006
Posts: 66
Default

Thanks for all the replies.
Ivan: I hadn't seen your reply yet, when I posted the second time. Thanks for all your input. You gave me some very good pointers (i.e. iid). I'll need some time to look into that further though. To adress your questions. I am fairly certain that a linear fit is the right way to go. In my actual data I expect exponential decay, when fitting the line I use the log of the values which (if exponential decay is correct) is a straight line.
Quote:
In the particular case of your data, assuming the answers to Q 1(a) and (b) are yes and yes, then what has happened is that the confidence interval around 0.01 is much larger (in absolute terms) than 0.01. The errors (although the same size as the second example) are large in comparision to this weak relationship - they swamp it. So on this data we cannot even be confident that the slope is positive rather than negative; indeed we are not even sure that there is a relation. That is what the low r2 is (in this case, assuming my assumptions are OK) telling you.
The above paragraph really helped me to understand the issue here much better. (Above is just fictional data, in my actual data I have R2s ranging from 0.89 to 0.99)

The problem is that in some cases the decay is relatively slow. A half life calculation done on a line with smaller slope is of course less accurate, but to show that my data is actually close to a straight line, I would like to have some value to present basically showing that the R2 is low not because I screwed up some samples but because the slope is small.

Quote:
Just look at the residuals, ...
Thanks. I'll have to go through my data again and see if i find anything of the sort you're pointing out.

geonuc, no worries. It made me double check, which is always a good idea.

It's bedtime for me. I'll check back tomorrow again. Thanks again for all the help.
__________________
Life is unfair. But that's ok.. as long as you make sure it's unfair in your favour. -Me

You don't plan sincerity. You have to make it up on the spot. -Denny Crane

I never make predictions, and I never will. -some footballer
Reply With Quote
  #9 (permalink)  
Old 03-April-2008, 12:38 AM
KaiYeves's Avatar
KaiYeves KaiYeves is offline
Senior Member
 
Join Date: Sep 2007
Location: Currently on assignment on planet shown in avatar photo
Posts: 7,766
Default

Blitzak! I thought this thread would be about Star Wars! Y'know, R2 and all...
__________________
"If you think the LHC will create black holes, you might as well believe Hobbits are at the bottom of your garden."- Dr. Mike Inglis
Rovers forever! - ToSeek
"Carl Sagan sent a message to ET,
Neil Armstrong walked in the Sea of Tranquility
Steve Squyers built Spirit and Opportunity
Dan Haylen upchucked in zero gravity." -Brent Simon, The Space Camp Song
Reply With Quote
  #10 (permalink)  
Old 03-April-2008, 01:46 AM
Lukas's Avatar
Lukas Lukas is online now
Member
 
Join Date: Oct 2006
Posts: 66
Default

It's more like Var Wars.
Unfortunately I'm not quite sure how to use the force.
__________________
Life is unfair. But that's ok.. as long as you make sure it's unfair in your favour. -Me

You don't plan sincerity. You have to make it up on the spot. -Denny Crane

I never make predictions, and I never will. -some footballer
Reply With Quote
  #11 (permalink)  
Old 03-April-2008, 02:53 PM
HenrikOlsen's Avatar
HenrikOlsen HenrikOlsen is offline
Moderator
 
Join Date: Sep 2003
Location: Denmark 55.6773° N 12.3610° E
Posts: 5,259
Send a message via MSN to HenrikOlsen Send a message via Yahoo to HenrikOlsen
Default

Quote:
Originally Posted by Lukas View Post
Thanks for all the replies.
Ivan: I hadn't seen your reply yet, when I posted the second time. Thanks for all your input. You gave me some very good pointers (i.e. iid). I'll need some time to look into that further though. To adress your questions. I am fairly certain that a linear fit is the right way to go. In my actual data I expect exponential decay, when fitting the line I use the log of the values which (if exponential decay is correct) is a straight line.
A problem with this approach is that when taking the log of the values, you're also taking the log of the confidence interval of those values, so the fitting algorithm needs to weigh closeness different depending on where in the dataset you are.

See this page for more on the problem and possible other solutions.
__________________
And the "driving on the freeway on a scooter" analogy still holds true because the pilots are sitting in 7 to 30 ton aircraft o' doom and you are running around them in your very own Meatbody, Mark I. Beep, beep.
Big Don
Trying to make sense of computers, The Error Log.
Reply With Quote
  #12 (permalink)  
Old 04-April-2008, 02:46 PM
Ivan Viehoff Ivan Viehoff is offline
Senior Member
 
Join Date: Apr 2004
Location: Chalfont St. Giles, England
Posts: 490
Default

I don't think you used the word "confidence interval" in that in a helpful way, but never mind.

But good on you for finding an explicit case of what might be one problem. If your errors are initially iid and you take logs then the errors aren't iid any more. So "take logs and use OLS" can produce a very misleading, possibly nonsense, result, in that case.

But in certain kinds of data it can be more plausible that the errors are iid after you take logs. So in that case, "take logs and use OLS" is fine, indeed best. But you need to know which. It is a question of understand your model, including the error process.

But I think there rae other issues too that can result in this being totally the wrong way to do it.

What concerns me about modelling an exponential decay process is not that there are measurement errors (which there probably are), but that the decay process itself is random. Suppose we have a radioactive source, a small quantity of fast decaying stuff. If by chance rather more than expected amount decays in one period, as it can, then the decay process in effect restarts with a rather smaller quantity of material that is left than was expected. So if our data is measured as "the amount left" or "the amount seen to decay", it is nearly always going to be below the initially expected trend after this point. We need to use special methods to model this kind of process.

This isn't just theoretic stuff. This kind of issue occurs all the time. I've had a real life example in business modelling where the failure to recognise that errors were not iid produced a completely different estimate of what were typical cost overruns for projects of different sizes, once the data had been transformed to give a reasonable assumption of iid errors.
Reply With Quote
  #13 (permalink)  
Old 04-April-2008, 03:02 PM
Disinfo Agent Disinfo Agent is offline
Senior Member
 
Join Date: Apr 2004
Posts: 6,534
Default

Quote:
Originally Posted by Ivan Viehoff View Post
If your errors are initially iid and you take logs then the errors aren't iid any more.
I don't think that's right.

The problem is not with the independence of the errors, but rather with their distribution. The confidence intervals are based on the assumption that the errors are normally distributed. But if the raw errors are normal, then their logs are no longer normal.

And vice-versa: if the logs are normal, then the "best" confidence interval for the log-errors does not correspond to the best confidence internal for the raw errors, in general.
__________________
"All your bias are belong to us." Ara Pacis
"A witty saying proves nothing." Voltaire
Reply With Quote
  #14 (permalink)  
Old 04-April-2008, 03:49 PM
Ivan Viehoff Ivan Viehoff is offline
Senior Member
 
Join Date: Apr 2004
Location: Chalfont St. Giles, England
Posts: 490
Default

The second i in iid is "identically". You are correct, the problem is with the identically bit, not the independently bit. I had no intention of implying otherwise.

You are right that changing normal to log normal is a problem. But usually a bigger problem is that after you take logs the variances become quite different form one data point to the next.
Reply With Quote
  #15 (permalink)  
Old 08-April-2008, 04:34 AM
Lukas's Avatar
Lukas Lukas is online now
Member
 
Join Date: Oct 2006
Posts: 66
Default

Thanks for all the help, sorry for taking so long with my reply.
Ivan: It isn't actually radioactive decay I'm working with. Would the randomness still be a problem if all I want to do is calculating the half life?

I went through my data again and recalculated some of the statistical values based
on some of the input here. The problem I have now might be a bit easier to explain (hopefully).

- I have exponential decay data, from experiments under two conditions (5 repetitions for each condition).
- For each experiment I calculate the half life based on the fitted curve.
- With a t-test I can check whether the half lives are different under the different conditions.

The problem I am thinking about now is, that the t-test is using the absolute values and is not taking into account the confidence interval for each individual value.
Is there a commonly used method for this?

It's not like my data look all that random, my half life calculations have R-squares of at least .86 (AVG .97).
If I directly use the half lives in the t-test I get a significant difference between the two conditions.
I was thinking of taking the lowest probable half lives (calculated from the adge of the 95% confidence interval) for the slower condition and t-testing those against the highest probable half lives for the faster condition.

But there are a whole bunch of issues with that method.

Thanks
__________________
Life is unfair. But that's ok.. as long as you make sure it's unfair in your favour. -Me

You don't plan sincerity. You have to make it up on the spot. -Denny Crane

I never make predictions, and I never will. -some footballer
Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Creationism and a "rate of star formation" question Robert Carnegie Questions and Answers 32 29-September-2007 11:34 PM
Test your intelligence Titana Off-Topic Babbling 184 20-January-2007 09:23 PM
Apollo tracking stations - very specific amplifier question Nicolas Space Exploration 17 20-March-2006 09:28 PM
Humans, Woo-Woos, and the Reality of Statistics genebujold Off-Topic Babbling 1 01-October-2005 10:47 PM
A question for Arthur C Clarke The Watcher Astronomy 9 27-February-2004 01:34 AM


All times are GMT. The time now is 06:27 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
LinkBacks Enabled by vBSEO 3.0.0
©  2006 Bad Astronomy and Universe Today