Reddit reviews Data Analysis Using Regression and Multilevel/Hierarchical Models

We found 13 Reddit comments about Data Analysis Using Regression and Multilevel/Hierarchical Models. Here are the top ones, ranked by their Reddit score.

Cambridge University Press

Check price on Amazon

13 Reddit comments about Data Analysis Using Regression and Multilevel/Hierarchical Models:

u/Croc600 · 12 pointsr/sociology

R for Data Science is great, especially because it teaches tidyverse.

Another good book is Learning Statistics with R: A tutorial for psychology students and other beginners, which also teaches the implementation of basic statistical techniques, like ANOVA or linear regression.

If you have some time spare, you can follow it by Data Analysis Using Regression and Multilevel/Hierarchical Models, which is also (mostly) based on R.

The Visual Display of Quantitative Information is a good book on the principles of data visualization. It’s theoretical, so no R examples.

Complex Surveys: A Guide to Analysis Using R is great if you work with survey data, especially if you work with complex designs (which nowdays is pretty much all the time).

Personaly, I would also invest some time learning methodology. Sadly, I can’t help you here, because I didn’t used textbook for this, but people seem to like books from Earl Babbie.

u/tiii · 8 pointsr/econometrics

Both time series and regression are not strictly econometric methods per se, and there are a range of wonderful statistics textbooks that detail them. If you're looking for methods more closely aligned with econometrics (e.g. difference in difference, instrumental variables) then the recommendation for Angrist 'Mostly Harmless Econometrics' is a good one. Another oft-prescribed econometric text that goes beyond Angrist is Wooldridge 'Introductory Econometrics: A Modern Approach'.

For a very well considered and basic approach to statistics up to regression including an excellent treatment of probability theory and the basic assumptions of statistical methodology, Andy Field (and co's) books 'Discovering Statistics Using...' (SPSS/SAS/R) are excellent.

Two excellent all-rounders are Cohen and Cohen 'Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences' and Gelman and Hill 'Data Analysis Using Regression and Multilevel/Hierarchical Modelling' although I would suggest both are more advanced than I am guessing you need right now.

For time series I can recommend Rob Hyndman's book/s on forecasting (online copy freely available)

For longitudinal data analysis I really like Judith Singer's book 'Applied Longitudinal Data Analysis'.

It sounds however as if you're looking for a bit of a book to explain why you would want to use one method over another. In my experience I wanted to know this when I was just starting. It really comes down to your own research questions and the available data. For example I had to learn Longitudinal/fixed/random effects modelling because I had to do a project with a longitudinal survey. Only after I put it into practice (and completed my stats training) did I come to understand why the modelling I used was appropriate.

u/klaxion · 5 pointsr/statistics

Recommendation - don't learn statistics through "statistics for biology/ecology".

Go straight to statistics texts, the applied ones aren't that hard and they usually have fewer of the lost-in-translation errors (e.g. the abuse of p-values in all of biology).

Try Gelman and Hill -

http://www.amazon.com/Analysis-Regression-Multilevel-Hierarchical-Models/dp/052168689X/ref=sr_1_1?ie=UTF8&amp;qid=1427768688&amp;sr=8-1&amp;keywords=gelman+hill

Faraway - Practical Regression and Anova using (free)

http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf

Categorical data analysis

http://www.amazon.com/Categorical-Data-Analysis-Alan-Agresti/dp/0470463635/ref=sr_1_1?ie=UTF8&amp;qid=1427768746&amp;sr=8-1&amp;keywords=categorical+data+analysis

u/shujaa-g · 5 pointsr/statistics

You're pretty good when it comes to linear vs. generalized linear models--and the comparison is the same regardless of whether you use mixed models or not. I don't agree at all with your "Part 3".

My favorite reference on the subject is Gelman & Hill. That book prefers to the terminology of "pooling", and considers models that have "no pooling", "complete pooling", or "partial pooling".

One of the introductory datasets is on Radon levels in houses in Minnesota. The response is the (log) Radon level, the main explanatory variable is the floor of the house the measurement was made: 0 for basement, 1 for first floor, and there's also a grouping variable for the county.

Radon comes out of the ground, so, in general, we expect (and see in the data) basement measurements to have higher Radon levels than ground floor measurements, and based on varied soil conditions, different overall levels in different counties.

We could fit 2 fixed effect linear models. Using R formula psuedocode, they are:

radon ~ floor
radon ~ floor + county (county as a fixed effect)

The first is the "complete pooling" model. Everything is grouped together into one big pool. You estimate two coefficients. The intercept is the mean value for all the basement measurements, and your "slope", the floor coefficient, is the difference between the ground floor mean and the basement mean. This model completely ignores the differences between the counties.

The second is the "no pooling" estimate, where each county is in it's own little pool by itself. If there are k counties, you estimate k + 1 coefficients: one intercept--the mean value in your reference county, one "slope", and k - 1 county adjustments which are the differences between the mean basement measurements in each county to the reference county.

Neither of these models are great. The complete pooling model ignores any information conveyed by the county variable, which is wasteful. A big problem with the second model is that there's a lot of variation in how sure we are about individual counties. Some counties have a lot of measurements, and we feel pretty good about their levels, but some of the counties only have 2 or 3 data points (or even just 1). What we're doing in the "no pooling" model is taking the average of however many measurement there are in each county, even if there are only 2, and declaring that to be the radon level for that county. Maybe Lincoln County has only two measurements, and they both happen to be pretty high, say 1.5 to 2 standard deviations above the grand mean. Do you really think that this is good evidence that Lincoln County has exceptionally high Radon levels? Your model does, it's fitted line goes straight between the two Lincoln county points, 1.75 standard deviations above the grand mean. But maybe you're thinking "that could just be a fluke. Flipping a coin twice and seeing two heads doesn't mean the coin isn't fair, and having only two measurements from Lincoln County and they're both on the high side doesn't mean Radon levels there are twice the state average."

Enter "partial pooling", aka mixed effects. We fit the model radon ~ floor + (1 | county). This means we'll keep the overall fixed effect for the floor difference, but we'll allow the intercept to vary with county as a random effect. We assume that the intercepts are normally distributed, with each county being a draw from that normal distribution. If a county is above the statewide mean and it has lots of data points, we're pretty confident that the county's Radon level is actually high, but if it's high and has only two data points, they won't have the weight to pull up the measurement. In this way, the random effects model is a lot like a Bayesian model, where our prior is the statewide distribution, and our data is each county.

The only parameters that are actually estimated are the floor coefficient, and then the mean and SD of the county-level intercept. Thus, unlike the complete pooling model, the partial pooling model takes the county info into account, but it is far more parsimonious than the no pooling model. If we really care about the effects of each county, this may not be the best model for us to use. But, if we care about general county-level variation, and we just want to control pretty well for county effects, then this is a great model!

Of course, random effects can be extended to more than just intercepts. We could fit models where the floor coefficient varies by county, etc.

Hope this helps! I strongly recommend checking out Gelman and Hill.

u/RabidRabbit · 5 pointsr/statistics

Gelman's book is pretty awesome. http://www.amazon.com/Analysis-Regression-Multilevel-Hierarchical-Models/dp/052168689X/ref=pd_sim_b_1

u/gmarceau · 4 pointsr/science

How to Lie with Statistics by Darrell Huff is a quick and fun read that will sharpen your ability to identify fraudulent claims.
My favorite stats book for people who are not math-geeks is Statistical Methods for Psychology by David Howell
The particular technique used in the paper is somewhat beyond beginner level. A third good book to read might be Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman

u/geneusutwerk · 2 pointsr/sociology

So I am a political scientist (though my research crosses into sociology).

What I would recommend is starting by learning Generalized Linear Models (GLMs). Logistic regression is one type, but GLMs are just a way of approaching a bunch of other type of dependent variables.

Gelman and Hill's book is probably the best single text book that can cover it all. I think it provides examples in R so you could also work on picking up R. It covers GLMs and multi-level models which are also relatively common in sociology.

u/shaggorama · 2 pointsr/probabilitytheory

u/cokechan · 2 pointsr/rstats

https://www.amazon.com/Analysis-Regression-Multilevel-Hierarchical-Models/dp/052168689X is the definitive text on the subject. I highly recommend this book to understand the fundamentals of multilevel modeling.

u/Here4TheCatPics · 2 pointsr/statistics

I've used a book by Gelman for self study. Great author, very good at using meaningful graphics -- which may be an effective way to convey ideas to students.

u/mshron · 2 pointsr/AskStatistics

It sounds like you want some kind of regression, especially to answer 2. In a GLM, you are not claiming that the data by itself has a Normal/Poisson/Negative Binomial/Binomial distribution, only that it has such a distribution when conditioned on a number of factors.

In a nutshell: you model the mean of the distribution as a linear combination of the inputs. Then you can read the weighting factors on each input to learn about the relationship.

In other words, it doesn't need to be that your data is Poisson or NB in order to do a Poisson or NB regression. It only has to be that the error, that is, the difference between the expected based on the mean function and the actual, follows such a distribution. In fact, there may be some simple transformations (like taking the log of the outcome) that lets you use a standard linear model, where you can reasonably assume that the error is Normal, even if the outcome is anything but.

If your variance is not dependent on any of your inputs, that's a great sign, since heteroskedasticity is a great annoyance when trying to do regressions.

If you have time, the modern classic in this area is http://www.amazon.com/Analysis-Regression-Multilevel-Hierarchical-Models/dp/052168689X. It starts with a pretty gentle introduction to regression and works its way into the cutting edge by the end.

u/PM_ME_YOUR_WOMBATS · 1 pointr/statistics

Somewhat facetiously, I'd say the probability that an individual who has voted in X/12 of the last elections will vote in the next election is (X+1)/14. That would be my guess if I had no other information.

As the proverb goes: it's difficult to make predictions, especially about the future. We don't have any votes from the next election to try to discern what relationship those votes have to any of the data at hand. Of course that isn't going to stop people who need to make decisions. I'm not well-versed in predictive modeling (being more acquainted with the "make inference about the population from the sample" sort of statistics) but I wonder what would happen if you did logistic regression with the most recent election results as the response and all the other information you have as predictors. See how well you can predict the recent past using the further past, and suppose those patterns will carry forward into the future. Perhaps someone else can propose a more sophisticated solution.

I'm not sure how this data was collected, but keep in mind that a list of people who have voted before is not a complete list of people who might vote now, since there are some first-time voters in every election.

If you want to get serious about data modeling in social science, you might check out this book by statistician/political scientist Andrew Gelman.

u/mark_bellhorn · 1 pointr/statistics

I've always loved Andrew Gelman and Jennifer Hill's book: http://www.amazon.com/Analysis-Regression-Multilevel-Hierarchical-Models/dp/052168689X/ref=sr_1_1?s=books&amp;ie=UTF8&amp;qid=1410039864&amp;sr=1-1&amp;keywords=data+analysis+using+regression+and+multilevel+hierarchical+models

Specifically written for Social Research, and presents example code for R and WinBugs