Top products from r/statistics

We found 119 product mentions on r/statistics. We ranked the 465 resulting products by number of redditors who mentioned them. Here are the top 20.

Next page

Top comments that mention products on r/statistics:

u/flight_club · 5 pointsr/statistics

This is a huge brain dump but hopefully some of it is useful. Mostly personal opinion so take it with a grain of salt.

Statistical Culture

Go and read a copy of The Lady Tasting Tea. Now.

The typical Stats101 course is like "Wheee!!! A seemingly arbitrary collection of formulas to cookbook your way through!!" Do not be discouraged. Although there is no winner, there are a series of 'philosophies' of statistics which each present a cogent, unified perspective on how to proceed (Fisherian, Neyman-Pearson, Bayesian). The messiness comes from trying to give the engineers a cookbook of results to follow. [Resources for (too?) advanced extra credit: I haven't yet found a good intro to this but maybe look at the personal appendix in "Principles of Statistical Inference" by D Cox. I've been reading this paper recently.]

Mathematical Background

The core mathematical background is probably: Linear Algebra, Multivariable Calculus, Real Analysis. Eventually measure theory too when you get to probability for grown ups.

Applied Stats

The best introduction to applied statistics I have discovered is the Statistical Sleuth.

The most useful activity I can recommend is to do little projects where you get some real, raw data, analyse it and then write up a report. So many issues crop up which you can't really understand from your textbooks. What data is available to me? Is this the data I really want? (Eg, observational vs randomised experiment.) What am I going to do about errors in it? (Eg, missing entries, outliers,...) How do I get it into my statistical analysis program? Do I actually have enough? Are my models at all good? It is a huge lie to say this but in a sense, once you have enough of the right data and it's all scrubbed up and nicely formatted, all the rest is easy.

The standard tools of the trade for applied work are the statistical analysis program/language R and the typesetting program Latex. This is the standard text for data analysis using R (R is a free version of the S language developed at Bell Labs.) There are online tutorials for Latex.

Learn to program. The language is less important than learning how programming works. The generic programming advice is to learn Python or Scheme. The former is probably more useful practically, the latter will give you more street cred with computer scientists. Within mathematics people tend to use R/Matlab/Mathematica or Fortran/C++ if they are doing heavy duty simulation. I'd recommend just putzing around with Python and R, and learn anything else as needed. The big BUT is if you want to go into finance, then you'll definitely want to dabble with C++. There are probably people who can give better advice but most seem to recommend Accelerate/Effective C++.

Studying Mathematics
Mathematics develops linearly, each step up builds on the preceding material. Even after you have finished a course, go back a couple of times over the next few years to refresh the material.

When studying mathematics I like to work top down and then bottom up. That is, start with a broad understanding of what you are doing and then go fill in the details within that framework.

For getting the high level view I like making mind maps or dependency trees. This isn't legible but hopefully it gives you the idea. I've summarised 24 pages of notes so that I can see the main branches covered, definitions of the basic objects and can quickly find the four super important theorems (Written as 'Theorem 1, p14: Rough idea of what it says'). With a bit more time I'd go through with another colour and draw dependency arrows to show which theorems/lemmas are used to prove which other ones. Having this big map somehow compresses the 'intellectual content' of the course down: making it easier to see interrelationships and not panic.

As an aside. Pure mathematics can be broken into four parts: definitions, important/key theorems, lemmas/propositions needed to prove important/key theorems, applicationy examples/results proved using the important/key theorems.

Then to actually learn the material we fill in details:

0. Find somewhere quiet with no friends/technology. Take your notes, some paper and a pen, and for the sake of cliche a cup of coffee.

  1. Pick one of the early key theorems to work on.

  2. Do you know from memory the definitions of the terms used in the statement of the key theorem? If not, look them up and play around with the definition and some examples until you have it memorised. Eg: an integer n is even if there exists an integer m such that n = 2m. The number 6 is even because 6 = 2(3). However, the number 7 is not even: 2(3) = 6 < 7 < 8 = 2(4).

  3. What is the theorem 'saying'? Get a simple, concrete example down on paper. Eg, for Theorem: if n and m are even integers then so is n+m, take 4 + 6 = 10.

  4. Spend a bit of time trying to prove the theorem is true. If you get stuck sometimes it can help to try drawing a picture, considering a special case or constructing a counter example (figuring out why you can't get your counter example to work can help you see why the theorem must be true). If you succeed, great.

  5. If you didn't come up with a proof start working through the proof line by line. It can be helpful to keep your concrete simple example in mind as you do this. Inevitably there will be gaps which you should try to fill in ('He says that this implies that but it isn't obvious how. Can I prove that it works?') If the proof involves an earlier lemma, you have two choices: go back and repeat this process on that lemma, or push on regarding the lemma as a 'black box'. My advice is to do the latter but take a bit of time to think about what it is exactly that your lemma is accomplishing within the proof. Something like: "To show f(x) has property Y given condition A, we first need to know that f(x) has property Z and our proof uses that to show property Y. We need Lemma 1 to show that condition A implies f(x) has property Z." One of the problem with mathematics lectures is that they present the material in a logical way and so sometimes lemmas crop up unmotivated because 'we use this later'. By doing this process you put the motivation front and center. If you get totally stuck make a note and talk to your teacher.

  6. You now in principle understand how to prove the result, perhaps conditionally on assuming some lemmas. Spend some time really making the proof your own. Knowing the result can you see an easier way to prove it? Can you put the steps in a more sensible order? Can you fill in the gaps in the proof? Can you add in some helpful comments which explain what you are doing? Can you cut anything out? At some point in the future (a few hours+ later) you should sit down and state the theorem from memory and then try to construct either the whole proof or a sketch of the proof also from memory. The 100% absolute best way to cement something into your mind is to teach it. If you can find a classmate to work with great, but often I will just find an empty room and lecture the material to the wall. There have been sooo many times when I've said something like "And this is true I'm not sure." The most efficient way to study is to test yourself, find out what you don't know, and then focus on filling those gaps. Explaining seems to cause me to think of questions I wouldn't have thought of if I was learning.

  7. Having figured out your tool you now want to use it to make sure you understand it. Solve examples from your problem sheets which use the theorem. Try to use it to prove corollaries with it.

  8. Now go and back fill any skipped lemmas which were used in the build up to the proof of the theorem.


    As Mark notes, it's worthwhile spending some time to learn a bit about a particular discipline. People want you to solve problems. The fact that you're doing it with statistics is irrelevant, if you could get correct answers by divining in chicken guts they'd be quite happy to accept that methodology too. Having a domain in which you can apply your knowledge gives you in idea of what the problems are and gives you the language to talk to the people who have the problems. I think you can sometimes pick this stuff up on the fly, but it's nice to just have it.

    Definitely try to get internships over your vacations. Ideally with a company you want to go on and work for.

    I don't know much about Actuarial work. Apparently there are a series of industry exams you need to pass. Look into that.
u/shujaa-g · 5 pointsr/statistics

You're pretty good when it comes to linear vs. generalized linear models--and the comparison is the same regardless of whether you use mixed models or not. I don't agree at all with your "Part 3".

My favorite reference on the subject is Gelman & Hill. That book prefers to the terminology of "pooling", and considers models that have "no pooling", "complete pooling", or "partial pooling".

One of the introductory datasets is on Radon levels in houses in Minnesota. The response is the (log) Radon level, the main explanatory variable is the floor of the house the measurement was made: 0 for basement, 1 for first floor, and there's also a grouping variable for the county.

Radon comes out of the ground, so, in general, we expect (and see in the data) basement measurements to have higher Radon levels than ground floor measurements, and based on varied soil conditions, different overall levels in different counties.

We could fit 2 fixed effect linear models. Using R formula psuedocode, they are:

  1. radon ~ floor
  2. radon ~ floor + county (county as a fixed effect)

    The first is the "complete pooling" model. Everything is grouped together into one big pool. You estimate two coefficients. The intercept is the mean value for all the basement measurements, and your "slope", the floor coefficient, is the difference between the ground floor mean and the basement mean. This model completely ignores the differences between the counties.

    The second is the "no pooling" estimate, where each county is in it's own little pool by itself. If there are k counties, you estimate k + 1 coefficients: one intercept--the mean value in your reference county, one "slope", and k - 1 county adjustments which are the differences between the mean basement measurements in each county to the reference county.

    Neither of these models are great. The complete pooling model ignores any information conveyed by the county variable, which is wasteful. A big problem with the second model is that there's a lot of variation in how sure we are about individual counties. Some counties have a lot of measurements, and we feel pretty good about their levels, but some of the counties only have 2 or 3 data points (or even just 1). What we're doing in the "no pooling" model is taking the average of however many measurement there are in each county, even if there are only 2, and declaring that to be the radon level for that county. Maybe Lincoln County has only two measurements, and they both happen to be pretty high, say 1.5 to 2 standard deviations above the grand mean. Do you really think that this is good evidence that Lincoln County has exceptionally high Radon levels? Your model does, it's fitted line goes straight between the two Lincoln county points, 1.75 standard deviations above the grand mean. But maybe you're thinking "that could just be a fluke. Flipping a coin twice and seeing two heads doesn't mean the coin isn't fair, and having only two measurements from Lincoln County and they're both on the high side doesn't mean Radon levels there are twice the state average."

    Enter "partial pooling", aka mixed effects. We fit the model radon ~ floor + (1 | county). This means we'll keep the overall fixed effect for the floor difference, but we'll allow the intercept to vary with county as a random effect. We assume that the intercepts are normally distributed, with each county being a draw from that normal distribution. If a county is above the statewide mean and it has lots of data points, we're pretty confident that the county's Radon level is actually high, but if it's high and has only two data points, they won't have the weight to pull up the measurement. In this way, the random effects model is a lot like a Bayesian model, where our prior is the statewide distribution, and our data is each county.

    The only parameters that are actually estimated are the floor coefficient, and then the mean and SD of the county-level intercept. Thus, unlike the complete pooling model, the partial pooling model takes the county info into account, but it is far more parsimonious than the no pooling model. If we really care about the effects of each county, this may not be the best model for us to use. But, if we care about general county-level variation, and we just want to control pretty well for county effects, then this is a great model!

    Of course, random effects can be extended to more than just intercepts. We could fit models where the floor coefficient varies by county, etc.

    Hope this helps! I strongly recommend checking out Gelman and Hill.
u/efrique · 2 pointsr/statistics

> she didn't know how useful it would be

probably more employable than geography

> Do you guys have any recommendations on where to develop my knowledge and skills?

There's a bunch of free and inexpensive stuff around .. but there's also a lot of bad free/inexpensive stuff around; you have to be a bit discerning (which is hard when you're trying to learn it).

It might sound a bit old-school but I'd suggest going to a university library and finding some decent stats texts; you probably want to avoid the stuff that says "11th edition".

Find several that you like and work with those for a while

Some you might look for:

Statistics, Freedman, Pisani & Purves (any edition)

Introduction to the Practice of Statistics, Moore & McCabe (5th edition or earlier)

For a bit of theory (you'll need a bit of mathematics for this but not a ton of it):

Introduction To The Theory Of Statistics, Mood Graybill And Boes

These are all old books. You should be able to get them second hand for cheap, or read them in a library. They'll be a good grounding, but you'll need to be able to ask questions as well.

Places like this one and can be handy resources. I've seen determined people teach themselves a lot of statistics with only a bit of guidance so it can certainly be done.

> Do I need programming, if so, what would be the best programming language to learn?

It would be best to learn some, yes, because modern statistics relies on it heavily. You don't necessarily have to do it immediately but getting an early start (and using it to help with learning stats) will be better than leaving it really long.

Two main things are widely used ... R and Python. Both are free. The second is more of a mainstream programming language, the first is a statistics package as well as a more specialized language

Learn one or both; my own suggestion would be to try R but other people may have different advice.

If you want to be a programmer rather than a statistician who uses code to solve statistical problems, python would be the better choice.

u/COOLSerdash · 9 pointsr/statistics
u/apple-jacks · 2 pointsr/statistics

The reference text that I use the most is Tabachnick and Fidell's Using Multivariate Statistics. For you, if you are interested in primarily using stata, you might still derive value from the content, but the example SAS or SPSS output would not be as helpful. (Disclaimer: I use both SPSS and Stata regularly, and have one semester of SAS experience under my belt)

Acock's A Gentle Introduction to Stata might be a good similar stata-based resource (I am resisting the urge to make jokes about the author's name). I've only read the first few chapters but I found it well-written and easy to understand. Stata also has some great specialized topics books, such as Long & Freese's categorical dependent variables. And don't forget about stata's great help section. I know you already know about the UCLA website but when I encounter stata questions, I'm usually able to resolve questions by looking in the help section and checking the UCLA webiste.

u/[deleted] · 10 pointsr/statistics


"Doing Bayesian Data Analysis" by Kruschke. The instruction is really clear and there are code examples, and a lot of the mainstays of NHST are given a Bayesian analogue, so that should have some relevance to you.

"Bayesian Data Analysis" by Gelman. This one is more rigorous (notice the obvious lack of puppies on the cover) but also very good.

Free stuff:

"Think Bayes" by our own resident Bayesian apostle, Allen Downey. This book introduces Bayesian stats from a computational perspective, meaning it lays out problems and solves them by writing Python code. Very easy to follow, free, and just a great resource.

Lecture: "Bayesian Statistics Made (As) Simple (As Possible)" again by Prof. Downey. He's a great teacher.

u/El-Dopa · 1 pointr/statistics

If you are looking for something very calculus-based, this is the book I am familiar with that is most grounded in that. Though, you will need some serious probability knowledge, as well.

If you are looking for something somewhat less theoretical but still mathematical, I have to suggest my favorite. Statistics by William L. Hays is great. Look at the top couple of reviews on Amazon; they characterize it well. (And yes, the price is heavy for both books.... I think that is the cost of admission for such things. However, considering the comparable cost of much more vapid texts, it might be worth springing for it.)

u/mikethechampion · 3 pointsr/statistics

I would highly recommend the following book: Mostly harmless econometrics

It is very problem driven book and will help build up your knowledge base to know what models are appropriate for a given situation or dataset.

You will then need to start practicing in a statistical program to gain the practical skills of applying those models to real data. Excel works, but I don't know a good book to recommend to guide you through using excel on real problems.

I recommend Stata to new data analysts and have them pick up "microeconomics using stata"; once they've worked through these two books they get excited and start grabbing data all over and begin running models, its exciting to watch new data modellers apply tools they're learning. R is free and open source but is more difficult to learn, if you're willing to ask there are tons of people willing to help you through R.

u/M_Bus · 2 pointsr/statistics

Not long! For this purpose I highly highly recommend Richard McElreath's Statistical Rethinking (this one here). It's SO good. The math is exceptionally straightforward for someone familiar with regression, and it's huge on developing intuition. Bonus: he sets you up with all the tools you need to do your own analyses, and there are tons of examples that he works from a lot of different angles. He even does hierarchical regression.

It's an easy math book to read cover to cover by yourself, to be honest. He really holds your hand the whole way through.

Jesus, he should pay me to rep his book.

u/berf · 1 pointr/statistics

I don't understand the question. Isn't this easy? Just follow the KISS principle (keep it simple, stupid). They're presumably seen histograms somewhere. Just present the kernel density estimate, presumably with optimal bandwidth chosen by cross-validation or something, which is way too complicated to explain, as a better competitor to the histogram. Explain the kernel density estimate as a better estimate of the "theoretical histogram" (I get this terminology from Freedman, Pisani, and Purves, an excellent "statistics for poets" book), which is what the histogram would be if you had an infinite amount of data. No one believes the theoretical histogram actually has jumps like a histogram (estimate), so why not use a smooth estimate like the kernel density estimate? That's almost ELI5.

u/yggdrasilly · 3 pointsr/statistics

It really depends on your mathematical maturity. Are you more interested in the application of statistics or the theoretical/methodological underpinnings of statistics? What have you covered so far?

My favorite book for theoretical statistics/statistical inference is In All Likelihood. It's an absolutely brilliant introduction to inference for both Bayesian and Frequentist methodologies but you will need some knowledge of probability, calculus, linear algebra, real analysis etc.

For applied statistics I would recommend something like MASS. This book uses R (a popular open source statistics package) to explore a multitude of applications with loads of examples, data etc.

u/Bromskloss · 1 pointr/statistics

> There are some philosophical reasons and some practical reasons that being a "pure" Bayesian isn't really a thing as much as it used to be. But to get there, you first have to understand what a "pure" Bayesian is: you develop reasonable prior information based on your current state of knowledge about a parameter / research question. You codify that in terms of probability, and then you proceed with your analysis based on the data. When you look at the posterior distributions (or posterior predictive distribution), it should then correctly correspond to the rational "new" state of information about a problem because you've coded your prior information and the data, right?

Sounds good. I'm with you here.

> However, suppose you define a "prior" whereby a parameter must be greater than zero, but it turns out that your state of knowledge is wrong?

Isn't that prior then just an error like any other, like assuming that 2 + 2 = 5 and making calculations based on that?

> What if you cannot codify your state of knowledge as a prior?

Do you mean a state of knowledge that is impossible to encode as a prior, or one that we just don't know how to encode?

> What if your state of knowledge is correctly codified but makes up an "improper" prior distribution so that your posterior isn't defined?

Good question. Is it settled how one should construct the strictly correct priors? Do we know that the correct procedure ever leads to improper distributions? Personally, I'm not sure I know how to create priors for any problem other than the one the prior is spread evenly over a finite set of indistinguishable hypotheses.

The thing about trying different priors, to see if it makes much of a difference, seems like a legitimate approximation technique that needn't shake any philosophical underpinnings. As far as I can see, it's akin to plugging in different values of an unknown parameter in a formula, to see if one needs to figure out the unknown parameter, or if the formula produces approximately the same result anyway.

> read this book. I promise it will only try to brainwash you a LITTLE.

I read it and I loved it so much for its uncompromising attitude. Jaynes made me a militant radical. ;-)

I have an uncomfortable feeling that Gelman sometimes strays from the straight and narrow. Nevertheless, I looked forward to reading the page about Prior Choice Recommendations that he links to in one of the posts you mention. In it, though, I find the puzzling "Some principles we don't like: invariance, Jeffreys, entropy". Do you know why they write that?

u/pgoetz · 1 pointr/statistics

I would try Mathematical Statistics and Data Analysis by Rice. The standard intro text for Mathematical Statistics (this is where you get the proofs) is Wackerly, Mendenhall, and Schaeffer but I find this book to be a bit too dry and theoretical (and I'm in math). Calculus is less important than a thorough understanding of how random variables work. Rice has a couple of pretty good chapters on this, but it will require some mathematical maturity to read this book. Good luck!

u/jmcq · 2 pointsr/statistics

Depending on how strong your math/stats background is you might consider Statistical Inference by Casella and Berger. It's what we use for our first year PhD Mathematical Statistics course.

That might be a little too difficult if you're not very comfortable with probability theory and basic statistics. If you look at the first few chapters on Amazon and it seems like too much I recommend Mathematical Statistics and Data Analysis by Rice which I guess I would consider a "prequel" to the Casella text. I worked through this in an advanced statistics undergrad course (along with Mostly Harmless Econometrics and the Goldberger's course in Econometrics).

Let's see, if you're interested in Stochastic Models (Random Walks, Markov Chains, Poisson Processes etc), I recommend Introduction to Stochastic Modeling by Taylor and Karlin. Also something I worked through as an undergrad.

u/beaverteeth92 · 2 pointsr/statistics

The absolute best book I've found for someone with a frequentist background and undergraduate-level math skills is Doing Bayesian Data Analysis by John Kruschke. It's a fantastic book that goes into mathematical depth only when it needs to while also building your intuition.

The second edition is new and I'd recommend it over the first because of its improved code. It uses JAGS and STAN instead of Bugs, which is Windows-only now.

u/shaggorama · 3 pointsr/statistics

I'm a fan of Hogg, Mckean &Craig. This is a graduate level text so don't feel like you need to understand everything in it, but it could be a good way to get a better understanding of the topics you've already covered but don't quite grock. Also, don't be intimidated just because it's a graduate level textbook: it's fairly accessible, certainly more so than Casella & Berger, which someone else probably would have already suggested if I'd gotten to this later.

u/Here4TheCatPics · 2 pointsr/statistics

I've used a book by Gelman for self study. Great author, very good at using meaningful graphics -- which may be an effective way to convey ideas to students.

u/RogerSmithII · 2 pointsr/statistics

Thanks. The program is Data Science and prereqs are Calc, Lin Alg and basic stats.

I started my review using but the book assumes you have basic stats. I took these courses 5+ years ago so I only vaguely remember the material.

Good example with hetero/homoskedasticity. I want to make sure I understand things like random variables and different types of distributions.

u/PM_ME_YOUR_WOMBATS · 1 pointr/statistics

Somewhat facetiously, I'd say the probability that an individual who has voted in X/12 of the last elections will vote in the next election is (X+1)/14. That would be my guess if I had no other information.

As the proverb goes: it's difficult to make predictions, especially about the future. We don't have any votes from the next election to try to discern what relationship those votes have to any of the data at hand. Of course that isn't going to stop people who need to make decisions. I'm not well-versed in predictive modeling (being more acquainted with the "make inference about the population from the sample" sort of statistics) but I wonder what would happen if you did logistic regression with the most recent election results as the response and all the other information you have as predictors. See how well you can predict the recent past using the further past, and suppose those patterns will carry forward into the future. Perhaps someone else can propose a more sophisticated solution.

I'm not sure how this data was collected, but keep in mind that a list of people who have voted before is not a complete list of people who might vote now, since there are some first-time voters in every election.

If you want to get serious about data modeling in social science, you might check out this book by statistician/political scientist Andrew Gelman.

u/gpark · 1 pointr/statistics

Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences by Cohen, Cohen, West, and Aiken and Using Multivariate Statistics by Tabachnik and Fidell are both good for your situation, I think. They are easy to read, touch on a wide variety of popular methods, and have lots of examples with code and data from popular software (including SPSS).

u/marmle · 4 pointsr/statistics

The short version is that in a bayesian model your likelihood is how you're choosing to model the data, aka P(x|\theta) encodes how you think your data was generated. If you think your data comes from a binomial, e.g. you have something representing a series of success/failure trials like coin flips, you'd model your data with a binomial likelihood. There's no right or wrong way to choose the likelihood, it's entirely based on how you, the statistician, thinks the data should be modeled. The prior, P(\theta), is just a way to specify what you think \theta might be beforehand, e.g. if you have no clue in the binomial example what your rate of success might be you put a uniform prior over the unit interval. Then, assuming you understand bayes theorem, we find that we can estimate the parameter \theta given the data by calculating P(\theta|x)=P(x|\theta)P(\theta)/P(x) . That is the entire bayesian model in a nutshell. The problem, and where mcmc comes in, is that given real data, the way to calculate P(x) is usually intractable, as it amounts to integrating or summing over P(x|\theta)P(\theta), which isn't easy when you have multiple data points (since P(x|\theta) becomes \prod_{i} P(x_i|\theta) ). You use mcmc (and other approximate inference methods) to get around calculating P(x) exactly. I'm not sure where you've learned bayesian stats from before, but I've heard good things , for gaining intuition (which it seems is what you need), about Statistical Rethinking (, the authors website includes more resources including his lectures. Doing Bayesian data analysis ( also seems to be another beginner friendly book.

u/siddboots · 9 pointsr/statistics

It is hard to provide a "comprehensive" view, because there's so much disperate material in so many different fields that draw upon probability theory.

Feller is an approachable classic that covers all of the main results in traditional probability theory. It certainly feels a little dated, but it is full of the deep central limit insights that are rarely explained in full in other texts. Feller is rigorous, but keeps applications at the center of the discussion, and doesn't dwell too much on the measure-theoretical / axiomatic side of things. If you are more interested in the modern mathematical theory of probability, try Probability with Martingales.

On the other hand, if you don't care at all about abstract mathematical insights, and just want to be able to use probabilty theory directly for every-day applications, then I would skip both of the above, and look into Bayesian probabilistic modelling. Try Gelman, et. al..

Of course, there's also machine learning. It draws on a lot of probability theory, but often teaches it in a very different way to a traditional probability class. For a start, there is much more emphasis on multivariate models, so linear algebra is much more central. (Bishop is a good text).

u/blair_necessities · 3 pointsr/statistics

If your just looking for a concept overview the cartoon guide to statistics is great. It's easy to read and filled with great visuals and examples.

If you want to learn how to do intro statistics/practice, look no further than khan Academy.

u/michaelquinn32 · 1 pointr/statistics

My math stats textbook is Hogg McKean Craig. I don't think the math would be too much for a computation statistics major, but it would give you a great overview if you're interested in that direction.

u/blossom271828 · 5 pointsr/statistics

The book that you want the person to look up is Applied Linear Statistical Models. It is a great reference book and gets into the nitty gritty calculations for figuring out the appropriate degrees of freedom in some pretty ugly experimental designs.

u/CodeNameSly · 3 pointsr/statistics

Casella and Berger is one of the go-to references. It is at the advanced undergraduate/first year graduate student level. It's more classical statistics than data science, though.

Good statistical texts for data science are Introduction to Statistical Learning and the more advanced Elements of Statistical Learning. Both of these have free pdfs available.

u/Jake_JAM · 6 pointsr/statistics

I like Discovering Statistics using R . Great book for learning the basics of hypothesis testing, a little bit of math, and you learn how to do it in R; not to mention there are a few bits you’ll chuckle at. There are also other books for other programs in this series (SPSS, SAS).

u/belarius · 3 pointsr/statistics

Casella & Berger is the go-to reference (as Smartless has already pointed out), but you may also enjoy Jaynes. I'm not sure I'd say it's quick but if gaps are your concern, it's pretty drum-tight.

u/coffeecoffeecoffeee · 1 pointr/statistics

One way is picking a distribution with a mode and a "concentration" around that mode that reflects what you have. John Kruschke does an amazing job at explaining how to pick Beta priors based off of that in Doing Bayesian Data Analysis (which, may I note, has the best cover of any statistics book I've ever read).

u/lenwood · 1 pointr/statistics

I'm doing the same. Here are a couple of resources that you may find helpful.

u/Sarcuss · 6 pointsr/statistics

I would say: Go for it as long as you are interested in the job :)

For study references for remembering R and Statistics, I think all you would need would be:

For R, data cleaning and the such: and for basic statistics with R probably either Daalgard for Applied Statistics with R and something like OpenIntroStats or Freedman for review of stats

u/clarinetist001 · 4 pointsr/statistics

If you really need it dumbed down, I would recommend Asimow and Maxwell. This text has a solutions manual. Note that this is specifically tailored toward actuarial exams - i.e., people that have to learn the material quickly but not necessarily for grad school. (And yes, the website is legit. I've done some contract work for them in the past and have ordered books through them.)

If you don't mind something more mathematical, I would recommend Wackerly et al.

u/mrdevlar · 2 pointsr/statistics

If you want a math book with that perspective, I'd recommend E.T. Jaynes "Probability Theory: The Logic of Science" he devolves into quite a lot of discussions about that topic.

If you want a popular science book on the subject, try "The Theory That Would Not Die".

Bayesian statistics has, in my opinion, been the force that has attempted to reverse this particular historical trend. However, that viewpoint is unlikely to be shared by all in this area. So take my viewpoint with a grain of salt.

u/gatherinfer · 2 pointsr/statistics

A lot of the recommendations in this thread are good, I'd like to add "Bayesian Data Analysis 3rd edition" by Gelman et al. Useful if you encounter Bayesian models, especially hierarchical/multilevel models.

u/glutamate · 2 pointsr/statistics

Data analysis: a Bayesian tutorial is really nice. It starts off with continuous parameter estimation and then moves on to model selection. Unlike Peter Lee's book it feels like a clean break from classical stats.

u/DrGar · 3 pointsr/statistics

Try to get through the first chapter of Bickel and Doksum. It is a great book on mathematical statistics. You need a solid foundation before you can build up.

For a less rigorous, more applied and broad book, I thought this book was alright. Just realize that the more "heavy math" (i.e., mathematical statistics and probability theory) you do, the better prepared you will be to face applied problems later. A lot of people want to jump right into the applications and the latest and greatest algorithms, but if you go this route, you will never be the one greatly improving such algorithms or coming up with the next one (and you might even run the risk of not fully understanding these tools and when they do not apply).

u/marketfailure · 1 pointr/statistics

In my graduate econometrics course we used Mostly Harmless Econometrics. It's focused on the question of causal inference, and specifically how to do empirically rigorous studies when variables aren't exogenous. It covers a bunch of best practices in design experiments. It's not focused on networks, which is a rapidly emerging field of study in the social sciences. However, it does a very good job of explaining possible sources of error in statistical inference and research design.

u/DiogenicOrder · 8 pointsr/statistics

How would you rather split beginner vs intermediate/advanced ?

My feeling was that Ben Lambert's book would be a good intro and that Bayesian Data Analysis would be a good next ?

u/bluecoffee · 2 pointsr/statistics

If you'd like to understand statistical methods as well apply them, yes. It's a much more consistent, intuitive approach. Personally a whole pile of frequentist concepts only made sense after I'd worked through a Bayesian-based machine learning textbook.

u/PatsysStone · 4 pointsr/statistics

Andy Field also has a book for learning statistics using R:

I also recommend his book, it is quite a fun read.

u/dabomb4real · 1 pointr/statistics

I don't understand how my example of spurious correlation among randomly generated numbers doesn't already meet that burden. That's a data generating process that is not causal by design but produces your preferred observed signal.

Your additions of "repeated", "different times" and "different places" only reduce likelihood of finding a set with your preferred signal (or similarly require checking more pairs). There's literally a cottage industry around finding these funny noncausal relationships

If you're imagining something more elaborate about what it means to move "reliably" together, Mostly Harmless Econometrics walks through how every single thing you might be thinking of is really just trying to get back to Rubin style randomized treatment assignment

u/ViewofDelft · 1 pointr/statistics

Surprisingly effective intro to probability

might be too informal for your purposes though...

u/clm100 · 2 pointsr/statistics

Honestly, ignore the "for engineering" part of "Statistics for Engineering." They're largely the same content.

How much calculus have you taken? Does the class use calculus?

First, the cartoon guide to statistics is surprisingly helpful for some people.

For a more traditional textbook, you might try Devore's main intro book.

Almost every student finds statistics confusing and it's either difficult to teach, or just difficult to learn. It's also a fractal discipline, since you can keep going deeper and deeper, but it's generally just going over the same few concepts with additional depth. If you end up in a class that's not well suited to your mathematical background it's especially frustrating.

Good luck.

u/Adamworks · 1 pointr/statistics

I'm assuming this is some sort of experimental psychology?

Probably everything in this book:

or this website:

Same guy, great book.

u/blimpy_stat · 11 pointsr/statistics

Applied Linear Statistical Models by Kutner is a far better reference for statistical modeling compared to ISLR/ESLR or any kind of "machine learning" text, but it sounds as though you did a stat masters since you're asking about stat modeling instead of the new buzzwords. The latter options are certainly more narrow.
Considered a cornerstone, of sorts.

u/statmama · 9 pointsr/statistics

Seconding /u/khanable_ -- most of statistical theory is built on matrix algebra, especially regression. Entry-level textbooks usually use simulations to explain concepts because it's really the only way to get around assuming your audience knows linear algebra.

My Ph.D. program uses Casella and Berger as the main text for all intro classes. It's incredibly thorough, beginning with probability and providing rigorous proofs throughout, but you would need to be comfortable with linear algebra and at least the basic principles of real analysis. That said, this is THE book that I refer to whenever I have a question about statistical theory-- it's always on my desk.

u/AdActa · 2 pointsr/statistics

A good bet would be "Mostly harmless Econometrics"

Not overly theoretic and very focussed on practical applications.

u/wil_dogg · 2 pointsr/statistics

Jaccard and Becker is a neat book and ideal for the level you are looking for:

But Jaccard and Becker may not have SAS programming examples. You can upgrade to Tabachnik and Fidel which is a more advanced text which I think does include SAS coding examples (can't find an online edition to check on that but my older editions had SPSS and SAS and way back in the day BMDP)

u/gwippold · 3 pointsr/statistics

You could read the IBM manual OR you could buy this much more user friendly book:

u/Slippery_Slope_Guy · 3 pointsr/statistics

It requires study so you might not have any sudden moments of clarity, but this is pretty much the Bible of regression.

Highly recommended.

u/josquindesprez · 1 pointr/statistics

If you want an extremely practical book to complement BDA3, try Statistical Rethinking.

It's got some of the clearest writing I've seen in a stats book, and there are some good R and STAN code examples.

u/ipu0014 · 2 pointsr/statistics

This one is a quite good book: Sivia, Skilling - Data Analysis: A Bayesian Tutorial

It's quite pragmatical, as opposed to the forementioned Jaynes for instance.

u/TheLeaderIsGood · 1 pointr/statistics

This one? Damn, it's £40-ish. Any highlights or is it just a case of this book is the highlight?

It's on my wishlist anyway. Thanks.

u/PhaethonPrime · 2 pointsr/statistics

Another book is D.S. Sivia's Data Analysis: A Bayesian Tutorial. It's more expensive than when I first got it, though (sorry I don't have a free reference). The examples in the beginning of the book are easily done in PyMC, as well!

u/okcukv · 2 pointsr/statistics

Tabachnick and Fidell is pretty good. Get yourself a used copy - $165 is outrageous.

u/lykonjl · 4 pointsr/statistics

Jaynes: Probability Theory. Perhaps 'rigorous' is not the first word I'd choose to describe it, but it certainly gives you a thorough understanding of what Bayesian methods actually mean.

u/mr0860 · 3 pointsr/statistics

I found Andy Field's Discovering Statistics Using R to be quite helpful.

u/AndersonCoopersDick · 1 pointr/statistics

Good read on the topic and history of the rise of Bayesian Statistics here:

u/RAPhisher · 4 pointsr/statistics

In addition to linear regression, do you need a reference for future use/other topics? Casella/Berger is a good one.

For linear regression, I really enjoyed A Modern Approach to Regression with R.

u/oz0509 · 2 pointsr/statistics

I agree with all of the above. Also, here's the Linear Models tome we used:

u/AllezCannes · 2 pointsr/statistics

They're not free, but Doing Bayesian Data Analysis and Statistical Rethinking are worth their weight in gold.

u/Jimmy_Goose · 1 pointr/statistics

If you want to go deep, try a math stats book. Although there would probably be disagreement in this subreddit on which is the best, they are all pretty much the same. Maybe try an early edition of the Wackerly book (I think that one is most widely used one). A lot of people would suggest Casella and Berger, but I would suspect those people have never taught a course and forget that that book does require a bit of mathematical maturity. Go with an undergrad book.

For R, I would suggest either going through a tutorial (such as the swirl package), or, what I am assuming how most people learned it, buying an applied stats book and just doing the problems in R. You have to go through the hump of learning it, but you learn programming by doing it. After you are done with math stats, a good next step in the applied direction is Regression and ANOVA/Design. Regression, there are a ton of books. But again, the first few chapters of most books are the same. I would try and find a cheapo book with a modern typeset. ANOVA... probably want to go with Montgomery. I don't know the others too well though.

u/gtani · 3 pointsr/statistics

Hmm, not sure what "following" 7 texts means, or why they chose that particular Hogg/Craig (which is now on 7th edition) but Casella and Berger is another standard text, and for Bayesian analysis, Gelman, Carlin, Stern, Rubin (new edition in Nov will be: Gelman, Carlin, Stern, Dunson.

Also you could look at the 6 standard machine learning texts: Murphy, Bishop, Barber etc (The review by Bratieres)


stackexchange has consistently decent book reviews

(or google " intermediate advanced statistics textbook"

u/trngoon · 9 pointsr/statistics

You must learn an application heavy book in 2018. Preferably in R unless you can program, in that case maybe Python.

I will link you two perfect books with very little math that people from any discipline can understand and are very well written. Both heavy on application in R with accompanying websites with all the code. (dont worry, R code is easy and the vast majority of R users are not programmers in the traditional sense). The first book I link does go into some more advanced topics, but everything is explained in a very common language. Its accompanying website also has lecture videos from the prof who wrote it.

^^ I emailed andy some time ago and he wants to release edition 2 next year probably

Trust me, these two books are what you want to look into.

NOTE some idiot is going to try to suggest to you a book called "Introduction to statistical learning" (mainly a supervised machine-learning book which is stats-focused) by the standford stats team. Do not start with this book if you want to learn traditional stats (like you point out in your post). No one who recommends you this book has considered your needs. I see this recommended every single day for all the wrong reasons. It actually makes me frustrated. It's a great book but has confused many people because of its name. Is it a stats book? Yeah. Is it an ML book, yeah? Is it a traditional stats book? Nope. Anything that says "_____ learning" is probably a machine learning book. Sorry for the rant.

u/kiwipete · 2 pointsr/statistics

An intermediate resource between the Downey book and the Gelman book is Doing Bayesian Analysis. It's a bit more grounded in mathematics and theory than the Downey, but a little less mathy than the Gelman.

u/SupportVectorMachine · 5 pointsr/statistics

A very user-friendly treatment that hits every criterion you mention is John Kruschke's Doing Bayesian Data Analysis, Second Edition.

u/gianisa · 2 pointsr/statistics

found it! Apparently they've gone through several editions and added a coauthor since I bought my copy.

My father is a statistician and he is the one who recommended Hogg and Craig when I complaining about Casella and Berger. I spent a summer working my way through Hogg and Craig and then reviewed everything from my classes that previous year as my way for studying for the written quals. I passed so it worked. And then I promptly forgot everything.

u/ajmarks · 1 pointr/statistics

This one is fairly standard: After all, it's where the MASS library comes from.

u/CrazyStatistician · 10 pointsr/statistics

Bayesian Data Analysis and Hoff are both well-respected. The first is a much bigger book with lots of applications, the latter is more of an introduction to the theory and methods.

u/lrnz13 · 1 pointr/statistics

I’m finishing up my stats degree this summer. For math, I took 5 courses: single variable calculus , multi variable calculus, and linear algebra.

My stat courses are divided into three blocks.

First block, intro to probability, mathematical stats, and linear models.

Second block, computational stats with R, computation & optimization with R, and Monte Carlo Methods.

Third block, intro to regression analysis, design and analysis of experiments, and regression and data mining.

And two electives of my choice: survey sampling & statistical models in finance.

Here’s a book for intro to probability. There’s also lectures available on YouTube: search MIT intro to probability.

For a first course in calculus search on YouTube: UCLA Math 31A. You should also search for Berkeley’s calculus lectures; the professor is so good. Here’s the calc book I used.

For linear algebra, search MIT linear algebra. Here’s the book.

The probability book I listed covers two courses in probability. You’ll also want to check out this book.

If you want to go deeper into stats, for example, measure theory, you’re going to have to take real analysis & a more advanced course on linear algebra.

u/timshoaf · 3 pointsr/statistics

Frequentist statistics does use Bayes' theorem, all of the measure theoretic results are identical between the philosophies. It is the inclusion of a priori knowledge (or information attempting to express a lack thereof) that demarcates the primary modeling differences.

If you would like a solid background in bayesian statistics I would recommend BDA3 by Andrew Gelman and Machine Learning: A Probabilistic Perspective by Kevin Murphy

One can of course not forget Hastie et al.'s Elements of Statistical Learning as well.

If you would like a general introduction, however, I would recommend the following text by Sivia.

Probability theory itself is consistently axiomatized under the Komolgorov axioms. But the philosophy regarding how to perform inference is not.

There is not an obvious inconsistency in the mathematical formulations, but there are inconsistencies with how each of the philosophies treats various issues.

A brief overview of differences is here:

In short though, there is nothing mathematically wrong with the Frequentist approach--but I would personally argue there are things that are philosophically wrong with certain applications of those methods--not the least of which are issues where generating processes are non-stationary (though similar issues can be stated for Bayesians) or where simply the formulation tends to lead practitioners to drawing mistaken conclusions by mistake. You can make a Rube Goldberg machine of computation and still have it preserve all information and be mathematically consistent, but the likelihood of humans misinterpreting it is much higher than a simpler framework.

u/klaxion · 5 pointsr/statistics

Recommendation - don't learn statistics through "statistics for biology/ecology".

Go straight to statistics texts, the applied ones aren't that hard and they usually have fewer of the lost-in-translation errors (e.g. the abuse of p-values in all of biology).

Try Gelman and Hill -

Faraway - Practical Regression and Anova using (free)

Categorical data analysis

u/dbzgtfan4ever · 3 pointsr/statistics

If you are running parametric tests (ANOVA and regression families), then you have a set of underlying assumptions that you need to test. You assume normality, homoscedasticity (equal variance/error variance between groups or at each level of your DV), and linearity between variables. You have to test for them. This also means testing for outliers and whether your data are missing completely at random (if you have missing data).

If your data do not meet these assumptions, then you have to decide how to proceed: should you run the tests anyway noting potential changes to alpha; transform the data (possibly compromise interpretation); run non-parametric tests; or model the non-normality or non-linearity?

I learned all of this in my Multivariate Statistics course, and this course used Tabachnick and Fidel's book called Using Multivariate Statistics.

Good luck! Severe violations to any of these assumptions could severely compromise any conclusions you draw from your research. However, some may just hold the view that that violations of these assumptions in your sample may not lead to erroneous conclusions about your population, citing evidence that ANOVA is generally robust (produces similar results) to violations of normality.