(Part 3) Top products from r/datascience

Jump to the top 20

We found 26 product mentions on r/datascience. We ranked the 205 resulting products by number of redditors who mentioned them. Here are the products ranked 41-60. You can also go back to the previous section.

Next page

Top comments that mention products on r/datascience:

u/parts_of_speech · 12 pointsr/datascience

Hey, DE here with lots of experience, and I was self taught. I can be pretty specific about the subfield and what is necessary to know and not know. In an inversion of the normal path I did a mid career M.Sc in CS so it was kind of amusing to see what was and was not relevant in traditional CS. Prestigious C.S. programs prepare you for an academic career in C.S. theory but the down and dirty of moving and processing data use only a specific subset. You can also get a lot done without the theory for a while.

If I had to transition now, I'd look into a bootcamp program like Insight Data Engineering. At least look at their syllabus. In terms of CS fundamentals... https://teachyourselfcs.com/ offers a list of resources you can use over the years to fill in the blanks. They put you in front of employers, force you to finish a demo project.

Data Engineering is more fundamentally operational in nature that most software engineering You care a lot about things happening reliably across multiple systems, and when using many systems the fragility increases a lot. A typical pipeline can cross a hundred actual computers and 3 or 4 different frameworks.doesn't need a lot of it. (Also I'm doing the inverse transition as you... trying to understand multivariate time series right now)

I have trained jr coders to be come data engineers and I focus a lot on Operating System fundamentals: network, memory, processes. Debugging systems is a different skill set than debugging code, it's often much more I/O centric. It's very useful to be quick on the command line too as you are often shelling in to diagnose what's happening on this computer or that. Checking 'top', 'netstat', grepping through logs. Distributed systems are a pain. Data Eng in production is like 1/4 linux sysadmin.

It's good to be a language polyglot. (python, bash commands, SQL, Java)

Those massive java stack traces are less intimidating when you know that Java's design encourages lots of deep class hierarchies, and every library you import introduces a few layers to the stack trace. But usually the meat and potatoes method you need to look at is at the top of a given thread. Scala is only useful because of Spark, and the level of Scala you need to know for Spark is small compared to the full extent of the language. Mostly you are programatically configuring a computation graph.

Kleppman's book is a great way to skip to relevant things in large system design.

https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321

It's very worth understanding how relational databases work because all the big distributed systems are basically subsets of relational database functionality, compromised for the sake of the distributed-ness. The fundamental concepts of how the data is partitioned, written to disk, caching, indexing, query optimization and transaction handling all apply. Whether the input is SQL or Spark, you are usually generate the same few fundamental operations (google Relational Algebra) and asking the system to execute it the best way it knows how. We face the same data issues now we did in the 70s but at a larger scale.

Keeping up with the framework or storage product fashion show is a lot easier when you have these fundamentals. I used Ramakrishnan, Database Management Systems. But anything that puts you in the position of asking how database systems work from the inside is extremely relevant even for "big data" distributed systems.

https://www.amazon.com/Database-Management-Systems-Raghu-Ramakrishnan/dp/0072465638

I also saw this recently and by the ToC it covers lots of stuff.

https://www.amazon.com/Database-Internals-Deep-Distributed-Systems-ebook/dp/B07XW76VHZ/ref=sr_1_1?keywords=database+internals&qid=1568739274&s=gateway&sr=8-1

But to keep in mind... the designers of these big data systems all had a thorough grounding in the issues of single node relational databases systems. It's very clarifying to see things through that lens.

u/adventuringraw · 4 pointsr/datascience

EVERYTHING always comes down to the task you're trying to actually achieve, as /u/fastestsynapses points out, it's all about the hypothesis.

That said... you might be interested in a side question: when is correlation causation? Or perhaps put another way... what is causation, and under what circumstances can you establish causality? It's a giant fucking rabbit hole that you might not care to venture down, but it's fascinating... to the point where I literally got into Unity so I could try building out a little playground to get some intuition on the concepts, haha. It's a bizarre corner of math that's got more in common with solving Jonathan Blow's 'the witness' style line puzzles than it does doing algebra, but if you can establish enough conditional statistical independences between various variables, you can use Pearl's IP^* algorithm to turn them into a partial causal structure, or if you're dealing with just a small number of variables, and you want to try and identify true causation/possible causation/spurious correlation and so on, there are some tests available if you can find a good set of nodes to condition on (instrumental variables from econometrics and so on are a special case of this). Pearl's 2009 book 'Causality' has a description of those specific tests near the end of chapter 3 if I remember right. Though as a (very large) word of warning, that book isn't practical at all... it's an excellent theoretical introduction, but even after reading the whole thing, it hasn't exactly informed my work yet, haha. hence picking up this book recently to dive in a little deeper.

Another very useful corner though... how deep did you get into statistical regression? Parameter interpretation is a whole big thing, there's some cool stuff there to dive into as well if you have the patience and the interest.

u/arbiter_of_tastes · 3 pointsr/datascience

Whoa, there. Healthcare data scientist here, mainly working in areas like clinical epidemiology and with a background in health services research and pharmacoepidemiology.

First, kudos for having questions and reaching out for help. This is my opinion, but health care is different from other sectors. The work you do has the potential to affect people in visceral, fundamentally life-changing ways...such as recommending a patient should or should not get treatment. Or a patient should or should not be placed on end-of-life-care...that a life-threatening complication is or is not related to a pharmaceutical on the market. Point just being - I think this sector carries responsibility that many other sectors don't.

Second, are you at a pharmaceutical/related organization? If so, there should be qualified biostatisticians/epidemiologists/psychometricians/health economist/something similar to sit down with you and help you figure you this out.

Third, you said you study 'data science and knowledge engineering', but I'm not sure what your curriculum consists of - do you study causal inference? If you don't, it's the most important topic you need to be familiar with (not competent, mind you). Here are several references that could get you familiar with identifying and dealing with bias and confounding, and designing experiments to assess causal relationships instead of just association. In healthcare you have to know when a question warrants a causal analysis vs a predictive or associative one. If a causal analysis is needed, an epidemiologist or biostatistician might likely do that work, but it certainly helps to know what a DAG is and how to read one.

https://www.amazon.com/Epidemiology-Introduction-Kenneth-J-Rothman/dp/0199754551

https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

Fourth, I'm hesitant to suggest anything about your dataset, because I still only have a rough idea of the details. Also, it sounds like you've got a psychometric dataset, and I've never studied psychometrics. I will say, though, that the question (hypothesis) being asked should really drive the analytic approach. Is the goal to look at a homogenous population and find that there's something about that causing them to require or be adherent to treatment? Do those results then need to get applied to a diverse, heterogenous population? That's a very high bar to achieve for experimental purposes. Is it enough to look at some data and say that certain characteristics are associated or predictive of certain outcomes? That's a much lower bar from an experimental standpoint and probably an analytic standpoint, too. If there is a selection bias, I think that's only relevant if there's a desire to extrapolate the study results to a different population. As you point out, if the desire is to generalize results to a larger population it's likely a significant problem that would require a intentional experimental design to address. If the company you're working with doesn't recognize this or can't have a qualified person explain why it's not a study design problem, you're working with bad people that likely don't know what they're doing. I've colloborated with several software/'health analytic' companies and startups that are like this, and it's why I'm dis-trustful of all health analytic software until proven.

Hope this helps!

​

u/flipstables · 1 pointr/datascience

I'm very much a data science newbie. But here's my attempt to answer your question.

Excel is both very powerful and very limited in data analysis.

Pros: a lot of quick and dirty analysis can be done using Excel. Data manipulation is easy and fast. Particularly with Excel 2013, it's very easy to mash data together from different data sources. PowerPivot is a nice way to "import" data from SQL sources. Popular data mining addin = XLMiner.

Cons: Limited in size of data sets, and it can get extremely slow and unstable with large amounts of data. Analysis is very biased towards financial applications. Programming is Excel is very clunky (I'd rather let python do the heavy lifting if necessary).

Excel really needs a database engine and/or business intelligence backend to perform a lot of the heavy analytics. Nevertheless, it's a good tool.

For resources, I was recently introduced to this book, and it seems promising:

http://www.amazon.com/Data-Analysis-Using-SQL-Excel/dp/0470099518

u/sven_ftw · 2 pointsr/datascience

Sounds like you are interested in Operations Research as a discipline.

If you are looking for something to give you ideas about what longterm projects and outcomes look like, something like this book here(The Applied Business Analytics Casebook: Applications in Supply Chain Management, Operations Management, and Operations Research) might be good.

If you are looking for something more hands on, then either the Rardin or Nocedal and Wright books might be a good starting point.

u/dsjumpstart · 4 pointsr/datascience

One of my favorite not-super-technical books that can give some insights into the thought process and actionability of analytics and machine learning is "Everybody Lies". https://www.amazon.com/Everybody-Lies-Internet-About-Really/dp/0062390856

It touches on a concept I really like to rely on data for which is revealed vs. stated intent. People tell you they want what they wish they wanted. Data tells you what they actually want and how they actually behave. There are some good intuitive regression models in there as well.

u/ninepoints · 6 pointsr/datascience

I studied education policy, and causal inference is big and growing in that field as well. Most courses on the subject will begin with randomized controlled trials and then discuss quasi-experimental techniques that can be used to support causal inference in instances where RCTs are not practical, ethical, etc. I teach an advanced quantitative methods course on the subject and personally covered topics such as regression discontinuity, difference-in-differences, (comparative) interrupted time series, propensity score methods, and the like. Although it's mostly specific to education, a book called Methods Matter by Murnane & Willett (2011) is a very gentle but rigorous introduction to the subject that I used in my course. In addition to epidemiology, I'd say policy schools would be another great place to find courses on the subject.

u/shaggorama · 1 pointr/datascience

This book is pretty sweet if you don't have it already. You might also consider purchasing a private API key or dataset or something like that that's pertinent to your business domain.

u/tmthyjames · 1 pointr/datascience

Check out Thomas Sowell's Basic Economics. It's probably the best place to start for basic econ. But for something more data sciency, check out Carter Hill's Principles of Econometrics

u/ttelbarto · 1 pointr/datascience

Hi, There are so many resources out there I don't know where to start! I would work through some kind of beginner python book (recommendation below). Then maybe try Andrew Ng's Machine Learning Coursera course to get a taste of Machine Learning. Once you have completed both of those I would reassess what you would like to focus on. I will include some other books I would recommend below.

Beginner Python - https://www.amazon.co.uk/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?keywords=python+books&qid=1565035502&s=books&sr=1-3

Machine Learning Coursera - https://www.coursera.org/learn/machine-learning

Python Machine Learning - https://www.amazon.co.uk/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291/ref=sr_1_7?crid=2QF98N9Q9GCJ9&keywords=hands+on+data+science&qid=1565035593&s=books&sprefix=hands+on+data+sc%2Cstripbooks%2C183&sr=1-7

https://www.amazon.co.uk/Data-Science-Scratch-Joel-Grus/dp/1492041130/ref=sr_1_1?crid=PJEJNNUBNQ8N&keywords=data+science+from+scratch&qid=1565035617&s=books&sprefix=data+science+from+s%2Cstripbooks%2C140&sr=1-1

Statistics (intro) - https://www.amazon.co.uk/Naked-Statistics-Stripping-Dread-Data/dp/039334777X/ref=sr_1_1?keywords=naked+statistics&qid=1565035650&s=books&sr=1-1

More stats (I haven't read this but gets recommended) - https://www.amazon.co.uk/Think-Stats-Allen-B-Downey/dp/1491907339/ref=sr_1_1?keywords=think+stats&qid=1565035674&s=books&sr=1-1

u/bmburns98 · 1 pointr/datascience

I took a time series class during undergrad at Virginia Tech, and my professor swore by this book: https://www.amazon.com/gp/product/0470540648/ref=oh_aui_detailpage_o09_s00?ie=UTF8&psc=1. It's a little dense, but I think it explains the basic time series concepts very well.

u/yayo4ayo · 1 pointr/datascience

Used this for an experimental design class at my university and I really liked the way it was written.

u/datadude · 6 pointsr/datascience

I have an excellent statistics text book that I am using to learn stats: Discovering Statistics Using R by Andy Field. My approach is to do the exercise in R first, then try to reproduce the same result in Python. It's slow going, but it's a real learning experience.

u/MonsterMash2017 · 4 pointsr/datascience

>If you Google KNNL it'll know what you're looking for).

Mine certainly didn't, I got two pages of Karnataka Neeravari Nigam Limited and associated projects.

If anyone else is wondering, I'm assuming this is the book, I eventually found it on a CSU syllabus: https://www.amazon.ca/Applied-Linear-Statistical-Models-Student/dp/007310874X

Not to be confused with: https://www.openhub.net/p/knnl

u/wouldeye · 4 pointsr/datascience

field's "introduction to statistics using R" is the best book for my money.

EDIT: sorry I got the title wrong:

https://www.amazon.com/Discovering-Statistics-Using-Andy-Field/dp/1446200469

u/lab_fly · 2 pointsr/datascience

I found this book extremely accessible and thorough. It's really expensive though. If you live in NYC, I can let you borrow it.

https://www.amazon.com/Introduction-Bootstrap-Monographs-Statistics-Probability/dp/0412042312

u/o_safadinho · 1 pointr/datascience

Time Series Analysis Forecasting and Control is the most recent edition of THE Box-Jenkins book.

These are the people that literally developed the framework (ARIMA) that a lot of modern time series analysis methods use.

u/Nerdloaf · 2 pointsr/datascience

This is a 700 page book and it's only an introduction to linear regression, do you really think you can read it and fully understand it in two weeks?
https://www.amazon.com/Introduction-Regression-Analysis-Douglas-Montgomery/dp/0470542810