Top products from r/hadoop

We found 9 product mentions on r/hadoop. We ranked the 9 resulting products by number of redditors who mentioned them. Here are the top 20.

Next page

Top comments that mention products on r/hadoop:

u/recitegod · 2 pointsr/hadoop

Class 10 but it does not really matter. Once you run them 24/7 for 12 weeks, they get physically corrupted on the block. Not fun (you throw them away). I got the good ones, a combination of samsung, memorex, sony, patriot, they all fail at some point. I would highly recommend to create a usb mounting point like so: https://www.raspberrypi.org/blog/pi-3-booting-part-i-usb-mass-storage-boot/

You don't build such cluster to achieve anything but learning. You stretch every aspect of this cluster design (power, eth0, throughput, writing on the card, interface, everything) and you will definitely have so "gotcha" moment as you troubleshoot along the way. Highly recommend. Building a 5 nodes cluster is better than 3. Nine nodes are overkill but still fun.

If I had to redo it all over the gain, I would get a 4gb pack:
https://www.amazon.com/10-PACK-SanDisk-SDSDQAB-008G-Packaging/dp/B00MHZ70KO/ref=sr_1_6?ie=UTF8&qid=1481181417&sr=8-6&keywords=8gb+micro+sd+card+pack

and x9
https://www.amazon.com/Samsung-METAL-Flash-MUF-32BA-AM/dp/B013CCTM2E/ref=sr_1_fkmr2_2?ie=UTF8&qid=1481181453&sr=8-2-fkmr2&keywords=usb%2Bkey%2Bsamsung%2Busb3&th=1

I believe I will be able to get a 2mb/sec sustain throughput on Spark with this sort of setup, not sure, possibly more. Got to try. Why the usb3? I image these on the laptop so it is much faster than USB2. If not, a USB2 keys will suffice. More than 32gb of storage per node is useless.

FUCKYOUINYOURFACE is right, even with this setup, it is a slow as shit cluster, but you absolutely control this environment in every dimensions you can think of. And that is awesome in my book.

u/batoure · 3 pointsr/hadoop

So its been my experience that because the meta ecosystem of hadoop evolves weirdly sometimes at breakneck speeds I have yet to find one or two books to rule them all.


That being said start with "Hadoop Security" by Ben Spivey.

https://www.amazon.com/Hadoop-Security-Protecting-Your-Platform/dp/1491900989/ref=sr_1_1?ie=UTF8&qid=1498101555&sr=8-1&keywords=hadoop+security

One of the blindspots that 80% of clusters I have interacted with share, is securing the ecosystem. Have you ever seen the interview where Berners-Lee says that it terrifies him that most of the systems on the international space station run over http. Similar situation here. Most Bigdata platforms have a Trust as a default and require strategy and intelligent thought up-front for proper configuration.

Monitoring and managing is where things get complicated management is essentially what the various flavors of hadoop are arguing about. The two majors being Cloudera with "Cloudera Manager" where HortonWorks has "Apache Ambari".

This isn't a plug per-say but I have found that though it often feels like a frankenstein's monster of knitted together tools; Cloudera Manager seems to provide a slightly less painful install process. Additionally their parcels system allows you to manage and mess around with various versions of tools pretty easily. At the end of the day none of these tools are going to hold your hand so expect one of two things to happen.

-You get the cluster up fast and working-ish in "Lets please not call it production mode"
-You get a clean and stable cluster running and start wondering what data to ingest after working on this steadily for the next 3 months or so.

The danger of option 1 is just how achievable it is because a couple months from now the company will start talking about this vital infrastructure and you will have a problem on your hands.

u/cleric04 · 1 pointr/hadoop

If you have an O'reilly subscription there is a pretty good video with Ben and a few other Cloudera guys thats probably a good quickstart on cluster security A Practitioner's Guide to Securing Hadoop

Hadoop Operations is a good book but getting a bit outdated. I'd say just start setting up a cluster and maybe read it if you have some free time.

u/wolf2600 · 1 pointr/hadoop

Don't really want a plug & play solution, I'd like to mimic as closely as possible starting with ~3 Linux boxes, configuring them, then installing and configuring Hortonworks on them.

End goal would be the Hortonworks HDP Administrator certification, so I need to know how to do everything myself.


oooh...

u/MrFromEurope · 2 pointsr/hadoop

Well that sucks. I already have this one:Hadoop Guide

There is a whole chapter on HBase in there but nothing else.

u/e13e7 · 2 pointsr/hadoop

1: You don't need big guns for NN/JT on a small cluster, but I assume you are running something on that box anyway to put it to use.
2: Yes for config, ehh for settings butyouknewthatsorryforpointingitoutlikeyoudidnt.... ._. still, be weary of how multi-threading is handled
3: I didn't, I was an intern for all this - and my DBA mentor hated the whole deal, and long term its not how you wanna do it, especially if what you're doing may or may not be breaking some privacy policies... Some of the cheaper hardware I was looking at rounded in at $6k for 8 nodes with 32 intel cores and 64G ram total. I doubt you need that power, but that's what I was considering to chomp through 5-10G raw data per day on a timely basis, given that up to 100 pig queries with multiple joins were the task at hand.

500G of input with group, join, join, order (about 4 maps / 3 reduces per job) took over 3 hours for 3 nodes w/(4cores..6Gmem..1Tdisk as /tmp and /data). Hosting some of the files on mem and not writing to /tmp really cut down on that replicated joins, but I could only do it in rare circumstances, 6G really isn't a lot @_@