LINKS & FURTHER INFO
Hadoop Distributions
If you want to get started with Hadoop, there are a number of options available to you. The local sandboxes tend to be available as Azure or AWS virtual machines as well, so if you don't have a beefy machine at home, you can still get started pretty easily.
Local sandboxes:
- Hortonworks Sandbox. This is probably the easiest way to get started with Hadoop. They give you a full VM which is already installed and configured with a number of tools. The Ambari user interface is pretty nice, and if you are on the .NET stack, Hortonworks tends to present a nicer experience.
- Cloudera QuickStart VM. Like the Hortonworks Sandbox, Cloudera's offering is a fully-featured single-node VM, as well as a Docker image.
- MapR Sandbox For Hadoop.
Platform-as-a-Service offerings:
- Azure HDInsight. It's fairly pricey--data nodes can run you a couple hundred dollars per month apiece at the low end and upwards of $2K a month per node on the high end. If you want a fairly simple Platform-as-a-Service Hadoop experience, HDInsight is a good option, as there are good tools available for developers. There are limitations which prevent integration with services like Polybase, and the common answer to integration questions tends to be "move your output data to Azure Blob Storage and access it from there."
- Elastic MapReduce. Amazon's offering ties to S3 and is less expensive than HDInsight. The marginal cost for ElasticMapReduce is low, though also factor in the EC2 costs and it's no longer pennies per hour. If your company is integrated with Amazon already, this is a good service.
Interesting Links
Learning Resources
Books are hard to recommend because the source material changes so frequently--a book written in 2017 can be out of date by the time it's published in 2018. These are a few books that I have on my to-read list:
Some of the foundational papers do hold up well, as they provide information on the underpinnings of these technologies. Examples include:
I have a few other talks in which I cover elements of Hadoop in detail.
I learned a good deal from the Hortonworks tutorials, which include both written and video tutorials. They are a good place to start.