Hadoop, anyone ?

ChristianVirtual

macrumors 601
Original poster
May 10, 2010
4,096
266
* ** *
It might be a bit off topic but does anyone here has experience with BigData ? Tools like Hadoop for example ? I wonder if it worth investing time learning that toolset or if eventually only SAS or SAP are relevant.
It at this point only for my own "entertainment" and learning but I think good to have some basics for what might come across in my company the next years.

My current pool of big data: firewall log, network traffic and a number of folding logs should give enough "big data"-like source information to play around with.

Any recommendation on books/kindle to read or online trainings /webinars ?
 

rwh202

macrumors regular
Nov 14, 2010
114
11
UK
The day job is data analysis, but typically only in the low TBs in MS SQL Server, so not sure that counts as big data. However, I attended a company lunch and learn on Hadoop this week. It's a big shift in thinking from some of our traditional warehouses but seems necessary when trying to shift through PB of data.

I had already attended a Microsoft Big Data seminar and had various presentations on their offerings. A presentation from Hortonworks led me to go through their hadoop tutorials:

http://hortonworks.com/tutorials/

Not especially in depth, but enough to get a flavour of the technologies and provides a nice sandbox to play with.
 

AFEPPL

macrumors 68030
Sep 30, 2014
2,644
1,564
England
I work with some of the largest companies in the world and often get involved in these sort of project. i formally worked for SAP and mainly live in the in-memory world of HANA. I also directly worked on the worlds largest data warehouse project.

All depends what you want to learn or understand. If you want large amounts of data you could look at sucking on the firehose of sentiment data from Facebook or twitter. firewall logs are not big data.

Analytics is then used to make sense or drive value from that data, you could look at the likes of what eBay/Amazon do around looking at buying patterns and matching to buyers/sellers.

You can get some good overview docs, you could look at what IBM and SAP are doing. Take a look at openstack and cloudera too. Take a look on http://www.big-dataforum.com and http://www.dataanalyticsweek.com

Like with all project, you need to understand what you want to do and why and more importantly what business problem you are looking to solve or insight/value you are looking to gain from the data.
 

twoodcc

macrumors P6
Feb 3, 2005
15,307
25
Right side of wrong
are there any open source tools that allow you to run this on your own computer/server at home to play with? i think i tried to install it a few years ago with no luck, and i never tried again.
 

kage207

macrumors 6502a
Jul 23, 2008
961
8
I've taken a class for Big data and we learned about using Hadoop. It's a good tool with things a like Hive and so forth to make it easier to work with. IBM uses Hadoop and many other big companies. It really allows you to scale big data with parallel processes that you don't have to learn how to do. It's fairly difficult to learn as the resources are for people willing to read technical papers, low level command lines / man pages, and trial and error.

I'd like to work with the underlying architecture of Hadoop one day to help with parallel and concurrent processing.
 

iEMH

macrumors member
Apr 20, 2015
39
18
I live in my own little World
I work with Big Data.
You can always download the VMs from Cloudera and Hortonwork and use their tutorials to get a Big Picture and learn the basics. They both have tons of free resources. Cloudera Hadoop Essentials series are a great entry point.
I run those VMs even in my old iMac 2009, and they work fine.

Now, you are talking about logs. Hadoop is great for storing and "batch"-processing large data sets in parallel, but when it come to real time and stream data analysis, Hadoop is not the right solution.

If you want to go in that path, my recommendation is that you start learning Spark, and just some basics about Hadoop.
HANA, elasticsearch, splunk, etc. They all have their pros and cons.
Always think about the V's in Big Data: Volume, Velocity, Variety.
i.e. HANA is fast in memory processing, but it requires the data to be structured (unlike Hadoop which is schema on read), you are bound by the size of your memory, and let's not get into SAP cost...

IMHO, I believe Spark is going to be the next level for Big Data. Hadoop is not going to go away as the preferred storage solution for un/multi-structured data, but for data processing and analysis, I believe Spark will take the lead very soon.

Hope that helps.:)