Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

ChristianVirtual

macrumors 601
Original poster
May 10, 2010
4,122
282
日本
It might be a bit off topic but does anyone here has experience with BigData ? Tools like Hadoop for example ? I wonder if it worth investing time learning that toolset or if eventually only SAS or SAP are relevant.
It at this point only for my own "entertainment" and learning but I think good to have some basics for what might come across in my company the next years.

My current pool of big data: firewall log, network traffic and a number of folding logs should give enough "big data"-like source information to play around with.

Any recommendation on books/kindle to read or online trainings /webinars ?
 
The day job is data analysis, but typically only in the low TBs in MS SQL Server, so not sure that counts as big data. However, I attended a company lunch and learn on Hadoop this week. It's a big shift in thinking from some of our traditional warehouses but seems necessary when trying to shift through PB of data.

I had already attended a Microsoft Big Data seminar and had various presentations on their offerings. A presentation from Hortonworks led me to go through their hadoop tutorials:

http://hortonworks.com/tutorials/

Not especially in depth, but enough to get a flavour of the technologies and provides a nice sandbox to play with.
 
I've only done Toy work with Hadoop. Any non trivial datasets I've worked with have been sufficiently small to load onto a High End GPU. ;)

I used this book.
 
As an Amazon Associate we earn from qualifying purchases.
I work with some of the largest companies in the world and often get involved in these sort of project. i formally worked for SAP and mainly live in the in-memory world of HANA. I also directly worked on the worlds largest data warehouse project.

All depends what you want to learn or understand. If you want large amounts of data you could look at sucking on the firehose of sentiment data from Facebook or twitter. firewall logs are not big data.

Analytics is then used to make sense or drive value from that data, you could look at the likes of what eBay/Amazon do around looking at buying patterns and matching to buyers/sellers.

You can get some good overview docs, you could look at what IBM and SAP are doing. Take a look at openstack and cloudera too. Take a look on http://www.big-dataforum.com and http://www.dataanalyticsweek.com

Like with all project, you need to understand what you want to do and why and more importantly what business problem you are looking to solve or insight/value you are looking to gain from the data.
 
are there any open source tools that allow you to run this on your own computer/server at home to play with? i think i tried to install it a few years ago with no luck, and i never tried again.
 
I've taken a class for Big data and we learned about using Hadoop. It's a good tool with things a like Hive and so forth to make it easier to work with. IBM uses Hadoop and many other big companies. It really allows you to scale big data with parallel processes that you don't have to learn how to do. It's fairly difficult to learn as the resources are for people willing to read technical papers, low level command lines / man pages, and trial and error.

I'd like to work with the underlying architecture of Hadoop one day to help with parallel and concurrent processing.
 
I work with Big Data.
You can always download the VMs from Cloudera and Hortonwork and use their tutorials to get a Big Picture and learn the basics. They both have tons of free resources. Cloudera Hadoop Essentials series are a great entry point.
I run those VMs even in my old iMac 2009, and they work fine.

Now, you are talking about logs. Hadoop is great for storing and "batch"-processing large data sets in parallel, but when it come to real time and stream data analysis, Hadoop is not the right solution.

If you want to go in that path, my recommendation is that you start learning Spark, and just some basics about Hadoop.
HANA, elasticsearch, splunk, etc. They all have their pros and cons.
Always think about the V's in Big Data: Volume, Velocity, Variety.
i.e. HANA is fast in memory processing, but it requires the data to be structured (unlike Hadoop which is schema on read), you are bound by the size of your memory, and let's not get into SAP cost...

IMHO, I believe Spark is going to be the next level for Big Data. Hadoop is not going to go away as the preferred storage solution for un/multi-structured data, but for data processing and analysis, I believe Spark will take the lead very soon.

Hope that helps.:)
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.