Or 40 Minutes at a Developers Conference Taught Me a Whole Lot About Big Data
Recently I had the opportunity to attend a developer’s conference (FWD JS). It’s organizer had contributed to the organization of the chaotic but powerful HTML5 conference in past years and the first edition of this conference last summer was called JS FWD, so this crowd has roots in the front end web technologies. However, this summer David Nugent expanded the scope and I believe we were all better for it. In fact, my favorite talk by far was by Sujee Maniyam where he effectively explained how different open source technologies can be leveraged to handle the tremendous amount of data that might be produced, managed and analyzed by any massive system of devices. He was especially referring to the often talked about but rarely understood universe of the “Internet of Things.” For the benefit of others, I attempt to quickly summarize his talk here.
Sujee started the talk by explaining how much data, often in terra bytes per day, a company such as “Fitbit” might produce. The stages of a system that handles this much data are broken up into:
1.
Capture
2.
Process
3.
Store
4.
Query
To Capture data from such a system, Sujee explained many options were available including MQ, FluentD, Flume and Kafka. These systems will provide redundancy because there is virtually zero tolerance for failure in systems such as these and at many times, the data is arriving at incredibly high volumes. Of course, the capturing process also needs to scale.
Once you’ve captured this data, you will need to process it in real-time. One of the options is an open source computation system created at Twitter: Apache Storm. Their landing page sums up this need well:
“Storm makes it easy to reliably
process unbounded streams of data, doing for real-time processing what Hadoop
did for batch processing.”
He also mentioned Apache Samza and Apache Nifi were options for stream processing.
Of course, for many years, this role of processing might be fulfilled by the almighty Hadoop. Sujee mentioned that Hadoop is the first-generation big data technology that was created for batch analytics. For streaming, we are looking at second-generation technologies like (Spark, Nifi, Samza ..etc) to process data in (near) real time.
Then in terms of storage, he introduced the room to one of the great challenges of large data intensive systems such as these as they have two major requirements in their storage system: to make the data available forever and allow real-time lookup, even of data that was just acquired. And once again this system needs to scale.
The answer is what is known as a “Lambda Architecture.” To put as simply as possible there should be two data pipelines, one to handle rapid responses and one to handle the massive volume that is collected over months and years.
To store massive amounts of data with no plausible limit, the answer is the storage portion of Hadoop ( A.K.A. a HDFS, Hadoop distributed file system). It is still cost effective and has always been a scalable, fault tolerant means of storing massive amounts of data in perpetuity. HDFS is incredibly powerful but it will always respond using a batch processing paradigm.
For the real-time portion of the Lambda architecture, the modern developer has a variety of distributed databases, most of which support NoSQL. The pros and cons of different distributed databases and NoSQL systems are not in the scope of this post or the talk I am summarizing, but I appreciated Sujee mentioning some big names.
Both Apache projects HBase and Cassandra were popular distributed databases mentioned that scale incredibly better than an old RDBS of the previous generation and neither have a single point of failure. The real choice is whether you want to be dependent on Hadoop, which is required for HBase.
The last sphere of the IoT architecture that Sujee presented was the Querying. In terms of recent and high speed queries, the distributed databes will handle the job flawlessly. For batch queries, interfacing the HDFS directly will suffice as well as Hadoop enhancing tools such as HIVE that allow for more SQL-like queries.
So in short, Sujee Maniyam stood in front of about one hundred of what were mostly “front-end” (or browser code) oriented techies and he gave us a very effective overview of what goes on in the piles of servers we have been sending requests to for years. Names like Hadoop, Spark, Kafaka, and Cassandra have been thrown around at my current and previous positions yet I knew little more than “lots of data” and possibly the word “scalable.” Now I know so much more and I am very grateful.
His slides were well written and he should be commended for speaking to the audience at the appropriate level. The only criticism I could possibly make is that he is one of the few humans that is capable of speaking faster than myself and this may, at some point in time, cause emotional damage to those who are wired at a different frequency.