Saturday, December 5, 2015

An IoT Architecture Crash Course

Or 40 Minutes at a Developers Conference Taught Me a Whole Lot About Big Data



Recently I had the opportunity to attend a developer’s conference (FWD JS).  It’s organizer had contributed to the organization of the chaotic but powerful HTML5 conference in past years and the first edition of this conference last summer was called JS FWD, so this crowd has roots in the front end web technologies.  However, this summer David Nugent expanded the scope and I believe we were all better for it.  In fact, my favorite talk by far was by Sujee Maniyam where he effectively explained how different open source technologies can be leveraged to handle the tremendous amount of data that might be produced, managed and analyzed by any massive system of devices.  He was especially referring to the often talked about but rarely understood universe of the “Internet of Things.”  For the benefit of others, I attempt to quickly summarize his talk here.

Sujee started the talk by explaining how much data, often in terra bytes per day, a company such as “Fitbit” might produce.  The stages of a system that handles this much data are broken up into:

1.     Capture
2.     Process
3.     Store
4.     Query

 

To Capture data from such a system, Sujee explained many options were available including MQ, FluentD, Flume and Kafka.  These systems will provide redundancy because there is virtually zero tolerance for failure in systems such as these and at many times, the data is arriving at incredibly high volumes.  Of course, the capturing process also needs to scale.

Once you’ve captured this data, you will need to process it in real-time.  One of the options is an open source computation system created at Twitter: Apache Storm.  Their landing page sums up this need well:

“Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing.”

He also mentioned Apache Samza and Apache Nifi were options for stream processing.

Of course, for many years, this role of processing might be fulfilled by the almighty Hadoop.  Sujee mentioned that Hadoop is the first-generation big data technology that was created for batch analytics.  For streaming, we are looking at second-generation technologies like (Spark, Nifi, Samza ..etc)  to process data in (near) real time.

Then in terms of storage, he introduced the room to one of the great challenges of large data intensive systems such as these as they have two major requirements in their storage system: to make the data available forever and allow real-time lookup, even of data that was just acquired.  And once again this system needs to scale.

The answer is what is known as a “Lambda Architecture.”  To put as simply as possible there should be two data pipelines, one to handle rapid responses and one to handle the massive volume that is collected over months and years.

To store massive amounts of data with no plausible limit, the answer is the storage portion of Hadoop ( A.K.A. a HDFS, Hadoop distributed file system).  It is still cost effective and has always been a scalable, fault tolerant means of storing massive amounts of data in perpetuity.  HDFS is incredibly powerful but it will always respond using a batch processing paradigm.

For the real-time portion of the Lambda architecture, the modern developer has a variety of distributed databases, most of which support NoSQL.  The pros and cons of different distributed databases and NoSQL systems are not in the scope of this post or the talk I am summarizing, but I appreciated Sujee mentioning some big names.

Both Apache projects HBase and Cassandra were popular distributed databases mentioned that scale incredibly better than an old RDBS of the previous generation and neither have a single point of failure.  The real choice is whether you want to be dependent on Hadoop, which is required for HBase.

The last sphere of the IoT architecture that Sujee presented was the Querying.  In terms of recent and high speed queries, the distributed databes will handle the job flawlessly.  For batch queries, interfacing the HDFS directly will suffice as well as Hadoop enhancing tools such as HIVE that allow for more SQL-like queries.




So in short, Sujee Maniyam stood in front of about one hundred of what were mostly “front-end” (or browser code) oriented techies and he gave us a very effective overview of what goes on in the piles of servers we have been sending requests to for years.  Names like Hadoop, Spark, Kafaka, and Cassandra have been thrown around at my current and previous positions yet I knew little more than “lots of data” and possibly the word “scalable.”  Now I know so much more and I am very grateful.

His slides were well written and he should be commended for speaking to the audience at the appropriate level.  The only criticism I could possibly make is that he is one of the few humans that is capable of speaking faster than myself and this may, at some point in time, cause emotional damage to those who are wired at a different frequency.


Sujee's company Elephant Scale specializes in Big data training and consulting.  Please note they have also generously shared their slides and the webinar recording.


Thank you Sujee for participating in FWD JS and please encourage others in Big Data to do the same!  Also thank you to Jeremy Mailen for serving as a reviewer of this article.


Friday, December 4, 2015

A Calendar of Data Science Conferences

For those interested in participating in Data Science conferences, I list them by the season you would likely submit your text.*

Submitting in Winter


 ICDM 2016

15th Industrial Conference on Data Mining
http://www.data-mining-forum.de/
July 13-17, 2016, New York, USA
The deadline for paper submission is January 15th, 2016.

ICDM 2017

17th Industrial Conference on Data Mining
http://www.data-mining-forum.de/icdm2017.php
July 12 - 16, 2017, Leipzig, Germany
Estimating Jan 2017 submission, based on previous year

ACM SIGKDD

22nd ACM SIGKDD Conference of Knowledge Discovery and Data Mining
www.kdd.org/kdd2016/
San Francisco, California: August 13-17, 2016
Research Papers
Submission date: February 12, 2016
Notification date: May 12, 2016
Camera Ready: June 10, 2016
Applied Track:  Submit Feb 12, 2016
Other Important Dates: http://www.kdd.org/kdd2016/calls

DBKDA 2016

The Eighth International Conference on Advances in Databases, Knowledge, and Data Applications
http://www.iaria.org/conferences2016/DBKDA16.html
June 26 - 30, 2016 - Lisbon, Portugal
Submission (full paper)
February 9, 2016
Notification - April 2, 2016
Registration - April 17, 2016
Camera ready - May 17,2016


Submitting in Spring


The 10th ACM Recommender Systems Conference 

http://recsys.acm.org/
Boston, MA, USA from Sept 15-19, 2016.
Abstract submission deadline: April 13th, 2016
Paper submission deadline: April 20th, 2016
Notification: June 14th, 2016
Camera-ready paper deadline: July 1st, 2016

CIKM 2016

Conference on Information and Knowledge Management
http://www.cikmconference.org/
http://mohammadalhasan.wix.com/cikm2016
OCTOBER CIKM 2016 in Indianapolis, USA.
Submit May
Notification July
16.86% acceptance rate

ICIOT 2017

19th International Conference on Internet of Things
http://www.waset.org/conference/2017/12/istanbul/ICIOT/home
Istanbul, Turkey
Important Dates :Conference Dates :Dec 21-22, 2017 Final Submission :2017-08-21 00:00:00 Notification of Acceptance :2017-07-21 00:00:00 Paper Submission : 2017-06-21 00:00:00



Submitting in Summer


SIGMOD/PODS 2017

2017 International Conference on Management of Data
http://conference.researchbib.com/view/event/46959
Raleigh, NC, USA -  2017-06-25 - 2017-06-30
Deadline:2016-08-07

KDIR 2016

 International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management Conference
http://www.kdir.ic3k.org/
Porto, Portugal - Nov 12
Regular Papers
Paper Submission: May 12, 2016
Authors Notification: September 5, 2016
Camera Ready and Registration: September 19, 2016
Position Papers
Paper Submission: June 27, 2016
Authors Notification: September 7, 2016
Camera Ready and Registration: September 19, 2016


Submitting in Fall


SIAM international conference of data mining

http://www.siam.org/meetings/sdm16/
Abstract Submission: October 9, 2015
Workshop Proposals: October 9, 2015
Tutorial Proposals: October 9, 2015
Paper Submission: October 16, 2015
Author Notification: December 21, 2015
Camera Ready Papers Due: January 25, 2016:


WWW-2016

25TH INTERNATIONAL WORLD WIDE WEB CONFERENCE
http://www2016.ca/
Montreal APRIL 11 TO 15 2016
Tutorial proposals submission deadline: November 16, 2015
Acceptance notification: December 11, 2015
Tutorial dates: 
April 11-12, 2016



Unknown Submission Deadline

WSDM 2016

ACM International Conference on Web Search and Data Mining
http://www.wsdm-conference.org/2016/
Does any one know when this deadline was?  To anticipate next years?
Paper notifications - Oct 14, 2015
Tutorial notifications - Nov 6, 2015
Workshop paper submissions due Nov 30 - Dec 6, 2015
The 2015 WSDM took place in Feb 2015 and papers were due July 2014



* Please note not all of these seem to be an exact 12 month cycle so please comment if you see any inaccuracies in the future

Web Mining, Some Definitions

According to  (Mobasher, Jain, Han, & Srivastava, 1997) Web mining is “the application of data mining and knowledge discovery techniques to data collected in the World Wide Web transactions.”

(Cooley, Mobasher, & Srivastava, 1997) defines to web mining as the, “the discovery and analysis of useful information from the World Wide Web” and the “application of data mining techniques to the World Wide Web.”

After a couple months of reading papers on this subject, my take is that web mining can be defined as the many different ways to gain insight, most often focusing on the different ways that users use a given web “site,” by applying data mining techniques to all the data that a web server accumulates.  As nearly every web presence in 2015 is some form of a web application, mining the data produced on such web servers go way beyond who requested which “page” and when, but the papers from many years ago focused on those three attributes as the first papers focused on gleaning data from server web logs almost exclusively.  

In 2015, we have the ability to mine new dimensions to a user's experience to form a more elaborate view of the user’s context as we can gather much more data on the specifics of the usage within a given “page.”  We have also come to rank importance of different tasks.  For instance: viewing an item vs purchasing an item, in the case of eCommerce.  

So we now have greater sources of data originating from the web server, but we still tend to focus on how the users are using these web applications so that we can improve the experience and create a more valuable application.  This might be why the term “web usage mining” is more common in more recent papers rather than simple "web mining."

(Mobasher, Cooley, & Srivastava, 2000) explain further that “web usage mining systems run any number of data mining algorithms on usage or clickstream data gathered from one or more Web sites in order to discover user profiles.“  In (Yang, Kou, Chen, & Li, 2007) they explain that Web usage mining, “is the application for data mining techniques to analyze and discover interesting patterns of user’s usage data on the web.”

A complete discussion of the processes and methods of web mining is beyond the scope of this post and probably are best covered in a future text book, but I would like to quote (Arbelaitz et al., 2013) as they summarize this area of data mining research:  

“Web mining can be defined as the application of machine learning techniques to data from the Internet. This process requires a data acquisition and pre-processing stage. The machine learning techniques are mainly applied in the pattern discovery and analysis phase to find groups of web users with common characteristics related to the Internet and the corresponding patterns or user profiles. Finally, the patterns detected in the previous steps are used in the operational phase to adapt the system and make navigation more efficient for new users or to extract important information for the service providers.”


Arbelaitz, O., Gurrutxaga, I., Lojo, A., Muguerza, J., Pérez, J. M., & Perona, I. (2013). Web usage and content mining to extractknowledge for modelling the users of the Bidasoa Turismo website and to adaptit. Expert Systems with Applications, 40(18), 7478–7491. doi:10.1016/j.eswa.2013.07.040
Cooley, R., Mobasher, B., & Srivastava, J. (1997). Web mining: information and patterndiscovery on the World Wide Web. IEEE International Conference on Tools with Artificial Intelligence, 558–567. doi:10.1109/TAI.1997.632303
Mobasher, B., Cooley, R., & Srivastava, J. (2000). Web usage mining can help improve the scalability, accuracy, and flexibility of recommender systems. Communications of the ACM, 43(8), 142 – 151. doi:10.1145/345124.345169
Mobasher, B., Jain, N., Han, E. S., & Srivastava, J. (1997). Web Mining : PatternDiscovery from World Wide Web Transactions. Technical Report, 1–25. Retrieved from http://eolo.cps.unizar.es/docencia/doctorado/Articulos/DataWebMining/webminer-tr96.pdf

Yang, Q. Y. Q., Kou, J. K. J., Chen, F. C. F., & Li, M. L. M. (2007). A New Similarity Measure for Generalized Web Session Clustering. Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007), 3(Fskd).

Sunday, August 2, 2015

Why Python for Statistical Studies?

The primary motivation of this blog is to support fellow novices in their attempts to take their existing development skills and apply them to data science.  In that vein, I present an investigation of the language Python and why it is so powerful for building applications focused on data analysis.


Analytical Power

The inner computational strength of Python comes from a library, NumPy, that was added early in its history to allow for an incredibly efficient manipulation of very large vectors of data. (Van Der Walt, Colbert, & Varoquaux, 2011)(Lutz, 2007).  All previous examples of this level of computational efficiency had required a compiled language such as C or C++.  However, using a compiled language requires that every time a change is made the developer needs to wait a period, often many minutes, for the code to be converted to binaries.  Meanwhile, Python is known for its “rapid turnaround” (Lutz, 2007)(Grandell, Peltomäki, Back, & Salakoski, 2006) as it is technically an interpreted language.  This is the main reason that developers prefer interpreted languages such as Python to many other languages in terms of development time.  So over the past decade and a half, Python’s ability to perform large scale computation along with its popularity among developers has led to other Python libraries being developed within the open source community to support a tremendous variety of analysis almost right out of the box (Grandell et al., 2006).

So when in combination with these mathematical open source libraries, Python is said to have many of the same statistical capabilities of R, SAS or MATLAB (McKinney, 2012)(Nilsen, 2007), while still maintaining its status as a fully functioning object oriented language capable of being used to build enormous systems (Lutz, 2007)(Chudoba, Sadílek, Rypl, & Vorechovsky, 2013).  Also, unlike the other infamous mathematical languages such as R and MATLAB, Python is simply very web savvy as the most popular frameworks built in Python are based on the server-browser paradigm. This was important to us because we wanted to leverage the established power of the web browser for our user interface and also offer the option of making parts of this system open to anybody.  This ability to support “Programming-in-the-large” is unlike some of the popular early scripting languages of the past like Perl (McKinney, 2012).

Generous Community

Python also has an incredibly enthusiastic and generous community that freely evangelizes knowledge via blog posts and forums(Lutz, 2007).  So if we ever do find some sort of computational limitation, the community can point us to packages that are easily tied to scientific plugins written in binary languages like C, C++ and Java.

Ease of use

That Python has a strong community also means that answers to questions on almost any development challenge will be easily found but even this is less of a concern when compared to other language because Python is universally accepted as one of the easiest main stream programming languages to learn (Zelle, n.d.)(Lutz, 2007)(Grandell et al., 2006).  It’s also considered one of the cleanest languages, allowing developers to write it incredibly quickly (A.K.A. it is very “writable”)  (Grandell et al., 2006) (McKinney, 2012).  This is primarily because of its simple yet structured syntax but also due to its use of dynamic typing.  We contend that academics that need to write code should learn a bit about proper control structures and they probably should learn the difference between a hash table and an array.  However, in Python there is no compelling reason to learn many additional  computer science concepts such as the tradeoffs between a floating point number vs a “double” like you would if they were developing in Java, C, or C++.  In Python data types are just a bit simpler.  Similarly, the syntax vaguely resembles English with the unusual use of “is”, “pass” and “not” so that it is a very easy for one developer to read another developers code.  This “readability” is furthered by the community convention of “snake_case” that encourages very descriptive variable names.  So in short, as many of the authors of the study in quetsion are first and foremost Industrial Engineers, Python’s reduced learning curve, reputable write-ability and genuine readability was incredibly appealing to us.

In Conclusion 

So in the end Python I suggest Python for many reasons but primarily for its strong and justified reputation as a language capable of building a large and stable system while simultaneously being capable of satisfying almost any statistically analytical request all while being easy to use.

Of course, if I am missing any arguments at all, I would love to hear about them. Please comment below!