Wednesday, February 3, 2016

Getting started with SPMF

Or: Transforming your data for use in SPMF


For the last chapter(s) of my dissertation, I have been excited to use the open source tool SPMF to apply sequential pattern mining algorithms to a large dataset that includes literally millions of events tied to half a million work tickets.

In the process of preparing and transforming my data, I have learned a great deal about how to prepare an input file for SPMF.  I summarize the steps here in hopes they may help other researchers get started with this powerful open source data mining tool.

Please note, I am not claiming to be a pattern mining expert(yet!) and this is simply the manner in which I have developed to reduce a multi-dimensional data set into the format required for SPMF.  There could be many other legitimate approaches!

To use the (literal) text book example of associative pattern mining  . . .

In the mid-90s, a “major Midwest based retailer” mined their transaction data to find an unexpected relationship between the purchase of beer and diapers.  

If the analysts today were to conduct a similar study with SPMF, I would suggest they:

      1.    Reduce every item purchased to a meaningful label
2.    Translate the label into a “retrievable integer”
3.    Assemble every transaction (set of items purchased by the same person) into the string format needed by concatenating the retrievable integers in the proper way and putting them all into one file
4.    Applying an algorithm available in SPMF to this input file


I will now describe how I accomplish these steps using my system written in Python and built in the Django Framework.   Python is an excellent language for both advanced data analysis and application building as I explain here.  

Step 0: Assembling Item Sets

SPMF is capable of seeking all kinds of patterns  in data including association rules, itemsets, sequential patterns, sequential rules, high-utility patterns, and perform sequence prediction, classification, and clustering.  In my case, I am especially interested in sequential pattern mining but I have adopted the strategy of acknowledging that many events happen simultaneously or, as I argue, effectively simultaneously

As in the classic shopping cart analogy, it is relevant that the diapers and the beer were bought in the same trip.  Even if the retailer were able to obtain the customers path in the store, it is much less interesting[1] that the diapers were put in his cart before the beer.  Similarly, in my data if a user makes two changes within a few minutes of each other, I do not want to distinguish whether either change came first.  This is largely due to the lack of effect this sequence within one “session[2]” has on the outcome of the work ticket.  Does it really matter whether the project leader changes the priority of a ticket before or after he assigns a ticket to a specific contributor?  The assigned developer will probably not see the ticket until long after the project manager has made their changes so in terms of sequential information, it is much more important that the developer made their changes after the project manager and not that the developer might mark the ticket as resolved before attaching a solution[3].  So in short, I need to represent itemsets within the sequential data.  To accomplish this, all events from the same user within a certain threshold define a session.  That threshold is currently set at one hour but further research will determine the best criteria.

Within my code, I tried two data structures to represent itemsets.  First, I tried a python “list of lists” and then a “list of Django QuerySets.”  I found the second much more useful as I could use the queryset methods (especially filter and count) throughout the transformation code. 

This is a highly simplified example:


    for session_query_set in session_list:
        session_event_count = session_query_set.count()
        session_attachment_event_list = session_query_set.filter(action_name="attachment")

        for attachment_event in session_attachment_event_list:
                # Do something with this attachment event


Of course, regardless of whether or not it is necessary for you to utilize itemsets (or sessions as I call them in my data), you will still need to reduce each event into a meaningful label.

Step 1: Reducing Events to Meaningful Labels

Based on my summary of steps above, this step may sound simple.  However, somewhere in the process of discovering the infamous beer and diapers relationship, the “Midwestern retail analysts” had to decide:

 . . were they looking for items purchased along with “Huggies Diapers Overnight Size 3”, “All Huggies Diaper Products”, “All Diaper Products,” “All Baby Doody Management Products”, or possibly even “All Baby Products (Non-Elmo Related).” 

This is a decision often referred to as the choice of granularity (Jiawei Han, Kamber, & Pei, 2012) (Liu, Zhang, & Xiong, 2014).  Unfortunately, this step is very specific to the data you are studying and is probably best covered in a series of book chapters, but might also be covered in a future post.  For now, consider the events occurring to a work ticket being reduced to labels (in this case, with sessions being represented) such as:


{CREATE_Bug, Attachment,  } {VERSION_set, } {Resolution_Fixed,  Attachment,  } {Status_Closed, }


In the case of my code utilizing a list of QuerySets, a grossly simplified example of the code to assemble the list of label lists might be similar to:


    transaction_list = []
    for session_query_set in session_list:
        new_session_labels = []

        session_attachment_event_list =                       
                session_query_set.filter(action_name="attachment")
        for attachment_event in session_attachment_event_list:
            new_session_labels.append("Attachment")

            ## other translations would go here ....

        transaction_list.append(new_session_labels)

    return transaction_list



However, these labels are to be used to make sense of the output later and there is no reason to store them like this just yet.  So the next step is to translate the labels into the format SPMF requires: positive integers.

Step 2 & 3: Translate the Event Labels and Concatenate Retrievable Integers

In the SPMF input format, each event is represented by a positive integer and they are listed in order from first to last occurring.  All events are separated by one “space” character.  If there are itemsets involved, each itemset is separated by a -1. 

It is important to note that you can only have one space character between each event or itemset separator (-1) and each line needs to end with a -2 plus both “end line” characters.

The result might look like[4]:


622 638 621 364 -1 574 -1 575 -1 252 256 -1 250 -1 -2\r\n


(In most ide's, the carriage return and endline would not typically be visible)

Of course, I keep referring to these integers as “retrievable integers” because what good are they if their meaning is not completely consistent throughout your study?

In order to be sure of this retrievability, I created a simple Django model:


class Sequence_keys(models.Model):
    id = models.AutoField(primary_key=True)
    sequence_lbl = models.CharField(max_length=100, default='uNk')


Of course, the  primary_key=True  makes sure the integer key is unique.  And then I would fetch the integer via:


def get_sequence_int(seq_lbl):
    seq_obj, cr = Sequence_keys.objects.get_or_create(sequence_lbl=seq_lbl)



Subsequently a simple routine can assemble each transaction into the format shown above and a (potentially very large) file could be prepared for SPMF.

In conclusion

I speculate some of these steps are probably similar with other data mining tools besides SPMF for the basic principles of associative pattern mining should be common throughout.  

After reading all this, you might be taken aback by how much prep work this is and, of course in this post, I hardly addressed the step that is taking the vast, vast majority of my time: translating my database of events into meaningful labels.  This realization you and I are having reminds me of a quote from theAZ:

“80-85% of your effort in a data mining based research project will be focused on data exploration and pre-processing; not running the algorithms  . . .”
–A Regarded ASU professor (name to be revealed when I get his permission)

Up next:

Sometime this year I hope to post some tips on how to run SPMF effectively (Step 4) and (ideally) also share some code on how to partition the output.  I have also discovered the hard way that another majorly time consuming task is to analyze the output files.  As I am starting to use scenarios where literally over a 100,000 output lines might be produced, I have created scripts that will parse these files and give a very abridged version of the output to answer very specific questions.  Please stay tuned . ..





[1] See also papers such as (Baena-Garcı´a & Morales-Bueno, 2012) on objectively measuring “interestingness.”

[2] For my data, I used research pertaining to web sessions (Park, Suresh, & Jeong, 2008)(Xiao & Zhang, 2001)(Bianco, Mardente, Mellia, Munafò, & Muscariello, 2005) to guide my definition of a user session. 

[3] For online retailers, the order of “cart adds” before a purchase is made is readily available.  Does anyone know if this sequential information is utilized?

[4] For those that recognize the “screen shots” might be a reference to the Apple IIe or Apple IIc, major hats off to you.  And yes, I first learned to code on such epic machines bc I am was born in the 70s and these machines were “Rad!!”


References

Baena-Garcı´a, M., & Morales-Bueno, R. (2012). Mining interestingness measures for string pattern mining. Knowledge-Based Systems, 25(1), 45–50. doi:10.1016/j.knosys.2011.01.013
Bianco,  a., Mardente, G., Mellia, M., Munafò, M., & Muscariello, L. (2005). Web user session characterization via clustering techniques. GLOBECOM - IEEE Global Telecommunications Conference, 2, 1102–1107. doi:10.1109/GLOCOM.2005.1577807
Han, J., Kamber, M., & Pei, J. (2012). 5 - Data Cube Technology. In J. H. Kamber & J. Pei (Eds.), Data Mining (Third Edition) (Third Edit., pp. 187–242). Boston: Morgan Kaufmann. doi:http://dx.doi.org/10.1016/B978-0-12-381479-1.00005-8
Liu, C., Zhang, K., & Xiong, H. (2014). Sequential Pattern Analysis with Right Granularity 1. doi:10.1109/ICDMW.2014.164
Park, S., Suresh, N. C., & Jeong, B. K. (2008). Sequence-based clustering for Web usage mining: A new experimental framework and ANN-enhanced K-means algorithm. Data and Knowledge Engineering, 65(3), 512–543. doi:10.1016/j.datak.2008.01.002
Xiao, J., & Zhang, Y. (2001). Clustering of web users using session-based similarity measures. In Computer Networks and Mobile Computing, 2001. Proceedings. 2001 International Conference on (pp. 223–228). IEEE. doi:10.1109/ICCNMC.2001.962600


Friday, January 29, 2016

Sequential Pattern Mining: The initial attempts

Or: Why I suggest you consider using SPMF for all kinds of Pattern Mining Studies


This past year a classmate of mine, who is highly skilled in data mining, Lei “Erik” Liu, helped me realize that there is a great opportunity in my dissertation to study unusual workflows using sequential pattern mining.  I will save the details of my research for other posts, but in short I am working with a vast amount of project management data. 

In this post I share my initial findings on two open source tools available for Sequential Pattern Mining in hopes that others might benefit.  I would be thrilled to receive any suggestions others might have in the comments below.

The Problem Statement

Sequential pattern mining is a form of associate rule mining that takes into account the specific order of certain events as the sequence may be relevant to any conclusions drawn.  The steps for a sequential pattern mining study usually involve reducing the ordered events into discrete labels (often integers), grouping them by events that occurred simultaneously (or as I argue, effectively simultaneously), and applying a data mining tool to obtain the useful patterns.


During a Frantic Search: The Open Source Tools Available

I investigated two open source packages that claimed to support sequential pattern mining. A Spanish researcher name Chris Borgelt wrote Seqwog and Sequoia in C and they both would appear to be built for scalability as they handled different size files with what would appear to be virtually the same amount of time[1]. However, it has no user interface so I cannot control the algorithm parameters (minimum support, for example) as easily as I would like and I only have the two algorithms to choose from. Also, most examples of sequential pattern mining allow for simultaneous events and these two tools do not seem to. The classic example of a sequential pattern mining study is focused on purchasing patterns. A series of trips to the store for one customer are studied but they usually purchase more than on item per visit (Tan, Steinbach, & Kumar, 2005) so some events are simultaneous. This is also true with my project management data. A user can make more than one change to a work ticket at the same time. There is also a serious lack of documentation with these tools.

Meanwhile, a Professor at Harbin Institute of Technology Shenzhen Graduate School named Philippe Fournier-Viger founded a project that has created a tool based in Java to support dozens of data mining algorithms including eight for sequential rule mining[2] alone.  Some of the sequential pattern mining algorithms his tool provides support simultaneous events and his team has done an excellent job of documenting the use of all of the algorithms.  They even (ATTENTION PhD STUDENTS!) link to the papers associated with these algorithms.   The blogosphere touts his contributions and I am excited about how easy it is to control the algorithm parameters through an impressive user interface so I moved forward with “SPMF” by Dr Fournier-Viger and associates. 

SPMF

Of course, being an application developer by day (and ABD PhD student by night) I am much more familiar with Java than C so this is part of my bias.  It should also not go unnoticed that Dr Fournier-Viger provides the sourcecode  and is happy to discuss the possibility of others contributing. I am tremendous supporter of Open Source Software (OSS).

By Request: The Prototype

Under the pressure of my Research Review,[3] I was asked to prototype my planned sequential pattern mining attempts.  This lead to the initial assessment above and then a lot of data prep work so that I could apply at least one of SPMFs algorithms to my data. What I realized in building the prototype is that the real effort is going to be establishing the proper level of detail to represent in the chain of events.  In many papers (Liu, Zhang, & Xiong, 2014)(Liu, Ge, et al., 2014), this question is referred to in terms of levels of granularity.  In the case of my project management (specifically issue tracking) data, one question might be: will it be advantageous to note every priority change specifically in terms of “from what to what” (E.G. P2 to P1) or should I just capture “priority changed?”  Higher specificity is intuitively more useful, but if I am too specific then there might not be enough data to support potentially useful sequences.  If it is too low, the patterns might be too obvious.  For instance, my data shows that in one project, 56.6% of the time that a ticket is created, a release date change event will occur after the creation.  Perhaps as a part of a more detailed study on the circumstances of postponement this may be of interest, but by itself, especially at this level of granularity, it is a bit moot.

Also, the prototype showed that 1000 sequences with well under a dozen possible events in the transactions processed in seconds.  Of course, the plan was to add a lot more detail and apply the algorithms to about a half of a million transactions so the question of scalability was still unanswered during my exam.


The initial findings: Low Minimum Support is Costly

I have now been working with the SPMF tool for 6 months and I can say with great confidence that on my simple Mac mini, it scales incredibly well.  Using a level of granularity where there are hundreds of events possible (but the transactions are dominated by probably a few dozen), I can tackle a list of ~12k transactions in a 30-90 minutes with standard parameters. I mention “standard parameters” as many example studies mention minimum confidence and minimum support levels in the 50-75% range.  As I am sometimes investigating anomalies, I have experimented with bringing these levels down a lot further and this seems to be what will bring a request of SPMF (on 12-14k transactions of my data, anyway) from about a minute to many minutes.  I understand this is largely due to the fact that for frequent pattern mining algorithm  in general, the number of patterns (an the search space) can increase exponentially  when the minimum support threshold is decreased.  Of course, the legitimacy of bringing these parameters down too low is a very worthwhile topic and should be addressed in another post.



[1] Admittedly, these tools are designed to handle a much larger set of transactions and the point where I drew this conclusion, I had only utilized SPMF at more than 1000 transactions.

[2] Dr Philippe Fournier-Viger differentiates between Sequential Rule Mining and Sequential Pattern Mining.  I hope to address this differentiation in a different post but please note his map of algorithms here

[3] The Research Review for my department is called a “Comprehensive Exam” but it might also be called a “Prospectus Review” in other programs as it should be a formal way for the committee to provide life saving feedback on the research plan midway through the dissertation process.  Mine turned out to be, although very stressful, an excellent way to move from about ~35% done to about ~60-65% in only 3 months.  If you have one in your program, I hope you will look on it as a way to find the light at the end of the tunnel and not just a form of torture.  ‘Although it may be that as well.

References

Liu, C., Ge, Y., Xiong, H., Xiao, K., Geng, W., & Perkins, M. (2014). Proactive workflow modeling by stochastic processes with application to healthcare operation and management. KDD ’14 The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1593–1602. doi:10.1145/2623330.2623363
Liu, C., Zhang, K., & Xiong, H. (2014). Sequential Pattern Analysis with Right Granularity 1. doi:10.1109/ICDMW.2014.164

Tan, P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to Data Mining, (First Edition). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.

Saturday, December 5, 2015

An IoT Architecture Crash Course

Or 40 Minutes at a Developers Conference Taught Me a Whole Lot About Big Data



Recently I had the opportunity to attend a developer’s conference (FWD JS).  It’s organizer had contributed to the organization of the chaotic but powerful HTML5 conference in past years and the first edition of this conference last summer was called JS FWD, so this crowd has roots in the front end web technologies.  However, this summer David Nugent expanded the scope and I believe we were all better for it.  In fact, my favorite talk by far was by Sujee Maniyam where he effectively explained how different open source technologies can be leveraged to handle the tremendous amount of data that might be produced, managed and analyzed by any massive system of devices.  He was especially referring to the often talked about but rarely understood universe of the “Internet of Things.”  For the benefit of others, I attempt to quickly summarize his talk here.

Sujee started the talk by explaining how much data, often in terra bytes per day, a company such as “Fitbit” might produce.  The stages of a system that handles this much data are broken up into:

1.     Capture
2.     Process
3.     Store
4.     Query

 

To Capture data from such a system, Sujee explained many options were available including MQ, FluentD, Flume and Kafka.  These systems will provide redundancy because there is virtually zero tolerance for failure in systems such as these and at many times, the data is arriving at incredibly high volumes.  Of course, the capturing process also needs to scale.

Once you’ve captured this data, you will need to process it in real-time.  One of the options is an open source computation system created at Twitter: Apache Storm.  Their landing page sums up this need well:

“Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing.”

He also mentioned Apache Samza and Apache Nifi were options for stream processing.

Of course, for many years, this role of processing might be fulfilled by the almighty Hadoop.  Sujee mentioned that Hadoop is the first-generation big data technology that was created for batch analytics.  For streaming, we are looking at second-generation technologies like (Spark, Nifi, Samza ..etc)  to process data in (near) real time.

Then in terms of storage, he introduced the room to one of the great challenges of large data intensive systems such as these as they have two major requirements in their storage system: to make the data available forever and allow real-time lookup, even of data that was just acquired.  And once again this system needs to scale.

The answer is what is known as a “Lambda Architecture.”  To put as simply as possible there should be two data pipelines, one to handle rapid responses and one to handle the massive volume that is collected over months and years.

To store massive amounts of data with no plausible limit, the answer is the storage portion of Hadoop ( A.K.A. a HDFS, Hadoop distributed file system).  It is still cost effective and has always been a scalable, fault tolerant means of storing massive amounts of data in perpetuity.  HDFS is incredibly powerful but it will always respond using a batch processing paradigm.

For the real-time portion of the Lambda architecture, the modern developer has a variety of distributed databases, most of which support NoSQL.  The pros and cons of different distributed databases and NoSQL systems are not in the scope of this post or the talk I am summarizing, but I appreciated Sujee mentioning some big names.

Both Apache projects HBase and Cassandra were popular distributed databases mentioned that scale incredibly better than an old RDBS of the previous generation and neither have a single point of failure.  The real choice is whether you want to be dependent on Hadoop, which is required for HBase.

The last sphere of the IoT architecture that Sujee presented was the Querying.  In terms of recent and high speed queries, the distributed databes will handle the job flawlessly.  For batch queries, interfacing the HDFS directly will suffice as well as Hadoop enhancing tools such as HIVE that allow for more SQL-like queries.




So in short, Sujee Maniyam stood in front of about one hundred of what were mostly “front-end” (or browser code) oriented techies and he gave us a very effective overview of what goes on in the piles of servers we have been sending requests to for years.  Names like Hadoop, Spark, Kafaka, and Cassandra have been thrown around at my current and previous positions yet I knew little more than “lots of data” and possibly the word “scalable.”  Now I know so much more and I am very grateful.

His slides were well written and he should be commended for speaking to the audience at the appropriate level.  The only criticism I could possibly make is that he is one of the few humans that is capable of speaking faster than myself and this may, at some point in time, cause emotional damage to those who are wired at a different frequency.


Sujee's company Elephant Scale specializes in Big data training and consulting.  Please note they have also generously shared their slides and the webinar recording.


Thank you Sujee for participating in FWD JS and please encourage others in Big Data to do the same!  Also thank you to Jeremy Mailen for serving as a reviewer of this article.


Friday, December 4, 2015

A Calendar of Data Science Conferences

For those interested in participating in Data Science conferences, I list them by the season you would likely submit your text.*

Submitting in Winter


 ICDM 2016

15th Industrial Conference on Data Mining
http://www.data-mining-forum.de/
July 13-17, 2016, New York, USA
The deadline for paper submission is January 15th, 2016.

ICDM 2017

17th Industrial Conference on Data Mining
http://www.data-mining-forum.de/icdm2017.php
July 12 - 16, 2017, Leipzig, Germany
Estimating Jan 2017 submission, based on previous year

ACM SIGKDD

22nd ACM SIGKDD Conference of Knowledge Discovery and Data Mining
www.kdd.org/kdd2016/
San Francisco, California: August 13-17, 2016
Research Papers
Submission date: February 12, 2016
Notification date: May 12, 2016
Camera Ready: June 10, 2016
Applied Track:  Submit Feb 12, 2016
Other Important Dates: http://www.kdd.org/kdd2016/calls

DBKDA 2016

The Eighth International Conference on Advances in Databases, Knowledge, and Data Applications
http://www.iaria.org/conferences2016/DBKDA16.html
June 26 - 30, 2016 - Lisbon, Portugal
Submission (full paper)
February 9, 2016
Notification - April 2, 2016
Registration - April 17, 2016
Camera ready - May 17,2016


Submitting in Spring


The 10th ACM Recommender Systems Conference 

http://recsys.acm.org/
Boston, MA, USA from Sept 15-19, 2016.
Abstract submission deadline: April 13th, 2016
Paper submission deadline: April 20th, 2016
Notification: June 14th, 2016
Camera-ready paper deadline: July 1st, 2016

CIKM 2016

Conference on Information and Knowledge Management
http://www.cikmconference.org/
http://mohammadalhasan.wix.com/cikm2016
OCTOBER CIKM 2016 in Indianapolis, USA.
Submit May
Notification July
16.86% acceptance rate

ICIOT 2017

19th International Conference on Internet of Things
http://www.waset.org/conference/2017/12/istanbul/ICIOT/home
Istanbul, Turkey
Important Dates :Conference Dates :Dec 21-22, 2017 Final Submission :2017-08-21 00:00:00 Notification of Acceptance :2017-07-21 00:00:00 Paper Submission : 2017-06-21 00:00:00



Submitting in Summer


SIGMOD/PODS 2017

2017 International Conference on Management of Data
http://conference.researchbib.com/view/event/46959
Raleigh, NC, USA -  2017-06-25 - 2017-06-30
Deadline:2016-08-07

KDIR 2016

 International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management Conference
http://www.kdir.ic3k.org/
Porto, Portugal - Nov 12
Regular Papers
Paper Submission: May 12, 2016
Authors Notification: September 5, 2016
Camera Ready and Registration: September 19, 2016
Position Papers
Paper Submission: June 27, 2016
Authors Notification: September 7, 2016
Camera Ready and Registration: September 19, 2016


Submitting in Fall


SIAM international conference of data mining

http://www.siam.org/meetings/sdm16/
Abstract Submission: October 9, 2015
Workshop Proposals: October 9, 2015
Tutorial Proposals: October 9, 2015
Paper Submission: October 16, 2015
Author Notification: December 21, 2015
Camera Ready Papers Due: January 25, 2016:


WWW-2016

25TH INTERNATIONAL WORLD WIDE WEB CONFERENCE
http://www2016.ca/
Montreal APRIL 11 TO 15 2016
Tutorial proposals submission deadline: November 16, 2015
Acceptance notification: December 11, 2015
Tutorial dates: 
April 11-12, 2016



Unknown Submission Deadline

WSDM 2016

ACM International Conference on Web Search and Data Mining
http://www.wsdm-conference.org/2016/
Does any one know when this deadline was?  To anticipate next years?
Paper notifications - Oct 14, 2015
Tutorial notifications - Nov 6, 2015
Workshop paper submissions due Nov 30 - Dec 6, 2015
The 2015 WSDM took place in Feb 2015 and papers were due July 2014



* Please note not all of these seem to be an exact 12 month cycle so please comment if you see any inaccuracies in the future

Web Mining, Some Definitions

According to  (Mobasher, Jain, Han, & Srivastava, 1997) Web mining is “the application of data mining and knowledge discovery techniques to data collected in the World Wide Web transactions.”

(Cooley, Mobasher, & Srivastava, 1997) defines to web mining as the, “the discovery and analysis of useful information from the World Wide Web” and the “application of data mining techniques to the World Wide Web.”

After a couple months of reading papers on this subject, my take is that web mining can be defined as the many different ways to gain insight, most often focusing on the different ways that users use a given web “site,” by applying data mining techniques to all the data that a web server accumulates.  As nearly every web presence in 2015 is some form of a web application, mining the data produced on such web servers go way beyond who requested which “page” and when, but the papers from many years ago focused on those three attributes as the first papers focused on gleaning data from server web logs almost exclusively.  

In 2015, we have the ability to mine new dimensions to a user's experience to form a more elaborate view of the user’s context as we can gather much more data on the specifics of the usage within a given “page.”  We have also come to rank importance of different tasks.  For instance: viewing an item vs purchasing an item, in the case of eCommerce.  

So we now have greater sources of data originating from the web server, but we still tend to focus on how the users are using these web applications so that we can improve the experience and create a more valuable application.  This might be why the term “web usage mining” is more common in more recent papers rather than simple "web mining."

(Mobasher, Cooley, & Srivastava, 2000) explain further that “web usage mining systems run any number of data mining algorithms on usage or clickstream data gathered from one or more Web sites in order to discover user profiles.“  In (Yang, Kou, Chen, & Li, 2007) they explain that Web usage mining, “is the application for data mining techniques to analyze and discover interesting patterns of user’s usage data on the web.”

A complete discussion of the processes and methods of web mining is beyond the scope of this post and probably are best covered in a future text book, but I would like to quote (Arbelaitz et al., 2013) as they summarize this area of data mining research:  

“Web mining can be defined as the application of machine learning techniques to data from the Internet. This process requires a data acquisition and pre-processing stage. The machine learning techniques are mainly applied in the pattern discovery and analysis phase to find groups of web users with common characteristics related to the Internet and the corresponding patterns or user profiles. Finally, the patterns detected in the previous steps are used in the operational phase to adapt the system and make navigation more efficient for new users or to extract important information for the service providers.”


Arbelaitz, O., Gurrutxaga, I., Lojo, A., Muguerza, J., Pérez, J. M., & Perona, I. (2013). Web usage and content mining to extractknowledge for modelling the users of the Bidasoa Turismo website and to adaptit. Expert Systems with Applications, 40(18), 7478–7491. doi:10.1016/j.eswa.2013.07.040
Cooley, R., Mobasher, B., & Srivastava, J. (1997). Web mining: information and patterndiscovery on the World Wide Web. IEEE International Conference on Tools with Artificial Intelligence, 558–567. doi:10.1109/TAI.1997.632303
Mobasher, B., Cooley, R., & Srivastava, J. (2000). Web usage mining can help improve the scalability, accuracy, and flexibility of recommender systems. Communications of the ACM, 43(8), 142 – 151. doi:10.1145/345124.345169
Mobasher, B., Jain, N., Han, E. S., & Srivastava, J. (1997). Web Mining : PatternDiscovery from World Wide Web Transactions. Technical Report, 1–25. Retrieved from http://eolo.cps.unizar.es/docencia/doctorado/Articulos/DataWebMining/webminer-tr96.pdf

Yang, Q. Y. Q., Kou, J. K. J., Chen, F. C. F., & Li, M. L. M. (2007). A New Similarity Measure for Generalized Web Session Clustering. Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007), 3(Fskd).

Sunday, August 2, 2015

Why Python for Statistical Studies?

The primary motivation of this blog is to support fellow novices in their attempts to take their existing development skills and apply them to data science.  In that vein, I present an investigation of the language Python and why it is so powerful for building applications focused on data analysis.


Analytical Power

The inner computational strength of Python comes from a library, NumPy, that was added early in its history to allow for an incredibly efficient manipulation of very large vectors of data. (Van Der Walt, Colbert, & Varoquaux, 2011)(Lutz, 2007).  All previous examples of this level of computational efficiency had required a compiled language such as C or C++.  However, using a compiled language requires that every time a change is made the developer needs to wait a period, often many minutes, for the code to be converted to binaries.  Meanwhile, Python is known for its “rapid turnaround” (Lutz, 2007)(Grandell, Peltomäki, Back, & Salakoski, 2006) as it is technically an interpreted language.  This is the main reason that developers prefer interpreted languages such as Python to many other languages in terms of development time.  So over the past decade and a half, Python’s ability to perform large scale computation along with its popularity among developers has led to other Python libraries being developed within the open source community to support a tremendous variety of analysis almost right out of the box (Grandell et al., 2006).

So when in combination with these mathematical open source libraries, Python is said to have many of the same statistical capabilities of R, SAS or MATLAB (McKinney, 2012)(Nilsen, 2007), while still maintaining its status as a fully functioning object oriented language capable of being used to build enormous systems (Lutz, 2007)(Chudoba, Sadílek, Rypl, & Vorechovsky, 2013).  Also, unlike the other infamous mathematical languages such as R and MATLAB, Python is simply very web savvy as the most popular frameworks built in Python are based on the server-browser paradigm. This was important to us because we wanted to leverage the established power of the web browser for our user interface and also offer the option of making parts of this system open to anybody.  This ability to support “Programming-in-the-large” is unlike some of the popular early scripting languages of the past like Perl (McKinney, 2012).

Generous Community

Python also has an incredibly enthusiastic and generous community that freely evangelizes knowledge via blog posts and forums(Lutz, 2007).  So if we ever do find some sort of computational limitation, the community can point us to packages that are easily tied to scientific plugins written in binary languages like C, C++ and Java.

Ease of use

That Python has a strong community also means that answers to questions on almost any development challenge will be easily found but even this is less of a concern when compared to other language because Python is universally accepted as one of the easiest main stream programming languages to learn (Zelle, n.d.)(Lutz, 2007)(Grandell et al., 2006).  It’s also considered one of the cleanest languages, allowing developers to write it incredibly quickly (A.K.A. it is very “writable”)  (Grandell et al., 2006) (McKinney, 2012).  This is primarily because of its simple yet structured syntax but also due to its use of dynamic typing.  We contend that academics that need to write code should learn a bit about proper control structures and they probably should learn the difference between a hash table and an array.  However, in Python there is no compelling reason to learn many additional  computer science concepts such as the tradeoffs between a floating point number vs a “double” like you would if they were developing in Java, C, or C++.  In Python data types are just a bit simpler.  Similarly, the syntax vaguely resembles English with the unusual use of “is”, “pass” and “not” so that it is a very easy for one developer to read another developers code.  This “readability” is furthered by the community convention of “snake_case” that encourages very descriptive variable names.  So in short, as many of the authors of the study in quetsion are first and foremost Industrial Engineers, Python’s reduced learning curve, reputable write-ability and genuine readability was incredibly appealing to us.

In Conclusion 

So in the end Python I suggest Python for many reasons but primarily for its strong and justified reputation as a language capable of building a large and stable system while simultaneously being capable of satisfying almost any statistically analytical request all while being easy to use.

Of course, if I am missing any arguments at all, I would love to hear about them. Please comment below!