Friday, January 29, 2016

Sequential Pattern Mining: The initial attempts

Or: Why I suggest you consider using SPMF for all kinds of Pattern Mining Studies


This past year a classmate of mine, who is highly skilled in data mining, Lei “Erik” Liu, helped me realize that there is a great opportunity in my dissertation to study unusual workflows using sequential pattern mining.  I will save the details of my research for other posts, but in short I am working with a vast amount of project management data. 

In this post I share my initial findings on two open source tools available for Sequential Pattern Mining in hopes that others might benefit.  I would be thrilled to receive any suggestions others might have in the comments below.

The Problem Statement

Sequential pattern mining is a form of associate rule mining that takes into account the specific order of certain events as the sequence may be relevant to any conclusions drawn.  The steps for a sequential pattern mining study usually involve reducing the ordered events into discrete labels (often integers), grouping them by events that occurred simultaneously (or as I argue, effectively simultaneously), and applying a data mining tool to obtain the useful patterns.


During a Frantic Search: The Open Source Tools Available

I investigated two open source packages that claimed to support sequential pattern mining. A Spanish researcher name Chris Borgelt wrote Seqwog and Sequoia in C and they both would appear to be built for scalability as they handled different size files with what would appear to be virtually the same amount of time[1]. However, it has no user interface so I cannot control the algorithm parameters (minimum support, for example) as easily as I would like and I only have the two algorithms to choose from. Also, most examples of sequential pattern mining allow for simultaneous events and these two tools do not seem to. The classic example of a sequential pattern mining study is focused on purchasing patterns. A series of trips to the store for one customer are studied but they usually purchase more than on item per visit (Tan, Steinbach, & Kumar, 2005) so some events are simultaneous. This is also true with my project management data. A user can make more than one change to a work ticket at the same time. There is also a serious lack of documentation with these tools.

Meanwhile, a Professor at Harbin Institute of Technology Shenzhen Graduate School named Philippe Fournier-Viger founded a project that has created a tool based in Java to support dozens of data mining algorithms including eight for sequential rule mining[2] alone.  Some of the sequential pattern mining algorithms his tool provides support simultaneous events and his team has done an excellent job of documenting the use of all of the algorithms.  They even (ATTENTION PhD STUDENTS!) link to the papers associated with these algorithms.   The blogosphere touts his contributions and I am excited about how easy it is to control the algorithm parameters through an impressive user interface so I moved forward with “SPMF” by Dr Fournier-Viger and associates. 

SPMF

Of course, being an application developer by day (and ABD PhD student by night) I am much more familiar with Java than C so this is part of my bias.  It should also not go unnoticed that Dr Fournier-Viger provides the sourcecode  and is happy to discuss the possibility of others contributing. I am tremendous supporter of Open Source Software (OSS).

By Request: The Prototype

Under the pressure of my Research Review,[3] I was asked to prototype my planned sequential pattern mining attempts.  This lead to the initial assessment above and then a lot of data prep work so that I could apply at least one of SPMFs algorithms to my data. What I realized in building the prototype is that the real effort is going to be establishing the proper level of detail to represent in the chain of events.  In many papers (Liu, Zhang, & Xiong, 2014)(Liu, Ge, et al., 2014), this question is referred to in terms of levels of granularity.  In the case of my project management (specifically issue tracking) data, one question might be: will it be advantageous to note every priority change specifically in terms of “from what to what” (E.G. P2 to P1) or should I just capture “priority changed?”  Higher specificity is intuitively more useful, but if I am too specific then there might not be enough data to support potentially useful sequences.  If it is too low, the patterns might be too obvious.  For instance, my data shows that in one project, 56.6% of the time that a ticket is created, a release date change event will occur after the creation.  Perhaps as a part of a more detailed study on the circumstances of postponement this may be of interest, but by itself, especially at this level of granularity, it is a bit moot.

Also, the prototype showed that 1000 sequences with well under a dozen possible events in the transactions processed in seconds.  Of course, the plan was to add a lot more detail and apply the algorithms to about a half of a million transactions so the question of scalability was still unanswered during my exam.


The initial findings: Low Minimum Support is Costly

I have now been working with the SPMF tool for 6 months and I can say with great confidence that on my simple Mac mini, it scales incredibly well.  Using a level of granularity where there are hundreds of events possible (but the transactions are dominated by probably a few dozen), I can tackle a list of ~12k transactions in a 30-90 minutes with standard parameters. I mention “standard parameters” as many example studies mention minimum confidence and minimum support levels in the 50-75% range.  As I am sometimes investigating anomalies, I have experimented with bringing these levels down a lot further and this seems to be what will bring a request of SPMF (on 12-14k transactions of my data, anyway) from about a minute to many minutes.  I understand this is largely due to the fact that for frequent pattern mining algorithm  in general, the number of patterns (an the search space) can increase exponentially  when the minimum support threshold is decreased.  Of course, the legitimacy of bringing these parameters down too low is a very worthwhile topic and should be addressed in another post.



[1] Admittedly, these tools are designed to handle a much larger set of transactions and the point where I drew this conclusion, I had only utilized SPMF at more than 1000 transactions.

[2] Dr Philippe Fournier-Viger differentiates between Sequential Rule Mining and Sequential Pattern Mining.  I hope to address this differentiation in a different post but please note his map of algorithms here

[3] The Research Review for my department is called a “Comprehensive Exam” but it might also be called a “Prospectus Review” in other programs as it should be a formal way for the committee to provide life saving feedback on the research plan midway through the dissertation process.  Mine turned out to be, although very stressful, an excellent way to move from about ~35% done to about ~60-65% in only 3 months.  If you have one in your program, I hope you will look on it as a way to find the light at the end of the tunnel and not just a form of torture.  ‘Although it may be that as well.

References

Liu, C., Ge, Y., Xiong, H., Xiao, K., Geng, W., & Perkins, M. (2014). Proactive workflow modeling by stochastic processes with application to healthcare operation and management. KDD ’14 The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1593–1602. doi:10.1145/2623330.2623363
Liu, C., Zhang, K., & Xiong, H. (2014). Sequential Pattern Analysis with Right Granularity 1. doi:10.1109/ICDMW.2014.164

Tan, P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to Data Mining, (First Edition). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.

No comments:

Post a Comment