Or: Why I suggest you consider using SPMF for all kinds of Pattern Mining Studies
This past year a classmate of mine,
who is highly skilled in data mining, Lei “Erik” Liu, helped me realize that
there is a great opportunity in my dissertation to study unusual workflows using
sequential pattern mining. I will save
the details of my research for other posts, but in short I am working with a
vast amount of project management data.
In this post I share my initial findings on two open source tools available for Sequential Pattern Mining in hopes that others might benefit. I would be thrilled to receive any suggestions others might have in the comments below.
The Problem Statement
Sequential pattern mining is a form of associate rule mining that takes into account the specific order of certain events as the sequence may be relevant to any conclusions drawn. The steps for a sequential pattern mining study usually involve reducing the ordered events into discrete labels (often integers), grouping them by events that occurred simultaneously (or as I argue, effectively simultaneously), and applying a data mining tool to obtain the useful patterns.
During a Frantic Search: The Open Source Tools Available
I investigated two open source packages that claimed to support sequential pattern mining. A Spanish researcher name Chris Borgelt wrote Seqwog and Sequoia in C and they both would appear to be built for scalability as they handled different size files with what would appear to be virtually the same amount of time[1]. However, it has no user interface so I cannot control the algorithm parameters (minimum support, for example) as easily as I would like and I only have the two algorithms to choose from. Also, most examples of sequential pattern mining allow for simultaneous events and these two tools do not seem to. The classic example of a sequential pattern mining study is focused on purchasing patterns. A series of trips to the store for one customer are studied but they usually purchase more than on item per visit (Tan, Steinbach, & Kumar, 2005) so some events are simultaneous. This is also true with my project management data. A user can make more than one change to a work ticket at the same time. There is also a serious lack of documentation with these tools.
Meanwhile, a Professor at Harbin
Institute of Technology Shenzhen Graduate School named Philippe Fournier-Viger
founded a project that has created a tool based in Java to support dozens of
data mining algorithms including eight for sequential rule mining[2]
alone. Some of the sequential pattern
mining algorithms his tool provides support simultaneous events and his team
has done an excellent job of documenting the use of all of the algorithms. They even (ATTENTION PhD STUDENTS!) link to
the papers associated with these algorithms.
The blogosphere touts his contributions
and I am excited about how easy it is to control the algorithm parameters
through an impressive user interface so I moved forward with “SPMF” by Dr Fournier-Viger
and associates.
Of course, being an application
developer by day (and ABD PhD student by night) I am much more familiar with
Java than C so this is part of my bias. It should also not go unnoticed that Dr Fournier-Viger provides the sourcecode and is happy to discuss the possibility of others contributing. I am tremendous supporter of Open Source Software (OSS).
By Request: The Prototype
Under the pressure of my Research Review,[3] I was
asked to prototype my planned sequential pattern mining attempts. This lead to the initial assessment above and then a lot of data prep work so that I could apply at least one of SPMFs algorithms to my data. What I realized in building the prototype is that the real effort is going to be establishing the proper level of detail to represent in the chain of events. In many papers (Liu, Zhang, & Xiong, 2014)(Liu, Ge, et al., 2014), this question is referred to in terms of levels of granularity. In the case of my project management (specifically issue tracking) data, one question might be: will it be advantageous to note every priority change specifically in terms of “from what to what” (E.G. P2 to P1) or should I just capture “priority changed?” Higher specificity is intuitively more useful, but if I am too specific then there might not be enough data to support potentially useful sequences. If it is too low, the patterns might be too obvious. For instance, my data shows that in one project, 56.6% of the time that a ticket is created, a release date change event will occur after the creation. Perhaps as a part of a more detailed study on the circumstances of postponement this may be of interest, but by itself, especially at this level of granularity, it is a bit moot.
Also, the prototype showed that 1000 sequences with well under a dozen possible events in the transactions processed in seconds. Of course, the plan was to add a lot more detail and apply the algorithms to about a half of a million transactions so the question of scalability was still unanswered during my exam.
The initial findings: Low Minimum Support is Costly
I have now been working with the
SPMF tool for 6 months and I can say with great confidence that on my simple
Mac mini, it scales incredibly well. Using
a level of granularity where there are hundreds of events possible (but the transactions are dominated by probably a few dozen), I can tackle a list of ~12k transactions
in a 30-90 minutes with standard parameters. I mention “standard parameters”
as many example studies mention minimum confidence and minimum support levels
in the 50-75% range. As I am sometimes
investigating anomalies, I have experimented with bringing these levels down a
lot further and this seems to be what will bring a request of SPMF (on 12-14k transactions of my data, anyway) from about a minute to many minutes. I understand this is largely due to the fact that for frequent pattern mining algorithm in general, the number of patterns (an the search space) can increase exponentially when the minimum support threshold is decreased. Of course, the legitimacy of bringing these
parameters down too low is a very worthwhile topic and should be addressed in another post.
[1]
Admittedly, these tools are designed to handle a much larger set of transactions and the point where I drew this conclusion, I had only utilized SPMF at more than 1000 transactions.
[2] Dr Philippe
Fournier-Viger differentiates between Sequential Rule Mining and Sequential
Pattern Mining. I hope to address this
differentiation in a different post but please note his map of algorithms here
[3] The
Research Review for my department is called a “Comprehensive Exam” but it might
also be called a “Prospectus Review” in other programs as it should be a formal
way for the committee to provide life saving feedback on the research plan
midway through the dissertation process.
Mine turned out to be, although very stressful, an excellent way to move
from about ~35% done to about ~60-65% in only 3 months. If you have one in your program, I hope you
will look on it as a way to find the light at the end of the tunnel and not
just a form of torture. ‘Although it may
be that as well.
References
Liu, C.,
Ge, Y., Xiong, H., Xiao, K., Geng, W., & Perkins, M. (2014). Proactive
workflow modeling by stochastic processes with application to healthcare
operation and management. KDD ’14 The 20th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 1593–1602.
doi:10.1145/2623330.2623363
Liu, C.,
Zhang, K., & Xiong, H. (2014). Sequential Pattern Analysis with Right
Granularity 1. doi:10.1109/ICDMW.2014.164
Tan, P.-N.,
Steinbach, M., & Kumar, V. (2005). Introduction to Data Mining, (First
Edition). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.
No comments:
Post a Comment