Wednesday, February 3, 2016

Getting started with SPMF

Or: Transforming your data for use in SPMF


For the last chapter(s) of my dissertation, I have been excited to use the open source tool SPMF to apply sequential pattern mining algorithms to a large dataset that includes literally millions of events tied to half a million work tickets.

In the process of preparing and transforming my data, I have learned a great deal about how to prepare an input file for SPMF.  I summarize the steps here in hopes they may help other researchers get started with this powerful open source data mining tool.

Please note, I am not claiming to be a pattern mining expert(yet!) and this is simply the manner in which I have developed to reduce a multi-dimensional data set into the format required for SPMF.  There could be many other legitimate approaches!

To use the (literal) text book example of associative pattern mining  . . .

In the mid-90s, a “major Midwest based retailer” mined their transaction data to find an unexpected relationship between the purchase of beer and diapers.  

If the analysts today were to conduct a similar study with SPMF, I would suggest they:

      1.    Reduce every item purchased to a meaningful label
2.    Translate the label into a “retrievable integer”
3.    Assemble every transaction (set of items purchased by the same person) into the string format needed by concatenating the retrievable integers in the proper way and putting them all into one file
4.    Applying an algorithm available in SPMF to this input file


I will now describe how I accomplish these steps using my system written in Python and built in the Django Framework.   Python is an excellent language for both advanced data analysis and application building as I explain here.  

Step 0: Assembling Item Sets

SPMF is capable of seeking all kinds of patterns  in data including association rules, itemsets, sequential patterns, sequential rules, high-utility patterns, and perform sequence prediction, classification, and clustering.  In my case, I am especially interested in sequential pattern mining but I have adopted the strategy of acknowledging that many events happen simultaneously or, as I argue, effectively simultaneously

As in the classic shopping cart analogy, it is relevant that the diapers and the beer were bought in the same trip.  Even if the retailer were able to obtain the customers path in the store, it is much less interesting[1] that the diapers were put in his cart before the beer.  Similarly, in my data if a user makes two changes within a few minutes of each other, I do not want to distinguish whether either change came first.  This is largely due to the lack of effect this sequence within one “session[2]” has on the outcome of the work ticket.  Does it really matter whether the project leader changes the priority of a ticket before or after he assigns a ticket to a specific contributor?  The assigned developer will probably not see the ticket until long after the project manager has made their changes so in terms of sequential information, it is much more important that the developer made their changes after the project manager and not that the developer might mark the ticket as resolved before attaching a solution[3].  So in short, I need to represent itemsets within the sequential data.  To accomplish this, all events from the same user within a certain threshold define a session.  That threshold is currently set at one hour but further research will determine the best criteria.

Within my code, I tried two data structures to represent itemsets.  First, I tried a python “list of lists” and then a “list of Django QuerySets.”  I found the second much more useful as I could use the queryset methods (especially filter and count) throughout the transformation code. 

This is a highly simplified example:


    for session_query_set in session_list:
        session_event_count = session_query_set.count()
        session_attachment_event_list = session_query_set.filter(action_name="attachment")

        for attachment_event in session_attachment_event_list:
                # Do something with this attachment event


Of course, regardless of whether or not it is necessary for you to utilize itemsets (or sessions as I call them in my data), you will still need to reduce each event into a meaningful label.

Step 1: Reducing Events to Meaningful Labels

Based on my summary of steps above, this step may sound simple.  However, somewhere in the process of discovering the infamous beer and diapers relationship, the “Midwestern retail analysts” had to decide:

 . . were they looking for items purchased along with “Huggies Diapers Overnight Size 3”, “All Huggies Diaper Products”, “All Diaper Products,” “All Baby Doody Management Products”, or possibly even “All Baby Products (Non-Elmo Related).” 

This is a decision often referred to as the choice of granularity (Jiawei Han, Kamber, & Pei, 2012) (Liu, Zhang, & Xiong, 2014).  Unfortunately, this step is very specific to the data you are studying and is probably best covered in a series of book chapters, but might also be covered in a future post.  For now, consider the events occurring to a work ticket being reduced to labels (in this case, with sessions being represented) such as:


{CREATE_Bug, Attachment,  } {VERSION_set, } {Resolution_Fixed,  Attachment,  } {Status_Closed, }


In the case of my code utilizing a list of QuerySets, a grossly simplified example of the code to assemble the list of label lists might be similar to:


    transaction_list = []
    for session_query_set in session_list:
        new_session_labels = []

        session_attachment_event_list =                       
                session_query_set.filter(action_name="attachment")
        for attachment_event in session_attachment_event_list:
            new_session_labels.append("Attachment")

            ## other translations would go here ....

        transaction_list.append(new_session_labels)

    return transaction_list



However, these labels are to be used to make sense of the output later and there is no reason to store them like this just yet.  So the next step is to translate the labels into the format SPMF requires: positive integers.

Step 2 & 3: Translate the Event Labels and Concatenate Retrievable Integers

In the SPMF input format, each event is represented by a positive integer and they are listed in order from first to last occurring.  All events are separated by one “space” character.  If there are itemsets involved, each itemset is separated by a -1. 

It is important to note that you can only have one space character between each event or itemset separator (-1) and each line needs to end with a -2 plus both “end line” characters.

The result might look like[4]:


622 638 621 364 -1 574 -1 575 -1 252 256 -1 250 -1 -2\r\n


(In most ide's, the carriage return and endline would not typically be visible)

Of course, I keep referring to these integers as “retrievable integers” because what good are they if their meaning is not completely consistent throughout your study?

In order to be sure of this retrievability, I created a simple Django model:


class Sequence_keys(models.Model):
    id = models.AutoField(primary_key=True)
    sequence_lbl = models.CharField(max_length=100, default='uNk')


Of course, the  primary_key=True  makes sure the integer key is unique.  And then I would fetch the integer via:


def get_sequence_int(seq_lbl):
    seq_obj, cr = Sequence_keys.objects.get_or_create(sequence_lbl=seq_lbl)



Subsequently a simple routine can assemble each transaction into the format shown above and a (potentially very large) file could be prepared for SPMF.

In conclusion

I speculate some of these steps are probably similar with other data mining tools besides SPMF for the basic principles of associative pattern mining should be common throughout.  

After reading all this, you might be taken aback by how much prep work this is and, of course in this post, I hardly addressed the step that is taking the vast, vast majority of my time: translating my database of events into meaningful labels.  This realization you and I are having reminds me of a quote from theAZ:

“80-85% of your effort in a data mining based research project will be focused on data exploration and pre-processing; not running the algorithms  . . .”
–A Regarded ASU professor (name to be revealed when I get his permission)

Up next:

Sometime this year I hope to post some tips on how to run SPMF effectively (Step 4) and (ideally) also share some code on how to partition the output.  I have also discovered the hard way that another majorly time consuming task is to analyze the output files.  As I am starting to use scenarios where literally over a 100,000 output lines might be produced, I have created scripts that will parse these files and give a very abridged version of the output to answer very specific questions.  Please stay tuned . ..





[1] See also papers such as (Baena-Garcı´a & Morales-Bueno, 2012) on objectively measuring “interestingness.”

[2] For my data, I used research pertaining to web sessions (Park, Suresh, & Jeong, 2008)(Xiao & Zhang, 2001)(Bianco, Mardente, Mellia, Munafò, & Muscariello, 2005) to guide my definition of a user session. 

[3] For online retailers, the order of “cart adds” before a purchase is made is readily available.  Does anyone know if this sequential information is utilized?

[4] For those that recognize the “screen shots” might be a reference to the Apple IIe or Apple IIc, major hats off to you.  And yes, I first learned to code on such epic machines bc I am was born in the 70s and these machines were “Rad!!”


References

Baena-Garcı´a, M., & Morales-Bueno, R. (2012). Mining interestingness measures for string pattern mining. Knowledge-Based Systems, 25(1), 45–50. doi:10.1016/j.knosys.2011.01.013
Bianco,  a., Mardente, G., Mellia, M., Munafò, M., & Muscariello, L. (2005). Web user session characterization via clustering techniques. GLOBECOM - IEEE Global Telecommunications Conference, 2, 1102–1107. doi:10.1109/GLOCOM.2005.1577807
Han, J., Kamber, M., & Pei, J. (2012). 5 - Data Cube Technology. In J. H. Kamber & J. Pei (Eds.), Data Mining (Third Edition) (Third Edit., pp. 187–242). Boston: Morgan Kaufmann. doi:http://dx.doi.org/10.1016/B978-0-12-381479-1.00005-8
Liu, C., Zhang, K., & Xiong, H. (2014). Sequential Pattern Analysis with Right Granularity 1. doi:10.1109/ICDMW.2014.164
Park, S., Suresh, N. C., & Jeong, B. K. (2008). Sequence-based clustering for Web usage mining: A new experimental framework and ANN-enhanced K-means algorithm. Data and Knowledge Engineering, 65(3), 512–543. doi:10.1016/j.datak.2008.01.002
Xiao, J., & Zhang, Y. (2001). Clustering of web users using session-based similarity measures. In Computer Networks and Mobile Computing, 2001. Proceedings. 2001 International Conference on (pp. 223–228). IEEE. doi:10.1109/ICCNMC.2001.962600


Friday, January 29, 2016

Sequential Pattern Mining: The initial attempts

Or: Why I suggest you consider using SPMF for all kinds of Pattern Mining Studies


This past year a classmate of mine, who is highly skilled in data mining, Lei “Erik” Liu, helped me realize that there is a great opportunity in my dissertation to study unusual workflows using sequential pattern mining.  I will save the details of my research for other posts, but in short I am working with a vast amount of project management data. 

In this post I share my initial findings on two open source tools available for Sequential Pattern Mining in hopes that others might benefit.  I would be thrilled to receive any suggestions others might have in the comments below.

The Problem Statement

Sequential pattern mining is a form of associate rule mining that takes into account the specific order of certain events as the sequence may be relevant to any conclusions drawn.  The steps for a sequential pattern mining study usually involve reducing the ordered events into discrete labels (often integers), grouping them by events that occurred simultaneously (or as I argue, effectively simultaneously), and applying a data mining tool to obtain the useful patterns.


During a Frantic Search: The Open Source Tools Available

I investigated two open source packages that claimed to support sequential pattern mining. A Spanish researcher name Chris Borgelt wrote Seqwog and Sequoia in C and they both would appear to be built for scalability as they handled different size files with what would appear to be virtually the same amount of time[1]. However, it has no user interface so I cannot control the algorithm parameters (minimum support, for example) as easily as I would like and I only have the two algorithms to choose from. Also, most examples of sequential pattern mining allow for simultaneous events and these two tools do not seem to. The classic example of a sequential pattern mining study is focused on purchasing patterns. A series of trips to the store for one customer are studied but they usually purchase more than on item per visit (Tan, Steinbach, & Kumar, 2005) so some events are simultaneous. This is also true with my project management data. A user can make more than one change to a work ticket at the same time. There is also a serious lack of documentation with these tools.

Meanwhile, a Professor at Harbin Institute of Technology Shenzhen Graduate School named Philippe Fournier-Viger founded a project that has created a tool based in Java to support dozens of data mining algorithms including eight for sequential rule mining[2] alone.  Some of the sequential pattern mining algorithms his tool provides support simultaneous events and his team has done an excellent job of documenting the use of all of the algorithms.  They even (ATTENTION PhD STUDENTS!) link to the papers associated with these algorithms.   The blogosphere touts his contributions and I am excited about how easy it is to control the algorithm parameters through an impressive user interface so I moved forward with “SPMF” by Dr Fournier-Viger and associates. 

SPMF

Of course, being an application developer by day (and ABD PhD student by night) I am much more familiar with Java than C so this is part of my bias.  It should also not go unnoticed that Dr Fournier-Viger provides the sourcecode  and is happy to discuss the possibility of others contributing. I am tremendous supporter of Open Source Software (OSS).

By Request: The Prototype

Under the pressure of my Research Review,[3] I was asked to prototype my planned sequential pattern mining attempts.  This lead to the initial assessment above and then a lot of data prep work so that I could apply at least one of SPMFs algorithms to my data. What I realized in building the prototype is that the real effort is going to be establishing the proper level of detail to represent in the chain of events.  In many papers (Liu, Zhang, & Xiong, 2014)(Liu, Ge, et al., 2014), this question is referred to in terms of levels of granularity.  In the case of my project management (specifically issue tracking) data, one question might be: will it be advantageous to note every priority change specifically in terms of “from what to what” (E.G. P2 to P1) or should I just capture “priority changed?”  Higher specificity is intuitively more useful, but if I am too specific then there might not be enough data to support potentially useful sequences.  If it is too low, the patterns might be too obvious.  For instance, my data shows that in one project, 56.6% of the time that a ticket is created, a release date change event will occur after the creation.  Perhaps as a part of a more detailed study on the circumstances of postponement this may be of interest, but by itself, especially at this level of granularity, it is a bit moot.

Also, the prototype showed that 1000 sequences with well under a dozen possible events in the transactions processed in seconds.  Of course, the plan was to add a lot more detail and apply the algorithms to about a half of a million transactions so the question of scalability was still unanswered during my exam.


The initial findings: Low Minimum Support is Costly

I have now been working with the SPMF tool for 6 months and I can say with great confidence that on my simple Mac mini, it scales incredibly well.  Using a level of granularity where there are hundreds of events possible (but the transactions are dominated by probably a few dozen), I can tackle a list of ~12k transactions in a 30-90 minutes with standard parameters. I mention “standard parameters” as many example studies mention minimum confidence and minimum support levels in the 50-75% range.  As I am sometimes investigating anomalies, I have experimented with bringing these levels down a lot further and this seems to be what will bring a request of SPMF (on 12-14k transactions of my data, anyway) from about a minute to many minutes.  I understand this is largely due to the fact that for frequent pattern mining algorithm  in general, the number of patterns (an the search space) can increase exponentially  when the minimum support threshold is decreased.  Of course, the legitimacy of bringing these parameters down too low is a very worthwhile topic and should be addressed in another post.



[1] Admittedly, these tools are designed to handle a much larger set of transactions and the point where I drew this conclusion, I had only utilized SPMF at more than 1000 transactions.

[2] Dr Philippe Fournier-Viger differentiates between Sequential Rule Mining and Sequential Pattern Mining.  I hope to address this differentiation in a different post but please note his map of algorithms here

[3] The Research Review for my department is called a “Comprehensive Exam” but it might also be called a “Prospectus Review” in other programs as it should be a formal way for the committee to provide life saving feedback on the research plan midway through the dissertation process.  Mine turned out to be, although very stressful, an excellent way to move from about ~35% done to about ~60-65% in only 3 months.  If you have one in your program, I hope you will look on it as a way to find the light at the end of the tunnel and not just a form of torture.  ‘Although it may be that as well.

References

Liu, C., Ge, Y., Xiong, H., Xiao, K., Geng, W., & Perkins, M. (2014). Proactive workflow modeling by stochastic processes with application to healthcare operation and management. KDD ’14 The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1593–1602. doi:10.1145/2623330.2623363
Liu, C., Zhang, K., & Xiong, H. (2014). Sequential Pattern Analysis with Right Granularity 1. doi:10.1109/ICDMW.2014.164

Tan, P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to Data Mining, (First Edition). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.