Gigantic Data: Getting started with SPMF

Or: Transforming your data for use in SPMF

For the last chapter(s) of my dissertation, I have been excited to use the open source tool SPMF to apply sequential pattern mining algorithms to a large dataset that includes literally millions of events tied to half a million work tickets.

In the process of preparing and transforming my data, I have learned a great deal about how to prepare an input file for SPMF. I summarize the steps here in hopes they may help other researchers get started with this powerful open source data mining tool.

Please note, I am not claiming to be a pattern mining expert(yet!) and this is simply the manner in which I have developed to reduce a multi-dimensional data set into the format required for SPMF. There could be many other legitimate approaches!

To use the (literal) text book example of associative pattern mining . . .

In the mid-90s, a “major Midwest based retailer” mined their transaction data to find an unexpected relationship between the purchase of beer and diapers.

If the analysts today were to conduct a similar study with SPMF, I would suggest they:

1. Reduce every item purchased to a meaningful label

2. Translate the label into a “retrievable integer”

3. Assemble every transaction (set of items purchased by the same person) into the string format needed by concatenating the retrievable integers in the proper way and putting them all into one file

4. Applying an algorithm available in SPMF to this input file

I will now describe how I accomplish these steps using my system written in Python and built in the Django Framework. Python is an excellent language for both advanced data analysis and application building as I explain here.

Step 0: Assembling Item Sets

SPMF is capable of seeking all kinds of patterns in data including association rules, itemsets, sequential patterns, sequential rules, high-utility patterns, and perform sequence prediction, classification, and clustering. In my case, I am especially interested in sequential pattern mining but I have adopted the strategy of acknowledging that many events happen simultaneously or, as I argue, effectively simultaneously.

As in the classic shopping cart analogy, it is relevant that the diapers and the beer were bought in the same trip. Even if the retailer were able to obtain the customers path in the store, it is much less interesting[1] that the diapers were put in his cart before the beer. Similarly, in my data if a user makes two changes within a few minutes of each other, I do not want to distinguish whether either change came first. This is largely due to the lack of effect this sequence within one “session[2]” has on the outcome of the work ticket. Does it really matter whether the project leader changes the priority of a ticket before or after he assigns a ticket to a specific contributor? The assigned developer will probably not see the ticket until long after the project manager has made their changes so in terms of sequential information, it is much more important that the developer made their changes after the project manager and not that the developer might mark the ticket as resolved before attaching a solution[3]. So in short, I need to represent itemsets within the sequential data. To accomplish this, all events from the same user within a certain threshold define a session. That threshold is currently set at one hour but further research will determine the best criteria.

Within my code, I tried two data structures to represent itemsets. First, I tried a python “list of lists” and then a “list of Django QuerySets.” I found the second much more useful as I could use the queryset methods (especially filter and count) throughout the transformation code.

This is a highly simplified example:

for session_query_set in session_list:

session_event_count = session_query_set.count()

session_attachment_event_list = session_query_set.filter(action_name="attachment")

for attachment_event in session_attachment_event_list:

# Do something with this attachment event

Of course, regardless of whether or not it is necessary for you to utilize itemsets (or sessions as I call them in my data), you will still need to reduce each event into a meaningful label.

Step 1: Reducing Events to Meaningful Labels

Based on my summary of steps above, this step may sound simple. However, somewhere in the process of discovering the infamous beer and diapers relationship, the “Midwestern retail analysts” had to decide:

. . were they looking for items purchased along with “Huggies Diapers Overnight Size 3”, “All Huggies Diaper Products”, “All Diaper Products,” “All Baby Doody Management Products”, or possibly even “All Baby Products (Non-Elmo Related).”

This is a decision often referred to as the choice of granularity (Jiawei Han, Kamber, & Pei, 2012) (Liu, Zhang, & Xiong, 2014). Unfortunately, this step is very specific to the data you are studying and is probably best covered in a series of book chapters, but might also be covered in a future post. For now, consider the events occurring to a work ticket being reduced to labels (in this case, with sessions being represented) such as:

{CREATE_Bug, Attachment, } {VERSION_set, } {Resolution_Fixed, Attachment, } {Status_Closed, }

In the case of my code utilizing a list of QuerySets, a grossly simplified example of the code to assemble the list of label lists might be similar to:

transaction_list = []

for session_query_set in session_list:

new_session_labels = []

session_attachment_event_list =

session_query_set.filter(action_name="attachment")

for attachment_event in session_attachment_event_list:

new_session_labels.append("Attachment")

## other translations would go here ....

transaction_list.append(new_session_labels)

return transaction_list

However, these labels are to be used to make sense of the output later and there is no reason to store them like this just yet. So the next step is to translate the labels into the format SPMF requires: positive integers.

Step 2 & 3: Translate the Event Labels and Concatenate Retrievable Integers

In the SPMF input format, each event is represented by a positive integer and they are listed in order from first to last occurring. All events are separated by one “space” character. If there are itemsets involved, each itemset is separated by a -1.

It is important to note that you can only have one space character between each event or itemset separator (-1) and each line needs to end with a -2 plus both “end line” characters.

The result might look like[4]:

622 638 621 364 -1 574 -1 575 -1 252 256 -1 250 -1 -2\r\n

(In most ide's, the carriage return and endline would not typically be visible)

Of course, I keep referring to these integers as “retrievable integers” because what good are they if their meaning is not completely consistent throughout your study?

In order to be sure of this retrievability, I created a simple Django model:

class Sequence_keys(models.Model):

id = models.AutoField(primary_key=True)

sequence_lbl = models.CharField(max_length=100, default='uNk')

Of course, the primary_key=True makes sure the integer key is unique. And then I would fetch the integer via:

def get_sequence_int(seq_lbl):

seq_obj, cr = Sequence_keys.objects.get_or_create(sequence_lbl=seq_lbl)

Subsequently a simple routine can assemble each transaction into the format shown above and a (potentially very large) file could be prepared for SPMF.

In conclusion

I speculate some of these steps are probably similar with other data mining tools besides SPMF for the basic principles of associative pattern mining should be common throughout.

After reading all this, you might be taken aback by how much prep work this is and, of course in this post, I hardly addressed the step that is taking the vast, vast majority of my time: translating my database of events into meaningful labels. This realization you and I are having reminds me of a quote from theAZ:

“80-85% of your effort in a data mining based research project will be focused on data exploration and pre-processing; not running the algorithms . . .”

–A Regarded ASU professor (name to be revealed when I get his permission)

Up next:

Sometime this year I hope to post some tips on how to run SPMF effectively (Step 4) and (ideally) also share some code on how to partition the output. I have also discovered the hard way that another majorly time consuming task is to analyze the output files. As I am starting to use scenarios where literally over a 100,000 output lines might be produced, I have created scripts that will parse these files and give a very abridged version of the output to answer very specific questions. Please stay tuned . ..

[1] See also papers such as (Baena-Garcı´a & Morales-Bueno, 2012) on objectively measuring “interestingness.”

[2] For my data, I used research pertaining to web sessions (Park, Suresh, & Jeong, 2008)(Xiao & Zhang, 2001)(Bianco, Mardente, Mellia, Munafò, & Muscariello, 2005) to guide my definition of a user session.

[3] For online retailers, the order of “cart adds” before a purchase is made is readily available. Does anyone know if this sequential information is utilized?

[4] For those that recognize the “screen shots” might be a reference to the Apple IIe or Apple IIc, major hats off to you. And yes, I first learned to code on such epic machines bc I am was born in the 70s and these machines were “Rad!!”

References

Baena-Garcı´a, M., & Morales-Bueno, R. (2012). Mining interestingness measures for string pattern mining. Knowledge-Based Systems, 25(1), 45–50. doi:10.1016/j.knosys.2011.01.013

Bianco, a., Mardente, G., Mellia, M., Munafò, M., & Muscariello, L. (2005). Web user session characterization via clustering techniques. GLOBECOM - IEEE Global Telecommunications Conference, 2, 1102–1107. doi:10.1109/GLOCOM.2005.1577807

Han, J., Kamber, M., & Pei, J. (2012). 5 - Data Cube Technology. In J. H. Kamber & J. Pei (Eds.), Data Mining (Third Edition) (Third Edit., pp. 187–242). Boston: Morgan Kaufmann. doi:http://dx.doi.org/10.1016/B978-0-12-381479-1.00005-8

Liu, C., Zhang, K., & Xiong, H. (2014). Sequential Pattern Analysis with Right Granularity 1. doi:10.1109/ICDMW.2014.164

Park, S., Suresh, N. C., & Jeong, B. K. (2008). Sequence-based clustering for Web usage mining: A new experimental framework and ANN-enhanced K-means algorithm. Data and Knowledge Engineering, 65(3), 512–543. doi:10.1016/j.datak.2008.01.002

Xiao, J., & Zhang, Y. (2001). Clustering of web users using session-based similarity measures. In Computer Networks and Mobile Computing, 2001. Proceedings. 2001 International Conference on (pp. 223–228). IEEE. doi:10.1109/ICCNMC.2001.962600

Gigantic Data

Wednesday, February 3, 2016

Getting started with SPMF