Sunday, August 2, 2015

Why Python for Statistical Studies?

The primary motivation of this blog is to support fellow novices in their attempts to take their existing development skills and apply them to data science.  In that vein, I present an investigation of the language Python and why it is so powerful for building applications focused on data analysis.


Analytical Power

The inner computational strength of Python comes from a library, NumPy, that was added early in its history to allow for an incredibly efficient manipulation of very large vectors of data. (Van Der Walt, Colbert, & Varoquaux, 2011)(Lutz, 2007).  All previous examples of this level of computational efficiency had required a compiled language such as C or C++.  However, using a compiled language requires that every time a change is made the developer needs to wait a period, often many minutes, for the code to be converted to binaries.  Meanwhile, Python is known for its “rapid turnaround” (Lutz, 2007)(Grandell, Peltomäki, Back, & Salakoski, 2006) as it is technically an interpreted language.  This is the main reason that developers prefer interpreted languages such as Python to many other languages in terms of development time.  So over the past decade and a half, Python’s ability to perform large scale computation along with its popularity among developers has led to other Python libraries being developed within the open source community to support a tremendous variety of analysis almost right out of the box (Grandell et al., 2006).

So when in combination with these mathematical open source libraries, Python is said to have many of the same statistical capabilities of R, SAS or MATLAB (McKinney, 2012)(Nilsen, 2007), while still maintaining its status as a fully functioning object oriented language capable of being used to build enormous systems (Lutz, 2007)(Chudoba, Sadílek, Rypl, & Vorechovsky, 2013).  Also, unlike the other infamous mathematical languages such as R and MATLAB, Python is simply very web savvy as the most popular frameworks built in Python are based on the server-browser paradigm. This was important to us because we wanted to leverage the established power of the web browser for our user interface and also offer the option of making parts of this system open to anybody.  This ability to support “Programming-in-the-large” is unlike some of the popular early scripting languages of the past like Perl (McKinney, 2012).

Generous Community

Python also has an incredibly enthusiastic and generous community that freely evangelizes knowledge via blog posts and forums(Lutz, 2007).  So if we ever do find some sort of computational limitation, the community can point us to packages that are easily tied to scientific plugins written in binary languages like C, C++ and Java.

Ease of use

That Python has a strong community also means that answers to questions on almost any development challenge will be easily found but even this is less of a concern when compared to other language because Python is universally accepted as one of the easiest main stream programming languages to learn (Zelle, n.d.)(Lutz, 2007)(Grandell et al., 2006).  It’s also considered one of the cleanest languages, allowing developers to write it incredibly quickly (A.K.A. it is very “writable”)  (Grandell et al., 2006) (McKinney, 2012).  This is primarily because of its simple yet structured syntax but also due to its use of dynamic typing.  We contend that academics that need to write code should learn a bit about proper control structures and they probably should learn the difference between a hash table and an array.  However, in Python there is no compelling reason to learn many additional  computer science concepts such as the tradeoffs between a floating point number vs a “double” like you would if they were developing in Java, C, or C++.  In Python data types are just a bit simpler.  Similarly, the syntax vaguely resembles English with the unusual use of “is”, “pass” and “not” so that it is a very easy for one developer to read another developers code.  This “readability” is furthered by the community convention of “snake_case” that encourages very descriptive variable names.  So in short, as many of the authors of the study in quetsion are first and foremost Industrial Engineers, Python’s reduced learning curve, reputable write-ability and genuine readability was incredibly appealing to us.

In Conclusion 

So in the end Python I suggest Python for many reasons but primarily for its strong and justified reputation as a language capable of building a large and stable system while simultaneously being capable of satisfying almost any statistically analytical request all while being easy to use.

Of course, if I am missing any arguments at all, I would love to hear about them. Please comment below!