Atoms 2 Bits

Applied Analytics in Retail and Supply Chain Management

Kernel Density Estimation in Python

Last week Michael Lerner posted a nice explanation of the relationship between histograms and kernel density estimation (KDE). I've made some attempts in this direction before (both in the scikit-learn documentation and in our upcoming textbook), but Michael's use of interactive javascript widgets makes the relationship extremely intuitive. I had been planning to write a similar post on the theory behind KDE and why it's useful, but Michael took care of that part. Instead, I'm going to focus here on comparing the actual implementations of KDE currently available in Python. If you're unsure what kernel density estimation is, read Michael's post and then come back here.

There are several options available for computing kernel density estimates in Python. The question of the optimal KDE implementation for any situation, however, is not entirely straightforward, and depends a lot on what your particular goals are. Here are the four KDE implementations I'm aware of in the SciPy/Scikits stack:

Each has advantages and disadvantages, and each has its area of applicability. I'll start with a table summarizing the strengths and weaknesses of each, before discussing each feature in more detail and running some simple benchmarks to gauge their computational cost:

The Big Data Brain Drain: Why Science is in Trouble

Regardless of what you might think of the ubiquity of the "Big Data" meme, it's clear that the growing size of datasets is changing the way we approach the world around us. This is true in fields from industry to government to media to academia and virtually everywhere in between. Our increasing abilities to gather, process, visualize, and learn from large datasets is helping to push the boundaries of our knowledge.

But where scientific research is concerned, this recently accelerated shift to data-centric science has a dark side, which boils down to this: the skills required to be a successful scientific researcher are increasingly indistinguishable from the skills required to be successful in industry. While academia, with typical inertia, gradually shifts to accommodate this, the rest of the world has already begun to embrace and reward these skills to a much greater degree. The unfortunate result is that some of the most promising upcoming researchers are finding no place for themselves in the academic community, while the for-profit world of industry stands by with deep pockets and open arms.

Understanding the FFT Algorithm

The Fast Fourier Transform (FFT) is one of the most important algorithms in signal processing and data analysis. I've used it for years, but having no formal computer science background, It occurred to me this week that I've never thought to ask how the FFT computes the discrete Fourier transform so quickly. I dusted off an old algorithms book and looked into it, and enjoyed reading about the deceptively simple computational trick that JW Cooley and John Tukey outlined in their classic 1965 paper introducing the subject.

The goal of this post is to dive into the Cooley-Tukey FFT algorithm, explaining the symmetries that lead to it, and to show some straightforward Python implementations putting the theory into practice. My hope is that this exploration will give data scientists like myself a more complete picture of what's going on in the background of the algorithms we use.

The Game of Life in Python

In 1970 the British Mathematician John Conway created his "Game of Life" -- a set of rules that mimics the chaotic yet patterned growth of a colony of biological organisms. The "game" takes place on a two-dimensional grid consisting of "living" and "dead" cells, and the rules to step from generation to generation are simple:

  • Overpopulation: if a living cell is surrounded by more than three living cells, it dies.
  • Stasis: if a living cell is surrounded by two or three living cells, it survives.
  • Underpopulation: if a living cell is surrounded by fewer than two living cells, it dies.
  • Reproduction: if a dead cell is surrounded by exactly three cells, it becomes a live cell.

By enforcing these rules in sequential steps, beautiful and unexpected patterns can appear.

I was thinking about classic problems that could be used to demonstrate the effectiveness of Python for computing and visualizing dynamic phenomena, and thought back to a high school course I took where we had an assignment to implement a Game Of Life computation in C++. If only I'd had access to IPython and associated tools back then, my homework assignment would have been a whole lot easier!

Here I'll use Python and NumPy to compute generational steps for the game of life, and use my JSAnimation package to animate the results.

XKCD Plots in Matplotlib: Going the Whole Way

One of the most consistently popular posts on this blog has been my XKCDify post, where I followed in the footsteps of others to write a little hack for xkcd-style plotting in matplotlib. In it, I mentioned the Sketch Path Filter pull request that would eventually supersede my ugly little hack.

Well, "eventually" has finally come. Observe:

In [1]:
%pylab inline

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

In [2]:
plt.xkcd()  # Yes...
plt.plot(sin(linspace(0, 10)))
plt.title('Whoo Hoo!!!')
Out[2]:
<matplotlib.text.Text at 0x2fade10>

The plt.xkcd() function enables some rcParam settings which can automatically convert any matplotlib plot into XKCD style. You can peruse the matplotlib xkcd gallery here for inspiration, or read on where I'll show off some of my favorite of the possibilities.

Numba vs. Cython: Take 2

Last summer I wrote a post comparing the performance of Numba and Cython for optimizing array-based computation. Since posting, the page has received thousands of hits, and resulted in a number of interesting discussions. But in the meantime, the Numba package has come a long way both in its interface and its performance.

Here I want to revisit those timing comparisons with a more recent Numba release, using the newer and more convenient autojit syntax, and also add in a few additional benchmarks for completeness. I've also written this post entirely within an IPython notebook, so it can be easily downloaded and modified.

As before, I'll use a pairwise distance function. This will take an array representing M points in N dimensions, and return the M x M matrix of pairwise distances. This is a nice test function for a few reasons. First of all, it's a very clean and well-defined test. Second of all, it illustrates the kind of array-based operation that is common in statistics, datamining, and machine learning. Third, it is a function that results in large memory consumption if the standard numpy broadcasting approach is used (it requires a temporary array containing M * M * N elements), making it a good candidate for an alternate approach.

IPython Notebook: Javascript/Python Bi-directional Communication

I've been working with javascript and the IPython notebook recently, and found myself in need of a way to pass data back and forth between the Javascript runtime and the IPython kernel. There's a bit of information about this floating around on various mailing lists and forums, but no real organized tutorial on the subject. Partly this is because the tools are relatively specialized, and partly it's because the functionality I'll outline here is planned to be obsolete in the 2.0 release of IPython.

Nevertheless, I thought folks might be interested to hear what I've learned. What follows are a few basic examples of moving data back and forth between the IPython kernel and the browser's javascript. Note that if you're viewing this statically (i.e. on the blog or on nbviewer) then the javascript calls below will not work: to see the code in action, you'll need to download the notebook and open it in IPython.

A Simple Animation: The Magic Triangle

I've been spending a lot of time lately playing with animations in IPython notebook. Here's my latest one - a slightly puzzling rearrangement of shapes into two different triangles:

Where does the extra block go? Look close and you might be able to figure it out.

A Javascript Viewer for Matplotlib Animations

I'll cut to the chase: here's what I've created: a javascript-based animation viewer, with hooks to embed it in IPython. It's best viewed in a modern browser (unfortunately Firefox does not currently qualify as "modern" due to its lack of HTML5 support)

In [1]:
%pylab inline

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

In [2]:
# get the JSAnimation import at https://github.com/jakevdp/JSAnimation
from JSAnimation import examples
examples.basic_animation()
Out[2]:


Once Loop Reflect

I think the result is pretty good, if I do say so myself.

Embedding Matplotlib Animations in IPython Notebooks

I've spent a lot of time on this blog working with matplotlib animations (see the basic tutorial here, as well as my examples of animating a quantum system, an optical illusion, the Lorenz system in 3D, and recreating Super Mario). Up until now, I've not have not combined the animations with IPython notebooks. The problem is that so far the integration of IPython with matplotlib is entirely static, while animations are by their nature dynamic. There are some efforts in the IPython and matplotlib development communities to remedy this, but it's still not an ideal setup.

I had an idea the other day about how one might get around this limitation in the case of animations. By creating a function which saves an animation and embeds the binary data into an HTML string, you can fairly easily create automatically-embedded animations within a notebook.