(A Creative Blog Name Here)

Code, math, and other things I find useful

Setting up python in virtualenv for data analysis on OSX

I have decided to use python for my research code as I have become fed up with needing an internet connection to work with Matlab and not being able to check out a license for toolboxes I need (e.g. the statistics toolbox) right before a paper deadline. I have used python for while for munging raw data for input to Matlab, but have just recently felt that I could work with python as effectively as I could with Matlab.

I wanted my Python installation to use the newest versions of the various modules used for data analysis (e.g. numpy, scipy, ipython, matplotlib, etc.). Since I already had the source for many of these projects I decided to build the current master branches (after a git pull to get the latest sources) with all of the bells and whistles like the ipython qt console and notebook. Additionally, I wanted to put everything in a virtualenv to isolate it from my system Python in case I broke anything. In the process I ran into a few problems and eventually figured out how to get a working system, in this post I want to document the process in case I need to reinstall any of it in the future.

Note: Not everything is built from source, the main modules I work with are, but some dependencies are installed with pip. Additionally, I assume that you will be using python 2.7 and that the directory where you store source code is SRC, the directory with virtualenvs is VENV and the name of the virtualenv is myvenv (you should name yours something meaningful though). Also, it is assumed that all git clone ... commands should be run in the SRC directory.

Install virtualenv and virtualenv wrapper.

There are a number of introductions on this. The one here is good.

Create a virtualenv

virtualenvwrapper makes this really easy, see the documentation for instructions. Name it whatever you'd like. As long as it is active any modules you install (via pip or setup.py install) will be installed for that python.

IMPORTANT: Fix the python executable

virtualenv only copies the python executable, however, for the application manager in OSX to recognize running python processes then, python must be called from an application bundle (also see pythonw in the python framework).
This step is necessary so that figures opened with matplotlib can be brought to the front via cmd-tab (or clicking the icon in the dock). If this step is not performed then matplotlib figures open behind all other windows and the only way to find them is to move all open windows.

$ git clone git://github.com/gldnspud/virtualenv-pythonw-osx.git
$ python install_pythonw.py `which python`/../..

numpy

{% codeblock lang:bash %} $ cd your/src/directory $ git clone git://github.com/numpy/numpy.git # or git pull $ cd numpy $ python setup.py build $ python setup.py install

nose

{% codeblock lang:bash %} $ pip install nose

Test numpy

    python -c 'import numpy; numpy.test()'

scipy

{% codeblock lang:bash %} $ cd your/src/directory $ git clone git://github.com/scipy/scipy.git # or git pull $ cd scipy $ python setup.py build $ python setup.py install $ python -c 'import scipy; scipy.test()'

readline

{% codeblock lang:bash %} $ easy_install readline {% endcodeblock %} We use easy_install because readline won't be picked up if we install it with pip.

Install other dependencies

{% codeblock lang:bash %} $ pip install python-dateutil sphinx pygments tornado

Install the ZMQ library

Describing how is beyond the scope of this post, but there is a lot of information available elsewhere.

pyzmq

{% codeblock lang:bash %} $ pip install pyzmq

Install PySide system-wide

There is a dmg to do this, instructions for which can be found online.
Then create a symbolic link in this virtualenv to the system installation

    ln -s path-to-sys-PySide VENV/myvenv/lib/python2.7/site-packages/PySide

ipython

By default the build script will use the system installation of python.
Instead, use the following so that the correct version of python is used.

    python setup.py build --executable "VENV/myvenv/bin/python" 
    python setup.py install

See this discussion for more information. Alternatively `which python` can be used to specify the executable.

Also, install mathjax if you'd like for the notebook. In an IPython session run

from IPython.external.mathjax import install_mathjax
install_mathjax()

You should now have a working ipython with qtconsole and notebook functionality (though you can't plot yet until we've installed matplotlib).

matplotlib

Building matplotlib on OSX is a pain. Installing using pip might be best

$ pip install -e https://github.com/matplotlib/matplotlib.git#egg=Package

However, if you really want to build matplotlib from source then

$ git clone git://github.com/matplotlib/matplotlib.git

and follow the instructions in README.osx and make.osx

cython

I chose to build the latest stable version of cython rather than the master branch since so many other modules use it.

$ git clone git://github.com/cython/cython.git
$ make local
$ python setup.py install
$ python runtests.py -vv

Just a heads up that the tests take a very long time to run.

scikit-learn

$ git clone git://github.com/scikit-learn/scikit-learn.git
$ make all
$ python setup.py install

Update: If building scikit-learn from source it is better to not install it system-wide with the last line about, but rather add the directory to the repository to your PYTHONPATH after the make all.

pandas

First install the dependencies.

numexpr
$ git clone git://github.com/erdc-cm/numexpr.git
$ python setup.py build
$ python setup.py install
$ python -c "import numexpr; numexpr.test()"
PyTables

Make sure that you have the HDF5 libraries installed on your system and that they are on PATH. Then, clone the repository from github and follow the instructions reproduced below

$ git clone git://github.com/PyTables/PyTables.git
$ python setup.py build_ext --inplace
$ python -c 'import tables; tables.test()'
$ python setup.py install

h5py

I've also found it useful to have h5py built as well for easily loading version 7.3 mat files. Do this as follows

$ git clone git://github.com/qsnake/h5py.git
$ cd h5py
$ python setup.py build --hdf5=/usr/local
$ python setup.py install

You should replace usr/local above with the path that contains the include, lib, etc. directories of your HDF5 installation.

rpy2
$ pip install rpy2

Now, we can build and install pandas.

$ git clone git://github.com/pydata/pandas.git
$ python setup.py build_ext --inplace
$ nosetests pandas
$ python setup.py install

Update: Again, rather than installing you can just build in place and add the path to the pandas repository to your PYTHONPATH. This way every time you pull a new version and build it you will see the effects immediately.

statsmodels

First we install the patsy module which statsmodels uses for formulas

$ git clone git://github.com/pydata/patsy.git
$ python setup.py install

Then, statsmodels itself

$ git clone git://github.com/statsmodels/statsmodels.git
$ python setup.py install

PyMC

$ git clone git://github.com/pymc-devs/pymc.git
$ python setup.py config_fc --fcompiler gnu95 build
$ python setup.py install
$ python -c 'import pymc; pymc.test()'

bottleneck

This modulde contains cythonized versions of certain numpy functions to speed them up including nansum, etc. By now the drill is standard. Beware, this takes a very long time to build on my laptop.

$ git clone git://github.com/kwgoodman/bottleneck.git
$ python setup.py install
$ python -c 'import bottleneck; bottleneck.test()'

CythonGSL

This wraps the GSL library for use with Cython. This is handy if you need things like the gamma function, or you can use it to generate random numbers (just be careful if you're also using np.random also) .

$ git clone git://github.com/twiecki/CythonGSL.git 
$ cd CythonGSL
$ python setup.py build
$ python setup.py install

Finished

Ok, that's it, we have a pretty complete system for data analysis and research.