January 22, 2015

Titus Brown

Our MEGAHIT review

It may not surprise peope to learn that I was one of the reviewers on the MEGAHIT metagenome assembly paper... which is now published!.

Below is my review, edited to remove all of the stuff they addressed in their revision.

Please also see our first blog post on MEGAHIT and Adam Rivers' blog post on metatranscriptome assembly with MEGAHIT, too!

It's worth noting that the evaluation I did for the first blog post was used in the final paper ;). Successful adventures in peer review!

Li et al describe a new approach to metagenome assembly and provide an implementation of this approach in the MEGAHIT software. The new approach constructs a succinct de Bruijn graph using multiple k-mers, and uses a novel "mercy k-mer" approach that preserves low-abundance regions of reads. They also use GPUs to accelerate the graph construction.

The paper reads well. The "mercy k-mer" approach is novel to my knowledge, but makes intuitive sense. The GPU implementation seems to yield a significant improvement (although we did not test it). The software is available, licensed under GPL, is fairly straightforward to compile, and runs cleanly. In our hands, it produced good results on known datasets and ran extraordinarily fast and in low memory (see below). The authors are to be commended on all of this!

[ ... ]

See http://ivory.idyll.org/blog/2014-how-good-is-megahit.html for a blog post on the evaluation we did. The authors should feel free to take our makefile & results and build off of them; it's all under BSD license.

In sum, I find this to be a great piece of software, with good algorithmic insights, good implementation (at least from a cursory usage), and lots of promise.


by C. Titus Brown at January 22, 2015 11:00 PM

A Reddit ask-us-anything with the Moore Data Driven Discovery Investigators

Today at 3pm EST, the Moore Data Driven Discovery Investigators will be answering questions on reddit, in the science "ask me anything (AMA)" series. This is an opportunity to ask us anything you want about our research, data-driven discovery more generally, or ...well, you tell us!

Go here to see the questions and our answers.

A (slightly updated) list of the Investigators and some of their proposals is below.

Dr. Ethan White, University of Florida

Proposal: Data-intensive forecasting and prediction for ecological systems

Source code repository: https://github.com/weecology

Google Scholar: http://scholar.google.com/citations?user=pHmra8cAAAAJ&hl=en

ImpactStory: https://impactstory.org/EthanWhite

Blog: http://jabberwocky.weecology.org/

Twitter: @ethanwhite

Dr. Laura Waller, UC Berkeley

Source code repository: https://github.com/Waller-Lab/CalOptrics

Dr. Matt Turk, University of Illinois at Urbana-Champaign

Proposal: Changing How We Conduct Inquiry

Source code repository: http://bitbucket.org/MatthewTurk/

Google Scholar: http://scholar.google.com/citations?user=QTmv2p0AAAAJ&hl=en

Blog: https://plus.google.com/+MatthewTurk

Twitter: @powersoffour

Dr. Blair D. Sullivan, North Carolina State University

Proposal title: Enabling Science via Structural Graph Algorithms

Source code repository: https://github.com/bdsullivan/

Google Scholar: http://scholar.google.com/citations?user=oZBtvZMAAAAJ&hl=en&oi=ao

Twitter: @BlairDSullivan

Dr. Matthew Stephens, University of Chicago

Proposal title: Gene Regulation and Dynamic Statistical Comparisons

Source code repository: https://github.com/stephens999

Google Scholar: http://scholar.google.com/citations?user=qOQFhUkAAAAJ&hl=en

Blog: http://randomdeviation.blogspot.com/

Twitter: @mstephens999

Dr. Amit Singer, Princeton University

Proposal title: Maximum Likelihood Estimation by Semidefinite Programming: Application to Cryo-Electron Microscopy

Google Scholar: http://scholar.google.com/citations?user=h67w8QsAAAAJ&hl=en

Dr. Kimberly Reynolds, UT Southwestern

Proposal title: Decoding the Genome: Finding Effective Variables from Evolutionary Ensembles

Google Scholar: http://scholar.google.com/citations?user=6bWFU7MAAAAJ&hl=en&oi=ao

Dr. Chris Re, Stanford

Google Scholar: http://scholar.google.com/citations?user=DnnCWN0AAAAJ&hl=en

Dr. Laurel Larsen, UC Berkeley

Proposal: Developing the informationscape approach to environmental change detection.

Dr. Carl Kingsford, Carnegie Mellon University

Proposal title: Lightweight Algorithms for Petabase-scale Genomics

Google Scholar: http://scholar.google.com/citations?user=V_cvqKcAAAAJ

Twitter: @ckingsford

Dr. Jeffrey Heer, U. Washington Seattle

Proposal: Interactive Data Analysis

Google Scholar: http://scholar.google.com/citations?user=vlgs4G4AAAAJ

Twitter: @jeffrey_heer

Dr. Casey Greene, Dartmouth

Proposal title: Learning the context of publicly available genome-wide data

Google Scholar: http://scholar.google.com/citations?user=ETJoidYAAAAJ&hl=en

Twitter: @GreeneScientist

Dr. C. Titus Brown, UC Davis

Proposal: Infrastructure for Data Intensive Biology

Source code repository: http://github.com/ged-lab/

Google Scholar: http://scholar.google.com/citations?user=O4rYanMAAAAJ

ImpactStory: https://impactstory.org/TitusBrown

Blog: http://ivory.idyll.org/blog/

Twitter: @ctitusbrown

Dr. Joshua Bloom, UC Berkeley

Proposal title: Efficient Data-Driven Astrophysical Inquiry with Machine Learning

Google Scholar: http://scholar.google.com/citations?user=fHkUYk0AAAAJ

Blog: http://5nf5.blogspot.com/

Twitter: @profjsb


by C. Titus Brown at January 22, 2015 11:00 PM

Continuum Analytics

Automatic generation of vectorized functions with Blaze and Numba

tl;dr Using Blaze in conjunction with Numba helps speed up code by reducing temporary variables from intermediate computations while allowing you to express computations at a high level.

by Phillip Cloud at January 22, 2015 08:04 AM

January 20, 2015

Titus Brown

Baltis, Bioinformatics, and Google Chromebox

I participated in my second Balti and Bioinformatics on Wednesday - unlike the first one, which ended with only slightly sketchy Indian food in Birmingham, this one was entirely online. The technology worked really well and I think this is a great way to do talks!

For those that haven't seen this before, it works as follows:

  1. Nick Loman sets up a Google Hangout "on the air".
  2. The speakers connect into the Google Hangout and do their thing (screen sharing, video, etc.)
  3. Viewers can connect in via what is essentially YouTube, but live streamed to watch the talks.
  4. The talks are also recorded for consumption later on.

There are a few technical issues to deal with; the main one is that full screen doesn't work, so Nick had everyone produce PDFs that they then paged through. I also found it a bit disconcerting to give a talk without any audience feedback!

One thing I'd like to explore is doing this for an in-person conference. Nick did that once, too, but I don't know how well it worked. Anyway, it would be a great way to open conferences up to the world, and we should do that :)

...would that we could use a Google Chromebox for Meetings to do so!

The Google Chromebox for Meetings

A while back, we bought a Google Chromebox for Meetings, together with a 60" TV. The idea was to use it for teleconferencing and meetings; most of the teleconferencing systems at MSU didn't (and don't) interface well with remote individuals, and Michael Crusoe and I thought this might do the trick.

For those who haven't seen it, the Google Chromebox for Meetings is basically marketed as a cheap teleconferencing solution. It's a Google Hangout box in a box, with a nice microphone and a decent video camera. We bought it, registered it, and then connected it to our TV. How's it worked out?

tl;dr? I do not think I will buy another one, nor would I recommend it to anyone.

There have been a few problems:

  1. They don't readily support our configuration, where we are at a university but have our own organization. -0
  2. At least initially, we ran into several configuration and installation problems that took a few hours of Michael's time to resolve. -0
  3. They don't support all of the functionality of Google Hangouts -- in particular, they don't support the sharing-with-YouTube functionality that Nick used for Balti & Bioinformatics.

Overall it's been a somewhat poor experience, and Michael has had to put in an annoyingly large amount of time to get it to work. Once it works, it works pretty well, though. So it's not a total loss. But it's not a better solution than just buying your own cheap driver computer, hooking it up to a TV with a nice camera and mic, and putting a wireless keyboard & mouse on it.


by C. Titus Brown at January 20, 2015 11:00 PM

January 18, 2015

Titus Brown

On benchmarking: when should you publish a comparison of two data analysis pipelines?

A while back, someone else's graduate student asked me (slightly edited to protect the innocent :) --

I already have two independent sets of de novo transcriptome assemblies and annotations of the NGS data [...] 1) from the company who did the sequencing and analysis, and 2) from our pipeline here. It would be great to have a third set from your lab to compare. Do you think such a comparison of independent analyses is worthwhile?

My response, somewhat edited:

Well, yes, and no -- first, just to make my hypocrisy clear, see:


(which is about exactly this!)

I definitely think evaluating pipelines and assemblies is worthwhile, but at the end of the day it's time consuming and I worry that it distracts from the science. My experience is that it takes a lot more time and effort than anyone realizes before they start the project -- either that, or the researchers do a bad job of evaluation :). So my two guidelines (for myself) are --

  1. how reproducible is the analysis by you, and by others? If you're simply evaluating what one person did against what some other person did, without having carefully recorded what was actually done, then it's all anecdatal. In this case, no one is going to learn anything long term, and you shouldn't bother.

    A good way to self-evaluate here: suppose one of your reviewers said, "tweak these parameters because I believe that they make a big difference." How long would it take you to reanalyze everything? If the answer is going to be "an annoyingly long time" then don't write the paper in the first place!

  2. Is there anything to be learned about the underlying science? In our case we pushed a bit to show that though Trinity seems to have recovered more low abundance isoforms, we got essentially no new biologically relevant information out of it - something that I think will hold generally true for a wide range of data sets. To me, that's reasonably important and useful information. Not sure I would have liked the paper as much if we'd just said "hey, look, we get basically the same results, with some small differences. Cool?"

    A good self-evaluation question: is anyone going to find anything you say surprising or useful for building intuition and understanding? If not, why bother doing the work?


So that was my response. You'll see some of these sentiments echoed in my blog post about the Critical Assessment of Metagenome Interpretation; if you're not automating it and it's not reproducible, why bother? The software is going to change tomorrow and your benchmark will be outdated.

This is also why I'm a huge fan of nucleotid.es. These principles are also guiding our development of computational protocols and paper pipelines. And, fundamentally, it's why I'm planning to rain swift and sudden death down on benchmarking and comparison papers that aren't highly reproducible.


by C. Titus Brown at January 18, 2015 11:00 PM

January 16, 2015

Matthew Rocklin

Towards Out-of-core ND-Arrays -- Spilling to Disk

tl;dr We implement a dictionary that spills to disk when we run out of memory. We connect this to our scheduler.


This is the fifth in a sequence of posts constructing an out-of-core nd-array using NumPy, Blaze, and dask. You can view these posts here:

  1. Simple task scheduling,
  2. Frontend usability
  3. A multi-threaded scheduler
  4. Matrix Multiply Benchmark

We now present chest a dict type that spills to disk when we run out of memory. We show how it prevents large computations from flooding memory.

Intermediate Data

A case where our scheduling algorithm fails to avoid intermediates

If you read the post on scheduling you may recall our goal to minimize intermediate storage during a multi-worker computation. The image on the right shows a trace of our scheduler as it traverses a task dependency graph. We want to compute the entire graph quickly while keeping only a small amount of data in memory at once.

Sometimes we fail and our scheduler stores many large intermediate results. In these cases we want to spill excess intermediate data to disk rather than flooding local memory.


Chest is a dict-like object that writes data to disk once it runs out of memory.

>>> from chest import Chest
>>> d = Chest(available_memory=1e9)  # Only use a GigaByte

It satisfies the MutableMapping interface so it looks and feels exactly like a dict. Below we show an example using a chest with only enough data to store one Python integer in memory.

>>> d = Chest(available_memory=24)  # enough space for one Python integer

>>> d['one'] = 1
>>> d['two'] = 2
>>> d['three'] = 3
>>> d['three']
>>> d.keys()
['one', 'two', 'three']

We keep some data in memory

>>> d.inmem
{'three': 3}

While the rest lives on disk

>>> d.path
>>> os.listdir(d.path)
['one', 'two']

By default we store data with pickle but chest supports any protocol with the dump/load interface (pickle, json, cbor, joblib, ….)

A quick point about pickle. Frequent readers of my blog may know of my sadness at how this library serializes functions and the crippling effect on multiprocessing. That sadness does not extend to normal data. Pickle is fine for data if you use the protocol= keyword to pickle.dump correctly . Pickle isn’t a good cross-language solution, but that doesn’t matter in our application of temporarily storing numpy arrays on disk.

Recent tweaks

In using chest alongside dask with any reasonable success I had to make the following improvements to the original implementation:

  1. A basic LRU mechanism to write only infrequently used data
  2. A policy to avoid writing the same data to disk twice if it hasn’t changed
  3. Thread safety

Now we can execute more dask workflows without risk of flooding memory

A = ...
B = ...
expr = A.T.dot(B) - B.mean(axis=0)

cache = Chest(available_memory=1e9)

into('myfile.hdf5::/result', expr, cache=cache)

Now we incur only moderate slowdown when we schedule poorly and run into large quantities of intermediate data.


Chest is only useful when we fail to schedule well. We can still improve scheduling algorithms to avoid keeping data in memory but it’s nice to have chest as a backup for when these algorithms fail. Resilience is reassuring.

January 16, 2015 12:00 AM

January 14, 2015

Matthew Rocklin

Towards Out-of-core ND-Arrays -- Benchmark MatMul

tl;dr We benchmark dask on an out-of-core dot product. We also compare and motivate the use of an optimized BLAS.

Here are the results

Performance (GFLOPS) In-Memory On-Disk
Reference BLAS 6 18
OpenBLAS one thread 11 23
OpenBLAS four threads 22 11

Disclaimer: This post is on experimental buggy code. This is not ready for public use.


This is the fourth in a sequence of posts constructing an out-of-core nd-array using NumPy, Blaze, and dask. You can view these posts here:

  1. Simple task scheduling,
  2. Frontend usability
  3. A multi-threaded scheduler

We now give performance numbers on out-of-core matrix-matrix multiplication.

Matrix-Matrix Multiplication

Dense matrix-matrix multiplication is compute-bound, not I/O bound. We spend most of our time doing arithmetic and relatively little time shuffling data around. As a result we may be able to read large data from disk without performance loss.

When multiplying two $n\times n$ matrices we read $n^2$ bytes but perform $n^3$ computations. There are $n$ computations to do per byte so, relatively speaking, I/O is cheap.

We normally measure speed for single CPUs in Giga Floating Point Operations Per Second (GFLOPS). Lets look at how my laptop does on single-threaded in-memory matrix-matrix multiplication using NumPy.

>>> import numpy as np
>>> x = np.ones((1000, 1000), dtype='f8')
>>> %timeit x.dot(x)  # Matrix-matrix multiplication
10 loops, best of 3: 176 ms per loop

>>> 1000 ** 3 / 0.176 / 1e9  # n cubed computations / seconds / billion
>>> 5.681818181818182

OK, so NumPy’s matrix-matrix multiplication clocks in at 6 GFLOPS more or less. The np.dot function ties in to the GEMM operation in the BLAS library on my machine. Currently my numpy just uses reference BLAS. (you can check this with np.show_config().)

Matrix-Matrix Multiply From Disk

For matrices too large to fit in memory we compute the solution one part at a time, loading blocks from disk when necessary. We parallelize this with multiple threads. Our last post demonstrates how NumPy+Blaze+Dask automates this for us.

We perform a simple numerical experiment, using HDF5 as our on-disk store.

We install stuff

$ conda install -c blaze blaze
$ pip install git+https://github.com/ContinuumIO/dask

We set up a fake dataset on disk

>>> import h5py
>>> f = h5py.File('myfile.hdf5')
>>> A = f.create_dataset(name='A', shape=(200000, 4000), dtype='f8',
...                                chunks=(250, 250), fillvalue=1.0)
>>> B = f.create_dataset(name='B', shape=(4000, 4000), dtype='f8',
...                                chunks=(250, 250), fillvalue=1.0)

We tell Dask+Blaze how to interact with that dataset

>>> from dask.obj import Array
>>> from blaze import Data, into

>>> a = into(Array, 'myfile.hdf5::/A', blockshape=(1000, 1000))  # dask things
>>> b = into(Array, 'myfile.hdf5::/B', blockshape=(1000, 1000))
>>> A = Data(a)  # Blaze things
>>> B = Data(b)

We compute our desired result, storing back onto disk

>>> %time into('myfile.hdf5::/result', A.dot(B))
2min 49s

>>> 200000 * 4000 * 4000 / 169 / 1e9

18.9 GFLOPS, roughly 3 times faster than the in-memory solution. At first glance this is confusing - shouldn’t we be slower coming from disk? Our speedup is due to our use of four cores in parallel. This is good, we don’t experience much slowdown coming from disk.

It’s as if all of our hard drive just became memory.


Reference BLAS is slow; it was written long ago. OpenBLAS is a modern implementation. I installed OpenBLAS with my system installer (apt-get) and then reconfigured and rebuilt numpy. OpenBLAS supports many cores. We’ll show timings with one and with four threads.

$ ipython
>>> import numpy as np
>>> x = np.ones((1000, 1000), dtype='f8')
>>> %timeit x.dot(x)
10 loops, best of 3: 89.8 ms per loop

>>> 1000 ** 3 / 0.0898 / 1e9  # compute GFLOPS
$ ipython
>>> %timeit x.dot(x)
10 loops, best of 3: 46.3 ms per loop

>>> 1000 ** 3 / 0.0463 / 1e9  # compute GFLOPS

This is about four times faster than reference. If you’re not already parallelizing in some other way (like with dask) then you should use a modern BLAS like OpenBLAS or MKL.

OpenBLAS + dask

Finally we run on-disk our experiment again, now with OpenBLAS. We do this both with OpenBLAS running with one thread and with many threads.

We’ll skip the code (it’s identical to what’s above) and give a comprehensive table of results below.

Sadly the out-of-core solution doesn’t improve much by using OpenBLAS. Acutally when both OpenBLAS and dask try to parallelize we lose performance.


Performance (GFLOPS) In-Memory On-Disk
Reference BLAS 6 18
OpenBLAS one thread 11 23
OpenBLAS four threads 22 11

tl:dr When doing compute intensive work, don’t worry about using disk, just don’t use two mechisms of parallelism at the same time.

Main Take-Aways

  1. We don’t lose much by operating from disk in compute-intensive tasks
  2. Actually we can improve performance when an optimized BLAS isn’t avaiable.
  3. Dask doesn’t benefit much from an optimized BLAS. This is sad and surprising. I expected performance to scale with single-core in-memory performance. Perhaps this is indicative of some other limiting factor
  4. One shouldn’t extrapolate too far with these numbers. They’re only relevant for highly compute-bound operations like matrix-matrix multiply

Also, thanks to Wesley Emeneker for finding where we were leaking memory, making results like these possible.

January 14, 2015 12:00 AM

January 06, 2015


Spyder 2.3 is released!

Hi all,

On the behalf of Spyder's development team, I'm pleased to announce that Spyder 2.3 has been released and is available for Windows XP/Vista/7/8, GNU/Linux and MacOS X here: https://bitbucket.org/spyder-ide/spyderlib/downloads

This release represents 14 months of development since version 2.2 and introduces major enhancements and new features:

  • Python 3 support (versions 3.2, 3.3 and 3.4 are supported).
  • Drop Python 2.5 support.
  • Various Editor improvements:
    • Use the Tab key to do code completions
    • Highlight cells
    • First-class support for Enaml files
    • Improve how calltips are shown
    • Better looking Object Inspector
Spyder 2.2 has been a huge success (being downloaded more than 400,000 times!) and we hope 2.3 will be as successful as it. For that we fixed 70 important bugs, merged 30 pull requests from 11 authors and added almost 1000 commits between these two releases.

- Carlos

by Carlos (noreply@blogger.com) at January 06, 2015 01:34 PM

Spyder 2.3.2 is released!

Hi all,

On the behalf of Spyder's development team, I'm pleased to announce that Spyder 2.3.2 has been released and is available for Windows XP/Vista/7/8, GNU/Linux and MacOS X: https://bitbucket.org/spyder-ide/spyderlib/downloads

This release represents more than 2 months of development since 2.3.1 and introduces major enhancements and new features:

- Editor
  - Improve cells visualization
  - Add support for drag selection and improve look of line number area
  - Open on it any text file present in the Variable Explorer
  - View and edit IPython notebooks as Json files
  - Syntax highlighting for Json and Yaml files

- Variable Explorer:
  - Import csv files as Pandas DataFrames
  - Improve browsing speed for NumPy arrays and DataFrames with more than 1e5 elements

- IPython Console
  - Add a stop button to easily stop computations

We fixed almost 40 bugs, merged 13 pull requests from 8 authors and added about 150 commits between these two releases.

This is a very important bugfix release which solves a lot of unicode problems in our consoles, the variable explorer and the main interface, so everyone is encouraged to update.

For a full list of fixes see our changelog

by Carlos (noreply@blogger.com) at January 06, 2015 01:31 PM


PyNN 0.8 beta 2 released

We're happy to announce the second beta release of PyNN 0.8.

With this release we are getting close to a first release candidate: the NEST and NEURON backends are essentially feature complete, although with a few bugs remaining. If you're using one of these simulators as your PyNN backend we recommend using this beta release for new projects; now would also be a good time to think about upgrading existing projects. The Brian backend is less well developed, but considerable progress has been made.

For a list of the main changes between PyNN 0.7 and 0.8, see the release notes for the 0.8 alpha 1 release.

For the changes in this beta release see the release notes. In particular note that this release requires NEST version 2.4, and for the first time supports Python 3.2+ as well as Python 2.6 and 2.7. Support for NEST 2.6 will be provided in an upcoming release.

The source package is available from the INCF Software Center.

What is PyNN?

PyNN (pronounced 'pine' ) is a simulator-independent language for building neuronal network models.

In other words, you can write the code for a model once, using the PyNN API and the Python programming language, and then run it without modification on any simulator that PyNN supports (currently NEURON, NEST and Brian).

Even if you don't wish to run simulations on multiple simulators, you may benefit from writing your simulation code using PyNN's powerful, high-level interface. In this case, you can use any neuron or synapse model supported by your simulator, and are not restricted to the standard models.

PyNN is also being used as a user-friendly interface to neuromorphic hardware systems.

The code is released under the CeCILL licence (GPL-compatible).

by Andrew Davison (noreply@blogger.com) at January 06, 2015 10:21 AM

Matthew Rocklin

Towards Out-of-core ND-Arrays -- Multi-core Scheduling

tl;dr We show off a multi-threaded shared-memory task scheduler. We share two techniques for space-constrained computing. We end with pretty GIFs.

Disclaimer: This post is on experimental buggy code. This is not ready for public use.

Disclaimer 2: This post is technical and intended for people who care about task scheduling, not for traditional users.


My last two posts (post 1, post 2) construct an ND-Array library out of a simple task scheduler, NumPy, and Blaze.

In this post we discuss a more sophisticated scheduler. In this post we outline a less elegent but more effective scheduler that uses multiple threads and caching to achieve performance on an interesting class of array operations.

We create scheduling policies to minimize the memory footprint of our computation.


First, we establish value by doing a hard thing well. Given two large arrays stored in HDF5:

import h5py
f = h5py.File('myfile.hdf5')
A = f.create_dataset(name='A', shape=(4000, 2000000), dtype='f8',
                     chunks=(250, 250), fillvalue=1.0)
B = f.create_dataset(name='B', shape=(4000, 4000), dtype='f8',
                     chunks=(250, 250), fillvalue=1.0)

We do a transpose and dot product.

from blaze import Data, into
from dask.obj import Array

f = h5py.File('myfile.hdf5')

a = into(Array, f['A'], blockshape=(1000, 1000), name='A')
b = into(Array, f['B'], blockshape=(1000, 1000), name='B')

A = Data(a)
B = Data(b)

expr = A.T.dot(B)

result = into('myfile.hdf5::/C', expr)

This uses all of our cores and can be done with only 100MB or so of ram. This is impressive because neither the inputs, outputs, nor any intermediate stage of the computation can fit in memory.

We failed to achieve this exactly (see note at bottom) but still, in theory, we’re great!

Avoid Intermediates

To keep a small memory footprint we avoid holding on to unnecessary intermediate data. The full computation graph of a smaller problem might look like the following:

Un-inlined dask

Boxes represent data, circles represent functions that run on that data , arrows specify which functions produce/consume which data.

The top row of circles represent the actual blocked dot products (note the many data dependence arrows originating from them). The bottom row of circles represents pulling blocks of data from the the A HDF5 dataset to in-memory numpy arrays. The second row transposes the blocks from the first row and adds more blocks from B.

Naively performed, this graph can be very bad; we replicate our data four times here, once in each of the rows. We pull out all of the chunks, transpose each of them, and then finally do a dot product. If we couldn’t fit the original data in memory then there is no way that this will work.

Function Inlining

We resolve this in two ways. First, we don’t cache intermediate values for fast-running functions (like np.transpose). Instead we inline fast functions into tasks involving expensive functions (like np.dot). We may end up running the same quick function twice, but at least we won’t have to store the result. We trade computation for memory.

The result of the graph above with all access and transpose operations inlined looks like the following:

inlined dask

Now our tasks nest (see below). We run all functions within a nested task as a single operation. (non-LISP-ers avert your eyes):

('x_6', 6, 0): (dotmany, [(np.transpose, (ndslice, 'A', (1000, 1000), 0, 6)),
                          (np.transpose, (ndslice, 'A', (1000, 1000), 1, 6))],
                          [(ndslice, 'B', (1000, 1000), 0, 0),
                           (ndslice, 'B', (1000, 1000), 1, 0)]),

This effectively shoves all of the storage responsibility back down to the HDF5 store. We end up pulling out the same blocks multiple times, but repeated disk access is inevitable on large complex problems.

This is automatic. Dask now includes an inline function that does this for you. You just give it a set of “fast” functions to ignore, e.g.

dsk2 = inline(dsk, [np.transpose, ndslice, add, mul, ...])


Now that we have a nice dask to crunch on, we run those tasks with multiple worker threads. This is the job of a scheduler.

Thread pool, courtesy of sewingmantra.com

We build and document such a scheduler here. It targets a shared-memory single-process multi-threaded environment. It replaces the elegant 20 line reference solution with a large blob of ugly code filled with locks and mutable state. Still, it manages the computation sanely, performs some critical optimizations, and uses all of our hardware cores (Moore’s law is dead! Long live Moore’s law!)

Many NumPy operations release the GIL and so are highly amenable to multi-threading. NumPy programs do not suffer the single-active-core-per-process limitation of most Python code.


We follow a fairly standard model. We create a ThreadPool with a fixed number of workers. We analyze the dask to determine “ready to run” tasks. We send a task to each of our worker threads. As they finish they update the run-state, marking jobs as finished, inserting their results into a shared cache, and then marking new jobs as ready based on the newly available data. This update process is fully indexed and handled by the worker threads themselves (with appropriate locks) making the overhead negligible and hopefully scalable to complex workloads.

When a newly available worker selects a new ready task they often have several to choose from. We have a choice. The choice we make here is very powerful. The next section talks about our selection policy:

Select tasks that immediately free data resources on completion


Our policy to prefer tasks that free resources lets us run many computations in a very small space. We now show three expressions, their resulting schedules, and an animation showing the scheduler walk through those schedules. These were taken from live runs.

Example: Embarrassingly parallel computation

Embarassingly parallel dask

On the right we show an animated GIF of the progress of the following embarrassingly parallel computation:

expr = (((A + 1) * 2) ** 3)

Circles represent computations, boxes represent data.

Red means actively taking up resources. Red is bad.

  • Red circles: tasks currently executing in a thread
  • Red boxes: data currently residing in the cache occupying precious memory

Blue means finished or released. Blue is good.

  • Blue circles: finished tasks
  • Blue boxes: data released from memory because we no longer need it for any task

We want to turn all nodes blue while minimizing the number of red boxes we have at any given time.

The policy to execute tasks that free resources causes “vertical” execution when possible. In this example our approach is optimal because the number of red boxes is kept small throughout the computation. We have one only red box for each of our four threads.

Example: More complex computation with Reductions

More complex dask

We show a more complex expression:

expr = (B - B.mean(axis=0)) + (B.T / B.std())

This extends the class of expressions that we’ve seen so far to reductions and reductions along an axis. The per-chunk reductions start at the bottom and depend only on the chunk from which they originated. These per-chunk results are then concatenated together and re-reduced with the large circles (zoom in to see the text concatenate in there.) The next level takes these (small) results and the original data again (note the long edges back down the bottom data resources) which result in per-chunk subtractions and divisions.

From there on out the work is embarrassing, resembling the computation above. In this case we have relatively little parallelism, so the frontier of red boxes covers the entire image; fortunately the dataset is small.

Example: Fail Case

A case where our scheduling algorithm fails to avoid intermediates

We show a case where our greedy solution fails miserably:

expr = (A.T.dot(B) - B.mean(axis=0))

The two large functions on the second level from the bottom are the computation of the mean. These are cheap and, once completed, allow each of the blocks of the dot product to terminate quickly and release memory.

Tragically these mean computations execute at the last possible moment, keeping all of the dot product result trapped in the cache. We have almost a full row of red boxes at one point in the computation.

In this case our greedy solution was short sighted; a slightly more global solution would quickly select these large circles to run quickly. Perhaps betweenness centrality would resole this problem.

On-disk caching

We’ll never build a good enough scheduler. We need to be able to fall back to on-disk caching. Actually this isn’t terrible. High performance SSDs get close to 1 GB/second throughput and in the complex cases where data-aware scheduling fails we probably compute slower than that anyway.

I have a little disk-backed dictionary project, chest, for this but it’s immature. In general I’d like to see more projects implement the dict interface with interesting policies.

Trouble with binary data stores

I have a confession, the first computation, the very large dot product, sometimes crashes my machine. While then scheduler manages memory well I have a memory leak somewhere. I suspect that I use HDF5 improperly.

I also tried doing this with bcolz. Sadly nd-chunking is not well supported. email thread here and here.

Expression Scope

Blaze currently generates dasks for the following:

  1. Elementwise operations (like +, *, exp, log, …)
  2. Dimension shuffling like np.transpose
  3. Tensor contraction like np.tensordot
  4. Reductions like np.mean(..., axis=...)
  5. All combinations of the above

We also comply with the NumPy API on these operations.. At the time of writing notable missing elements include the following:

  1. Slicing (though this should be easy to add)
  2. Solve, SVD, or any more complex linear algebra. There are good solutions to this implemented in other linear algebra software (Plasma, Flame, Elemental, …) but I’m not planning to go down that path until lots of people start asking for it.
  3. Anything that NumPy can’t do.

I’d love to hear what’s important to the community. Re-implementing all of NumPy is hard, re-implementing a few choice pieces of NumPy is relatively straightforward. Knowing what those few choices pieces are requires community involvement.

Bigger ideas

My experience building dynamic schedulers is limited and my approach is likely suboptimal. It would be great to see other approaches. None of the logic in this post is specific to Blaze or even to NumPy. To build a scheduler you only need to understand the model of a graph of computations with data dependencies.

If we were ambitious we might consider a distributed scheduler to execute these task graphs across many nodes in a distributed-memory situation (like a cluster). This is a hard problem but it would open up a different class of computational solutions. The Blaze code to generate the dasks would not change ; the graphs that we generate are orthogonal to our choice of scheduler.


I could use help with all of this, either as open source work (links below) or for money. Continuum has funding for employees and ample interesting problems.


January 06, 2015 12:00 AM

January 04, 2015

Juan Nunez-Iglesias


Understatement: I’m not much of a web developer. However, we all have to become a little bit versed in web-dev if we want to publish things these days. GitHub Pages makes it really easy to publish a site (check out the official guide, and Thinkful’s truly excellent interactive getting started guide).

If you just want to publish a static set of html files using absolute paths, you’ll be fine. However, Pages uses Jekyll, a server that can transform collections of Markdown, HTML, and other files into full-fledged websites. The process is definitely full of gotchas, though, and you’ll run into issues for anything other than single pages. I’m making this list for my own future reference, and so that I can finally close the umpteen tabs I have open on the topic! But I hope someone else will find it useful.

1. Jekyll lets you write Markdown, but it’s not GitHub-Flavored Markdown

You might assume, since you are writing markdown hosted on GitHub, that GitHub-Flavored Markdown (GFM) would work. But you would be wrong. You can configure Jekyll on GH-Pages to use either Redcarpet or Kramdown, both of which have significant syntactic differences with GFM. For example, fenced code blocks use ~~~ in Kramdown, instead of GFM’s ```.

Anyway, to set the Markdown engine used to render your site, set the markdown attribute in the Jekyll configuration file, _config.yml, placed at the root of your repo:

# Dependencies
markdown:    kramdown
highlighter: pygments

2. … And syntax highlighting doesn’t work properly in Jekyll on GH-Pages

Even using the correct delimiters for Kramdown is not good enough to get your code snippets to render properly, because of this and this. So, instead of using backtick- or tilde-fenced blocks, you have to use slightly clunkier Jekyll “liquid tags”, which have the following fairly ugly syntax:

{% highlight python %}
    from scipy import ndimage as nd
{% endhighlight %}

3. Linking is a bit weird

My expectation when I started working with this was that I could write in markdown, and link directly to other markdown files using relative URLs, and Jekyll would just translate these to HTML links when serving the site. This is how MkDocs and local viewers such as Marked 2 work, for example.

That’s not at all how Jekyll does things. Every markdown page that you want served needs to have a header called the Front Matter. Other markdown files will be publicly accessible but they will be treated as plaintext files.

How the markdown files get served also varies depending on your _config.yml and the permalink variable. The annoying thing is that the documentation on this is entirely centred on blog posts, with little indication about what happens to top-level pages. In the case of the pretty setting, filename.markdown, for example, gets served at base.url/filename/, unless you set the page’s permalink attribute manually in the Front Matter, for example to title.html or title/.

4. Relative paths don’t work on GH-Pages

This is a biggy, and super-annoying, because even if you follow GitHub’s guide for local development, all your style and layout files will be missing once you deploy to Pages.

The problem is that Jekyll assumes that your site lives at the root URL (i.e. username.github.io), when it is in fact in username.github.io/my-site.

The solution was developed by Matt Swensen here: You need to set a baseurl variable in your _config.yml. This should be your repository name. For example, if you post your site to the gh-pages branch of the repository https://github.com/username/my-jekyll-site, which gets served at username.github.io/my-jekyll-site/, you should set baseurl to my-jekyll-site.

Once you’ve set the baseurl, you must use it when linking to CSS and posts, using liquid tags: {{ site.baseurl }}/path/to/style.css and {{ site.baseurl }}{{ post.url }}.

When you’re testing the site locally, you can either manually configure the baseurl tag to the empty string (bundle exec jekyll serve --baseurl ''), so you can navigate to as you would normally with Jekyll; or you can leave it untouched and navigate to, mirroring the structure of GitHub Pages.

This is all documented in a specific Jekyll docs page that is unfortunately buried after a lot of other, less-relevant Jekyll docs. As a reference, look here, and search for “Project Page URL Structure”.

5. You must provide an index.html or index.md file

Otherwise you get a rather unhelpful Forbidden Error.

And 6. There is a “console” syntax highlighter for bash sessions

I’d always used “bash” as my highlighter whenever I wanted to show a shell session, but it turns out that “console” does a slightly better job, by highlighting $ prompts and commands, while graying output slightly. Christian Jann, who has written his own more-specific highlighter, ShellSession, has a nice comparison of the three. Although ShellSession is cool, I couldn’t get it to work out-of-the-box on Jekyll.

That should be enough to get you started… (Don’t forget to check out Thinkful’s guide!) Happy serving!

by Juan Nunez-Iglesias at January 04, 2015 12:09 PM

January 02, 2015


Python(x, y) Released!

Hi All,

I'm happy to announce that Python(x, y) is available for immediate download.from any of the mirrors. The full change log can be viewed here. Please post your comments and suggestions to the mailing list.

Work on the 64 bit & 3.x versions progresses slowly but surely. They will happen eventually :)

What's new in general:

  • The official Visual C++ for python 2.7 has replaced MinGW as the bundled compiler. Finally everyone can enjoy native python binary extenions for win32. MinGW is still available as an additional plugin.
  • Console2 has been replaced with ConsoleZ - an actively developed fork. Which adds split views and many other improvements.
  • Maintaining 2/3 code base is much easier with the inclusion of python-modernize. Also all backports have been updated and expanded with unittest2.
  • All packages updated to their latest versions (at the time) - IPython, GDAL, SciPy, ITK, VTK. sqlalchemy, Astropy, SWIG etc. 
  • User Local install support has been revived - it is still experimental though.
  • Numpy has been held back at 1.8.2 to avoid potential compatibility issues.

New noteworthy packages:

  • cvxpy- A domain-specific language for modeling convex optimization problems in Python.
  • grin - Utility which searches directories of source code better than grep or find.
Have fun!

-Gabi Davar

by Gabi Davar (noreply@blogger.com) at January 02, 2015 10:08 AM

December 31, 2014

Philip Herron


The WordPress.com stats helper monkeys prepared a 2014 annual report for this blog.

Here’s an excerpt:

A New York City subway train holds 1,200 people. This blog was viewed about 8,100 times in 2014. If it were a NYC subway train, it would take about 7 trips to carry that many people.

Click here to see the complete report.

by redbrain812 at December 31, 2014 06:11 PM

December 30, 2014

Matthew Rocklin

Towards Out-of-core ND-Arrays -- Frontend

tl;dr Blaze adds usability to our last post on out-of-core ND-Arrays

Disclaimer: This post is on experimental buggy code. This is not ready for public use.


This follows my last post designing a simple task scheduler for use with out-of-core (or distributed) nd-arrays. We encoded tasks-with-data-dependencies as simple dictionaries. We then built functions to create dictionaries that describe blocked array operations. We found that this was an effective-but-unfriendly way to solve some important-but-cumbersome problems.

This post sugars the programming experience with blaze and into to give a numpy-like experience out-of-core.

Old low-level code

Here is the code we wrote for an out-of-core transpose/dot-product (actually a symmetric rank-k update).

Create random array on disk

import bcolz
import numpy as np
b = bcolz.carray(np.empty(shape=(0, 1000), dtype='f8'),
                 rootdir='A.bcolz', chunklen=1000)
for i in range(1000):
    b.append(np.random.rand(1000, 1000))

Define computation A.T * A

d = {'A': b}
d.update(getem('A', blocksize=(1000, 1000), shape=b.shape))

# Add A.T into the mix
d.update(top(np.transpose, 'At', 'ij', 'A', 'ji', numblocks={'A': (1000, 1)}))

# Dot product A.T * A
d.update(top(dotmany, 'AtA', 'ik', 'At', 'ij', 'A', 'jk',
         numblocks={'A': (1000, 1), 'At': (1, 1000)}))

New pleasant feeling code with Blaze

Targetting users

The last section “Define computation” is written in a style that is great for library writers and automated systems but is challenging to users accustomed to Matlab/NumPy or R/Pandas style.

We wrap this process with Blaze, an extensible front-end for analytic computations

Redefine computation A.T * A with Blaze

from dask.obj import Array  # a proxy object holding on to a dask dict
from blaze import *

# Load data into dask dictionaries
dA = into(Array, 'A.bcolz', blockshape=(1000, 1000))
A = Data(dA)  # Wrap with blaze.Data

# Describe computation in friendly numpy style
expr = A.T.dot(A)

# Compute results
>>> %time compute(expr)
CPU times: user 2min 57s, sys: 6.4 s, total: 3min 3s
Wall time: 2min 50s
array([[ 334071.93541158,  250297.16968262,  250404.87729587, ...,
         250436.85274716,  250330.64262904,  250590.98832611],
       [ 250297.16968262,  333451.72293343,  249978.2751824 , ...,
         250103.20601281,  250014.96660956,  250251.0146828 ],
       [ 250404.87729587,  249978.2751824 ,  333279.76376277, ...,
         249961.44796719,  250061.8068036 ,  250125.80971858],
       [ 250436.85274716,  250103.20601281,  249961.44796719, ...,
         333444.797894  ,  250021.78528189,  250147.12015207],
       [ 250330.64262904,  250014.96660956,  250061.8068036 , ...,
         250021.78528189,  333240.10323875,  250307.86236815],
       [ 250590.98832611,  250251.0146828 ,  250125.80971858, ...,
         250147.12015207,  250307.86236815,  333467.87105673]])

Under the hood

Under the hood, Blaze creates the same dask dicts we created by hand last time. I’ve doctored the result rendered here to include suggestive names.

>>> compute(expr, post_compute=False).dask
{('A': carray((10000000, 1000), float64), ...
 ('A', 0, 0): (ndget, 'A', (1000, 1000), 0, 0),
 ('A', 1, 0): (ndget, 'A', (1000, 1000), 1, 0),
 ('At', 0, 0): (np.transpose, ('A', 0, 0)),
 ('At', 0, 1): (np.transpose, ('A', 1, 0)),
 ('AtA', 0, 0): (dotmany, [('At', 0, 0), ('At', 0, 1), ('At', 0, 2), ...],
                          [('A', 0, 0),  ('A', 1, 0),  ('A', 2, 0), ...])

We then compute this sequentially on a single core. However we could have passed this on to a distributed system. This result contains all necessary information to go from on-disk arrays to computed result in whatever manner you choose.

Separating Backend from Frontend

Recall that Blaze is an extensible front-end to data analytics technologies. It lets us wrap messy computational APIs with a pleasant and familiar user-centric API. Extending Blaze to dask dicts was the straightforward work of an afternoon. This separation allows us to continue to build out dask-oriented solutions without worrying about user-interface. By separating backend work from frontend work we allow both sides to be cleaner and to progress more swiftly.

Future work

I’m on vacation right now. Work for recent posts has been done in evenings while watching TV with the family. It isn’t particularly robust. Still, it’s exciting how effective this approach has been with relatively little effort.

Perhaps now would be a good time to mention that Continuum has ample grant funding. We’re looking for people who want to create usable large-scale data analytics tools. For what it’s worth, I quit my academic postdoc to work on this and couldn’t be happier with the switch.


This code is experimental and buggy. I don’t expect it to stay around for forever in it’s current form (it’ll improve). Still, if you’re reading this when it comes out then you might want to check out the following:

  1. master branch on dask
  2. array-expr branch on my blaze fork

December 30, 2014 12:00 AM

December 27, 2014

Matthew Rocklin

Towards Out-of-core ND-Arrays

tl;dr We propose a system for task-centered computation, show an example with out-of-core nd-arrays, and ask for comments.

Note: This post is not user-focused. It is intended for library writers.


Recent notebooks (links 1, 2) describe how Blaze handles out-of-core single-dataset tabular computations in the following stages.

  1. Partition the dataset into chunks
  2. Apply some computation on each chunk
  3. Concatenate the results (hopefully smaller)
  4. Apply another computation into the concatenated result

Steps 2 and 4 require symbolic analysis of what work should be done; Blaze does this well. Steps 1 and 3 are more about coordinating where data goes and when computation executes.

This setup is effective for a broad class of single-dataset tabular computations. It fails for more complex cases. Blaze doesn’t currently have a good target for describing complex inter-block data dependencies. The model for breaking apart data and arranging computations (1 and 3) is too simple.

A good example of a complex case is an nd-array matrix-matrix multiply / dot product / tensor contraction. In this case a blocked approach has a more complex communication pattern. This post is about finding a simple framework that allows us to express these patterns. It’s about finding a replacement for steps 1 and 3 above.

Task Scheduling

The common solution to this problem is to describe the computation as a bipartite directed acyclic graph where nodes are computations and data and edges indicate which pieces of data a computation takes as input and delivers as output.

Many solutions to this problem exist, both theoretical algorithms and implemented software. Forgive me for describing yet-another system.


We use a low-tech representation of a task dependency graph. We use a dictionary of key-value pairs where keys are any hashable identifier and values are one of the following:

  1. A value, like 1
  2. A tuple containing a function and arguments, like (inc, 1). This is like an s-expression and should be interpreted as an unevaluated inc(1)
  3. A tuple containing a function and arguments. Arguments may include other keys, (inc, 'my_key')

This is more clear with an example. We show this example on the right.

d = {'x': 1,
     'y': (inc, 'x'),
     'z': (add, 'y', 10)}

The dask library contains a small reference implementation to get values associated to keys in this task graph.

>>> import dask
>>> dask.get(d, 'x')
>>> dask.get(d, 'y')  # Triggers computation
>>> dask.get(d, 'z')  # Triggers computation

In principle this could be executed by a variety of different implementations each with different solutions for distributed computing, caching, etc..

Dask also includes convenience functions to help build this graph.

>>> dask.set(d, 'a', add, args=['x', 'y'])

Although this is mainly to help those who feel uncomfortable putting the parenthesis on the left side of a function call to avoid immediate execution

>>> # d['a'] =  add( 'x', 'y')  # intend this
>>>   d['a'] = (add, 'x', 'y')  # but write this to avoid immediate execution

Why low tech?

These “graphs” are just dictionaries of tuples. Notably, we imported dask after we built our graph. The framework investment is very light.

  • Q: Why don’t we build Task and Data classes and construct a Python framework to represent these things formally?
  • A: Because people have to learn and buy in to that framework and that’s hard to sell. Dictionaries are easier to sell. They’re also easy to translate into other systems. Additionally, I was able to write a reference implementation in a couple dozen lines.

It’s easy to build functions that create dicts like this for various applications. There is a decent chance that, if you’ve made it this far in this blogpost, you already understand the spec.


I want to encode out-of-core ND-Array algorithms as data. I’ve written a few functions that create dask-style dictionaries to help me describe a decent class of blocked nd-array computations.

The following section is a specific example applying these ideas to the domain of array computing. This is just one application and not core to the idea of task scheduling. The core ideas to task scheduling and the dask implementation have already been covered above.

Getting blocks from an array

First, we break apart a large possibly out-of-core array into blocks. For convenience in these examples we work in in-memory numpy arrays rather than on-disk arrays. Jump to the end if you’d like to see a real OOC dot product on on-disk data.

We make a function ndget to pull out a single block

>>> x = np.arange(24).reshape((4, 6))
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23]])

>>> # Cutting into (2, 3) shaped blocks, get the (0, 0)th block
>>> ndget(x, (2, 3), 0, 0)
array([[0, 1, 2],
       [6, 7, 8]])

>>> # Cutting into (2, 3) shaped blocks, get the (1, 0)th block
>>> ndget(x, (2, 3), 1, 0)
array([[12, 13, 14],
       [18, 19, 20]])

We now make a function getem that makes a dict that uses this ndget function to pull out all of the blocks. This creates more keys in our dictionary, one for each block. We name each key by the key of the array followed by a block-index.

  • getem: Given a large possibly out-of-core array and a blocksize, pull apart that array into many small blocks
>>> d = {'X': x}  # map the key 'X' to the data x

>>> getem('X', blocksize=(2, 3), shape=(4, 6))
{('X', 0, 0): (ndget, 'X', (2, 3), 0, 0),
 ('X', 1, 0): (ndget, 'X', (2, 3), 1, 0),
 ('X', 1, 1): (ndget, 'X', (2, 3), 1, 1),
 ('X', 0, 1): (ndget, 'X', (2, 3), 0, 1)}

>>> d.update(_)  # dump in getem dict

So we have a single original array, x and using getem we describe how to get many blocks out of x using the function ndget for on each block.

  • ndget actually does work on data
  • getem creates dask dict that describes on what ndget should operate

We haven’t done work yet. We only do work when we finally call dask.get on the appropriate key for one of the blocks.

>>> dask.get(d, ('X', 1, 0))  # Get block corresponding to key ('X' ,1, 0)
array([[12, 13, 14],
       [18, 19, 20]])

We use numpy.ndarrays for convenience. This would have worked with anything that supports numpy-style indexing, including out-of-core structures like h5py.Dataset, tables.Array, or bcolz.carray.

Example: Embarrassingly Parallel Computation

If we have a simple function

def inc(x):
    return x + 1

That we want to apply to all blocks of the dataset we could, in principle, add the following to our dictionary.

>>> d.update({('X-plus-1', i, j): (inc, ('X', i, j)) for i in range(2)
                                                     for j in range(2)})
>>> dask.get(d, ('X-plus-1', 0, 0))
array([[1, 2, 3],
       [7, 8, 9]])

Our use of keys like ('name', i, j) to refer to the i,jth block of an array is an incidental convention and not intrinsic to dask itself. This use of tuples as keys should not be confused with the use of tuples in values to encode unevaluated functions.

Index expressions

A broad class of array computations can be written with index expressions

– Matrix transpose

– Matrix-matrix multiply

Fortunately, the blocked versions of these algorithms look pretty much the same. To leverage this structure we made the function top for tensor operations (ideas for a better name welcome). This writes index operations like the following for blocked transpose:

>>> top(np.transpose, 'Z', 'ji', 'X', 'ij', numblocks={'X': (2, 2)})
{('Z', 0, 0): (numpy.transpose, ('X', 0, 0)),
 ('Z', 0, 1): (numpy.transpose, ('X', 1, 0)),
 ('Z', 1, 0): (numpy.transpose, ('X', 0, 1)),
 ('Z', 1, 1): (numpy.transpose, ('X', 1, 1))}

The first argument np.transpose is the function to apply to each block. The second and third arguments are the name and index pattern of the output. The succeeding arguments are the key and index pattern of the inputs. In this case the index pattern is the reverse. We map the ijth block to the jith block of the output after we call the function np.transpose. Finally we have the numblocks keyword arguments that give the block structure of the inputs. Index structure can be any iterable.

Matrix Multiply

We represent tensor contractions like matrix-matrix multiply with indices that are repeated in the inputs and missing in the output like the following. In the following example the index 'j' is a contracted dummy index.

>>> top(..., Z, 'ik', X, 'ij', Y, 'jk', numblocks=...)

In this case the function receives an iterator of blocks of data that iterate over the dummy index, j. We make such a function to take iterators of square array blocks, dot product the pairs, and then sum the results. This is the inner-most loop of a conventional blocked-matrix-matrix multiply algorithm.

def dotmany(A, B):
    return sum(map(np.dot, A, B))

By combining this per-block function with top we get an out-of-core dot product.

>>> top(dotmany, 'Z', 'ik', 'X', 'ij', 'Y', 'jk',  numblocks={'X': (2, 2),
                                                              'Y': (2, 2)})
{('Z', 0, 0): (dotmany, [('X', 0, 0), ('X', 0, 1)],
                        [('Y', 0, 0), ('Y', 1, 0)]),
 ('Z', 0, 1): (dotmany, [('X', 0, 0), ('X', 0, 1)],
                        [('Y', 0, 1), ('Y', 1, 1)]),
 ('Z', 1, 0): (dotmany, [('X', 1, 0), ('X', 1, 1)],
                        [('Y', 0, 0), ('Y', 1, 0)]),
 ('Z', 1, 1): (dotmany, [('X', 1, 0), ('X', 1, 1)],
                        [('Y', 0, 1), ('Y', 1, 1)])}

The top function inspects the index structure of the inputs and outputs and constructs dictionaries that reflect this structure, matching indices between keys and creating lists of keys over dummy indices like j.

And that was it, we have an out-of-core dot product. Calling dask.get on these keys results in out-of-core execution.

Full example

Here is a tiny proof of concept for an out-of-core dot product. I wouldn’t expect users to write this. I would expect libraries like Blaze to write this.

Create random array on disk

import bcolz
import numpy as np
b = bcolz.carray(np.empty(shape=(0, 1000), dtype='f8'),
                 rootdir='A.bcolz', chunklen=1000)
for i in range(1000):
    b.append(np.random.rand(1000, 1000))

Define computation A.T * A

d = {'A': b}
d.update(getem('A', blocksize=(1000, 1000), shape=b.shape))

# Add A.T into the mix
d.update(top(np.transpose, 'At', 'ij', 'A', 'ji', numblocks={'A': (1000, 1)}))

# Dot product A.T * A
d.update(top(dotmany, 'AtA', 'ik', 'At', 'ij', 'A', 'jk',
         numblocks={'A': (1000, 1), 'At': (1, 1000)}))

Do work

>>> dask.get(d, ('AtA', 0, 0))
CPU times: user 2min 57s, sys: 6.59 s, total: 3min 4s
Wall time: 2min 49s
array([[ 334071.93541158,  250297.16968262,  250404.87729587, ...,
         250436.85274716,  250330.64262904,  250590.98832611],
       [ 250297.16968262,  333451.72293343,  249978.2751824 , ...,
         250103.20601281,  250014.96660956,  250251.0146828 ],
       [ 250404.87729587,  249978.2751824 ,  333279.76376277, ...,
         249961.44796719,  250061.8068036 ,  250125.80971858],
       [ 250436.85274716,  250103.20601281,  249961.44796719, ...,
         333444.797894  ,  250021.78528189,  250147.12015207],
       [ 250330.64262904,  250014.96660956,  250061.8068036 , ...,
         250021.78528189,  333240.10323875,  250307.86236815],
       [ 250590.98832611,  250251.0146828 ,  250125.80971858, ...,
         250147.12015207,  250307.86236815,  333467.87105673]])

Three minutes for a 7GB dot product. This runs at about half the FLOPS of a normal in-memory matmul. I’m not sure yet why the discrepancy. Also, this isn’t using an optimized BLAS; we have yet to leverage multiple cores.

This isn’t trivial to write, but it’s not bad either.

Complexity and Usability

This system is not appropriate for users; it’s unPythonic, low-level, and LISP-y. However I believe that something like this would be an appropriate standard for infrastructural libraries. It’s a simple and easy standard for code to target.

Using projects like into and blaze we can build a usable high-level front-end onto dask for the subproblems of arrays and tables. Blaze could generate these dictionaries and then hand them off to other systems to execute.


Using the reference implementation, multithreading, HDF5/BColz, and out-of-core caching systems like chest I think that we can build a decent out-of-core ndarray solution that fully leverages a large workstation.

Ideally other people come along and build better execution engines / task schedulers. This might be an appropriate application for IPython parallel.


This could use design and technical feedback. What would encourage community buy-in to a system like this?

December 27, 2014 12:00 AM

December 25, 2014

Titus Brown

On gaining tenure as an open scientist

On December 10th, 2014, I was formally awarded tenure at UC Davis, where I will start as an Associate Professor in the School of Veterinary Medicine on January 5th, 2015. In my research statement for my job application, I wrote:

Open science and scientific reproducibility: I am a strong advocate of open science, open source, open data, open access, and the use of social media in research as a way to advance research more broadly.

I've been an advocate of "open" for decades, and since becoming an Assistant Professor at Michigan State University in spring 2008, I have explored a variety of approaches to doing science more openly:

While being open, I achieved several career milestones, including publishing several senior author papers, graduating several doctoral students, giving an invited talk at a Gordon Conference, keynoting several international conferences, having several highly ranked universities recruit me, getting an NIH R01 (and overall bringing in more than $2m in grants while at MSU), getting recruited and hired at an Associate Professor level, gaining tenure, and becoming a Moore Investigator with sustained medium-term funding. There are certainly many more successful people than me out there, but I personally don't know what else I could wish for - my plate is full and I'm pursuing research I believe to be important.

Here are some things I've observed over the years.

It is possible to achieve some measure of traditional success while being open. Grants; publications; tenure. 'nuff said.

Being open has become an increasingly positive driver of my career. Over the years, the benefit of openness has become increasingly obvious. I can't actually point to many negatives - the biggest specific problem I saw was that various people in my administrative chain at MSU didn't understand why I was pursuing open, but it's not clear that that had any negative consequences on my career. Meanwhile, thousands of people are using our software, I have a reasonably large and very enthusiastic blog and Twitter following, a number of papers with lots of citations, more invitations to speak than I can take, and excellent applications for grad student and postdoc positions.

My current research program developed independently of open. I think I would have been doing approximately the same things at a research level whether or not I had been so open about it. (With my Moore award, that's changing in the future.) In particular, some grant awards resulted from connections made through my blog, but most did not.

Open is not the point, and maybe shouldn't be. I would still rather hire an excellent closed scientist than a mediocre open scientist. But I would definitely push hiring an excellent open scientist over an excellent closed scientist. In my case, Davis definitely didn't hire me because I was an open scientist; they hired me because they liked my research and training efforts, and because there was a need for them at Davis. Open simply didn't figure in.

Some people will argue that your time is better spent elsewhere... Of the six or so letters evaluating my tenure case at Davis, I reportedly had one letter that was rather negative and argued that I was not living up to my potential. My chair agreed with my suspicion that one of the reasons for this backhanded compliment was the author of the letter felt I was wasting my time by investing in social media, open science, and reproducible research.

...but people can always find a reason to criticize you. There are very few science faculty who I think are doing everything right, and even fewer to whom I wouldn't offer friendly advice if asked. Certainly, If I wanted to tear someone down I could probably do so. The majority of my tenure letter writers, the entirety of my new department, and (presumably) the rest of the hierarchy at Davis all strongly supported my tenure case.

Most of my colleagues have been very supportive. Virtually none of the active research faculty in my departments at MSU question the need for change, or the utility of open science/access/source/data. However, they do get stuck on the details of how to incentivize, drive, and implement open science. The majority of negative or "WTF?" comments I've received have been from a subset of administrators; my most charitable perspective on this is that administrators are generally much more concerned with policy and incentives than individual faculty, and consequently weight the obstacles higher than the opportunities.

Open science will (eventually) win out in basic research... There are too many reasons for granting agencies to support open science for science to remain closed, absent significant monetary drivers for it to remain closed.

...but there's a fundamental asymmetry in closed vs open that's retarding progress towards open, and scientific progress more broadly. A closed scientist can make use of preprints, open source and open data; an open scientist cannot make use of closed science products until they are published. See: Prisoner's Dilemma. I will have more to say on this over the next years...

By remaining so closed, science is ignoring the role of serendipity in progress. I regularly read articles bemoaning the cost of openness, and I think many of these people are choosing the somewhat certain (but suboptimal) consequences of being closed over the insecurity of the uncertain (but potentially very positive) consequences of being open. As scientific funding becomes every more stringent and competitiveness grows, the advantages of being conservative will evaporate for all but the academic 1%. (As Shirley Tilghman says, for many "who have succeeded in the system, there appears to be little to be gained from messing with it". That's going to change quickly as the data science asteroid hits science, among other things; I expect to see fairly rapid turnover in that 1%.)

Open science needs more practitioners. A few years back I made a conscious decisions to be less of a cheerleader and more of a practitioner. I enjoy doing science, and I enjoy talking about it, and I think the experiments we do in my lab on how to promulgate and accelerate our science through openness are just as important as the policy making, the grant funding, the infrastructure building, and yes, the publicity and cheerleading that is going on. We need it all!

Scientific culture is generational; change will come, but slowly. Most senior scientists (the ones who sit on editorial boards, review panels, and tenure committees) are already overwhelmed and are unlikely to change -- as Mike Eisen says, "the publishing behavior of most established scientists has proven [...] to be beyond amendment." But there's hope; here's a great quote in the Atlantic article, from Virginia Barbour:

"There's a generation of scientists who are running labs and running tenure committees who were brought up on a very different system" said Barbour. "I have huge hopes on the generation that's coming up, because they’re a generation built on openness and availability, and for a lot of things we're talking about it may well take a generational change. It may not be until those people are in positions of authority that things will really change," she said.



Thanks to Tracy Teal for reading and commenting on a draft of this post!

by C. Titus Brown at December 25, 2014 11:00 PM

December 23, 2014

Luis Pedro Coelho


Just released mahotas 1.2.4. The single improvement over 1.2.3 is that PIL (or Pillow) based IO is automatically used as a fallback. This will hopefully make the package easier to use for new users who have one fewer dependency to install.

by luispedro at December 23, 2014 01:02 PM

December 22, 2014

Titus Brown

Three movies, plus two more

Here are three movies that I can recommend, starting with the best.

Jiro Dreams of Sushi - a great movie about the pursuit of perfection, one slice of fish at a time. Truly inspiring. Software engineers and scientists will recognize and identify with Jiro's aspirations.

The Angels' Share - a fun, laid back movie about a scotch whisky caper. Family fun. Plus the lead male character is just plain cute!

Somm - a somewhat intense movie about preparing for the Master Sommelier exam. I was hoping it would be another Jiro Dreams of Sushi, but instead it was much more about competition and study. On that front, anyone who has done a PhD will be a bit amused about the statements of hard labor - sure, it's more intense, but for a shorter period of time. Of the three, Somm was my least favorite, but it was still very good.

I'm also halfway through two more movies that I hope to finish soon:

Chef - Jon Favreau moves from being a restaurant chef to running a food truck. Great chemistry, good characters, delicious-looking food.

To Be Takei a surprisingly engrossing movie about George Takei's life and career.


by C. Titus Brown at December 22, 2014 11:00 PM

December 19, 2014

Gaël Varoquaux

PRNI 2016: call for organization

PRNI (Pattern Recognition for NeuroImaging) is an IEEE conference about applying pattern recognition and machine learning to brain imaging. It is a mid-sized conference (about 150 attendee), and is a satellite of OHBM (the annual “Human Brain Mapping” meeting).

The steering committee is calling for bids to organize the conference in June 2016, in Europe, as a satellite the OHBM meeting in Geneva.

by Gaël Varoquaux at December 19, 2014 04:00 AM

December 18, 2014

Stefan Pfenninger

Calliope, an open-source energy systems modeling framework

Energy system models are used by analysts in academia, government and industry. They allow researchers to build internally coherent simulations and scenarios of the extraction, conversion, transportation and use of energy, either today or in the future, at scales ranging from cities to entire countries or continents. These models are particularly important because they can help with understanding and planning the transition from a fossil-fuel dominated energy system to one primarily consisting of clean and renewable energy.

Calliope Logo

For about a year, I’ve been working Calliope, a framework to develop energy system models using a modern and open source Python-based toolchain (the name derives from the idea of “multi-scale energy systems modeling” which leads to the clever abbreviation MUSES).

Basically, I started to work on Calliope because I wanted something that ticks all these boxes:

  • Capable of using data with arbitrary (to some degree) resolution in space and time in an intelligent way
  • Able to define a number of model parameters to iterate over, generate a range of models to do so, and deploy these to a computing cluster, then process the results in a unified manner
  • Allow the specification of energy models in a text-based, modular, hierarchical data format (based on YAML and CSV files)
  • Written in a modern high-level language (Python)
  • Modular and easily extensible
  • Use a permissive Open Source license (Apache 2.0)

The goal of Calliope is not to compete with established large modeling systems such as TIMES. Rather, it is to provide a framework that enables the rapid construction of (small or large) models built to answer specific research questions. Furthermore, by combining its model specification format and the open code base, its aim is also to contribute to more open sharing of data and code in the energy modeling community.

It has its own website at www.callio.pe (alas, with little content so far).

It’s still work in progress, with significant further functionality to be added in future releases. Currently, I’m the only developer, but contributions from others for whom this might be a useful tool are very welcome. I’m using Calliope for several ongoing research projects and it is undergoing continued development, so more material including academic papers will become available over the coming months. For now, there is a brief tutorial in the documentation that explains some of the things that Calliope does.

Comments, criticism and contributions are very welcome!

by Stefan Pfenninger at December 18, 2014 04:19 PM

December 17, 2014

Titus Brown

My review of "Determining the quality and complexity of NGS..."

I was a reviewer on Determining the quality and complexity of next-generation sequencing data without a reference genome by Anvar et al., PDF here. Here is the top bit of my review.

One interesting side note - the authors originally named their tool kMer and I complained about it in my review. And they renamed it to kPal! Which is much less confusing.

The authors show that a specific set of low-k k-mer profile analysis tools can identify biases and errors in sequencing samples as well as determine sample distances between metagenomic samples. All of this is done independently of reference genomes/transcriptomes, which is very important.

The paper is well written and quite clear. I found it easy to read and easy to understand. The work is also novel, I believe.

Highlights of the paper for me included a solid discussion of k-mer size selection, a thorough exploration of how to compare various k-mer-based statistics, the excellent quality evaluation bit (Figure 3),

I was a bit surprised by the shift from quality assessment to metagenomic analysis, but there is an underlying continuity in the approach that makes this a reasonable transition. There might be a way to update the text to make this transition easier for the non-bioinformatic reader.

It's hard to pick out one particularly important result; the two biggest results are (a) k-mer based and reference free quality evaluation works quite well, and (b) k-mer analysis does a great job of grouping metagenome samples. The theory work on transitioning between k-mer sizes is potentially of great technical interest as well.

by C. Titus Brown at December 17, 2014 11:00 PM

December 16, 2014

Titus Brown

The post-apocalyptic world of binary containers

The apocalypse is nigh. Soon, binary executables and containers in object stores will join the many Web-based pipelines and the several virtual machine images on the dystopic wasteland of "reproducible science."


I had a conversation a few weeks back with a senior colleague about container-based approaches (like Docker) wherein they advocated the shipping around of executable binary blobs with APIs. I pointed out that blobs of executable code were not a good replacement for understanding or a path to progress (see my blog post on that) and they vehemently disagreed, saying that they felt it was an irrelevant point to the progress of science.

That made me sad.

One of the things I really like about Docker is that the community emphasizes devops-style clean installs and configurations over deployment and distribution of binary blobs (images, VMs, etc.) Let's make sure to keep that; I think it's important for scientific progress to be able to remix software.

I'll just close with this comment:

The issue of whether I can use your algorithm is largely orthogonal to the issue of whether I can understand your algorithm. The former is engineering progress; the latter is scientific progress.


p.s. While I do like to pick on the Shock/Awe/MG-RAST folk because their pipeline is utterly un-reusable by anyone, anywhere, ever, I am being extremely unfair in linking to their paper as part of this blog post. They're doing something neat that I am afraid will ultimately lead in bad directions, but they're not espousing a binary-only view of the world at all. I'm just being funny. Honest.

p.p.s. Bill Howe and I also agree. So I'm being multiply unfair. I know.

by C. Titus Brown at December 16, 2014 11:00 PM

December 15, 2014


Plotting in Excel with PyXLL and Matplotlib

Author: Tony Roberts, creator of PyXLL, a Python library that makes it possible to write add-ins for Microsoft Excel in Python. Download a FREE 30 day trial of PyXLL here. Python has a broad range of tools for data analysis and visualization. While Excel is able to produce various types of plots, sometimes it’s either […]

by admin at December 15, 2014 07:27 PM

December 11, 2014

Filipe Saraiva

Moving to Stockholm


[This post is off-topic for some Planet readers, sorry for it. I just expect to get some help with free software communities.]

Exciting times are coming to me before the end of year. Next week (probably) I am going to Stockholm to live for 5 months. I am getting an visiting researcher position at  KTH – Royal Institute of Technology as part of my PhD course. I will work with PSMIX group – Power System Management with related Information Exchange, leaded by Prof. Lars Nordström.

The subject of my project is the modelling and simulation of smart grids (power system plus a layer of communication and decision making automation) features using multi-agent systems. I expect to work with the simulation platform developed by PSMIX based in Raspberry PI and SPADE framework. The platform is described in this paper.

Well, I am very anxious with this travel because two things: the communication in English and the Sweden winter. The second is my main concern. Gosh, going out from the all the days Brazilian summer to the Nordic winter! :'( But ♫ I will survive ♪ (I expect). ;)

If you know someone to help me with tips about apartment rent, I will be very glad. Rent accommodation in Stockholm is very hard and expensive.

Thanks and I will send news!

by Filipe Saraiva at December 11, 2014 12:41 PM

December 08, 2014

Philip Herron


Cython, is an epic project lets just be up front about it. I do make and have made some use of it professionally in previous projects.

There is one thing that keeps coming up and it really bugs me: “Cython to speed up your Python programs”. This in my opinion is a mis-representation of what cython is, but i am however being “hyper-critical”.

Cython has many uses i have illustrated them probably more than most different cases aimed at the hacker. Extending pure C/C++ projects directly without writing any C/C++ and doing it in Python is where it can seriously shine and be insane how powerful it can become, if your setup your project accordingly. My frustration comes from some small criticism on how i don’t really talk about speeding up _your_ python code.

There are several reasons why i seriously avoiding talking about this in the book and my recent cython article for linux format. But to be clear you need to understand what cython is and what it does.

Cython provides a way of natively working with C/C++ types and code. And the side-effect is your can compile cython code which just so happens to look a lot like python to be really efficient and strip out the use of the Python Runtime since you are directly using C types. This is fine but let’s be clear. Cython and Python are two very different Programming language, it just so happens that Cython can compile (most) Python code but Python cannot compile Cython code. This is essentially the problem.

So when your brand cython as a way to speed up python programs your essentially saying that you need to re-write your code in cython. And outside of the toy examples of single cython files it will not work. Remember cython compiles single files into single python modules. Meaning you cannot therefore point cython at a package or set of modules to compile this all down magically without some serious downsides on project structure and common-sense.

Cython just so happens to fit extremely well into scientific computing which is entirely different topic to normal software engineering and not only that scientific code is an entirely different beast of engineering. And in the scientific world having some python scripts with single files that do heavy computation using cython here to use C Types and compile it as a shared module to be imported into Python code works extremely well. Even on lists or maps scientific people will use the libPython api to directly access data. I feel there was 2 sides to the coin those who are scientific which will want this section and another section of people who want to know what can i do with cython in my code (games, servers, applications), really anything outside of statistical computation programs. Not only that but having a solid understanding of the uses i demonstrate will make this side seem trivial outside of the more fine-grained ways of making it just that little bit faster using more cython specifics and libpython specifics which would fit better as a paper than a technical book since it would up to much more change.

by redbrain812 at December 08, 2014 03:37 PM

Continuum Analytics

Bokeh 0.7 Released!

We are excited to announce the release of version 0.7 of Bokeh, an interactive web plotting library for Python! This release includes many major new features:

  • IPython widgets and animations without a Bokeh server
  • Touch UI working for tools on mobile devices
  • Vastly improved linked data table
  • More new (and improving) bokeh.charts (high level charting interface)
  • Color mappers on the python side
  • Improved toolbar
  • Many new tools: lasso, poly, and point selection, crosshair inspector

by Bryan Van de Ven at December 08, 2014 08:30 AM

December 07, 2014

Titus Brown

Letter of resignation

Dear <chairs>,

I am resigning my Assistant Professor position at Michigan State University effective January 2nd, 2015.



Anticipated FAQ:

  • Why? I'm moving to UC Davis.
  • Do you have an employment contract with UC Davis?? Nope. But I'm starting there in January, anyway. Or that's the plan. And yes, that's how this kind of thing happen. I'm carefully refraining from geopolitical commentary on Twitter at the moment :).
  • Do you have tenure there? Nope. Wheels are still turning.
  • Aren't you nervous about that? Not really. Even if I don't get tenure, they've promised to hire me. And if that doesn't work out, someone will hire me... right? Guys? Guys? Why did it go silent all of a sudden?
  • Why didn't you use 'I am hereby...' at the beginning? I don't know, seemed pretentious.


by C. Titus Brown at December 07, 2014 11:00 PM

December 04, 2014

Fabian Pedregosa

Data-driven hemodynamic response function estimation

My latest research paper[1] deals with the estimation of the hemodynamic response function (HRF) from fMRI data.

This is an important topic since the knowledge of a hemodynamic response function is what makes it possible to extract the brain activation maps that are used in most of the impressive applications of machine learning to fMRI, such as (but not limited to) the reconstruction of visual images from brain activity [2] [3] or the decoding of numbers [4].

Besides the more traditional paper that describes the method, I've put online the code I used for the experiments. The code at this stage is far from perfect but it should help in reproducing the results or improving the method. I've also put online an ipython notebook with the analysis of a small piece of data. I'm obviously glad to receive feedback/bug reports/patches for this code.

  1. Pedregosa, Fabian, et al. "Data-driven HRF estimation for encoding and decoding models.", Neuroimage, (2014). Also available here as an arXiv preprint. 

  2. Miyawaki, Yoichi et al., "Visual Image Reconstruction from Human Brain Activity using a Combination of Multiscale Local Image Decoders", Neuron (2008)

  3. Kay, Kendrick N., et al. "Identifying natural images from human brain activity." Nature 452.7185 (2008). 

  4. Eger, Evelyn, et al. "Deciphering cortical number coding from human brain activity patterns." Current Biology 19.19 (2009). 

by Fabian Pedregosa at December 04, 2014 11:00 PM

December 01, 2014

Titus Brown

What version of Python should we use to for computational science courses?

Brian O'Shea (a physics prof at Michigan State) asked me the following, and I thought I'd post it on my blog to get a broader set of responses. I know the answer is "Python 3", but I would appreciate specific thoughts from people with experience either with the specific packages below, or in teaching Python 3 more generally.

(For the record, I continue to teach and develop in Python 2, but each year comes closer to switching to Python 3...)

We're going to start teaching courses for the new CMSE [ computational science] department next year, and we're using Python as the language. I need to decide whether to teach it using python 2.x or python 3.x. I'm curious about which one you have chosen to use when teaching your own courses, and why you made that choice. (Also, it'd be interesting to know if/why you regret that choice?)

The intro courses are aimed at students who plan to use computational and data science for research, other classes, and ultimately in their academic/industrial careers. We anticipate that it'll mostly be science/math and engineering students in the beginning, but there's significant interest from social science and humanities folks on campus. Given the audience and goals, my choice of programming language is fairly utilitarian - we want to introduce students to standard numerical packages (numpy, scipy, matplotlib, h5py, pandas, ipython) and also some of the discipline-specific packages, and get them into the mindset of "I have a problem, somebody has probably already written a tool to address this, let's not reinvent the wheel." So, I want to choose the version of Python that's likely to work well with pretty much any relatively widely-used Python package. My impression, based on a variety of blog posts and articles that I've found, is that the mainstream libraries work just fine with Python 3 (e.g., matplotlib), but a lot of other stuff simply doesn't work at this point.

This course is going to be the gateway course for our new minor/major, and a lot of later courses will be based on it (the graduate version of the course will be the gateway course for the certificate, and presumably taken by lots of grad students here at MSU). I'd like to make the most sensible choice given that we'll be creating course materials that will be used by other faculty, and which may stick around for a while... Anyway, any thoughts you have would be appreciated!

Brian sent me these links as examples of the discussion:





My strongest personal advice at this point is that Brian should invest in the Anaconda distribution as a teaching foundation.

Thoughts? Comments?


by C. Titus Brown at December 01, 2014 11:00 PM

November 29, 2014

Titus Brown

Annotating papers with pipeline steps - suggestions?

A few months ago, I wrote a short description of how we make our papers replicable in the lab. One problem with this process is that for complex pipelines, it's not always obvious how to connect a number in the paper to the steps in the pipeline that produced it -- there are lots of files and outputs involved in all of this. You can sometimes figure it out by looking at the numbers and correlating with what's in the paper, and/or trying to understand the process from the bottom up, but that's quite difficult. Even my students and I (who can meet in person) sometimes have trouble tracking things down quickly.

Now, on my latest paper, I've started annotating results in the paper source code with Makefile targets. For example, if you are interested in how we got these results, you can use the comment in the LaTeX file to go straight to the associated target in the Makefile.

That's pretty convenient, but then I got to thinking -- how do we communicate this to the reader more directly? Not everyone wants to go to the github repo, read the LaTeX source, and then go find the relevant target in the Makefile (and "not everyone" is a bit of an understatement :). But there's no reason we couldn't link directly to the Makefile target in the PDF, is there? And, separately, right now it is a reasonably large burden to copy the results from the output of the scripts into the LaTeX file. Surely there's a way to get the computer to do this, right?

So, everyone -- two questions!

First, assuming that I'm going the nouveau traditional publishing route of producing a PDF for people to download from a journal site, is there a good, or suggested, or even better yet journal-supported way to link directly to the actual computational methods? (Yes, yes, I know it's suboptimal to publish into PDFs. Let me know when you've got that figured out, and I'll be happy to try it out. 'til then kthxbye.) I'd like to avoid intermediaries like individual DOIs for each method, if at all possible; direct links to github FTW.

Ideally, it would make it through the publishing process. I could, of course, make it available as a preprint, and point interested people at that.

Second, what's a good way to convey results from a script into a LaTeX (or more generally any text document)? With LaTeX I could make use of the 'include' directive; or I could do something with templating; or...?

Well, and third, is there any (good, scriptable) way to do both (1) and (2) at the same time?

Anyway, hit me with your wisdom; would love to hear that there's already a good solution!


by C. Titus Brown at November 29, 2014 11:00 PM

November 21, 2014

Filipe Saraiva

Presenting a Season of KDE 2014 student – Minh Ngo

Season of KDE is an outreach program hosted by the KDE community. This year I am working as a mentor to a long time requested project related with Cantor – the development of Python 3 backend. You can read more about Cantor in my blog (texts in English and Portuguese). So, let’s say welcome and good luck to Minh Ngo, the student behind this project!


My name is Minh,

Minh Ngo

I’m BSc graduated student. I’m Vietnamese, but unlike other Vietnamese students spent most of my life in Ukraine. Currently, I’m preparing myself to the Master degree that will start in the next semester.

Open source is my free time hobby, so I would like to make something that is useful for the community. Previously, I was participated in the GSoC 2013 program and in several open source projects. Some of my personal projects is available on my github page https://github.com/Ignotus, not so popular like other cool projects, but several are used by other people and this fact makes me very happy :) .

Cantor is one of opportunities to spend time to create an useful thing and win an exclusive KDE T-shirt :). I decided to start my contribution with the Python3 backend, because few months ago I studied several courses that are related with Machine Learning, so I was looking for a stable desktop backend for IPython. A notebook version IPython I do not entirely like and its qtconsole version doesn’t satisfy me in terms of functionality, therefore I decided to find some existent frontend for IPython that I can tune for myself. And the story with Cantor began after than :)

Happy hacking!

by Filipe Saraiva at November 21, 2014 04:32 PM

November 19, 2014

Matthew Rocklin

Blaze Datasets

tl;dr Blaze aids exploration by supporting full databases and collections of datasets.

This post was composed using Blaze version 0.6.6. You can follow along with the following conda command.

conda install -c blaze blaze=0.6.6

When we encounter new data we need to explore broadly to find what exists before we can meaningfully perform analyses. The tools for this task are often overlooked.

This post outlines how Blaze explores collections and hierarchies of datasets, cases where one entity like a database or file or directory might hold many tables or arrays. We use examples from HDF5 files and SQL databases. Blaze understands how the underlying libraries work so that you don’t have to.

Motivating problem - Intuitive HDF5 File Navigation

For example, if we want to understand the contents of this set of HDF5 files encoding meteorological data then we need to navigate a hierarchy of arrays. This is common among HDF5 files.

Typically we navigate these files in Python with h5py or pytables.

>>> import h5py
>>> f = h5py.File('OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5')

HDF5 files organize datasets with an internal file system. The h5py library accesses this internal file system through successive calls to .keys() and item access.

>>> f.keys()

>>> f['/HDFEOS'].keys()

>>> f['/HDFEOS/SWATHS'].keys()

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol'].keys()
['Data Fields', 'Geolocation Fields']

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol/Data Fields'].keys()

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol/Data Fields/TerrainPressure']
<HDF5 dataset "TerrainPressure": shape (1643, 60), type "<i2">

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol/Data Fields/TerrainPressure'][:]
array([[1013, 1013, 1013, ..., 1013, 1013, 1013],
       [1013, 1013, 1013, ..., 1013, 1013, 1013],
       [1013, 1013, 1013, ..., 1013, 1013, 1013],
       [1010,  992,  986, ..., 1013, 1013, 1013],
       [ 999,  983,  988, ..., 1013, 1013, 1013],
       [1006,  983,  993, ..., 1013, 1013, 1013]], dtype=int16)

This interaction between programmer and interpreter feels like a long and awkward conversation.

Blaze improves the exploration user experience by treating the entire HDF5 file as a single Blaze variable. This allows users to both explore and compute on larger collections of data in the same workflow. This isn’t specific to HDF5 but works anywhere many datasets are bundled together.

Exploring a SQL Database

For example, a SQL database can be viewed as a collection of tables. Blaze makes it easy to to access a single table in a database using a string URI specifying both the database and the table name.

>>> iris = Data('sqlite:////my.db::iris')  # protocol://database::table-name
>>> iris
    sepal_length  sepal_width  petal_length  petal_width      species
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa
4            5.0          3.6           1.4          0.2  Iris-setosa

This only works if we know what table we want ahead of time. The approach above assumes that the user is already familiar with their data. To resolve this problem we omit the table name and access the database as a variable instead. We use the same interface to access the entire database as we would a specific table.

>>> db = Data('sqlite:////my.db')  # protocol://database
>>> db
Data:       Engine(sqlite:////home/mrocklin/workspace/blaze/blaze/examples/data/iris.db)
DataShape:  {
  iris: var * {
    sepal_length: ?float64,
    sepal_width: ?float64,
    petal_length: ?float64,
    petal_width: ?float64,
    species: ?string

The db expression points to a SQLAlchemy engine. We print the engine alongside a truncated datashape showing the metadata of the tables in that database. Note that the datashape maps table names to datashapes of those tables, in this case a variable length collection of records with fields for measurements of flowers.

Blaze isn’t doing any work of the grunt work here, SQLAlchemy is. SQLAlchemy is a mature Python library that interacts with a wide variety of SQL databases. It provides both database reflection (as we see above) along with general querying (as we see below). Blaze provides a convenient front-end.

We seamlessly transition from exploration to computation. We query for the shortest and longest sepal per species.

>>> by(db.iris.species, shortest=db.iris.sepal_length.min(),
...                      longest=db.iris.sepal_length.max())
           species  longest  shortest
0      Iris-setosa      5.8       4.3
1  Iris-versicolor      7.0       4.9
2   Iris-virginica      7.9       4.9

Blaze doesn’t pull data into local memory, instead it generates SQLAlchemy which generates SQL which executes on the foreign database. The (much smaller) result is pulled back into memory and rendered nicely using Pandas.

A Larger Database

Improved metadata discovery on SQL databases overlaps with the excellent work done by yhat on db.py. We steal their example, the Lahman Baseball statistics database, as an example of a richer database with a greater variety of tables.

>>> lahman = Data('sqlite:///baseball-archive-2012.sqlite')
>>> lahman.dshape  # ask for the entire datashape
  battingpost: var * {
    yearID: ?int32,
    round: ?string,
    playerID: ?string,
    teamID: ?string,
    lgID: ?string,
    G: ?int32,
    AB: ?int32,
    R: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    RBI: ?int32,
    SB: ?int32,
    CS: ?int32,
    BB: ?int32,
    SO: ?int32,
    IBB: ?int32,
    HBP: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
  awardsmanagers: var * {
    managerID: ?string,
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    tie: ?string,
    notes: ?string
  hofold: var * {
    hofID: ?string,
    yearid: ?int32,
    votedBy: ?string,
    ballots: ?int32,
    votes: ?int32,
    inducted: ?string,
    category: ?string
  salaries: var * {
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    playerID: ?string,
    salary: ?float64
  pitchingpost: var * {
    playerID: ?string,
    yearID: ?int32,
    round: ?string,
    teamID: ?string,
    lgID: ?string,
    W: ?int32,
    L: ?int32,
    G: ?int32,
    GS: ?int32,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    H: ?int32,
    ER: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    BAOpp: ?float64,
    ERA: ?float64,
    IBB: ?int32,
    WP: ?int32,
    HBP: ?int32,
    BK: ?int32,
    BFP: ?int32,
    GF: ?int32,
    R: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
  managers: var * {
    managerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    inseason: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32,
    rank: ?int32,
    plyrMgr: ?string
  teams: var * {
    yearID: ?int32,
    lgID: ?string,
    teamID: ?string,
    franchID: ?string,
    divID: ?string,
    Rank: ?int32,
    G: ?int32,
    Ghome: ?int32,
    W: ?int32,
    L: ?int32,
    DivWin: ?string,
    WCWin: ?string,
    LgWin: ?string,
    WSWin: ?string,
    R: ?int32,
    AB: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    SB: ?int32,
    CS: ?int32,
    HBP: ?int32,
    SF: ?int32,
    RA: ?int32,
    ER: ?int32,
    ERA: ?float64,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    HA: ?int32,
    HRA: ?int32,
    BBA: ?int32,
    SOA: ?int32,
    E: ?int32,
    DP: ?int32,
    FP: ?float64,
    name: ?string,
    park: ?string,
    attendance: ?int32,
    BPF: ?int32,
    PPF: ?int32,
    teamIDBR: ?string,
    teamIDlahman45: ?string,
    teamIDretro: ?string
  teamshalf: var * {
    yearID: ?int32,
    lgID: ?string,
    teamID: ?string,
    Half: ?string,
    divID: ?string,
    DivWin: ?string,
    Rank: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32
  fieldingpost: var * {
    playerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    round: ?string,
    POS: ?string,
    G: ?int32,
    GS: ?int32,
    InnOuts: ?int32,
    PO: ?int32,
    A: ?int32,
    E: ?int32,
    DP: ?int32,
    TP: ?int32,
    PB: ?int32,
    SB: ?int32,
    CS: ?int32
  master: var * {
    lahmanID: ?int32,
    playerID: ?string,
    managerID: ?string,
    hofID: ?string,
    birthYear: ?int32,
    birthMonth: ?int32,
    birthDay: ?int32,
    birthCountry: ?string,
    birthState: ?string,
    birthCity: ?string,
    deathYear: ?int32,
    deathMonth: ?int32,
    deathDay: ?int32,
    deathCountry: ?string,
    deathState: ?string,
    deathCity: ?string,
    nameFirst: ?string,
    nameLast: ?string,
    nameNote: ?string,
    nameGiven: ?string,
    nameNick: ?string,
    weight: ?int32,
    height: ?float64,
    bats: ?string,
    throws: ?string,
    debut: ?string,
    finalGame: ?string,
    college: ?string,
    lahman40ID: ?string,
    lahman45ID: ?string,
    retroID: ?string,
    holtzID: ?string,
    bbrefID: ?string
  fieldingof: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    Glf: ?int32,
    Gcf: ?int32,
    Grf: ?int32
  pitching: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    W: ?int32,
    L: ?int32,
    G: ?int32,
    GS: ?int32,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    H: ?int32,
    ER: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    BAOpp: ?float64,
    ERA: ?float64,
    IBB: ?int32,
    WP: ?int32,
    HBP: ?int32,
    BK: ?int32,
    BFP: ?int32,
    GF: ?int32,
    R: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
  managershalf: var * {
    managerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    inseason: ?int32,
    half: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32,
    rank: ?int32
  appearances: var * {
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    playerID: ?string,
    G_all: ?int32,
    G_batting: ?int32,
    G_defense: ?int32,
    G_p: ?int32,
    G_c: ?int32,
    G_1b: ?int32,
    G_2b: ?int32,
    G_3b: ?int32,
    G_ss: ?int32,
    G_lf: ?int32,
    G_cf: ?int32,
    G_rf: ?int32,
    G_of: ?int32,
    G_dh: ?int32,
    G_ph: ?int32,
    G_pr: ?int32
  halloffame: var * {
    hofID: ?string,
    yearid: ?int32,
    votedBy: ?string,
    ballots: ?int32,
    needed: ?int32,
    votes: ?int32,
    inducted: ?string,
    category: ?string
  awardsplayers: var * {
    playerID: ?string,
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    tie: ?string,
    notes: ?string
  schoolsplayers: var * {
    playerID: ?string,
    schoolID: ?string,
    yearMin: ?int32,
    yearMax: ?int32
  schools: var * {
    schoolID: ?string,
    schoolName: ?string,
    schoolCity: ?string,
    schoolState: ?string,
    schoolNick: ?string
  teamsfranchises: var * {
    franchID: ?string,
    franchName: ?string,
    active: ?string,
    NAassoc: ?string
  allstarfull: var * {
    playerID: ?string,
    yearID: ?int32,
    gameNum: ?int32,
    gameID: ?string,
    teamID: ?string,
    lgID: ?string,
    GP: ?int32,
    startingPos: ?int32
  tmp_batting: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    G: ?int32,
    G_batting: ?int32,
    AB: ?int32,
    R: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    RBI: ?int32,
    SB: ?int32,
    CS: ?int32,
    BB: ?int32,
    SO: ?int32,
    IBB: ?int32,
    HBP: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32,
    G_old: ?int32
  seriespost: var * {
    yearID: ?int32,
    round: ?string,
    teamIDwinner: ?string,
    lgIDwinner: ?string,
    teamIDloser: ?string,
    lgIDloser: ?string,
    wins: ?int32,
    losses: ?int32,
    ties: ?int32
  awardssharemanagers: var * {
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    managerID: ?string,
    pointsWon: ?int32,
    pointsMax: ?int32,
    votesFirst: ?int32
  awardsshareplayers: var * {
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    playerID: ?string,
    pointsWon: ?float64,
    pointsMax: ?int32,
    votesFirst: ?float64
  fielding: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    POS: ?string,
    G: ?int32,
    GS: ?int32,
    InnOuts: ?int32,
    PO: ?int32,
    A: ?int32,
    E: ?int32,
    DP: ?int32,
    PB: ?int32,
    WP: ?int32,
    SB: ?int32,
    CS: ?int32,
    ZR: ?float64

Seeing at once all the tables in the database, all the columns in those tables, and all the types of those columns provides a clear and comprehensive overview of our data. We represent this information as a datashape, a type system covers everything from numpy arrays to SQL databases and Spark collections.

We use standard Blaze expressions to navigate more deeply. Things like auto-complete work fine.

>>> lahman.<tab>
lahman.allstarfull          lahman.fieldingpost         lahman.pitchingpost
lahman.appearances          lahman.fields               lahman.relabel
lahman.awardsmanagers       lahman.halloffame           lahman.salaries
lahman.awardsplayers        lahman.hofold               lahman.schema
lahman.awardssharemanagers  lahman.isidentical          lahman.schools
lahman.awardsshareplayers   lahman.like                 lahman.schoolsplayers
lahman.battingpost          lahman.managers             lahman.seriespost
lahman.data                 lahman.managershalf         lahman.teams
lahman.dshape               lahman.map                  lahman.teamsfranchises
lahman.fielding             lahman.master               lahman.teamshalf
lahman.fieldingof           lahman.pitching             lahman.tmp_batting

And we finish by a fun multi-table computation, joining two tables on year, team, and league and then computing the average salary by team name and year

>>> joined = join(lahman.teams, lahman.salaries, ['yearID', 'teamID', 'lgID'])

>>> by(joined[['name', 'yearID']], average_salary=joined.salary.mean())
                    name  yearID  average_salary
0         Anaheim Angels    1997  1004370.064516
1         Anaheim Angels    1998  1214147.058824
2         Anaheim Angels    1999  1384704.150000
3         Anaheim Angels    2000  1715472.233333
4         Anaheim Angels    2001  1584505.566667
5         Anaheim Angels    2002  2204345.250000
6         Anaheim Angels    2003  2927098.777778
7         Anaheim Angels    2004  3723506.185185
8   Arizona Diamondbacks    1998   898527.777778
9   Arizona Diamondbacks    1999  2020705.852941

Looks good, we compute and store to CSV file with into

>>> into('salaries.csv',
...      by(j[['name', 'yearID']], total_salary=j.salary.mean()))

(Final result here: salaries.csv)

Beyond SQL

SQL isn’t unique, many systems hold collections or hierarchies of datasets. Blaze supports navigation over Mongo databases, Blaze servers, HDF5 files, or even just dictionaries of pandas DataFrames or CSV files.

>>> d = {'accounts 1': CSV('accounts_1.csv'),
...      'accounts 2': CSV('accounts_2.csv')}

>>> accounts = Data(d)

>>> accounts.dshape
  accounts 1: var * {id: ?int64, name: string, amount: ?int64},
  accounts 2: var * {id: ?int64, name: string, amount: ?int64}

None of this behavior was special-cased in Blaze. The same mechanisms that select a table from a SQL database select a column from a Pandas DataFrame.

Finish with HDF5 example

To conclude we revisit our motivating problem, HDF5 file navigation.

Raw H5Py

Recall that we previously had a long back-and-forth conversation with the Python interpreter.

>>> import h5py
>>> f = h5py.File('OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5')

>>> f.keys()

>>> f['/HDFEOS'].keys()

H5Py with Blaze expressions

With Blaze we get a quick high-level overview and an API that is shared with countless other systems.

>>> from blaze import Data
>>> d = Data(f)  # Wrap h5py file with Blaze interactive expression
>>> d
Data:       <HDF5 file "OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5" (mode r+)>
DataShape:  {
    SWATHS: {
      ColumnAmountAerosol: {
        Data Fields: {
          AerosolIndexUV: 1643 * 60 * int16,

By default we see the data and a truncated datashape.

We ask for the datashape explicitly to see the full picture.

>>> d.dshape
    SWATHS: {
      ColumnAmountAerosol: {
        Data Fields: {
          AerosolIndexUV: 1643 * 60 * int16,
          AerosolIndexVIS: 1643 * 60 * int16,
          AerosolModelMW: 1643 * 60 * uint16,
          AerosolModelsPassedThreshold: 1643 * 60 * 10 * uint16,
          AerosolOpticalThicknessMW: 1643 * 60 * 14 * int16,
          AerosolOpticalThicknessMWPrecision: 1643 * 60 * int16,
          AerosolOpticalThicknessNUV: 1643 * 60 * 2 * int16,
          AerosolOpticalThicknessPassedThreshold: 1643 * 60 * 10 * 9 * int16,
          AerosolOpticalThicknessPassedThresholdMean: 1643 * 60 * 9 * int16,
          AerosolOpticalThicknessPassedThresholdStd: 1643 * 60 * 9 * int16,
          CloudFlags: 1643 * 60 * uint8,
          CloudPressure: 1643 * 60 * int16,
          EffectiveCloudFraction: 1643 * 60 * int8,
          InstrumentConfigurationId: 1643 * uint8,
          MeasurementQualityFlags: 1643 * uint8,
          NumberOfModelsPassedThreshold: 1643 * 60 * uint8,
          ProcessingQualityFlagsMW: 1643 * 60 * uint16,
          ProcessingQualityFlagsNUV: 1643 * 60 * uint16,
          RootMeanSquareErrorOfFitPassedThreshold: 1643 * 60 * 10 * int16,
          SingleScatteringAlbedoMW: 1643 * 60 * 14 * int16,
          SingleScatteringAlbedoMWPrecision: 1643 * 60 * int16,
          SingleScatteringAlbedoNUV: 1643 * 60 * 2 * int16,
          SingleScatteringAlbedoPassedThreshold: 1643 * 60 * 10 * 9 * int16,
          SingleScatteringAlbedoPassedThresholdMean: 1643 * 60 * 9 * int16,
          SingleScatteringAlbedoPassedThresholdStd: 1643 * 60 * 9 * int16,
          SmallPixelRadiancePointerUV: 1643 * 2 * int16,
          SmallPixelRadiancePointerVIS: 1643 * 2 * int16,
          SmallPixelRadianceUV: 6783 * 60 * float32,
          SmallPixelRadianceVIS: 6786 * 60 * float32,
          SmallPixelWavelengthUV: 6783 * 60 * uint16,
          SmallPixelWavelengthVIS: 6786 * 60 * uint16,
          TerrainPressure: 1643 * 60 * int16,
          TerrainReflectivity: 1643 * 60 * 9 * int16,
          XTrackQualityFlags: 1643 * 60 * uint8
        Geolocation Fields: {
          GroundPixelQualityFlags: 1643 * 60 * uint16,
          Latitude: 1643 * 60 * float32,
          Longitude: 1643 * 60 * float32,
          OrbitPhase: 1643 * float32,
          SolarAzimuthAngle: 1643 * 60 * float32,
          SolarZenithAngle: 1643 * 60 * float32,
          SpacecraftAltitude: 1643 * float32,
          SpacecraftLatitude: 1643 * float32,
          SpacecraftLongitude: 1643 * float32,
          TerrainHeight: 1643 * 60 * int16,
          Time: 1643 * float64,
          ViewingAzimuthAngle: 1643 * 60 * float32,
          ViewingZenithAngle: 1643 * 60 * float32
    ArchiveMetadata.0: string[65535, 'A'],
    CoreMetadata.0: string[65535, 'A'],
    StructMetadata.0: string[32000, 'A']

When we see metadata for everything in this dataset all at once the full structure becomes apparent. We have two collections of arrays, all shaped (1643, 60); the collection in Data Fields holds what appear to be weather measurements while the collection in Geolocation Fields holds coordinate information.

Moreover we can quickly navigate this structure to compute relevant information.

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Geolocation_Fields.Latitude
array([[-67., -67., -67., ..., -61., -60., -60.],
       [-67., -67., -67., ..., -61., -60., -60.],
       [-67., -67., -68., ..., -61., -61., -60.],
       [ 69.,  70.,  71., ...,  85.,  84.,  84.],
       [ 69.,  70.,  71., ...,  85.,  85.,  84.],
       [ 69.,  70.,  71., ...,  85.,  85.,  84.]], dtype=float32)

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Geolocation_Fields.Longitude
array([[  46.,   43.,   40., ...,   -3.,   -5.,   -7.],
       [  46.,   43.,   40., ...,   -3.,   -5.,   -7.],
       [  46.,   43.,   40., ...,   -4.,   -5.,   -7.],
       [ 123.,  124.,  124., ..., -141., -131., -121.],
       [ 123.,  124.,  124., ..., -141., -130., -120.],
       [ 123.,  123.,  124., ..., -140., -130., -119.]], dtype=float32)

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Data_Fields.TerrainPressure
array([[1013, 1013, 1013, ..., 1013, 1013, 1013],
       [1013, 1013, 1013, ..., 1013, 1013, 1013],
       [1013, 1013, 1013, ..., 1013, 1013, 1013],
       [1010,  992,  986, ..., 1013, 1013, 1013],
       [ 999,  983,  988, ..., 1013, 1013, 1013],
       [1006,  983,  993, ..., 1013, 1013, 1013]], dtype=int16)

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Data_Fields.TerrainPressure.min()

Final Thoughts

Often the greatest challenge is finding what you already have.

Discovery and exploration are just as important as computation. By extending the Blaze’s expression system to hierarchies of datasets we create a smooth user experience from first introductions to data all the way to analytic queries and saving results.

This work was easy. The pluggable architecture of Blaze made it surprisingly simple to extend the Blaze model from tables and arrays to collections of tables and arrays. We wrote about 40 significant lines of code for each supported backend.

November 19, 2014 12:00 AM

November 18, 2014

Titus Brown

RFC: The khmer project: what are we, and what are our goals?

As we think about the next few years of khmer development, it is helpful to explore what khmer is, and what our goals for khmer development are. This can provide guiding principles for development, refactoring, extension, funding requests, and collaborations.

Comments solicited!



khmer is an open source project that serves as:

  • a stable research platform for novel CS/bio research on data structures and algorithms, mostly k-mer based;
  • a set of of command line tools for various kinds of data transformations;
  • a test bed for software engineering practice in science;
  • a Python library for working with k-mers and graph structures;
  • an exercise in community building in scientific software engineering;
  • an effort to participate in and sustainably grow the bioinformatics ecosystem.


Our long term goals for khmer, in some rough order of priority, are:

  • Keep khmer versatile and agile enough to easily enable the CS and bio we want to do.

    Practical implications: limit complexity of internals as much as possible.

  • Continue community building.

    Practical implications: run khmer as a real open source project, with everything done in the open; work nicely with other projects.

  • Build, sustain, and maintain a set of protocols and recipes around khmer.

    Practical implications: take workflow design into account.

  • Improve the efficiency (time/memory) of khmer implementations.

    Practical implications: optimize, but not at expense of clean code.

    Some specifics: streaming; variable sized counters.

  • Lower barriers to an increasing user base.

    Practical implications: find actual pain points, address if it’s easy or makes good sense.

    Some specifics: hash function k > 32, stranded hash function, integrate efficient k-mer cardinality counting, implement dynamically sized data structures.

  • Keep khmer technologically up to date.

    Practical implications: transition to Python 3.


p.s. Thanks to Qingpeng Zhang, Travis Collier, and Matt MacManes for comments in the draft stage!

by C. Titus Brown at November 18, 2014 11:00 PM

Thomas Wiecki

A modern guide to getting started with Data Science and Python

Python has an extremely rich and healthy ecosystem of data science tools. Unfortunately, to outsiders this ecosystem can look like a jungle (cue snake joke). In this blog post I will provide a step-by-step guide to venturing into this PyData jungle.

What's wrong with the many lists of PyData packages out there already you might ask? I think that providing too many options can easily overwhelm someone who is just getting started. So instead, I will keep a very narrow scope and focus on the 10% of tools that allow you to do 90% of the work. After you mastered these essentials you can browse the long lists of PyData packages to decide which to try next.

The upside is that the few tools I will introduce already allow you to do most things a data scientist does in his day-to-day (i.e. data i/o, data munging, and data analysis).


It has happened quite a few times that people came up to me and said "I heard Python is amazing for data science so I wanted to start learning it but spent two days installing Python and all the other modules!". It's quite reasonable to think that you have to install Python if you want to use it but indeed, installing the full PyData stack manually when you don't know which tools you actually need is quite an undertaking. So I strongly recommend against doing that.

Fortunately for you, the fine folks at Continuum have created the Anaconda Python distribution that installs most of the PyData stack for you, and the modules it doesn't provide out of the box can easily be installed via a GUI. The distribution is also available for all major platforms so save yourself the two days and just use that!

IPython Notebook

After Python is installed, most people start by launching it. Again, very reasonable but unfortunately dead wrong. I don't know a single SciPythonista that uses the Python command shell directly (YMMV). Instead, IPython, and specifically the IPython Notebook are incredibly powerful Python shells that are used ubiquitously in PyData. I strongly recommend you directly start using the IPython Notebook (IPyNB) and don't bother with anything else, you won't regret it. In brief, the IPyNB is a Python shell that you access via your web browser. It allows you to mix code, text, and graphics (even interactive ones). This blog post was written in an IPyNB and it's rare to go find a talk at a Python conference that does not use the IPython Notebook. It comes preinstalled by Anaconda so you can just start using it. Here's an example of what it looks like:

In [1]:
print('Hello World')
Hello World

This thing is a rocket -- every time I hear one of the core devs talk at a conference I am flabbergasted by all the new things they cooked up. To get an idea for some of the advanced capabilities, check out this short tutorial on IPython widgets. These allow you to attach sliders to control a plot interactively:

In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo('wxVx54ax47s') # Yes, it can also embed youtube videos.


Normally, people recommend you start by learning NumPy (pronounced num-pie, not num-pee!) which is the library that provides multi-dimensional arrays. Certainly this was the way to go a few years ago but I hardly use NumPy at all today. The reason is that NumPy became more of a core library that's used by other libraries which provide a much nicer interface. Thus, the main library to use for working with data is Pandas. It can input and output data from all kinds of formats (including databases), do joins and other SQL-like functions for shaping the data, handle missing values with ease, support time series, has basic plotting capabilities and basic statistical functionality and much more. There is certainly a learning curve to all its features but I strongly suggest you go through most of the documentation as a first step. Trust me, the time you invest will be set off a thousand fold by being more efficient in your data munging. Here are a few quick tricks to whet your appetite:

In [18]:
import pandas as pd

df = pd.DataFrame({ 'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D' : pd.Series([1, 2, 1, 2], dtype='int32'),
                    'E' : pd.Categorical(["test", "train", "test", "train"]),
                    'F' : 'foo' })
In [19]:
0 1 2013-01-02 1 1 test foo
1 1 2013-01-02 1 2 train foo
2 1 2013-01-02 1 1 test foo
3 1 2013-01-02 1 2 train foo

Columns can be accessed by name:

In [17]:
0   2013-01-02
1   2013-01-02
2   2013-01-02
3   2013-01-02
Name: B, dtype: datetime64[ns]

Compute the sum of D for each category in E:

In [21]:
test     2
train    4
Name: D, dtype: int32

Doing this is in NumPy (or *gasp* Matlab!) would be much more clunky.

There's a ton more. If you're not convinced, check out 10 minutes to pandas where I borrowed this from.


The main plotting library of Python is Matplotlib. However, I don't recommend using it directly for the same reason I don't recommend spending time learning NumPy initially. While Matplotlib is very powerful, it is its own jungle and requires lots of tweaking to make your plots look shiny. So instead, I recommend to start using Seaborn. Seaborn essentially treats Matplotlib as a core library (just like Pandas does with NumPy). I will briefly illustrate the main benefits of seaborn. Specifically, it:

  1. creates aesthetically pleasing plots by default (for one thing, it does not default to the jet colormap),
  2. creates statistically meaningful plots, and
  3. understands the pandas DataFrame so the two work well together.

While pandas comes prepackaged with anaconda, seaborn is not directly included but can easily be installed with conda install seaborn.

Statistically meaningful plots

In [5]:
%matplotlib inline # IPython magic to create plots within cells
In [7]:
import seaborn as sns

# Load one of the data sets that come with seaborn
tips = sns.load_dataset("tips")

sns.jointplot("total_bill", "tip", tips, kind='reg');

As you can see, with just one line we create a pretty complex statistical plot including the best fitting linear regression line along with confidence intervals, marginals and the correlation coefficients. Recreating this plot in matplotlib would take quite a bit of (ugly) code, including calls to scipy to run the linear regression and manually applying the linear regression formula to draw the line (and I don't even know how to do the marginal plots and confidence intervals from the top of my head). This and the next example are taken from the tutorial on quantitative linear models.

Works well with Pandas DataFrame

Data has structure. Often, there are different groups or categories we are interested in (pandas' groupby functionality is amazing in this case). For example, the tips data set looks like this:

In [9]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

We might ask if smokers tip differently than non-smokers. Without seaborn, this would require a pandas groupby together with the complex code for plotting a linear regression. With seaborn, we can provide the column name we wish to split by as a keyword argument to col:

In [11]:
sns.lmplot("total_bill", "tip", tips, col="smoker");

Pretty neat, eh?

As you dive deeper you might want to control certain details of these plots at a more fine grained level. Because seaborn is just calling into matplotlib you probably will want to start learning this library at that point. For most things I'm pretty happy with what seaborn provides, however.


The idea of this blog post was to provide a very select number of packages which maximize your efficiency when starting with data science in Python.

Further reading


Thanks to Katie Green and Andreas Dietzel for helpful feedback on an earlier draft.

by Thomas Wiecki at November 18, 2014 03:00 PM

Continuum Analytics

Blaze Datasets

tl;dr Blaze aids exploration by supporting full databases and collections of datasets.

by Matt Rocklin at November 18, 2014 08:00 AM

November 14, 2014

Titus Brown

A Saturday morning conversation about publishing inconclusions

Here's an excerpt from an e-mail to a student whose committee I'm on; they were asking me about a comment their advisor had made that they shouldn't put a result in a paper because "It'll confuse the reviewer."

One thing to keep in mind is that communicating the results _is_ important. "It'll confuse the reviewer" is not always shorthand for "well, it's true but we can't say it because it's too confusing", which is how I think you're interpreting your advisor's comment; it is also shorthand for "well, we don't really have the depth of data/information to understand why the result is this way, so there's something more complicated going on, but we can't talk about it because we don't understand it ourselves." These can lead to the most productive bits of a project, if followed up (but it takes a lot of time to do so...)

Their response:

I'm curious, however, if we don't understand what's going on ourselves, shouldn't that be all the more reason to publish something? Because then other people with more knowledge can read it and they may know what's going on? Or at least help us?

And my response:

Well, that's definitely not how it works. Let me see if I can think of some reasons why...

First, it's much easier to get confused than it is to fight your way out of confusion - and you usually learn something in the process, obviously. So you have a lot more people who are confused than have learned something, and the latter is considered more interesting and worthy of communication than the former.

Second, writing papers is a lot of work and takes a lot of time. If you're confused (and hence probably wrong about any conclusions) it's probably not worth the effort...

Third, reading papers is also a lot of work! Most papers are never read seriously by anyone other than the reviewers. So doubling or tripling the number of papers to include confusing or inconclusive data would probably not be helpful.

Fourth, most conclusions require multiple lines of evidence, most which are often not developed until you have at least some solid hypotheses about why you're seeing what you're seeing from one line of evidence. So a lot of those pubs would be wrong.

Fifth, failure is frequently underdetermined -- that is, we may have wrong or incorrect results and not know or know why. It's rarely worth chasing down exactly why something didn't work unless it doesn't work reproducibly and it's critical and important to your results.

That having been said, there's a movement to make publishing data easier and bring data citations into standard practice, which I think is partly targeted at the same problem you're asking about.

What am I missing?

thanks, --titus

by C. Titus Brown at November 14, 2014 11:00 PM

William Stein

SageMathCloud Notifications are Now Better

I just made live a new notifications systems for  SageMathCloud, which I spent all week writing.  

These notifications are what you see when you click the bell in the upper right.   This new system replaces the one I made live two weeks ago.     Whenever somebody actively *edits* (using the web interface) any file in any project you collaborate on, a notification will get created or updated.    If a person *comments* on any file in any project you collaborate on (using the chat interface to the right), then not only does the notification get updated, there is also a little red counter on top of the bell and also in the title of the  SageMathCloud tab.   In particular, people will now be much more likely to see the chats you make on files.

NOTE: I have not yet enabled any sort of daily email notification summary, but that is planned. 

Some technical details:  Why did this take all week?  It's because the technology that makes it work behind the scenes is something that was fairly difficult for me to figure out how to implement.  I implemented a way to create an object that can be used simultaneously by many clients and supports realtime synchronization.... but is stored by the distributed Cassandra database instead of a file in a project.   Any changes to that object get synchronized around very quickly.   It's similar to how synchronized text editing (with several people at once) works, but I rethought differential synchronization carefully, and also figured out how to synchronize using an eventually consistent database.    This will be useful for implementing a lot other things in SageMathCloud that operate at a different level than "one single project".  For example, I plan to add functions so you can access these same "synchronized databases" from Python processes -- then you'll be able to have sage worksheets (say) running on several different projects, but all saving their data to some common synchronized place (backed by the database).   Another application will be a listing of the last 100 (say) files you've opened, with easy ways to store extra info about them.    It will also be easy to make account and project settings more realtime, so when you change something, it automatically takes effect and is also synchronized across other browser tabs you may have open.   If you're into modern Single Page App web development, this might remind you of Angular or React or Hoodie or Firebase -- what I did this week is probably kind of like some of the sync functionality of those frameworks, but I use Cassandra (instead of MongoDB, say) and differential synchronization.  

I BSD-licensed the differential synchronization code  that I wrote as part of the above. 

by William Stein (noreply@blogger.com) at November 14, 2014 02:31 PM

November 12, 2014

Jake Vanderplas

The Hipster Effect: An IPython Interactive Exploration

This week I started seeing references all over the internet to this paper: The Hipster Effect: When Anticonformists All Look The Same. It essentially describes a simple mathematical model which models conformity and non-conformity among a mutually interacting population, and finds some interesting results: namely, conformity among a population of self-conscious non-conformists is similar to a phase transition in a time-delayed thermodynamic system. In other words, with enough hipsters around responding to delayed fashion trends, a plethora of facial hair and fixed gear bikes is a natural result.

Also naturally, upon reading the paper I wanted to try to reproduce the work. The paper solves the problem analytically for a continuous system and shows the precise values of certain phase transitions within the long-term limit of the postulated system. Though such theoretical derivations are useful, I often find it more intuitive to simulate systems like this in a more approximate manner to gain hands-on understanding. By the end of this notebook, we'll be using IPython's incredible interactive widgets to explore how the inputs to this model affect the results.

Mathematically Modeling Hipsters

We'll start by defining the problem, and going through the notation suggested in the paper. We'll consider a group of \(N\) people, and define the following quantities:

  • \(\epsilon_i\) : this value is either \(+1\) or \(-1\). \(+1\) means person \(i\) is a hipster, while \(-1\) means they're a conformist.
  • \(s_i(t)\) : this is also either \(+1\) or \(-1\). This indicates person \(i\)'s choice of style at time \(t\). For example, \(+1\) might indicated a bushy beard, while \(-1\) indicates clean-shaven.
  • \(J_{ij}\) : The influence matrix. This is a value greater than zero which indicates how much person \(j\) influences person \(i\).
  • \(\tau_{ij}\) : The delay matrix. This is an integer telling us the length of delay for the style of person \(j\) to affect the style of person \(i\).

The idea of the model is this: on any given day, person \(i\) looks at the world around him or her, and sees some previous day's version of everyone else. This information is \(s_j(t - \tau_{ij})\).

The amount that person \(j\) influences person \(i\) is given by the influence matrix, \(J_{ij}\), and after putting all the information together, we see that person \(i\)'s mean impression of the world's style is

\[ m_i(t) = \frac{1}{N} \sum_j J_{ij} \cdot s_j(t - \tau_{ij}) \]

Given the problem setup, we can quickly check whether this impression matches their own current style:

  • if \(m_i(t) \cdot s_i(t) > 0\), then person \(i\) matches those around them
  • if \(m_i(t) \cdot s_i(t) < 0\), then person \(i\) looks different than those around them

A hipster who notices that their style matches that of the world around them will risk giving up all their hipster cred if they don't change quickly; a conformist will have the opposite reaction. Because \(\epsilon_i\) = \(+1\) for a hipster and \(-1\) for a conformist, we can encode this observation in a single value which tells us what which way the person will lean that day:

\[ x_i(t) = -\epsilon_i m_i(t) s_i(t) \]

Simple! If \(x_i(t) > 0\), then person \(i\) will more likely switch their style that day, and if \(x_i(t) < 0\), person \(i\) will more likely maintain the same style as the previous day. So we have a formula for how to update each person's style based on their preferences, their influences, and the world around them.

But the world is a noisy place. Each person might have other things going on that day, so instead of using this value directly, we can turn it in to a probabilistic statement. Consider the function

\[ \phi(x;\beta) = \frac{1 + \tanh(\beta \cdot x)}{2} \]

We can plot this function quickly:

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Use seaborn styles for nice-looking plots
import seaborn; seaborn.set()
In [2]:
x = np.linspace(-1, 1, 1000)
for beta in [0.5, 1, 5]:
    plt.plot(x, 0.5 * (1 + np.tanh(beta * x)),
             label=r'$\beta = {0:.1f}$'.format(beta))
plt.legend(loc='upper left', fontsize=18)
plt.xlabel('$x$', size=18); plt.ylabel(r'$\phi(x;\beta)$', size=18);

This gives us a nice way to move from our preference \(x_i\) to a probability of switching styles. Here \(\beta\) is inversely related to noise. For large \(\beta\), the noise is small and we basically map \(x > 0\) to a 100% probability of switching, and \(x<0\) to a 0% probability of switching. As \(\beta\) gets smaller, the probabilities get less and less distinct.

The Code

Let's see this model in action. We'll start by defining a class which implements everything we've gone through above:

In [3]:
class HipsterStep(object):
    """Class to implement hipster evolution
    initial_style : length-N array
        values > 0 indicate one style, while values <= 0 indicate the other.
    is_hipster : length-N array
        True or False, indicating whether each person is a hipster
    influence_matrix : N x N array
        Array of non-negative values. influence_matrix[i, j] indicates
        how much influence person j has on person i
    delay_matrix : N x N array
        Array of positive integers. delay_matrix[i, j] indicates the
        number of days delay between person j's influence on person i.
    def __init__(self, initial_style, is_hipster,
                 influence_matrix, delay_matrix,
                 beta=1, rseed=None):
        self.initial_style = initial_style
        self.is_hipster = is_hipster
        self.influence_matrix = influence_matrix
        self.delay_matrix = delay_matrix
        self.rng = np.random.RandomState(rseed)
        self.beta = beta
        # make s array consisting of -1 and 1
        self.s = -1 + 2 * (np.atleast_2d(initial_style) > 0)
        N = self.s.shape[1]
        # make eps array consisting of -1 and 1
        self.eps = -1 + 2 * (np.asarray(is_hipster) > 0)
        # create influence_matrix and delay_matrix
        self.J = np.asarray(influence_matrix, dtype=float)
        self.tau = np.asarray(delay_matrix, dtype=int)
        # validate all the inputs
        assert self.s.ndim == 2
        assert self.s.shape[1] == N
        assert self.eps.shape == (N,)
        assert self.J.shape == (N, N)
        assert np.all(self.J >= 0)
        assert np.all(self.tau > 0)

    def phi(x, beta):
        return 0.5 * (1 + np.tanh(beta * x))
    def step_once(self):
        N = self.s.shape[1]
        # iref[i, j] gives the index for the j^th individual's
        # time-delayed influence on the i^th individual
        iref = np.maximum(0, self.s.shape[0] - self.tau)
        # sref[i, j] gives the previous state of the j^th individual
        # which affects the current state of the i^th individual
        sref = self.s[iref, np.arange(N)]

        # m[i] is the mean of weighted influences of other individuals
        m = (self.J * sref).sum(1) / self.J.sum(1)
        # From m, we use the sigmoid function to compute a transition probability
        transition_prob = self.phi(-self.eps * m * self.s[-1], beta=self.beta)
        # Now choose steps stochastically based on this probability
        new_s = np.where(transition_prob > self.rng.rand(N), -1, 1) * self.s[-1]
        # Add this to the results, and return
        self.s = np.vstack([self.s, new_s])
        return self.s
    def step(self, N):
        for i in range(N):
        return self.s
    def trend(self, hipster=True, return_std=True):
        if hipster:
            subgroup = self.s[:, self.eps > 0]
            subgroup = self.s[:, self.eps < 0]
        return subgroup.mean(1), subgroup.std(1)

Now we'll create a function which plots the trend for a certain number of time steps:

In [4]:
def plot_results(Npeople=500, Nsteps=200,
                 hipster_frac=0.8, initial_state_frac=0.5,
                 delay=20, log10_beta=0.5, rseed=42):
    rng = np.random.RandomState(rseed)
    initial_state = (rng.rand(1, Npeople) > initial_state_frac)
    is_hipster = (rng.rand(Npeople) > hipster_frac)

    influence_matrix = abs(rng.randn(Npeople, Npeople))
    influence_matrix.flat[::Npeople + 1] = 0
    delay_matrix = 1 + rng.poisson(delay, size=(Npeople, Npeople))

    h = HipsterStep(initial_state, is_hipster,
                    influence_matrix, delay_matrix=delay_matrix,
                    beta=10 ** log10_beta, rseed=rseed)
    def beard_formatter(y, loc):
        if y == 1:
            return 'bushy-\nbeard'
        elif y == -1:
            return 'clean-\nshaven'
            return ''
    t = np.arange(Nsteps + 1)

    fig, ax = plt.subplots(2, sharex=True, figsize=(8, 6))
    ax[0].imshow(h.s.T, cmap='binary', interpolation='nearest')
    mu, std = h.trend(True)
    ax[1].plot(t, mu, c='red', label='hipster')
    ax[1].fill_between(t, mu - std, mu + std, color='red', alpha=0.2)
    mu, std = h.trend(False)
    ax[1].plot(t, mu, c='blue', label='conformist')
    ax[1].fill_between(t, mu - std, mu + std, color='blue', alpha=0.2)
    ax[1].set_ylim(-1.1, 1.1);
    ax[1].set_xlim(0, Nsteps)

Exploring the Result

With this code in place, we can now explore the result. We'll start by seeing what happens when just 10% of the population is made up of non-conformist hipsters:

In [5]:

Let's describe this plot briefly: the top panel has 500 rows and 200 columns: each row represents an individual person, and the color (white or black) along the row represents the style of that person at that time.

In the bottom panel, we see the mean and standard deviation of the styles of all hipsters (red) and all conformists (blue).

This plot shows something relatively unsurprising: when there are only a few hipsters in the population, we quickly reach an equilibrium where hipsters all have one style (a bushy beard) while the norm-core conformists have the opposite (clean shaven faces).

Let's see what happens if there are more hipsters in the population:

In [6]:

With half the population made up of hipsters, the trend washes out. There is so much noise and confusion about styles, that both the hipsters and the conformists have a wide range of styles at any given time.

Now let's see what happens when we have even more hipsters:

In [7]:

Now this is getting interesting! With a population dominated by hipsters, we end up approaching steady cycles in trends. The hipsters start trying to distance themselves from the "normal" style, and then the normal style moves to catch up with them. The hipsters then swing the other way, and the process repeats. This is an example of the "phase transition" that the author of the original paper talked about in analogy to thermodynamic systems: above certain critical values of the model parameters, a qualitatively new type of behavior appears out of the noise. This oscillation can be thought of as a rough and simplistic mathematical model for recurrent cycles of cultural and fashion trends that anyone over a couple decades old has probably noticed over time.

But let's explore this even more.

Fully Interactive

One of the nice pieces of the IPython notebook is the ability to quickly create interactive visualizations. Unfortunately this only works when you're viewing the notebook live (i.e. a static HTML view on a blog post won't give you any interaction). If you're reading this on my blog or on nbviewer, then you can download the notebook and run it with IPython to see these features.

What we'll do is to call IPython's interactive tools on our Python function, which will create javascript sliders allowing us to explore the parameter space of this hipster conformity model. I'd encourage you to download the notebook and try it out!

In [8]:
from IPython.html.widgets import interact, fixed

interact(plot_results, hipster_frac=(0.0, 1.0), delay=(1, 50),
         initial_state_frac=(0.0, 1.0), log10_beta=(-2.0, 2.0),
         Npeople=fixed(500), Nsteps=fixed(200), rseed=fixed(42));

Again, unless you download the notebook and run it on a local Python kernel, all you'll see is a static graphic above. But with the interactive version, you can really start to see how these various parameters affect the system.


This has been a lot of fun, and if you've read this far I hope this helped you understand the mathematics of Hipster-dom! For more information and analysis, go read the paper. It goes much deeper than the rough, discrete approximation I've used here.

For further ideas, I'd love to see a simulation of how this looks if we add-in spatial information, and create a delay related to that information. Would you start to see pockets of people adapting similar styles? My guess is yes, but I'm not entirely sure... there's only one way to find out.

Happy coding!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

by Jake Vanderplas at November 12, 2014 05:00 AM

November 10, 2014

Titus Brown

Some thoughts on Journal Impact Factor

A colleague just e-mailed me to ask me how I felt about journal impact factor being such a big part of the Academic Ranking of World Universities - they say that 20% of the ranking weight comes from # of papers published in Nature and Science. So what do I think?

On evaluations

I'm really not a big fan of rankings and evaluations in the first place. This is largely because I feel that evaluations are rarely objective. For one very specific example, last year at MSU I got formal evaluations from both of my departments. Starting with the same underlying data (papers published, students graduated, grants submitted/awarded, money spent, classes taught, body mass/height ratio, gender weighting, eye color, miles flown, talks given, liquid volume of students' tears generated, committees served on, etc.) one department gave me a "satisfactory/satisfactory/satisfactory" while the other department gave me an "excellent/excellent/excellent." What fun! (I don't think the difference was the caliber in departments.)

Did I mention that these rankings helped determine my raise for the year?

Anyhoo, I find the rating and ranking scheme within departments at MSU to be largely silly. It's done in an ad hoc manner by untrained people, and as far as I can tell, is biased against people who are not willing to sing their own praises. (I brought up the Dunning-Kruger effect in my last evaluation meeting. Heh.) That's not to say there's not serious intent -- they are factored into raises, and at least one other purpose of evaluating assistant professors is so that once you fire their ass (aka "don't give them tenure") there's a paper trail of critical evaluations where you explained that they were in trouble.

Metrics are part of the job, though; departments evaluate their faculty so they can see who, if anyone, needs help or support or mentoring, and to do this, they rely at least in part on metrics. Basically, if someone has lots of funding and lots of papers, they're probably not failing miserably at being a research professor; if they're grant-poor and paper-poor, they're targets for further investigation. There are lots of ways to evaluate, but metrics seem like an inextricable part of it.

Back to the impact factor

Like faculty evaluations, ranking by the impact factor of the journals that university faculty publish in is an attempt to predict future performance using current data.

But impact factor is extremely problematic for many reasons. It's based on citations, which (over the long run) may be an OK measure of impact, but are subject to many confounding factors, including field-specific citation patterns. It's an attempt to predict the success of individual papers on a whole-journal basis, which falls apart in the face of variable editorial decisions. High-impact journals are also often read more widely by people than low-impact journals, which yields a troubling circularity in terms of citation numbers (you're more likely to cite a paper you've read!) Worse, the whole system is prone to being gamed in various ways, which is leading to high rates of retractions for high-impact journals, as well as outright fraud.

Impact factor is probably a piss-poor proxy for paper impact, in other words.

If impact factor was just a thing that didn't matter, I wouldn't worry. The real trouble is that impact factors have real-world effect - many countries use impact factor of publications as a very strong weight in funding and promotion decisions. Interestingly, the US is not terribly heavy handed here - most universities seem pretty enlightened about considering the whole portfolio of a scientist, at least anecdotally. But I can name a dozen countries that care deeply about impact factor for promotions and raises.

And apparently impact factor affects university rankings, too!

Taking a step back, it's not clear that any good ranking scheme can exist, and if it does, we're certainly not using it. All of this is a big problem if you care about fostering good science.

The conundrum is that many people like rankings, and it seems futile to argue against measuring and ranking people and institutions. However, any formalized ranking system can be gamed and perverted, which ends up sometimes rewarding the wrong kind of people, and shutting out some of the right kind of people. (The Reed College position on the US News & World Report ranking system is worth reading here.) More generally, in any ecosystem, the competitive landscape is evolving, and a sensible measure today may become a lousy measure tomorrow as the players evolve their strategies; the stricter the rules of evaluation, and the more entrenched the evaluation system, the less likely it is to adapt, and the more misranking will result. So ranking systems need to evolve continuously.

At its heart, this is a scientific management challenge. Rankings and metrics do pretty explicitly set the landscape of incentives and competition. If our goal in science is to increase knowledge for the betterment of mankind, then the challenge for scientific management is to figure out how to incentive behaviors that trend in that direction in the long term. If you use bad or outdated metrics, then you incentivize the wrong kind of behavior, and you waste precious time, energy, and resources. Complicating this is the management structure of academic science, which is driven by many things that include rankings and reputation - concepts that range from precise to fuzzy.

My position on all of this is always changing, but it's pretty clear that the journal system is kinda dumb and rewards the wrong behavior. (For the record, I'm actually a big fan of publications, and I think citations are probably not a terribly bad measure of impact when measured on papers and individuals, although I'm always happy to engage in discussions on why I'm wrong.) But the impact factor is especially horrible. The disproportionate effect that high-IF glamour mags like Cell, Nature and Science have on our scientific culture is clearly a bad thing - for example, I'm hearing more and more stories about editors at these journals warping scientific stories directly or indirectly to be more press-worthy - and when combined with the reproducibility crisis I'm really worried about the short-term future of science. Journal Impact Factor and other simple metrics are fundamentally problematic and are contributing to the problem, along with the current peer review culture and a whole host of other things. (Mike Eisen has written about this a lot; see e.g. this post.)

In the long term I think a much more experimental culture of peer review and alternative metrics will emerge. But what do we do for now?

More importantly:

How can we change?

I think my main advice to faculty is "lead, follow, or get out of the way."

Unless you're a recognized big shot, or willing to take somewhat insane gambles with your career, "leading" may not be productive - following or getting out of the way might be best. But there are a lot of things you can do here that don't put you at much risk, including:

  • be open to a broader career picture when hiring and evaluating junior faculty;
  • argue on behalf of alternative metrics in meetings on promotion and tenure;
  • use sites like Google Scholar to pick out some recent papers to read in depth when hiring faculty and evaluating grants;
  • avoid making (or push back at) cheap shots at people who don't have a lot of high-impact-factor papers;
  • invest in career mentoring that is more nuanced than "try for lots of C-N-S papers or else" - you'd be surprised how often this is the main advice assistant professors take away...
  • believe in and help junior faculty that seem to have a plan, even if you don't agree with the plan (or at least leave them alone ;)

What if you are a recognized big shot? Well, there are lots of things you can do. You are the people who set the tone in the community and in your department, and it behooves you to think scientifically about the culture and reward system of science. The most important thing you can do is think and investigate. What evidence is there behind the value of peer review? Are you happy with C-N-S editorial policies, and have you talked to colleagues who get rejected at the editorial review stage more than you do? Have you thought about per-article metrics? Do you have any better thoughts on how to improve the system than 'fund more people', and how would you effect changes in this direction by recognizing alternate metrics during tenure and grant review?

The bottom line is that the current evaluation systems are the creation of scientists, for scientists. It's our responsibility to critically evaluate them, and perhaps evolve them when they're inadequate; we shouldn't just complain about how the current system is broken and wait for someone else to fix it.

Addendum: what would I like to see?

Precisely predicting the future importance of papers is obviously kind of silly - see this great 1994 paper by Gans and Shepherd on rejected classics papers, for example -- and is subject to all sorts of confounding effects. But this is nonetheless what journals are accustomed to doing: editors at most journals, especially the high impact factor ones, select papers based on projected impact before sending them out for review, and/or ask the reviewers to review impact as well.

So I think we should do away with impact review and review for correctness instead. This is why I'm such a big fan of PLOS One and PeerJ, who purport to do exactly that.

But then, I get asked, what do we do about selecting out papers to read? Some (many?) scientists claim that they need the filtering effect of these selective journals to figure out what they should be reading.

There are a few responses to this.

First, it's fundamentally problematic to outsource your attention to editors at journals, for reasons mentioned above. There's some evidence that you're being drawn into a manipulated and high-retraction environment by doing that, and that should worry you.

But let's say you feel you need something to tell you what to read.

Well, second, this is technologically solvable - that's what search engines already do. There's a whole industry of search engines that give great results based on integrating free text search, automatic content classification, and citation patterns. Google Scholar does a great job here, for example.

Third, social media (aka "people you know") provides some great recommendation systems! People who haven't paid much attention to Twitter or blogging may not have noticed, but in addition to person-to-person recommendations, there are increasingly good recommendation systems coming on line. I personally get most of my paper recs from online outlets (mostly people I follow, but I've found some really smart people to follow on Twitter!). It's a great solution!

Fourth, if one of the problems is that many journals review for correctness AND impact together, why not separate them? For example, couldn't journals like Science or Nature evolve into literature overlays that highlight papers published in impact-blind journals like PLOS One or PeerJ? I can imagine a number of ways that this could work, but if we're so invested in having editors pick papers for us, why not have them pick papers that have been reviewed for scientific correctness first, and then elevate them to our attention with their magic editorial pen?

I don't see too many drawbacks to this vs the current approach, and many improvements. (Frankly this is where I see most of scientific literature going, once preprint archives become omnipresent.)

So that's where I want and expect to see things going. I don't see ranking based on predicted impact going away, but I'd like to see it more reflective of actual impact (and be measured in more diverse ways).


p.s. People looking for citations of high retraction rate, problematic peer review, and the rest could look at one of my earlier blog posts on problems with peer review. I'd be interested in more citations, though!

by C. Titus Brown at November 10, 2014 11:00 PM

November 06, 2014

Fabian Pedregosa

Plot memory usage as a function of time

One of the lesser known features of the memory_profiler package is its ability to plot memory consumption as a function of time. This was implemented by my friend Philippe Gervais, previously a colleague at INRIA and now working for Google.

With this feature it is possible to generate very easily a plot of the memory consumption as a function of time. The result will be something like this:

where you can see the memory used (in the y-axis) as a function of time (x-axis). Furthermore, we have used two functions, test1 and test2, and it is possible to see with the colored brackets at what time do these functions start and finish.

This plot was generated with the following simple script:

import time

def test1():
    n = 10000
    a = [1] * n
    return a

def test2():
    n = 100000
    b = [1] * n
    return b

if __name__ == "__main__":

what happens here is that we have two functions, test1() and test2() in which we create two lists of different sizes (the one in test2 is bigger). We call time.sleep() for one second so that the function does not return too soon and so we have time to get reliable memory measurements.

The decorator @profile is optional and is useful so that memory_profiler knows when the function has been called so he can plot the brackets indicating that. If you don't put the decorator, the example will work just fine except that the brackets will not appear in your plot.

Suppose we have saved the script as test1.py. We run the script as

$ mprof run test1.py

where mprof is an executable provided by memory_profiler. If the above command was successful it will print something like this

$ mprof run test1.py
mprof: Sampling memory every 0.1s
running as a Python program...

The above command will create a .dat file on your current working directory, something like mprofile_20141108113511.dat. This file (you can inspect it, it's a text file) contains the memory measurements for your program.

You can now plot the memory measurements with the command

$ mprof plot

This will open a matplotlib window and show you the plot:

As you see, attention has been paid to the default values so that the plot it generates already looks decent without much effort. The not-so-nice-part is that, at least as of November 2014, if you want to customize the plot, well, you'll have to look and modify the mprof script. Some refactoring is still needed in order to make it easier to customize the plots (work in progress).

by Fabian Pedregosa at November 06, 2014 11:00 PM

November 02, 2014

Titus Brown

Doing biology, however it needs to be done

Sean Eddy wrote an interesting blog post on how scripting is something every biologist should learn to do. This spurred a few discussions on Twitter and elsewhere, most of which devolved into the usual arguments about what, precisely, biologists should be taught.

I always find these discussions not merely predictable but rather besides the point. Statisticians will always complain that people need a better appreciation of stats; bioinformatics will point to alignment or sequence comparison or whatnot; evolutionary biologists will talk about how important evolution is; physicists will point to more math; etc. But the truth is there's very, very little that is necessary to be a biologist.

My perspective on this is informed by my own background, which is a tad idiosyncratic. Despite being an assistant professor in a Microbiology department and (soon) a professor in a VetMed program, I have taken more physics courses than biology courses; heck, I've taught more biology courses than I've taken. I've never taken a chemistry course and know virtually nothing about biochemistry, in particular. Most of my evolutionary knowledge derives from my reading for my research on Avida; I've never been taught anything about ecology and still know very little. I failed my first qual exam at Caltech because I didn't actually know anything about cells (which event, ahem, spurred me to learn). Despite this utter lack of formal background in biology, I spent 5-10 years of my life doing experimental molecular biology for developmental biology research (after 3-5 years learning basic molecular biology), and I've learned what I think this a reasonable amount about cell biology into the bargain. I know a fair bit about developmental biology, evo-devo, molecular biology, and genomics. My PhD is actually in Biology, with a Developmental Biology focus. But I learned it all as I needed it. (My undergrad degree is in Unapplied Mathematics.)

My weakest formal training is probably in stats, where I know enough to point out where whatever system I'm studying is violating standard statistical requirements, but not enough to point how to rescue our approach.

Despite having "run" a bioinformatics lab for the last few years, my formal bioinformatics background is basically nil - I took a strong programming background, learned a bunch of biology and genomics, and then realized that much of bioinformatics is fairly obvious at that point. I don't really understand Hidden Markov Models or sequence alignment (but shh, don't tell anyone!)

With all of this, what do I call myself? Well, I definitely consider myself a biologist, as do at least a few different hiring panels, including one at UC Davis :). And when I talk to other biologists, I think that at least some of them agree - I'm focused on answering biological questions. I do so primarily in collaboration at this point, and primarily from the standpoint of data, but: biology.

So here's my conclusion: to be a biologist, one must be seriously trying to study biology. Period. Clearly you must know something about biology in order to be effective here, and critical thinking is presumably pretty important there; I think "basic competency in scientific practice" is probably the minimum bar, but even there you can imagine lab techs or undergraduates putting in useful work at a pretty introductory level here. I think there are many useful skills to have, but I have a hard time concluding that any of them are strictly necessary.

The more interesting question, to my mind, is what should we be teaching undergraduates and graduate students in biology? And there I unequivocally agree with the people who prioritize some reasonable background in stats, and some reasonable background in data analysis (with R or Python - something more than Excel). What's more important than teaching any one thing in specific, though, is that the whole concept that biologists can avoid math or computing in their training (be it stats, modeling, simulation, programming, data science/data analysis, or whatever) needs to die. That is over. Dead, done, over.

One particular challenge we are facing now is that we don't have many people capable of teaching these younger biologists the appropriate data analysis skills, because most biologists (including the non-research-active faculty that do most teaching) don't know anything about them, and data analysis in biology is about data analysis in biology -- you can't just drop in a physicists or an engineer to teach this stuff.

At the end of the day, though, a scientist either learns what they need to know in order to do their research, or they collaborate with others to do it. As data becomes ever more important in biology, I expect more and more biologists will learn how to do their own analysis. One of my interests is in figuring out how to help biologists to make this transition if they want to.

So perhaps we can shift from talking about what you must know in order to practice biology, and talk about what we're going to teach, to whom, and when, to people who are the biologists of the future?


by C. Titus Brown at November 02, 2014 11:00 PM

October 27, 2014

Juan Nunez-Iglesias

GitHub's hidden PR comments

It’s time to draw my “continuous integration in Python” series to a close. This final post ties all six previous posts together and is the preferred write-up to share more widely and on which to provide feedback.

Almost everything I know about good Python development I’ve learned from Stéfan van der Walt, Tony Yu, and the rest of the scikit-image team. But a few weeks ago, I was trying to emulate the scikit-image CI process for my own project: cellom2tif, a tool to liberate images from a rather useless proprietary format. (I consider this parenthetical comment sufficient fanfare to announce the 0.2 release!) As I started copying and editing config files, I found that even from a complete template, getting started was not very straightforward. First, scikit-image has much more complicated requirements, so that a lot of the .travis.yml file was just noise for my purposes. And second, as detailed in the previous posts, a lot of the steps are not found or recorded anywhere in the repository, but rather must be navigated to on the webpages of GitHub, Travis, and Coveralls. I therefore decided to write this series as both a notetaking exercise and a guide for future CI novices. (Such as future me.)

To recap, here are my six steps to doing continuous integration in Python with pytest, Travis, and Coveralls:

If you do all of the above at the beginning of your projects, you’ll be in a really good place one, two, five years down the line, when many academic projects grow far beyond their original scope in unpredictable ways and end up with much broken code. (See this wonderful editorial by Zeeya Merali for much more on this topic.)

Reducing the boilerplate with PyScaffold

But it’s a lot of stuff to do for every little project. I was about to make myself some minimal setup.cfg and .travis.yml template files so that I could have these ready for all new projects, when I remembered PyScaffold, which sets up a Python project’s basic structure automatically (setup.py, package_name/__init__.py, etc.). Sure enough, PyScaffold has a --with-travis option that implements all my recommendations, including pytest, Travis, and Coveralls. If you set up your projects with PyScaffold, you’ll just have to turn on Travis-CI on your GitHub repo admin and Coveralls on coveralls.io, and you’ll be good to go.

When Travises attack

I’ve made a fuss about how wonderful Travis-CI is, but it breaks more often than I’d like. You’ll make some changes locally, and ensure that the tests pass, but when you push them to GitHub, Travis fails. This can happen for various reasons:

  • your environment is different (e.g. NumPy versions differ between your local build and Travis’s VMs).
  • you’re testing a function that depends on random number generation and have failed to set the seed.
  • you depend on some web resource that was temporarily unavailable when you pushed.
  • Travis has updated its VMs in some incompatible way.
  • you have more memory/CPUs locally than Travis allows.
  • some other, not-yet-understood-by-me reason.

Of these, the first three are acceptable. You can use conda to match your environments both locally and on Travis, and you should always set the seed for randomised tests. For network errors, Travis provides a special function, travis_retry, that you can prefix your commands with.

Travis VM updates should theoretically be benign and not cause any problems, but, in recent months, they have been a significant source of pain for the scikit-image team: every monthly update by Travis broke our builds. That’s disappointing, to say the least. For simple builds, you really shouldn’t run into this. But for major projects, this is an unnecessary source of instability.

Further, Travis VMs don’t have unlimited memory and disk space for your builds (naturally), but the limits are not strictly defined (unnaturally). This means that builds requiring “some” memory or disk space randomly fail. Again, disappointing. Travis could, for example, guarantee some minimal specs that everyone could program against — and request additional space either as special exemptions or at a cost.

Finally, there’s the weird failures. I don’t have any examples on hand but I’ll just note that sometimes Travis builds fail, where your local copy works fine every single time. Sometimes rebuilding fixes things, and other times you have to change some subtle but apparently inconsequential thing before the build is fixed. These would be mitigated if Travis allowed you to clone their VM images so you could run them on a local VM or on your own EC2 allocation.

HeisenbugA too-common Travis occurrence: randomly failing tests

In all though, Travis is a fantastic resource, and you shouldn’t let my caveats stop you from using it. They are just something to keep in mind before you pull all your hair out.

The missing test: performance benchmarks

Testing helps you maintain the correctness of your code. However, as Michael Droettboom eloquently argued at SciPy 2014, all projects are prone to feature creep, which can progressively slow code down. Airspeed Velocity is to benchmarks what pytest is to unit tests, and allows you to monitor your project’s speed over time. Unfortunately, benchmarks are a different beast to tests, because you need to keep the testing computer’s specs and load constant for each benchmark run. Therefore, a VM-based CI service such as Travis is out of the question.

If your project has any performance component, it may well be worth investing in a dedicated machine only to run benchmarks. The machine could monitor your GitHub repo for changes and PRs, check them out when they come in, run the benchmarks, and report back. I have yet to do this for any of my projects, but will certainly consider this strongly in the future.

Some reservations about GitHub

The above tools all work great as part of GitHub’s pull request (PR) development model. It’s a model that is easy to grok, works well with new programmers, and has driven massive growth in the open-source community. Lately, I recommend it with a bit more trepidation than I used to, because it does have a few high-profile detractors, notably Linux and git creator Linus Torvalds, and OpenStack developer Julien Danjou. To paraphrase Julien, there are two core problems with GitHub’s chosen workflow, both of which are longstanding and neither of which shows any sign of improving.

First, comments on code diffs are buried by subsequent changes, whether the changes are a rebase or they simply change the diff. This makes it very difficult for an outside reviewer to assess what discussion, if any, resulted in the final/latest design of a PR. This could be a fairly trivial fix (colour-code outdated diffs, rather than hiding them), so I would love to see some comments from GitHub as to what is taking so long.

GitHub's hidden PR commentsExpect to see a lot of these when using pull requests.

Second, bisectability is broken by fixup commits. The GitHub development model is not only geared towards small, incremental commits being piled on to a history, but it actively encourages these with their per-commit badging of a user’s contribution calendar. Fixup commits make bug hunting with git bisect more difficult, because some commits will not be able to run a test suite at all. This could be alleviated by considering only commits merging GitHub PRs, whose commit message start with Merge pull request #, but I don’t know how to get git to do this automatically (ideas welcome in the comments).

I disagree with Julien that there is “no value in the social hype [GitHub] brings.” In fact, GitHub has dramatically improved my coding skills, and no doubt countless others’. For many, it is their first experience with code review. Give credit where it is due: GitHub is driving the current, enormous wave of open-source development. But there is no doubt it needs improvement, and it’s sad to see GitHub’s developers apparently ignoring their critics. I hope the latter will be loud enough soon that GitHub will have no choice but to take notice.

Final comments

This series, including this post, sums up my current thinking on CI in Python. It’s surely incomplete: I recently came across a curious “Health: 88%” badge on Mitchell Stanton-Cook’s BanzaiDB README. Clicking it took me to the project’s landscape.io page, which appears to do for coding style what Travis does for builds/tests and Coveralls does for coverage. How it measures “style” is not yet clear to me, but it might be another good CI tool to keep track of. Nevertheless, since it’s taken me a few years to get to this stage in my software development practice, I hope this series will help other scientists get there faster.

If any more experienced readers think any of my advice is rubbish, please speak up in the comments! I’ll update the post(s) accordingly. CI is a big rabbit hole and I’m still finding my way around.

by Juan Nunez-Iglesias at October 27, 2014 02:03 PM

October 22, 2014

Titus Brown

Infrastructure for Data Intensive Biology - a statement of work

Since being chosen as a Moore Foundation Data Driven Discovery Investigator, I've been putting together the paperwork at UC Davis to actually receive the money. Part of that is putting together a budget and a Statement of Work to help guide the conversation between me, Davis, and the Moore Foundation. Here's what I sent Chris Mentzel at Moore:

Title: Infrastructure for Data Intensive Biology

OUTCOME: In support of demonstrating the high level of scientific impact that data scientists deliver through their focus on interdisciplinary data-driven research, funds from this award will be used to better understand gene function in non-model organisms through the development of new workflows and better data sharing technology for large-scale data analysis.

Research direction 1: Develop and extend protocols for non-model genomic and transcriptomic analysis.

Research direction 2: Integrate and extend existing workflow and data analysis software into a cloud-enabled deployment system with a Web interface for executing protocols.

Research direction 3: Investigate and develop a distributed graph database system and distributed query functionality to support distributed data-driven discovery. (DDDD :)

For more of the background, see my full award submission, my presentation, and a science fiction story that I would like to enable.

Comments and pointers welcome!


by C. Titus Brown at October 22, 2014 10:00 PM

October 17, 2014

Juan Nunez-Iglesias

Readme badges

We’re finally ready to wrap up this topic. By now you can:

But, much as exercise is wasted if your bathroom scale doesn’t automatically tweet about it, all this effort is for naught if visitors to your GitHub page can’t see it!

Most high-profile open-source projects these days advertise their CI efforts. Above, I cheekily called this showing off, but it’s truly important: anyone who lands on your GitHub page is a potential user or contributor, and if they see evidence that your codebase is stable and well-tested, they are more likely to stick around.

Badging your README is easy. (You do have a README, don’t you?) In Travis, go to your latest build. Near the top right, click on the “build passing” badge:

Travis-CI badge

You’ll get an overlay, with a pull-down menu for all the different options for getting the badge. You can grab the image URL, or you can directly grab the Markdown to put into your markdown-formatted README, or a bunch of other options, including RST:

Travis-CI badge URLs

Just copy and paste the appropriate code and add it to your README file wherever you please.

Meanwhile, on your repository’s Coveralls page, on the right-hand side, you will find another pull-down menu with the appropriate URLs:

Coveralls badge URLs

Again, just grab whichever URL is appropriate for your needs (I prefer Markdown-formatted READMEs), and add it to your README, usually next to the Travis badge.

And you’re done! You’ve got automated tests, tons of test coverage, you’re running everything correctly thanks to configuration files, and all this is getting run on demand thanks to Travis and Coveralls. And thanks to badging, the whole world knows about it:

Readme badges

by Juan Nunez-Iglesias at October 17, 2014 01:37 PM

William Stein

A Non-technical Overview of the SageMathCloud Project

What problems is the SageMathCloud project trying to solve? What pain points does it address? Who are the competitors and what is the state of the technology right now?

What problems you’re trying to solve and why are these a problem?

  • Computational Education: How can I teach a course that involves mathematical computation and programming?
  • Computational Research: How can I carry out collaborative computational research projects?
  • Cloud computing: How can I get easy user-friendly collaborative access to a remote Linux server?

What are the pain points of the status quo and who feels the pain?

  • Student/Teacher pain:
    • Getting students to install software needed for a course on their computers is a major pain; sometimes it is just impossible, due to no major math software (not even Sage) supporting all recent versions of Windows/Linux/OS X/iOS/Android.
    • Getting university computer labs to install the software you need for a course is frustrating and expensive (time and money).
    • Even if computer labs worked, they are often being used by another course, stuffy, and students can't possibly do all their homework there, so computation gets short shrift. Lab keyboards, hardware, etc., all hard to get used to. Crappy monitors.
    • Painful confusing problems copying files around between teachers and students.
    • Helping a student or collaborator with their specific problem is very hard without physical access to their computer.
  • Researcher pain:
    • Making backups every few minutes of the complete state of everything when doing research often hard and distracting, but important for reproducibility.
    • Copying around documents, emailing or pushing/pulling them to revision control is frustrating and confusing.
    • Installing obscuring software is frustarting and distracting from the research they really want to do.
  • Everybody:
    • It is frustrating not having LIVE working access to your files wherever you are. (Dropbox/Github doesn't solve this, since files are static.)
    • It is difficult to leave computations running remotely.

Why is your technology poised to succeed?

  • When it works, SageMathCloud solves every pain point listed above.
  • The timing is right, due to massive improvements in web browsers during the last 3 years.
  • I am on full sabbatical this year, so at least success isn't totally impossible due to not working on the project.
  • I have been solving the above problems in less scalable ways for myself, colleagues and students since the 1990s.
  • SageMathCloud has many users that provide valuable feedback.
  • We have already solved difficult problems since I started this project in Summer 2012 (and launched first version in April 2013).

Who are your competitors?

There are no competitors with a similar range of functionality. However, there are many webapps that have some overlap in capabilities:
  • Mathematical overlap: Online Mathematica: "Bring Mathematica to life in the cloud"
  • Python overlap: Wakari: "Web-based Python Data Analysis"; also PythonAnywhere
  • Latex overlap: ShareLaTeX, WriteLaTeX
  • Web-based IDE's/terminals: target writing webapps (not research or math education): c9.io, nitrous.io, codio.com, terminal.com
  • Homework: WebAssign and WebWork
Right now, SageMathCloud gives away for free far more than any other similar site, due to very substantial temporary financial support from Google, the NSF and others.

What’s the total addressable market?

Though our primary focus is the college mathematics classroom, there is a larger market:
  • Students: all undergrad/high school students in the world taking a course involving programming or mathematics
  • Teachers: all teachers of such courses
  • Researchers: anybody working in areas that involve programming or data analysis
Moreover, the web-based platform for computing that we're building lends itself to many other collaborative applications.

What stage is your technology at?

  • The site is up and running and has 28,413 monthly active users
  • There are still many bugs
  • I have a precise todo list that would take me at least 2 months fulltime to finish.

Is your solution technically feasible at this point?

  • Yes. It is only a matter of time until the software is very polished.
  • Morever, we have compute resources to support significantly more users.
  • But without money (from paying customers or investment), if growth continues at the current rate then we will have to clamp down on free quotas for users.

What technical milestones remain?

  • Infrastructure for creating automatically-graded homework problems.
  • Fill in lots of details and polish.

Do you have external credibility with technical/business experts and customers?

  • Business experts: I don't even know any business experts.
  • Technical experts: I founded the Sage math software, which is 10 years old and relatively well known by mathematicians.
  • Customers: We have no customers; we haven't offered anything for sale.

Market research?

  • I know about math software and its users as a result of founding the Sage open source math software project, NSF-funded projects I've been involved in, etc.

Is the intellectual property around your technology protected?

  • The IP is software.
  • The website software is mostly new Javascript code we wrote that is copyright Univ. of Washington and mostly not open source; it depends on various open source libraries and components.
  • The Sage math software is entirely open source.

Who are the team members to move this technology forward?

I am the only person working on this project fulltime right now.
  • Everything: William Stein -- UW professor
  • Browser client code: Jon Lee, Andy Huchala, Nicholas Ruhland -- UW undergrads
  • Web design, analytics: Harald Schilly -- Austrian grad student
  • Hardware: Keith Clawson

Why are you the ideal team?

  • We are not the ideal team.
  • If I had money maybe I could build the ideal team, leveraging my experience and connections from the Sage project...

by William Stein (noreply@blogger.com) at October 17, 2014 01:04 PM

October 16, 2014

Jake Vanderplas

How Bad Is Your Colormap?

(Or, Why People Hate Jet – and You Should Too)

I made a little code snippet that I find helpful, and you might too:

In [1]:
def grayify_cmap(cmap):
    """Return a grayscale version of the colormap"""
    cmap = plt.cm.get_cmap(cmap)
    colors = cmap(np.arange(cmap.N))
    # convert RGBA to perceived greyscale luminance
    # cf. http://alienryderflex.com/hsp.html
    RGB_weight = [0.299, 0.587, 0.114]
    luminance = np.sqrt(np.dot(colors[:, :3] ** 2, RGB_weight))
    colors[:, :3] = luminance[:, np.newaxis]
    return cmap.from_list(cmap.name + "_grayscale", colors, cmap.N)

What this function does is to give you a lumninance-correct grayscale version of any matplotlib colormap. I've found this useful for quickly checking how my plots might appear if printed in black and white, but I think it's probably even more useful for stoking the flame of the internet's general rant against jet.

If you want to take a step toward joining the in-crowd of chromatically-sensitive data viz geeks, your best bet is to start by bashing jet. Even if you don't know it by name, I can guarantee that if you've read many scientific papers, you've seen jet before. For example, here's a snapshot of a plot from neuroscience journal which is skewered by an appropriately ranty blogpost on the subject:

Jet is the default colorbar originally used by matlab, and this default was inherited in the early days of Python's matplotlib package. The reasons not to use jet are numerous, and you can find good arguments against it across the web. For some more subdued and nuanced arguments, I'd start with the paper Rainbow Color Map (Still) Considered Harmful and, for more general visualization tips, Ten Simple Rules for Better Figures.

So what do I have to add to this discussion that hasn't been already said? Well, nothing really, except the code snippet I shared above. Let me show you what it does.

Taking the Color Out of Jet

Let's start by defining some data and seeing how it looks with the default jet colormap:

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 6)
y = np.linspace(0, 3)[:, np.newaxis]
z = 10 * np.cos(x ** 2) * np.exp(-y)

We'll use matplotlib's imshow command to visualize this. By default, it will use the "jet" colormap:

In [4]:

At first glance this might look OK. But upon closer examination, you might notice that jet's Luminance profile is incredibly complicated. Because your eye has different levels of sensitivity to light of different color, the luminance is not simply the sum of the RGB values as you might naively expect, but some weighted Euclidean sum of the individual values. You can find more information than you'd ever need to know on imagemagick's website.

When you take the jet colormap used above and convert it to luminance using the code snippet above, you get this:

In [5]:
plt.imshow(z, cmap=grayify_cmap('jet'))

It's a mess! The greyscale-only version of this colormap has strange luminance spikes in the middle, and makes it incredibly difficult to figure out what's going on in a plot with a modicum of complexity. Much better is to use a colormap with a uniform luminance gradient, such as the built-in grayscale colormap. Let's plot this beside the previous two:

In [6]:
cmaps = [plt.cm.jet, grayify_cmap('jet'), plt.cm.gray]
fig, axes = plt.subplots(1, 3, figsize=(12, 3))

for cmap, ax in zip(cmaps, axes):
    im = ax.imshow(z, cmap=cmap)
    fig.colorbar(im, ax=ax)

In particular, notice that in the left panel, your eye is drawn to the yellow and cyan regions, because the luminance is higher. This can have the unfortunate side-effect of highlighting "features" in your data which may not actually exist!

We can see this Luminance spike more clearly if we look at the color profile of jet up close:

In [7]:
def show_colormap(cmap):
    im = np.outer(np.ones(10), np.arange(100))
    fig, ax = plt.subplots(2, figsize=(6, 1.5),
                           subplot_kw=dict(xticks=[], yticks=[]))
    ax[0].imshow(im, cmap=cmap)
    ax[1].imshow(im, cmap=grayify_cmap(cmap))

Once you have the grayscale lined-up with the color version, it's easy to point out these luminance spikes in the jet spectrum. By comparison, take a look at the Cube Helix colormap:

In [8]:

This is a rainbow-like colormap which – by design – has a uniform luminance gradient across its progression of colors. It's certainly not the best choice in all situations, but you could easily argue that it's always a better choice than jet.

All the Colormaps!

It's useful to see this sort of visualization for all the available colormaps in matplotlib. Here's a quick script that does this:

In [18]:
fig, axes = plt.subplots(36, 6, figsize=(10, 7))
fig.subplots_adjust(left=0, right=1, bottom=0, top=1,
                    hspace=0.1, wspace=0.1)

im = np.outer(np.ones(10), np.arange(100))

cmaps = [m for m in plt.cm.datad if not m.endswith("_r")]

axes = axes.T.ravel()
for ax in axes:

for cmap, color_ax, gray_ax, null_ax in zip(cmaps, axes[1::3], axes[2::3], axes[::3]):
    del null_ax
    color_ax.set_title(cmap, fontsize=10)
    color_ax.imshow(im, cmap=cmap)
    gray_ax.imshow(im, cmap=grayify_cmap(cmap))