## March 27, 2015

### Gaël Varoquaux

#### Euroscipy 2015: Call for paper

EuroScipy 2015, the annual conference on Python in science will take place in Cambridge, UK on 26-30 August 2015. The conference features two days of tutorials followed by two days of scientific talks & posters and an extra day dedicated to developer sprints. It is the major event in Europe in the field of technical/scientific computing within the Python ecosystem. Scientists, PhD’s, students, data scientists, analysts, and quants from more than 20 countries attended the conference last year.

The topics presented at EuroSciPy are very diverse, with a focus on advanced software engineering and original uses of Python and its scientific libraries, either in theoretical or experimental research, from both academia and the industry.

Submissions for posters, talks & tutorials (beginner and advanced) are welcome on our website at http://www.euroscipy.org/2015/ Sprint proposals should be addressed directly to the organisation at euroscipy-org@python.org

Important dates:

• Apr 30, 2015 Talk and tutorials submission deadline
• May 1, 2015 Registration opens
• May 30, 2015 Final program announced
• Jun 15, 2015 Early-bird registration ends
• Aug 26-27, 2015 Tutorials
• Aug 28-29, 2015 Main conference
• Aug 30, 2015 Sprints

We look forward to an exciting conference and hope to see you in Cambridge

The EuroSciPy 2015 Team - http://ww.euroscipy.org/2015/

## March 26, 2015

### Titus Brown

#### NGS course 2015 - 2+1 weeks of computing, at a lake!

I've just opened applications for the 2015 summer course on Analyzing Next Generation Sequencing Data. The course will run from August 10th through August 21st; for details, see the course information page.

This year there will also be a third week of the course that is invitation-only. This third week is devoted to the topics of up/re-skilling, reproducibility, and data integration. It will be much more loosely organized than the first two weeks, with a few lectures and practicals around a lot of open time. I hope the third week will serve three purposes: it should be

1. an opportunity for alumni of the course to engage with updated materials; we now have 120 or so alumni, and I know that a reasonably significant number of them now do this sort of thing for a living. They can come hang out and go through the newer materials with some support.
2. a way for some of the course instructors to evaluate and trial the teaching of newer technology (e.g. Docker) for enabling reproducible computing.
3. a meeting of the minds on where the course (and training more generally) should go in the future.

I plan to prioritize alumni for this third week, but I will save some space for other participants who are already doing regular sequencing data analysis. If you feel reasonably comfortable with the material taught in a Software Carpentry workshop and you work with DNA sequencing data, please consider applying!

The application form for the third week is here. The form will be open until April 17th, or later if space is available.

For this third week, we are also looking for a few instructors/lecturers; travel and lodging would be covered. Drop me an e-mail if you are a trained Software Carpentry or Data Carpentry instructor and are interested in coming and hanging out with us for a week to develop materials and approaches!

--titus

#### A new preprint on semi-streaming analysis of large data sets

We just posted a new preprint (well, ok, a few weeks back)! The preprint title is "Crossing the streams: a framework for streaming analysis of short DNA sequencing reads", by Qingpeng Zhang, Sherine Awad, and myself. Note that like our other recent papers, this paper is 100% reproducible, with all source code in github. we haven't finished writing up the instructions yet, but happy to share (see AWS.md for notes).

This paper's major point is that you can use the massive redundancy of deep short-read sequencing to analyze data as it comes in, and that this could easily be integrated into existing k-mer based error correctors and variant callers. Conveniently the algorithm doesn't care what you're sequencing - no assumptions are made about uniformity of coverage, so you can apply the same ol' spectral approaches you use for genomes to transcriptomes, metagenomes, and probably single-cell data. Which is, and let's be frank here, super awesome.

The paper also provides two useful tools, both implemented as part of khmer: one is an efficient approach to k-mer-abundance-based error trimming of short reads, and the other is a streaming approach to looking at per-position error profiles of short-read sequencing experiments.

A colleague, Erich Schwarz, suggested that I more strongly make the point that is really behind this work: we're in for more data. Scads and scads of it. Coming it with ways of efficiently dealing with it at an algorithmic level is important. (We didn't strengthen this point in the posted preprint - the feedback came too late -- but we will hopefully get a chance to stick it in in the revisions.)

The really surprising thing for me is that the general semi-streaming approach has virtually no drawbacks - the results are very similar to the full two-pass offline approaches used currently for error correction etc. Without implementing a huge amount of stuff we had to make this argument transitively, but I think it's solid.

For entertainment, take a look at the error profile in Figure 6. That's from a real data set, published in Nature something or other...

And, finally, dear prospective reviewers: the biggest flaws that I see are these:

• we chose to make most of our arguments by analyzing real data, and we didn't spend any time developing theory. This is a choice that our lab frequently makes -- to implement effective methods rather than developing the underlying theory -- but it leaves us open for a certain type of criticism.
• to extend the previous point, the algorithmic performance depends critically on details of the data set. We didn't know how to discuss this and so the discussion is maybe a bit weak. We'd love reviewers to ask pointed questions that we can address in order to shore it up.
• Repeats! How does all this stuff work with repeats!? I did a lot of work simulating repetitive sequences and couldn't find any place where repeats actually caused problems. My intuition now tells me that repeats are not actually a problem for methods that interrogate De Bruijn graphs using entire reads as an index into the graph, but I'd welcome someone telling me I'm wrong and either telling me where to look, or asking concrete questions that illuminate better directions to take it.

--titus

## March 25, 2015

### Matthew Rocklin

#### Partition and Shuffle

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

This post primarily targets developers.

tl;dr We partition out-of-core dataframes efficiently.

## Partition Data

Many efficient parallel algorithms require intelligently partitioned data.

For time-series data we might partition into month-long blocks. For text-indexed data we might have all of the “A”s in one group and all of the “B”s in another. These divisions let us arrange work with foresight.

To extend Pandas operations to larger-than-memory data efficient partition algorithms are critical. This is tricky when data doesn’t fit in memory.

## Partitioning is fundamentally hard

Data locality is the root of all performance
-- A Good Programmer


Partitioning/shuffling is inherently non-local. Every block of input data needs to separate and send bits to every block of output data. If we have a thousand partitions then that’s a million little partition shards to communicate. Ouch.

Consider the following setup

  100GB dataset
/ 100MB partitions
= 1,000 input partitions


To partition we need shuffle data in the input partitions to a similar number of output partitions

  1,000 input partitions
* 1,000 output partitions
= 1,000,000 partition shards


If our communication/storage of those shards has even a millisecond of latency then we run into problems.

  1,000,000 partition shards
x 1ms
= 18 minutes


Previously I stored the partition-shards individually on the filesystem using cPickle. This was a mistake. It was very slow because it treated each of the million shards independently. Now we aggregate shards headed for the same out-block and write out many at a time, bundling overhead. We balance this against memory constraints. This stresses both Python latencies and memory use.

## BColz, now for very small data

Fortunately we have a nice on-disk chunked array container that supports append in Cython. BColz (formerly BLZ, formerly CArray) does this for us. It wasn’t originally designed for this use case but performs admirably.

Briefly, BColz is…

• A binary store (like HDF5)
• With columnar access (useful for tabular computations)
• That stores data in cache-friendly sized blocks
• With a focus on compression
• Written mostly by Francesc Alted (PyTables) and Valentin Haenel

It includes two main objects:

• carray: An on-disk numpy array
• ctable: A named collection of carrays to represent a table/dataframe

## Partitioned Frame

We use carray to make a new data structure pframe with the following operations:

• Append DataFrame to collection, and partition it along the index on known block divisions blockdivs
• Extract DataFrame corresponding to a particular partition

Internally we invent two new data structures:

• cframe: Like ctable this stores column information in a collection of carrays. Unlike ctable this maps perfectly onto the custom block structure used internally by Pandas. For internal use only.
• pframe: A collection of cframes, one for each partition.

Through bcolz.carray, cframe manages efficient incremental storage to disk. PFrame partitions incoming data and feeds it to the appropriate cframe.

## Example

Create test dataset

In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a': [1, 2, 3, 4],
...                        'b': [1., 2., 3., 4.]},
...                       index=[1, 4, 10, 20])

Create pframe like our test dataset, partitioning on divisions 5, 15. Append the single test dataframe.

In [3]: from pframe import pframe
In [4]: pf = pframe(like=df, blockdivs=[5, 15])
In [5]: pf.append(df)

Pull out partitions

In [6]: pf.get_partition(0)
Out[6]:
a  b
1  1  1
4  2  2

In [7]: pf.get_partition(1)
Out[7]:
a  b
10  3  3

In [8]: pf.get_partition(2)
Out[8]:
a  b
20  4  4

Continue to append data…

In [9]: df2 = pd.DataFrame({'a': [10, 20, 30, 40],
...                         'b': [10., 20., 30., 40.]},
...                        index=[1, 4, 10, 20])
In [10]: pf.append(df2)

… and partitions grow accordingly.

In [12]: pf.get_partition(0)
Out[12]:
a   b
1   1   1
4   2   2
1  10  10
4  20  20

We can continue this until our disk fills up. This runs near peak I/O speeds (on my low-power laptop with admittedly poor I/O.)

## Performance

I’ve partitioned the NYCTaxi trip dataset a lot this week and posting my results to the Continuum chat with messages like the following

I think I've got it to work, though it took all night and my hard drive filled up.
Down to six hours and it actually works.
Three hours!
By removing object dtypes we're down to 30 minutes
20!  This is actually usable.
OK, I've got this to six minutes.  Thank goodness for Pandas categoricals.
Five.
Down to about three and a half with multithreading, but only if we stop blosc from segfaulting.


And thats where I am now. It’s been a fun week. Here is a tiny benchmark.

>>> import pandas as pd
>>> import numpy as np
>>> from pframe import pframe

>>> df = pd.DataFrame({'a': np.random.random(1000000),
'b': np.random.poisson(100, size=1000000),
'c': np.random.random(1000000),
'd': np.random.random(1000000).astype('f4')}).set_index('a')

Set up a pframe to match the structure of this DataFrame Partition index into divisions of size 0.1

>>> pf = pframe(like=df,
...             blockdivs=[.1, .2, .3, .4, .5, .6, .7, .8, .9],
...             chunklen=2**15)

Dump the random data into the Partition Frame one hundred times and compute effective bandwidths.

>>> for i in range(100):
...     pf.append(df)

CPU times: user 39.4 s, sys: 3.01 s, total: 42.4 s
Wall time: 40.6 s

>>> pf.nbytes
2800000000

>>> pf.nbytes / 40.6 / 1e6  # MB/s
68.9655172413793

>>> pf.cbytes / 40.6 / 1e6  # Actual compressed bytes on disk
41.5172952955665

We partition and store on disk random-ish data at 68MB/s (cheating with compression). This is on my old small notebook computer with a weak processor and hard drive I/O bandwidth at around 100 MB/s.

## Theoretical Comparison to External Sort

There isn’t much literature to back up my approach. That concerns me. There is a lot of literature however on external sorting and they often site our partitioning problem as a use case. Perhaps we should do an external sort?

I thought I’d quickly give some reasons why I think the current approach is theoretically better than an out-of-core sort; hopefully someone smarter can come by and tell me why I’m wrong.

We don’t need a full sort, we need something far weaker. External sort requires at least two passes over the data while the method above requires one full pass through the data as well as one additional pass through the index column to determine good block divisions. These divisions should be of approximately equal size. The approximate size can be pretty rough. I don’t think we would notice a variation of a factor of five in block sizes. Task scheduling lets us be pretty sloppy with load imbalance as long as we have many tasks.

I haven’t implemented a good external sort though so I’m only able to argue theory here. I’m likely missing important implementation details.

• PFrame code lives in a dask branch at the moment. It depends on a couple of BColz PRs (#163, #164)

## March 24, 2015

### Martin Fitzpatrick

#### Live Demo - Wooey: Web UIs for Python scripts

A new live demo of Wooey is now up and running with a few simple example scripts. Features:

• Web UIs for Python scripts generated automatically from argparse
• An automated background worker to run scripts and collect outputs
• Run Queue to schedule (and prioritise) jobs

Try it out, fork it or leave feedback below.

### Juan Nunez-Iglesias

#### Photo by Ian Rees (from the SciPy 2012 conference website)

SciPy is my favourite conference. My goal with this post is to convince someone to go who hasn’t had that chance yet.

## Why SciPy?

Most scientists go to conferences in their own field: neuroscientists go to the monstrous Society for Neuroscience (SfN); Bioinformaticians go to RECOMB, ISMB, or PSB; and so on.

People go to these to keep up with the latest advances in their field, and often, to do a bit of networking.

SciPy is a different kind of conference. It changes the way you do science. You learn about the latest free and open source software to help you with your work. You learn to write functions and interfaces instead of scripts, and to write tests so you don’t break your code. You also learn to contribute these to bigger projects, maximising the reach and impact of your work (see “sprints”, below).

And you learn these things by doing them, with the people who are the best at this, rather than by reading books and blog posts. (Which maybe I shouldn’t knock, since I’m writing a book about all this and you are reading my blog!)

Attendees to SciPy have scientific software in common, but come from diverse fields, including physics, pure maths, data visualisation, geosciences, climatology, and yes, biology and bioinformatics. Mingling with such a diverse group is a fantastic way to get your creative juices flowing!

The conference lasts a full week and is broken up into three parts: tutorials, main conference, and sprints.

### the tutorials

With a few exceptions, you won’t learn about your own field. But you will learn an enormous amount about tools that will help you be a better scientist. If you are a newbie to Python, you can go to the beginner tutorials track and learn about the fantastic scientific libraries available in Python. If you already use NumPy, SciPy, pandas, and the IPython notebook, you can go to the intermediate or advanced tracks, and learn new things about those. Even as an advanced SciPy user I still get tons of value from the tutorials. (Last year Min RK gave a wild demo of IPython parallel’s capabilities by crawling wikipedia remotely while building up a graph visualisation on his live notebook.) (Fast-forward to the 1h mark to see just the payoff.) Here’s last year’s tutorial schedule for an idea of what to expect.

### the main conference track

You will also hear about the latest advances in the scientific libraries you know and use, and about libraries you didn’t know about but will find useful (such as scikit-bio, yt or epipy). The main conference track features software advances written by presenters from many different fields. Hearing about these from the authors of the software lets you ask much different questions compared to hearing someone say, for example, “we used the Matlab image processing toolbox”. If you ever had a feature request for your favourite library, or you wondered why they do something in a particular way, there’s no better opportunity to get some closure.

The crosstalk between different fields is phenomenal. Hearing how a diverse set of people deal with their software problems really opens your mind to completely different approaches to what you had previously considered.

### the sprints

Finally, there’s two days of coding sprints. Even if you are a complete beginner in software development, do yourself a favour and participate in one of these.

Two and a half years after my first SciPy in 2012, I’m writing a scientific Python book for O’Reilly, and I can 100% trace it to participating in the scikit-image sprint that year. With their guidance, I wrote my first ever GitHub pull request and my first ever unit test. Both were tiny and cute, and I would consider them trivial now, but that seed grew into massive improvements in my code-writing practice and many more contributions to open source projects.

And this is huge: now, instead of giving up when a software package doesn’t do what I need it to do, I just look at the source code and figure out how I can add what I want. Someone else probably wants that functionality, and by putting it into a major software library instead of in my own code, I get it into the hands of many more users. It’s a bit counterintuitive but there is nothing more gratifying than having some random person you’ve never met complain when you break something! This never happens when all your code is in your one little specialised repository containing functionality for your one paper.

## How SciPy

The SciPy calls for tutorials, talks, posters, and its plotting contest are all out. There’s specialised tracks and most of you reading this are probably interested in the computational biology and medicine track. It’s taken me a while to write this post, so there’s just one week left to submit something: the deadline is April 1st.

Even if you don’t get something in, I encourage you to participate. Everything I said above still applies if you’re not presenting. You might have a bit more trouble convincing your funders to pay for your travels, but if that’s the case I encourage you to apply for financial assistance from the conference.

I’ve written about SciPy’s diversity problem before, so I’m happy to report that this year there’s specific scholarships for women and minorities. (This may have been true last year, I forget.) And, awesomely, Matt Davis has offered to help first-time submitters with writing their proposals.

Give SciPy a try: submit here and/or register here. And feel free to email me or comment below if you have questions!

Update: A colleague pointed out that I should also mention the awesomeness of the conference venue, so here goes: Austin in July is awesome. If you love the heat like I do, well, it doesn’t get any better. If you don’t, don’t worry: the AT&T Conference Center AC is on friggin overdrive the whole time. Plus, there’s some nearby cold springs to swim in. The center itself is an excellent hotel and the conference organises massive discounts for attendees. There’s a couple of great restaurants on-site; and the Mexican and Texas BBQ in the area are incredible — follow some Enthought and Continuum folks around to experience amazing food. Finally, Austin is a great city to bike in: last time I rented a road bike for the whole week from Mellow Johnny’s, and enjoyed quite a few lunchtime and evening rides.

## March 23, 2015

### Titus Brown

#### Metagenomics related/relevant workshops - a list

The other day I was contacted by someone whose student wants to attend the MSU NGS course in 2015, because they are interested in learning how to data integration with (among other things) metagenome data. My response was "we don't cover that in the course", which isn't very helpful ;).

So, I went hunting, and got the following list of metagenome relevant workshops from a program manager. Note that I asked them to cast a broad net, so this goes far beyond "mere" computational analysis of environmental microbes and their functional -- but it should be pretty inclusive of what's out there. If I'm missing something relevant, please let me know!

JGI 2015 workshops - fungal genomics, genomic technologies, KBase, sample QC, and synthetic biology. (Ongoing now, but keep an eye out for next year.)

QIIME workshops, ~monthly and/or by invitation.

The International GeoBiology workshop series, application deadline in February - keep an eye out for next year. (This is the workshop I need/want to attend myself, so I can learn more biogeochemistry!)

Bigelow Third Microbial Single Cell Genomics Workshop, in June - deadline Mar 29th!!

iMicrobe workshops at ASM this year - see Workshop WS18. Note, registration deadline Mar 30th!!

MBL STAMPS (Applications closed for 2015)

MBL Microbial Diversity (applications closed for 2015) - the course is looking interesting this year, with some focus on intersections between sequencing and good ol' fashioned physiology/biogeochemistry.

EDAMAME (applications closed for 2015) - mostly focused on microbial ecology.

My only frustration with this list is that it seems like there's very little out there that really digs into the bioinformatics of shotgun metagenomics and biogeochemistry - the MicDiv course at MBL may well dip a good portion of a leg into this pond, and I'll find out more this year because I'm participating in it. But that's it. Am I wrong?

--titus

#### &quot;Open Source, Open Science&quot; Meeting Report - March 2015

On March 19th and 20th, the Center for Open Science hosted a small meeting in Charlottesville, VA, convened by COS and co-organized by Kaitlin Thaney (Mozilla Science Lab) and Titus Brown (UC Davis). People working across the open science ecosystem attended, including publishers, infrastructure non-profits, public policy experts, community builders, and academics.

Open Science has emerged into the mainstream, primarily due to concerted efforts from various individuals, institutions, and initiatives. This small, focused gathering brought together several of those community leaders. The purpose of the meeting was to define common goals, discuss common challenges, and coordinate on common efforts.

We had good discussions about several issues at the intersection of technology and social hacking including badging, improving standards for scientific APIs, and developing shared infrastructure. We also talked about coordination challenges due to the rapid growth of the open science community. At least three collaborative projects emerged from the meeting as concrete outcomes to combat the coordination challenges.

A repeated theme was how to make the value proposition of open science more explicit. Why should scientists become more open, and why should institutions and funders support open science? We agreed that incentives in science are misaligned with practices, and we identified particular pain points and opportunities to nudge incentives. We focused on providing information about the benefits of open science to researchers, funders, and administrators, and emphasized reasons aligned with each stakeholders' interests. We also discussed industry interest in "open", both in making good use of open data, and also in participating in the open ecosystem. One of the collaborative projects emerging from the meeting is a paper or papers to answer the question "Why go open?" for researchers.

Many groups are providing training for tools, statistics, or workflows that could improve openness and reproducibility. We discussed methods of coordinating training activities, such as a training "decision tree" defining potential entry points and next steps for researchers. For example, Center for Open Science offers statistics consulting, rOpenSci offers training on tools, and Software Carpentry, Data Carpentry, and Mozilla Science Lab offer training on workflows. A federation of training services could be mutually reinforcing and bolster collective effectiveness, and facilitate sustainable funding models.

The challenge of supporting training efforts was linked to the larger challenge of funding the so-called "glue" - the technical infrastructure that is only noticed when it fails to function. One such collaboration is the SHARE project, a partnership between the Association of Research Libraries, its academic association partners, and the Center for Open Science. There is little glory in training and infrastructure, but both are essential elements for providing knowledge to enable change, and tools to enact change.

Another repeated theme was the "open science bubble". Many participants felt that they were failing to reach people outside of the open science community. Training in data science and software development was recognized as one way to introduce people to open science. For example, data integration and techniques for reproducible computational analysis naturally connect to discussions of data availability and open source. Re-branding was also discussed as a solution - rather than "post preprints!", say "get more citations!" Another important realization was that researchers who engage with open practices need not, and indeed may not want to, self-identify as "open scientists" per se. The identity and behavior need not be the same.

A number of concrete actions and collaborative activities emerged at the end, including a more coordinated effort around badging, collaboration on API connections between services and producing an article on best practices for scientific APIs, and the writing of an opinion paper outlining the value proposition of open science for researchers. While several proposals were advanced for "next meetings" such as hackathons, no decision has yet been reached. But, a more important decision was clear - the open science community is emerging, strong, and ready to work in concert to help the daily scientific practice live up to core scientific values.

People

Tal Yarkoni, University of Texas at Austin

Elliott Shore, Association of Research Libraries

Karthik Ram, rOpenSci and Berkeley Institute for Data Science

Min Ragan-Kelley, IPython and UC Berkeley

Brian Nosek, Center for Open Science and University of Virginia

Erin C. McKiernan, Wilfrid Laurier University

C. Titus Brown, UC Davis

## March 22, 2015

### Titus Brown

#### What to do with lots of (sequencing) data

On a recent west coast speaking junket where I spoke at OSU, OHSU, and VanBUG (Brown PNW '15!), I put together a new talk that tried to connect our past work on scaling metagenome assembly with our future work on driving data sharing and data integration. As you can maybe guess from the first few talk slides, the motivating chain was something like

1. We want to help biologists move more quickly to hypotheses;
2. This can in part be done by aiding in hypothesis generation and refinement;
3. Right now it's painful to analyze large sequencing data sets;
4. Let's make it less painful!

At both OHSU and OSU, where I gave very similar talks, I got important and critical feedback on these points. The three most valuable points of feedback were,

• what exactly do you mean by data integration, anyway, Titus?
• you never talked about generating hypotheses!
• no, seriously, you never talked about how to actually generate hypotheses!?

The culmination point of this satori-like experience was when Stephen Giovannoni leaned across the table at dinner and said, "Perhaps you can tell me what all this data is actually good for?" This led to a very robust and energetic dinner conversation which led to this blog post :). (Note, Stephen clearly had his own ideas, but wanted to hear mine!)

The underlying point is this: I, and others, are proselytizing the free and open availability of large data sets; we're pushing the development of big data analytic tools; and we're arguing that this is important to the future of science. Why? What good is it all? Why should we worry about this, rather than ignoring it and focusing instead on (for example) physiological characterization of complex environments, clinical trials, etc. etc.?

## What is all this data potentially good for?

Suppose you set out to use a body of sequencing data in your scientific research. This sequencing data is maybe something you generated yourself, or perhaps it's from a public data set, or from multiple public data sets. Either way, it's a bunch of data that you would like to make use of. What could you do with it?

(This isn't specific to sequencing data, although I think the exploratory approaches are particularly important in biology and sequencing data are well suited to exploratory analysis.)

1. Computational hypothesis falsification.

"I thought A was happening. If A is happening, then when I looked at my data I should have seen B. I didn't see B. Therefore A is probably not happening."

For example, if you are convinced that a specific biogeochemical process is at work, but can't find the relevant molecules in a metagenomic survey, then either you did something wrong, had insufficient sensitivity, or your hypothesis is incorrect in the first place.

This is one place where pre-existing data can really accelerate the scientific process, and where data availability is really important.

2. Determination of model sufficiency.

"I have a Boolean or quantitative model that I believe captures the essential components of my system under investigation. When I fit my actual data to this model, I see several important mismatches. Therefore my model needs more work."

For gene regulatory networks, or metabolic modeling, this kind of approach is where we need to go. See, for example, work from my graduate lab on sea urchin GRNs - this approach is used there implicitly to drive forward the investigation of underdetermined parts of the GRN.

3. Comparison with a null or neutral model.

"If interesting interactions were happening, I would see patterns that deviated from my model of what an uninteresting system should look like. I don't, therefore my model of what is interesting or uninteresting needs to change."

Somewhat of an elaboration of the above "model sufficiency", here we are choosing an explicit "null model" to interpret our data and concluding that our data is either interesting or boring. For me, the difference is that these models need not be mechanistic, while the models in the previous point are often mechanistic. One example I'd point to is Ethan White's work on maximum entropy models.

4. Hypothesis generation (or, "fishing expedition.")

"I have no idea what processes are at work here. Let's look at the data and come up with some ideas."

A thoroughly underappreciated yet increasingly default approach in various areas of biology, fishing expeditions can feed the masses. (Get it? Hah!)

But, seriously, this is an important part of biology; I wrote about why at some length back in 2011. All of the previous points rely on us already knowing or believing something, while in reality most of biology is poorly understood and in many cases we have almost no idea what is going on mechanistically. Just looking at systems can be very informative in this situation.

So, this is my first take on the reasons why I think large-scale data generation, availability, analysis, and integration can and should be first class citizens in biology. But I'd be interested in pushback and other thoughts, as well as references to places where this approach has worked well (or poorly) in biology!

--titus

p.s. Thanks to Stephen Giovannoni, Thomas Sharpton, Ryan Mueller, and
David Koslicki for the dinner conversation at OSU!

## March 20, 2015

### Titus Brown

#### A personal perspective on the Open Source, Open Science meeting (2015)

I'm returning from a small, excellent meeting on "Open Source, Open Science", held at the Center for Open Science in Charlottesville, VA. We'll post a brief meeting report soon, but I wanted to share my particular highlights --

First, I got a chance to really dig into what the Center for Open Science is doing technically with the Open Science Framework. I still have to explore it a bit more myself, but the OSF has built some excellent infrastructure around collaborative workflows in scientific practice, and I'm tremendously impressed. Meeting Brian Nosek and Jeff Spies, and having the chance to speak at length with several members of the OSF team, was worth the trip all on its own. They were also wonderful hosts, and everything ran incredibly smoothly!

Second, I finally met a bunch of tweeps in person, including Kara Woo of NCEAS (@kara_woo), Erin McKiernan (@emckiernan13), Mark Hahnel of Figshare (@markhahnel), Jason Priem of ImpactStory (@jasonpriem), and probably others that I'm forgetting (sorry ;(. I spent a lot of time talking with Erin McKiernan in particular - we chatted at length about career challenges in open science and science more generally.

Third, the willingness and enthusiasm of funders (both federal and foundation) to engage with open science was palpable. We got some great advice on "thinking like funders" - it's all fine and well to say that open science is better, but it's important to make the case to people who are open-science naive and want to know exactly which of their priorities and interests are better enabled by open science practices.

Fourth, I was completely blown away by what Figshare is doing. Mark Hahnel gave a great 5 minute talk on how Figshare is growing into the data sharing space being created by funder mandates. Tremendously impressive.

--titus

## March 19, 2015

### Continuum Analytics

#### Microsoft Chooses Anaconda for Azure Machine Learning Service

Anaconda, the Python distribution for large-scale data processing, is now included in Microsoft’s Azure Machine Learning service. Microsoft announced on February 18 at the Strata Hadoop World Conference the general availability of Azure ML, including some enhancements to the platform.

## March 17, 2015

### Titus Brown

#### An mRNAseq workshop at UC Davis - my first as a Davis prof

Two weeks ago, I ran a workshop at UC Davis on mRNAseq analysis for semi-model organisms, which focused on building new gene models ab initio -- with a reference genome. This was a milestone for me - the first time I taught a workshop at UC Davis as a professor there! My co-instructors were Tamer Mansour, a new postdoc in my lab, and Isabelle Laforest-Lapointe, a microbial ecologist visiting Jonathan Eisen's lab from Montreal - about whom more, later.

This is the third workshop in my planned workshop series -- the first workshop was a Software Carpentry instructor training workshop given by Greg Wilson and Tracy Teal, and the second was a Data Carpentry workshop given by Tracy Teal.

Tracy Teal will be running a one-day workshop in April, on mothur, and I will be giving a two-day workshop in early May (May 4-5), on de novo mRNAseq analysis, for transcriptome analysis in organisms without a reference genome. I will also be teaching at the MBL Microbial Diversity Course and attending the beginning of the MBL STAMPS course in Woods Hole. Finally, I will once again host the two week summer NGS course back in Michigan in August. Then, in September, I return to UC Davis and get to actually spin up my full workshop series -- these early ones are just tasters :).

## Note: Software Carpentry instructors are awesome

I needed some help with an R section in the mRNAseq workshop, so I advertised on the Software Carpentry instructors mailing list; this is a mailing list for everyone who is an accredited Software Carpentry instructor, meaning they've been through the training and a bit more. Lo and behold, one of the people who responded was Isabelle, who was visiting UC Davis to do some work in the Eisen Lab - no travel needed, and willing to help out both days. We corresponded a bit to make sure she could make the given days, and then just arranged to meet up at 8:30am on the first day of the workshop.

We showed up, she showed up, and it was a great success ;).

Now, I ask you - where else in the world can you e-mail a few hundred competent, capable, friendly people, and find one who will reliably show up at a given place and time to teach?

Software Carpentry, peeps. It's awesome.

--titus

## March 16, 2015

### Matthew Rocklin

#### Efficiently Store Pandas DataFrames

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr We benchmark several options to store Pandas DataFrames to disk. Good options exist for numeric data but text is a pain. Categorical dtypes are a good option.

## Introduction

For dask.frame I need to read and write Pandas DataFrames to disk. Both disk bandwidth and serialization speed limit storage performance.

• Disk bandwidth, between 100MB/s and 800MB/s for a notebook hard drive, is limited purely by hardware. Not much we can do here except buy better drives.
• Serialization cost though varies widely by library and context. We can be smart here. Serialization is the conversion of a Python variable (e.g. DataFrame) to a stream of bytes that can be written raw to disk.

Typically we use libraries like pickle to serialize Python objects. For dask.frame we really care about doing this quickly so we’re going to also look at a few alternatives.

## Contenders

• pickle - The standard library pure Python solution
• cPickle - The standard library C solution
• pickle.dumps(data, protocol=2) - pickle and cPickle support multiple protocols. Protocol 2 is good for numeric data.
• json - using the standardlib json library, we encode the values and index as lists of ints/strings
• json-no-index - Same as above except that we don’t encode the index of the DataFrame, e.g. 0, 1, ... We’ll find that JSON does surprisingly well on pure text data.
• msgpack - A binary JSON alternative
• CSV - The venerable pandas.read_csv and DataFrame.to_csv
• hdfstore - Pandas’ custom HDF5 storage format

Additionally we mention but don’t include the following:

• dill and cloudpickle- formats commonly used for function serialization. These perform about the same as cPickle
• hickle - A pickle interface over HDF5. This does well on NumPy data but doesn’t support Pandas DataFrames well.

## Experiment

Disclaimer: We’re about to issue performance numbers on a toy dataset. You should not trust that what follows generalizes to your data. You should look at your own data and run benchmarks yourself. My benchmarks lie.

We create a DataFrame with two columns, one with numeric data, and one with text. The text column has repeated values (1000 unique values, each repeated 1000 times) while the numeric column is all unique. This is fairly typical of data that I see in the wild.

df = pd.DataFrame({'text': [str(i % 1000) for i in range(1000000)],
'numbers': range(1000000)})

Now we time the various dumps and loads methods of the different serialization libraries and plot the results below.

As a point of reference writing the serialized result to disk and reading it back again should take somewhere between 0.05s and 0.5s on standard hard drives. We want to keep serialization costs below this threshold.

Thank you to Michael Waskom for making those charts (see twitter conversation and his alternative charts)

Gist to recreate plots here: https://gist.github.com/mrocklin/4f6d06a2ccc03731dd5f

Further Disclaimer: These numbers average from multiple repeated calls to loads/dumps. Actual performance in the wild is likely worse.

## Observations

We have good options for numeric data but not for text. This is unfortunate; serializing ASCII text should be cheap. We lose here because we store text in a Series with the NumPy dtype ‘O’ for generic Python objects. We don’t have a dedicated variable length string dtype. This is tragic.

For numeric data the successful systems systems record a small amount of metadata and then dump the raw bytes. The main takeaway from this is that you should use the protocol=2 keyword argument to pickle. This option isn’t well known but strongly impacts preformance.

Note: Aaron Meurer notes in the comments that for Python 3 users protocol=3 is already default. Python 3 users can trust the default protocol= setting to be efficient and should not specify protocol=2.

### Some thoughts on text

1. Text should be easy to serialize. It’s already text!

2. JSON-no-index serializes the text values of the dataframe (not the integer index) as a list of strings. This assumes that the data are strings which is why it’s able to outperform the others, even though it’s not an optimized format. This is what we would gain if we had a string dtype rather than relying on the NumPy Object dtype, 'O'.

3. MsgPack is surpsingly fast compared to cPickle

4. MsgPack is oddly unbalanced, it can dump text data very quickly but takes a while to load it back in. Can we improve msgpack load speeds?

5. CSV text loads are fast. Hooray for pandas.read_csv.

### Some thoughts on numeric data

1. Both pickle(..., protocol=2) and msgpack dump raw bytes. These are well below disk I/O speeds. Hooray!

2. There isn’t much reason to compare performance below this level.

## Categoricals to the Rescue

Pandas recently added support for categorical data. We use categorical data when our values take on a fixed number of possible options with potentially many repeats (like stock ticker symbols.) We enumerate these possible options (AAPL: 1, GOOG: 2, MSFT: 3, ...) and use those numbers in place of the text. This works well when there are many more observations/rows than there are unique values. Recall that in our case we have one million rows but only one thousand unique values. This is typical for many kinds of data.

This is great! We’ve shrunk the amount of text data by a factor of a thousand, replacing it with cheap-to-serialize numeric data.

>>> df['text'] = df['text'].astype('category')
>>> df.text
0      0
1      1
2      2
3      3
...
999997    997
999998    998
999999    999
Name: text, Length: 1000000, dtype: category
Categories (1000, object): [0 < 1 < 10 < 100 ... 996 < 997 < 998 < 999]

Lets consider the costs of doing this conversion and of serializing it afterwards relative to the costs of just serializing it.

seconds
Serialize Original Text 1.042523
Convert to Categories 0.072093
Serialize Categorical Data 0.028223

When our data is amenable to categories then it’s cheaper to convert-then-serialize than it is to serialize the raw text. Repeated serializations are just pure-win. Categorical data is good for other reasons too; computations on object dtype in Pandas generally happen at Python speeds. If you care about performance then categoricals are definitely something to roll in to your workflow.

## Final Thoughts

1. Several excellent serialization options exist, each with different strengths.
2. A combination of good serialization support for numeric data and Pandas categorical dtypes enable efficient serialization and storage of DataFrames.
3. Object dtype is bad for PyData. String dtypes would be nice. I’d like to shout out to DyND a possible NumPy replacement that would resolve this.
4. MsgPack provides surprisingly good performance over custom Python solutions, why is that?
5. I suspect that we could improve performance by special casing Object dtypes and assuming that they contain only text.

## March 13, 2015

### Titus Brown

#### Request for suggestions on a generator API for khmer

I've been putting together a streaming API for khmer that would let us use generators to do sequence analysis, and I'd be interested in thoughts on how to do it in a good Pythonic way.

Some background: a while back, Alex Jironkin asked us for high level APIs, which turned into an issue on GitHub. More recently, we posted a paper on streaming and semi-streaming approaches to spectral sequence analysis, and so for my VanBUG talk I was inspired to put together a little API using generators.

The code looks like this, currently (see versioned permalink to khmer_api.py for full implementation):

graph = khmer.new_counting_hash(20, 1e7, 4)
out_fp = open(os.path.basename(filename) + '.abundtrim', 'w')

## khmer scripts/trim-low-abund.py -V, using generators
input_iter = screed.open(filename)
input_iter = streamtrim(input_iter, graph, 20, 3)


Briefly, what this code does is

1. create the core data structure we use
2. open an output file
3. open an input file
4. create a generator that takes in a stream of data and groups records;
5. create a generator that takes in records, does necessary preprocessing, and outputs them;
6. create a generator that does our semi-streaming error trimming (from the semi-streaming preprint);
7. outputs the reads to the given output fp.

The key bit is that this uses generators, so all of this is happening with full-on streaming. The one exception to this is the 'streamtrim' which has to cache a subset of the reads for processing more than once.

Interestingly, this is an implementation of the core functionality for the 'trim-low-abund.py' script that we will be releasing with the next version of khmer (the release after khmer 1.3 - not sure if it's 1.3.1 or 1.4 yet).

You can also replace the 'streamtrim' line with:

input_iter = diginorm(input_iter, graph, 20)


if you want to do digital normalization. That turns this into an implementation of the core functionality for the 'normalize-by-median.py' script that has been in khmer for a while.

Obviously these generator implementations are not yet production-ready, although they do give identical results to the current command line scripts.

The question I have, though, is what should the actual API look like?

The two basic options I've come up with are method chaining and UNIX-style pipes.

Method chaining might look like this:

read(in_fp). \
streamtrim(graph, 20, 3). \
output(out_fp)


and piping would be

read(in_fp) | \
streamtrim(graph, 20, 3) | \
output(out_fp)


...so I guess that's really pretty minor. I don't know which is more Pythonic, though, or what would permit more flexibility in terms of an underlying flexibility. Thoughts?

There are some other decisions to be made -- configuration and parameters.

For configuration, ideally we would be able to specify multiple input and output files to be paired with them, and/or set default parameters for multiple steps. Runtime reporting, error handling, and benchmarking should all be put into the mix. Should there be a simple object with hooks to handle all of this, or what? For example,

s = Streamer(...)  # configure inputs and outputs

s | clean_reads() | streamtrim(graph, 20, 3) | output()


where 's' could help by holding metadata or the like. I'm not sure - I've never done this before ;).

As for parameters, I personally like named parameters, so streamtrim above would better be streamtrim(graph, coverage=20, cutoff=3). But perhaps passing in a dictionary of parameters would be more flexible - should we support both? (Yes, I'm aware that in Python you can use ** - is there a preference?)

I'd love some examples of more mature APIs that have had the rough edges rubbed off 'em; this is really a UX question and I'm just not that well equipped to do this kind of design.

Thoughts? Help? Examples?

cheers, --titus

p.s. Down the road we're going to add correct_errors and eventually assemble to the steps, and perhaps also adapter trimming, quality trimming, and mapping. Won't that be fun? Imagine:

assembly = stream_input(filename1).trim().assemble()


to do your assembly... betcha we can stream it all, too.

## March 12, 2015

### NeuralEnsemble

#### Students: spend the summer improving brain research software tools in Google Summer of Code

From Malin Sandström at the INCF:

Are you a student interested in brain research and software development? Or do you know one?
This year again, INCF is participating as mentoring organization in the Google Summer of Code, a global program that offers students stipends to spend the summer writing code for open source projects. INCF has 27 project proposals offered by mentors from the international research community, many of them with a computational neuroscience slant. All projects deal with development and/or improvement of open source tools that are used in the neuroscience community.

You can see our full list of projects here: https://incf.org/gsoc/2015/proposals

To be eligible, students must fulfill the Google definition of 'student': an individual enrolled in or accepted into an accredited institution including (but not necessarily limited to) colleges, universities, masters programs, PhD programs and undergraduate programs.

Student applications open on Monday, March 16.

GSoC questions welcome to: gsoc@incf.org

## March 11, 2015

### Matthew Rocklin

#### Towards Out-of-core DataFrames

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

This post primarily targets developers. It is on experimental code that is not ready for users.

tl;dr Can we build dask.frame? One approach involves indexes and a lot of shuffling.

Over the last two months we’ve watched the creation of dask, a task scheduling specification, and dask.array a project to implement the out-of-core nd-arrays using blocked algorithms. (blogposts: 1, 2, 3, 4, 5, 6). This worked pretty well. Dask.array is available on the main conda channel and on PyPI and, for the most part, is a pleasant drop-in replacement for a subset of NumPy operations. I’m really happy with it.

conda install dask
or


There is still work to do, in particular I’d like to interact with people who have real-world problems, but for the most part dask.array feels ready.

Can we do for Pandas what we’ve just done for NumPy?

Question: Can we represent a large DataFrame as a sequence of in-memory DataFrames and perform most Pandas operations using task scheduling?

Answer: I don’t know. Lets try.

## Naive Approach

If represent a dask.array as an N-d grid of NumPy ndarrays, then maybe we should represent a dask.frame as a 1-d grid of Pandas DataFrames; they’re kind of like arrays.

This approach supports the following operations:

• Elementwise operations df.a + df.b
• Row-wise filtering df[df.a > 0]
• Reductions df.a.mean()
• Some split-apply-combine operations that combine with a standard reduction like df.groupby('a').b.mean(). Essentially anything you can do with df.groupby(...).agg(...)

The reductions and split-apply-combine operations require some cleverness. This is how Blaze works now and how it does the does out-of-core operations in these notebooks: Blaze and CSVs, Blaze and Binary Storage.

However this approach does not support the following operations:

• Joins
• Split-apply-combine with more complex transform or apply combine steps
• Sliding window or resampling operations
• Anything involving multiple datasets

## Partition on the Index values

Instead of partitioning based on the size of blocks we instead partition on value ranges of the index.

 Partition on block size Partition on index value

This opens up a few more operations

• Joins are possible when both tables share the same index. Because we have information about index values we we know which blocks from one side need to communicate to which blocks from the other.
• Split-apply-combine with transform/apply steps are possible when the grouper is the index. In this case we’re guaranteed that each group is in the same block. This opens up general df.gropuby(...).apply(...)
• Rolling or resampling operations are easy on the index if we share a small amount of information between blocks as we do in dask.array for ghosting operations.

We note the following theme:

Complex operations are easy if the logic aligns with the index

And so a recipe for many complex operations becomes:

1. Re-index your data along the proper column
2. Perform easy computation

## Re-indexing out-of-core data

To be explicit imagine we have a large time-series of transactions indexed by time and partitioned by day. The data for every day is in a separate DataFrame.

Block 1
-------
credit    name
time
2014-01-01 00:00:00     100     Bob
2014-01-01 01:00:00     200   Edith
2014-01-01 02:00:00    -300   Alice
2014-01-01 03:00:00     400     Bob
2014-01-01 04:00:00    -500  Dennis
...

Block 2
-------
credit    name
time
2014-01-02 00:00:00     300    Andy
2014-01-02 01:00:00     200   Edith
...


We want to reindex this data and shuffle all of the entries so that now we partiion on the name of the person. Perhaps all of the A’s are in one block while all of the B’s are in another.

Block 1
-------
time  credit
name
Alice   2014-04-30 00:00:00     400
Alice   2014-01-01 00:00:00     100
Andy    2014-11-12 00:00:00    -200
Andy    2014-01-18 00:00:00     400
Andy    2014-02-01 00:00:00    -800
...

Block 2
-------
time  credit
name
Bob     2014-02-11 00:00:00     300
Bob     2014-01-05 00:00:00     100
...


Re-indexing and shuffling large data is difficult and expensive. We need to find good values on which to partition our data so that we get regularly sized blocks that fit nicely into memory. We also need to shuffle entries from all of the original blocks to all of the new ones. In principle every old block has something to contribute to every new one.

We can’t just call DataFrame.sort because the entire data might not fit in memory and most of our sorting algorithms assume random access.

We do this in two steps

1. Find good division values to partition our data. These should partition the data into blocks of roughly equal size.
2. Shuffle our old blocks into new blocks along the new partitions found in step one.

## Find divisions by external sorting

One approach to find new partition values is to pull out the new index from each block, perform an out-of-core sort, and then take regularly spaced values from that array.

1. Pull out new index column from each block

indexes = [block['new-column-index'] for block in blocks]

2. Perform out-of-core sort on that column

sorted_index = fancy_out_of_core_sort(indexes)

3. Take values at regularly spaced intervals, e.g.

partition_values = sorted_index[::1000000]


We implement this using parallel in-block sorts, followed by a streaming merge process using the heapq module. It works but is slow.

### Possible Improvements

This could be accelerated through one of the following options:

1. A streaming numeric solution that works directly on iterators of NumPy arrays (numtoolz anyone?)
2. Not sorting at all. We only actually need approximate regularly spaced quantiles. A brief literature search hints that there might be some good solutions.

## Shuffle

Now that we know the values on which we want to partition we ask each block to shard itself into appropriate pieces and shove all of those pieces into a spill-to-disk dictionary. Another process then picks up these pieces and calls pd.concat to merge them in to the new blocks.

For the out-of-core dict we’re currently using Chest. Turns out that serializing DataFrames and writing them to disk can be tricky. There are several good methods with about an order of magnitude performance difference between them.

## This works but my implementation is slow

Here is an example with snippet of the NYCTaxi data (this is small)

In [1]: import dask.frame as dfr

In [2]: d = dfr.read_csv('/home/mrocklin/data/trip-small.csv', chunksize=10000)

In [3]: d.head(3)   # This is fast
Out[3]:
0  89D227B655E5C82AECF13C3F540D4CF4  BA96DE419E711691B9445D6A6307C170
1  0BD7C8F5BA12B88E0B67BED28BEA73D8  9FD8F69F0804BDB5549F40E9DA1BE472
2  0BD7C8F5BA12B88E0B67BED28BEA73D8  9FD8F69F0804BDB5549F40E9DA1BE472

vendor_id  rate_code store_and_fwd_flag      pickup_datetime  \
0       CMT          1                  N  2013-01-01 15:11:48
1       CMT          1                  N  2013-01-06 00:18:35
2       CMT          1                  N  2013-01-05 18:49:41

dropoff_datetime  passenger_count  trip_time_in_secs  trip_distance  \
0  2013-01-01 15:18:10                4                382            1.0
1  2013-01-06 00:22:54                1                259            1.5
2  2013-01-05 18:54:23                1                282            1.1

pickup_longitude  pickup_latitude  dropoff_longitude  dropoff_latitude
0        -73.978165        40.757977         -73.989838         40.751171
1        -74.006683        40.731781         -73.994499         40.750660
2        -74.004707        40.737770         -74.009834         40.726002

In [4]: d2 = d.set_index(d.passenger_count, out_chunksize=10000)   # This takes some time

Out[5]:
medallion  \
passenger_count
0                3F3AC054811F8B1F095580C50FF16090
1                4C52E48F9E05AA1A8E2F073BB932E9AA
1                FF00E5D4B15B6E896270DDB8E0697BF7

passenger_count
1                307D1A2524E526EE08499973A4F832CF       VTS          1
1                0E8CCD187F56B3696422278EBB620EFA       VTS          1

store_and_fwd_flag      pickup_datetime     dropoff_datetime  \
passenger_count
0                              NaN  2013-01-13 03:25:00  2013-01-13 03:42:00
1                              NaN  2013-01-13 16:12:00  2013-01-13 16:23:00
1                              NaN  2013-01-13 15:05:00  2013-01-13 15:15:00

passenger_count  trip_time_in_secs  trip_distance  \
passenger_count
0                              0               1020           5.21
1                              1                660           2.94
1                              1                600           2.18

pickup_longitude  pickup_latitude  dropoff_longitude  \
passenger_count
0                      -73.986900        40.743736         -74.029747
1                      -73.976753        40.790123         -73.984802
1                      -73.982719        40.767147         -73.982170

dropoff_latitude
passenger_count
0                       40.741348
1                       40.758518
1                       40.746170

In [6]: d2.blockdivs  # our new partition values
Out[6]: (2, 3, 6)

In [7]: d.blockdivs   # our original partition values
Out[7]: (10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000)

## Some Problems

• First, we have to evaluate the dask as we go. Every set_index operation (and hence many groupbys and joins) forces an evaluation. We can no longer, as in the dask.array case, endlessly compound high-level operations to form more and more complex graphs and then only evaluate at the end. We need to evaluate as we go.

• Sorting/shuffling is slow. This is for a few reasons including the serialization of DataFrames and sorting being hard.

• How feasible is it to frequently re-index a large amount of data? When do we reach the stage of “just use a database”?

• Pandas doesn’t yet release the GIL, so this is all single-core. See post on PyData and the GIL.

• My current solution lacks basic functionality. I’ve skipped the easy things to first ensure sure that the hard stuff is doable.

## Help!

I know less about tables than about arrays. I’m ignorant of the literature and common solutions in this field. If anything here looks suspicious then please speak up. I could really use your help.

Additionally the Pandas API is much more complex than NumPy’s. If any experienced devs out there feel like jumping in and implementing fairly straightforward Pandas features in a blocked way I’d be obliged.

## March 10, 2015

### Matthew Rocklin

#### PyData and the GIL

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr Many PyData projects release the GIL. Multi-core parallelism is alive and well.

## Introduction

Machines grow more cores every year. My cheap laptop has four cores and a heavy workstation rivals a decent cluster without the hardware hassle. When I bring this up in conversation people often ask about the GIL and whether or not this poses a problem to the PyData ecosystem and shared-memory parallelism.

Q: Given the growth of shared-memory parallelism, should the PyData ecosystem be concerned about the GIL?

For those who aren’t familiar, the Global Interpreter Lock (GIL) is a CPython feature/bug that stops threads from manipulating Python objects in parallel. This cripples Pure Python shared-memory parallelism.

This sounds like a big deal but it doesn’t really affect the PyData stack (NumPy/Pandas/SciKits). Most PyData projects don’t spend much time in Python code. They spend 99% of their time in C/Fortran/Cython code. This code can often release the GIL. The following projects release the GIL at various stages:

• NumPy
• SciPy
• Numba (if requested) (example docs)
• SciKit Learn
• Anything that mostly uses the above projects
• if you add more in the comments then I will post them here

Our software stack has roots in scientific computing which has an amazing relationship with using all-of-the-hardware. I would like to see the development community lean in to the use of shared-memory parallelism. This feels like a large low-hanging fruit.

As a quick example, we compute a large random dot product with dask.array and look at top. Dask.array computes large array operations by breaking arrays up in to many small NumPy arrays and then executing those array operations in multiple threads.

In [1]: import dask.array as da

In [2]: x = da.random.random((10000, 10000), blockshape=(1000, 1000))

In [3]: float(x.dot(x.T).sum())
Out[3]: 250026827523.19141

Technical note: my BLAS is set to use one thread only, the parallelism in the above example is strictly due to multiple Python worker threads, and not due to parallelism in the underlying native code.

Note the 361.0% CPU utilization in the ipython process.

Because the PyData stack is fundamentally based on native compiled code, multiple Python threads can crunch data in parallel without worrying about the GIL. The GIL does not have to affect us in a significant way.

## That’s not true, the GIL hurts Python in the following cases

### Text

We don’t have a good C/Fortran/Cython solution for text. When given a pile-of-text-files we often switch from threads to processes and use the multiprocessing module. This limits inter-worker communication but this is rarely an issue for this kind of embarrassingly parallel work.

### Pandas

Pandas does not yet release the GIL in computationally intensive code. It probably could though. This requires momentum from the community and some grunt-work by some of the Pandas devs. I have a small issue here and I think that Phil Cloud is looking into it.

## PyData <3 Shared Memory Parallelism

If you’re looking for more speed in compute-bound applications then consider threading and heavy workstation machines. I personally find this approach to be more convenient than spinning up a cluster.

## March 09, 2015

### Martin Fitzpatrick

#### Wooey: Simple, automated web UIs for Python scripts

Wooey is a simple web interface (built on Flask) to run command line Python scripts. Think of it as an easy way to get your scripts up on the web for routine data analysis, file processing, or anything else.

Inspired by what Gooey can do, turning ArgumentParser-based command-line scripts into WxWidgets-based GUIs, I thought I’d see if I could do the same for the web. I’m still not sure if the result is beautiful or horrific.

Wooey (see what I did there?) uses the same, albeit slightly modified, back-end conversion of ArgumentParser instances to JSON definitions. These definitions are used to construct a web-based UI with type-dependent widgets. Submitted configurations are parsed, using the JSON definition, to command line arguments that are then submitted to a job queue.

Jobs in the queue are automatically run and the results made available in the job view, with smart handling of outputs such as images (CSV, etc. to be supported via pandas, possibly some kind of plugin system) into a tabbed output viewer. Support for downloading of zipped output files is to follow.

The use case for myself was as a simple platform to allow running of routine data-processing and analysis scripts within a research group, but I’m sure there are other possibilities. However, I wouldn’t recommend putting this on the public web just yet (pre-alpha warning). It’s somewhat comparable to things like Shiny for R, except multi-user out of the box. Support for multiple command-line formats is on my todo.

## Walkthrough

The front page of a wooey install presents a list of installed scripts:

Each script has it’s own UI form based on the config parameters defined in the ArgumentParser:

Documentation can be specified either manually via the JSON, or my providing a Markdown-format file alongside the script or config file.

Logged-in users get a nice listing of their previous jobs:

The output from successful jobs is available via an inline viewer (images only presently, .csv support via Pandas to follow):

Errors are output to the inline console:

### Juan Nunez-Iglesias

#### jnuneziglesias

Prompted in part by some discussions with Ed Schofield, creator of python-future.org, I’ve been going on a bit of a porting spree to Python 3. I just finished with my gala segmentation library. (Find it on GitHub and ReadTheDocs.) Overall, the process is nowhere near as onerous as you might think it is. Getting started really is the hardest part. If you have more than yourself as a user, you should definitely just get on with it and port.

The second hardest part is the testing. In particular, you will need to be careful with dictionary iteration, pickled objects, and file persistence in general. I’ll go through these gotchas in more detail below.

## Reminder: the order of dictionary items is undefined

This is one of those duh things that I forget over and over and over. In my porting, some tests that depended on a scikit-learn RandomForest object were failing. I assumed that there was some difference between the random seeding in Python 2 and Python 3, leading to slightly different models between the two versions of the random forest.

This was a massive red herring that took me forever to figure out. In actuality, the seeding was completely fine. However, gala uses networkx as its graph backend, which itself uses an adjacency dictionary to store edges. So when I asked for graph.edges() to get a set of training examples, I was getting the edges in a random order that was deterministic within Python 2.7: the edges returned were always in the same shuffled order. This went out the window when switching to Python 3.4, with the training examples now in a different order, resulting in a different random forest and thus a different learning outcome… And finally a failed test.

The solution should have been to use a classifier that is not sensitive to ordering of the training data. However, although many classifiers satisfy this property, in practice they suffer from slight numerical instability which is sufficient to throw the test results off between shufflings of the training data.

So I’ve trained a Naive Bayes classifier in Python 2.7, and which I then load up in Python 3.4 and check whether the parameters are close to a newly trained one. The actual classification results can differ slightly, and this becomes much worse in gala, where classification tasks are sequential, so a single misstep can throw off everything that comes after it.

## When pickling, remember to open files in binary mode

I’ve always felt that the pickle module was deficient for not accepting filenames as input to dump. Instead, it takes an open, writeable file. This is all well and good but it turns out that you should always open files in binary mode when using pickle! I got this far without knowing that, surely an indictment of pickle’s API!

Additionally, you’ll have specify a encoding='bytes' when loading a Python 2 saved file in the Python 3 version of pickle.

## Even when you do, objects may not map cleanly between Python 2 and 3 (for some libraries)

In Python 2:

>>> from sklearn.ensemble import RandomForestClassifier as RF
>>> rf = RF()
>>> rf = rf.fit(iris.data, iris.target)
>>> with open('rf', 'wb') as fout:
...     pck.dump(r, fout, protocol=2)


Then, in Python 3:

>>> with open('rf', 'rb') as fin:
...
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-9-674ee92b354d> in <module>()
1 with open('rf', 'rb') as fin:
----> 2     rf = pck.load(fin, encoding='bytes')
3

/Users/nuneziglesiasj/anaconda/envs/py3k/lib/python3.4/site-packages/sklearn/tree/_tree.so in sklearn.tree._tree.Tree.__setstate__ (sklearn/tree/_tree.c:18115)()

KeyError: 'node_count'


## When all is said and done, your code will probably run slower on Python 3

I have to admit: this just makes me angry. After a lot of hard work ironing out all of the above kinks, gala’s tests run about 2x slower in Python 3.4 than in 2.7. I’d heard quite a few times that Python 3 is slower than 2, but that’s just ridiculous.

Nick Coghlan’s enormous Q&A has been cited as required reading before complaining about Python 3. Well, I’ve read it (which took days), and I’m still angry that the CPython core development team are generally dismissive of anyone wanting faster Python. Meanwhile, Google autocompletes “why is Python” with “so slow”. And although Nick asserts that those of us complaining about this “misunderstand the perspective of conservative users”, community surveys show a whopping 40% of Python 2 users citing “no incentive” as the reason they don’t switch.

## In conclusion…

In the end, I’m glad I ported my code. I learned a few things, and I feel like a better Python “citizen” for having done it. But that’s the point: those are pretty weak reasons. Most people just want to get their work done and move on. Why would they bother porting their code if it’s not going to help them do that?

## March 05, 2015

### Continuum Analytics

#### Continuum Analytics - March Tech Events

This month, the Continuum team will be at a variety of meetups and conferences talking about the power of Python in analytics. Take a look at where you can find us, and reach out at info@continuum.io if you’re interested in meeting up, or if you would like us to participate in your event.

## February 28, 2015

### Matthew Rocklin

#### Ising models and Numba

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr I play with Numba and find it effective.

## Introduction

Confession, I’ve never actually used Numba. Well that’s not quite true; I’ve indirectly used Numba thousands of times because Blaze auto-generates numba ufuncs. Still I’ve never used it for a particular problem. I usually define problems with large array operations and compile those down. Numba takes a different approach and translates Python for loops to efficient LLVM code. This is all lower in the hardware stack than where I usually think.

But when I was looking for applications to motivate recent work in nearest-neighbor communications in dask a friend pointed me towards the Ising model, a simple physical system that is both easy to code up and has nice macro-scale properties. I took this as an example to play with Numba. This post details my experience.

## Ising Model

Disclaimer: I am not a physicist

The Ising model represents a regular grid of points where each point has two possible states, spin up and spin down. States like to have the same spin as their immediate neighbors so when a spin-down state is surrounded by more spin-up states it will switch to spin-up and vice versa. Also, due to random fluctuations, points might switch spins, even if this switch is not favorable. In pseudocode an Ising update step might look like the following

for every point in the grid:
energy = my spin * sum of all of the spins (+1 or -1) of neighboring points
if energy is improved by switching:
switch
else if we're particulalry unlucky
switch anyway


For this kind of algorithm imperative for-loopy code is probably the most clear. You can do this with high-level array operations too (e.g. NumPy/Blaze/Theano), but it’s a mess.

## Numba code

Here is my solution to the problem with Numba and a gif of the solution

import numba
import numpy as np
from math import exp, log, e, sqrt

kT = 2 / log(1 + sqrt(2), e)

@numba.jit(nopython=True)
def _update(x, i, j):
n, m = x.shape
dE = 2* x[i, j] * (
x[(i-1)%n, (j-1)%m]
+ x[(i-1)%n,  j     ]
+ x[(i-1)%n, (j+1)%m]

+ x[ i     , (j-1)%m]
+ x[ i     , (j+1)%m]

+ x[(i+1)%n, (j-1)%m]
+ x[(i+1)%n,  j     ]
+ x[(i+1)%n, (j+1)%m]
)
if dE <= 0 or exp(-dE / kT) > np.random.random():
x[i, j] *= -1

@numba.jit(nopython=True)
def update(x):
n, m = x.shape

for i in range(n):
for j in range(0, m, 2):  # Even columns first to avoid overlap
_update(x, i, j)

for i in range(n):
for j in range(1, m, 2):  # Odd columns second to avoid overlap
_update(x, i, j)

if __name__ == '__main__':
x = np.random.randint(2, size=(1000, 1000)).astype('i1')
x[x == 0] = -1
for i in range(100):
update(x)

My old beater laptop executes one update step on a 1000x1000 grid in 50ms. Without Numba this takes 12s. This wasn’t a canned demo by an expert user / numba developer; this was just my out-of-the-box experience.

## Thoughts

I really like the following:

• I can remove @numba.jit and use the Python debugger
• I can assert that I’m only using LLVM with nopython=True
• I can manage data with NumPy (or dask.array) separately from managing computation with Numba

I ran in to some issues and learned some things too:

• random is only present in the developer preview builds of Numba (conda install -c numba numba). It will be officially released in the 0.18 version this March.
• You don’t have to provide type signature strings. I tried providing these at first but I didn’t know the syntax and so repeatedly failed to write down the type signature correctly. Turns out the cost of not writing it down is that Numba will jit whenever it sees a new signature. For my application this is essentially free.

## February 26, 2015

### NeuralEnsemble

#### Workshop Announcement - "HBP Hippocamp CA1: Collaborative and Integrative Modeling of Hippocampal Area CA1"

Registration is now open for the workshop "HBP Hippocamp CA1: Collaborative and Integrative Modeling of Hippocampal Area CA1", to be held March 31st - April 1st, 2015 at UCL School of Pharmacy in London, supported by the Human Brain Project (www.humanbrainproject.eu).

In short, the aims of the workshop are two-fold. First, to engage the larger community of experimentalists and modelers working on hippocampus, and highlight existing modeling efforts and strategic datasets for modeling hippocampal area CA1. Second, to define and bootstrap an inclusive community-driven model and data-integration process to achieve open pre-competitive reference models of area CA1 (and, ultimately, the rest of the hippocampus), which are well documented, validated, and released at regular intervals (supported in part by IT infrastructure funded by HBP). Involvement from the community interested in characterization and modeling of the hippocampus is highly encouraged. To keep the meeting focused on the task, participation will be limited to ~30 people, so registration is required.

Please consult the meeting website at http://neuralensemble.org/meetings/HippocampCA1/for registration and further details.

Organizing committee:Jo Falck (UCL), Szabolcs Káli (Hungarian Academy of Sciences), Sigrun Lange (UCL), Audrey Mercer (UCL), Eilif Muller (EPFL), Armando Romani (EPFL) and Alex Thomson (UCL).

## February 24, 2015

### Jake Vanderplas

#### Optimizing Python in the Real World: NumPy, Numba, and the NUFFT

Donald Knuth famously quipped that "premature optimization is the root of all evil." The reasons are straightforward: optimized code tends to be much more difficult to read and debug than simpler implementations of the same algorithm, and optimizing too early leads to greater costs down the road. In the Python world, there is another cost to optimization: optimized code often is written in a compiled language like Fortran or C, and this leads to barriers to its development, use, and deployment.

Too often, tutorials about optimizing Python use trivial or toy examples which may not map well to the real world. I've certainly been guilty of this myself. Here, I'm going to take a different route: in this post I will outline the process of understanding, implementing, and optimizing a non-trivial algorithm in Python, in this case the Non-uniform Fast Fourier Transform (NUFFT). Along the way, we'll dig into the process of optimizing Python code, and see how a relatively straightforward pure Python implementation, with a little help from Numba, can be made to nearly match the performance of a highly-optimized Fortran implementation of the same algorithm.

## Why a Python Implementation?

First, I want to answer the inevitable question: why spend the time to make a Python implementation of an algorithm that's already out there in Fortran? The reason is that I've found in my research and teaching that pure-Python implementations of algorithms are far more valuable than C or Fortran implementations, even if they might be a bit slower. This is for a number of reasons:

• Pure-Python code is easier to read, understand, and contribute to. Good Python implementations are much higher-level than C or Fortran, and abstract-away loop indices, bit twiddling, workspace arrays, and other sources of code clutter. A typical student reading good Python code can immediately understand and modify the algorithm, while the same student would be lost trying to understand typical optimized Fortran code.

• Pure-python packages are much easier to install than Python-wrapped C or Fortran code. This is especially true on non-Linux systems. Fortran in particular can require some installation prerequisites that are non-trivial for many users. In practice, I've seen people give up on better tools when there is an installation barrier for those tools.

• Pure-python code often works for many data types. Because of the way it is written, pure Python code is often automatically applicable to single or double precision, and perhaps even to extensions to complex numbers. For compiled packages, supporting and compiling for all possible types can be a burden.

• Pure-python is easier to use at scale. Because it does not require complicated installation, pure Python packages can be much easier to install on cloud VMs and/or shared clusters for computation at scale. If you can easily pip-install a pure-Python package on a VM, then services like AWS and TravisCI are much easier to set up.

Certainly code speed will overcome these considerations if the performance gap is great enough, but I've found that for many applications a pure Python package, cleverly designed and optimized, can be made fast enough that these larger considerations win-out. The challenge is making the Python fast. We'll explore this below.

## Background: The Non-Uniform Fast Fourier Transform

The Fast Fourier Transform (FFT) is perhaps the most important and fundamental of modern numerical algorithms. It provides a fast, $$O[N\log N]$$ method of computing the discrete Fourier transform: $Y_k^\pm = \sum_{n=0}^{N-1} y_n e^{\pm i k n / N}$ You can read more about the FFT in my previous post on the subject.

One important limitation of the FFT is that it requires that input data be evenly-spaced: that is, we can think of the values $$y_n$$ as samples of a function $$y_n = y(x_n)$$ where $$x_n = x_0 + n\Delta x$$ is a regular grid of points. But what about when your grid is not uniform? That is, what if you want to compute this result: $Y_k^\pm = \sum_{j=1}^N y(x_j) e^{\pm i k x_j}$ where $$y(x)$$ is evaluated at an arbitrary set of points $$x_j$$? In this case, the FFT is no longer directly applicable, and you're stuck using a much slower $$O[N^2]$$ direct summation.

Stuck, that is, until the NUFFT came along.

The NUFFT is an algorithm which converts the non-uniform transform into an approximate uniform transform, not with error-prone interpolation, but instead using a clever "gridding" operation motivated by the convolution theorem. If you'd like to read about the algorithm in detail, the Courant Institute's NUFFT page has a nice set of resources.

Below we'll take a look at implementing this algorithm in Python.

## Direct Non-Uniform Fourier Transform

When developing optimized code, it is important to start with something easy to make sure you're on the right track. Here we'll start with a straightforward direct version of the non-uniform Fourier transform. We'll allow non-uniform inputs $$x_j$$, but compute the output on a grid of $$M$$ evenly-spaced frequencies in the range $$-M/2 \le f/\delta f < M/2$$. This is what the NUFFT group calls the Type-1 NUFFT.

First we'll implement nufftfreqs(), which returns the frequency grid for a given $$M$$, and nudft() which computes the non-uniform discrete Fourier transform using a slow direct method. The arguments for the latter include iflag, which is a positive or negative number indicating the desired sign of the exponent:

In [1]:
from __future__ import print_function, division
import numpy as np

def nufftfreqs(M, df=1):
"""Compute the frequency range used in nufft for M frequency bins"""
return df * np.arange(-(M // 2), M - (M // 2))

def nudft(x, y, M, df=1.0, iflag=1):
"""Non-Uniform Direct Fourier Transform"""
sign = -1 if iflag < 0 else 1
return (1 / len(x)) * np.dot(y, np.exp(sign * 1j * nufftfreqs(M, df) * x[:, np.newaxis]))


Again, I can't emphasize this enough: when writing fast code, start with a slow-and-simple version of the code which you know gives the correct result, and then optimize from there.

## Comparing to the Fortran NUFFT

We can double-check that this is producing the desired result by comparing to the Fortran NUFFT implementation, using Python wrappers written by Dan Foreman-Mackey, available at http://github.com/dfm/python-nufft/:

In [2]:
# Install nufft from http://github.com/dfm/python-nufft/
from nufft import nufft1 as nufft_fortran

x = 100 * np.random.random(1000)
y = np.sin(x)

Y1 = nudft(x, y, 1000)
Y2 = nufft_fortran(x, y, 1000)

np.allclose(Y1, Y2)

Out[2]:
True


The results match! A quick check shows that, as we might expect, the Fortran algorithm is orders of magnitude faster:

In [3]:
%timeit nudft(x, y, 1000)
%timeit nufft_fortran(x, y, 1000)

10 loops, best of 3: 47.9 ms per loop
1000 loops, best of 3: 265 µs per loop



On top of this, for $$N$$ points and $$N$$ frequencies, the Fortran NUFFT will scale as $$O[N\log N]$$, while our simple implementation will scale as $$O[N^2]$$, making the difference even greater as $$N$$ increases! Let's see if we can do better.

## NUFFT with Python

Here we'll attempt a pure-Python version of the fast, FFT-based NUFFT. We'll follow the basics of the algorithm presented on the NUFFT page, using NumPy broadcasting tricks to push loops into the compiled layer of NumPy. For later convenience, we'll start by defining a utility to compute the grid parameters as detailed in the NUFFT paper.

In [4]:
def _compute_grid_params(M, eps):
# Choose Msp & tau from eps following Dutt & Rokhlin (1993)
if eps <= 1E-33 or eps >= 1E-1:
raise ValueError("eps = {0:.0e}; must satisfy "
"1e-33 < eps < 1e-1.".format(eps))
ratio = 2 if eps > 1E-11 else 3
Msp = int(-np.log(eps) / (np.pi * (ratio - 1) / (ratio - 0.5)) + 0.5)
Mr = max(ratio * M, 2 * Msp)
lambda_ = Msp / (ratio * (ratio - 0.5))
tau = np.pi * lambda_ / M ** 2
return Msp, Mr, tau

def nufft_python(x, c, M, df=1.0, eps=1E-15, iflag=1):
"""Fast Non-Uniform Fourier Transform with Python"""
Msp, Mr, tau = _compute_grid_params(M, eps)
N = len(x)

# Construct the convolved grid
ftau = np.zeros(Mr, dtype=c.dtype)
Mr = ftau.shape[0]
hx = 2 * np.pi / Mr
mm = np.arange(-Msp, Msp)
for i in range(N):
xi = (x[i] * df) % (2 * np.pi)
m = 1 + int(xi // hx)
spread = np.exp(-0.25 * (xi - hx * (m + mm)) ** 2 / tau)
ftau[(m + mm) % Mr] += c[i] * spread

# Compute the FFT on the convolved grid
if iflag < 0:
Ftau = (1 / Mr) * np.fft.fft(ftau)
else:
Ftau = np.fft.ifft(ftau)
Ftau = np.concatenate([Ftau[-(M//2):], Ftau[:M//2 + M % 2]])

# Deconvolve the grid using convolution theorem
k = nufftfreqs(M)
return (1 / N) * np.sqrt(np.pi / tau) * np.exp(tau * k ** 2) * Ftau


Let's compare this to the previous results. For convenience, we'll define a single routine which validates the results and times the execution:

In [5]:
from time import time

def test_nufft(nufft_func, M=1000, Mtime=100000):
# Test vs the direct method
print(30 * '-')
name = {'nufft1':'nufft_fortran'}.get(nufft_func.__name__,
nufft_func.__name__)
print("testing {0}".format(name))
rng = np.random.RandomState(0)
x = 100 * rng.rand(M + 1)
y = np.sin(x)
for df in [1, 2.0]:
for iflag in [1, -1]:
F1 = nudft(x, y, M, df=df, iflag=iflag)
F2 = nufft_func(x, y, M, df=df, iflag=iflag)
assert np.allclose(F1, F2)
print("- Results match the DFT")

# Time the nufft function
x = 100 * rng.rand(Mtime)
y = np.sin(x)
times = []
for i in range(5):
t0 = time()
F = nufft_func(x, y, Mtime)
t1 = time()
times.append(t1 - t0)
print("- Execution time (M={0}): {1:.2g} sec".format(Mtime, np.median(times)))

In [6]:
test_nufft(nufft_python)
test_nufft(nufft_fortran)

------------------------------
testing nufft_python
- Results match the DFT
- Execution time (M=100000): 1.6 sec
------------------------------
testing nufft_fortran
- Results match the DFT
- Execution time (M=100000): 0.043 sec



The good news is that our Python implementation works; the bad news is that it remains several orders of magnitude slower than the Fortran result!

Let's make it faster.

### Data

Beautiful code is kind of a goal unto itself, but we really want to demonstrate how useful SciPy is in real scientific analysis. Therefore, although cute examples on synthetic data can illustrate quite well what a piece of code does, it would be extremely useful for us if you can point to examples with real scientific data behind them.

Thank you, and we hope to hear from you!

## February 17, 2015

### Continuum Analytics

#### Bokeh 0.8 Released

We are excited to announce the release of version 0.8 of Bokeh, an interactive web plotting library for Python… and other languages! This release includes many major new features:

• New and updated language bindings: R, JavaScript, Julia, Scala, and Lua now available
• More bokeh-server features geared towards production deployments
• Live gallery of server examples and apps!
• Simpler, more easily extensible design for charts API, plus new Horizon chart
• New build automation and substantial documentation improvements
• Shaded grid bands, configurable hover tool, and pan/zoom for categorical plots
• Improved and more robust crossfilter application
• AjaxDataSource for clients to stream data without a Bokeh server

### Matthew Rocklin

#### Towards Out-of-core ND-Arrays -- Dask + Toolz = Bag

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr We use dask to build a parallel Python list.

## Introduction

This is the seventh in a sequence of posts constructing an out-of-core nd-array using NumPy, and dask. You can view these posts here:

Today we take a break from ND-Arrays and show how task scheduling can attack other collections like the simple list of Python objects.

## Unstructured Data

Often before we have an ndarray or a table/DataFrame we have unstructured data like log files. In this case tuned subsets of the language (e.g. numpy/pandas) aren’t sufficient, we need the full Python language.

My usual approach to the inconveniently large dump of log files is to use Python streaming iterators along with multiprocessing or IPython Parallel on a single large machine. I often write/speak about this workflow when discussing toolz.

This workflow grows complex for most users when you introduce many processes. To resolve this we build our normal tricks into a new dask.Bag collection.

## Bag

In the same way that dask.array mimics NumPy operations (e.g. matrix multiply, slicing), dask.bag mimics functional operations like map, filter, reduce found in the standard library as well as many of the streaming functions found in toolz.

• Dask bag = Python/Toolz + processes

## Example

Here’s the obligatory wordcount example

>>> from dask.bag import Bag

>>> b = Bag.from_filenames('data/*.txt')

>>> def stem(word):
...     """ Stem word to primitive form """
...     return word.lower().rstrip(",.!:;'-\"").lstrip("'\"")

>>> dict(b.map(str.split).map(concat).map(stem).frequencies())
{...}

We use all of our cores and stream through memory on each core. We use multiprocessing but could get fancier with some work.

There are a lot of much larger much more powerful systems that have a similar API, notably Spark and DPark. If you have Big Data and need to use very many machines then you should stop reading this and go install them.

1. It was very easy given the work already done on dask.array
2. I often only need multiprocessing + a heavy machine
3. I wanted something trivially pip installable that didn’t use the JVM

But again, if you have Big Data, then this isn’t for you.

## Design

As before, a Bag is just a dict holding tasks, along with a little meta data.

>>> d = {('x', 0): (range, 5),
...      ('x', 1): (range, 5),
...      ('x', 2): (range, 5)}

>>> b = Bag(d, 'x', npartitions=3)

In this way we break up one collection

[0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]


into three independent pieces

[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]


When we abstractly operate on the large collection…

>>> b2 = b.map(lambda x: x * 10)

… we generate new tasks to operate on each of the components.

>>> b2.dask
{('x', 0): (range, 5),
('x', 1): (range, 5),
('x', 2): (range, 5)}
('bag-1', 0): (map, lambda x: x * 10, ('x', 0)),
('bag-1', 1): (map, lambda x: x * 10, ('x', 1)),
('bag-1', 2): (map, lambda x: x * 10, ('x', 2))}

And when we ask for concrete results (the call to list) we spin up a scheduler to execute the resulting dependency graph of tasks.

>>> list(b2)
[0, 10, 20, 30, 40, 0, 10, 20, 30, 40, 0, 10, 20, 30, 40]

More complex operations yield more complex dasks. Beware, dask code is pretty Lispy. Fortunately these dasks are internal; users don’t interact with them.

>>> iseven = lambda x: x % 2 == 0
{'bag-3': (sum, [('bag-2', 1), ('bag-2', 2), ('bag-2', 0)]),
('bag-2', 0): (count,
(filter, iseven, (range, 5))),
('bag-2', 1): (count,
(filter, iseven, (range, 5))),
('bag-2', 2): (count,
(filter, iseven, (range, 5)))}

The current interface for Bag has the following operations:

all             frequencies         min
any             join                product
count           map                 std
filter          map_partitions      sum
fold            max                 topk
foldby          mean                var


Manipulations of bags create task dependency graphs. We eventually execute these graphs in parallel.

## Execution

We repurpose the threaded scheduler we used for arrays to support multiprocessing to provide parallelism even on pure Python code. We’re careful to avoid unnecessary data transfer. None of the operations listed above requires significant communication. Notably we don’t have any concept of shuffling or scatter/gather.

We use dill to take care to serialize functions properly and collect/report errors, two issues that plague naive use of multiprocessing in Python.

>>> list(b.map(lambda x: x * 10))  # This works!
[0, 10, 20, 30, 40, 0, 10, 20, 30, 40, 0, 10, 20, 30, 40]

>>> list(b.map(lambda x: x / 0))   # This errs gracefully!
ZeroDivisionError:  Execption in remote Process

integer division or modulo by zero

Traceback:
...

These tricks remove need for user expertise.

## Productive Sweet Spot

I think that there is a productive sweet spot in the following configuration

1. Pure Python functions
2. Streaming/lazy data
3. Multiprocessing
4. A single large machine or a few machines in an informal cluster

This setup is common and it’s capable of handling terabyte scale workflows. In my brief experience people rarely take this route. They use single-threaded in-memory Python until it breaks, and then seek out Big Data Infrastructure like Hadoop/Spark at relatively high productivity overhead.

Your workstation can scale bigger than you think.

## Example

Here is about a gigabyte of network flow data, recording which computers made connections to which other computers on the UC-Berkeley campus in 1996.

846890339:661920 846890339:755475 846890340:197141 168.237.7.10:1163 83.153.38.208:80 2 8 4294967295 4294967295 846615753 176 2462 39 GET 21068906053917068819..html HTTP/1.0

846890340:989181 846890341:2147 846890341:2268 13.35.251.117:1269 207.83.232.163:80 10 0 842099997 4294967295 4294967295 64 1 38 GET 20271810743860818265..gif HTTP/1.0

846890341:80714 846890341:90331 846890341:90451 13.35.251.117:1270 207.83.232.163:80 10 0 842099995 4294967295 4294967295 64 1 38 GET 38127854093537985420..gif HTTP/1.0


This is actually relatively clean. Many of the fields are space delimited (not all) and I’ve already compiled and run the decade old C-code needed to decompress it from its original format.

Lets use Bag and regular expressions to parse this.

In [1]: from dask.bag import Bag, into

In [2]: b = Bag.from_filenames('UCB-home-IP*.log')

In [3]: import re

In [4]: pattern = """
...: (?P<request_time>\d+:\d+)
...: (?P<response_start>\d+:\d+)
...: (?P<response_end>\d+:\d+)
...: (?P<client_ip>\d+\.\d+\.\d+\.\d+):(?P<client_port>\d+)
...: (?P<server_ip>\d+\.\d+\.\d+\.\d+):(?P<server_port>\d+)
...: (?P<if_modified_since>\d+)
...: (?P<response_data_length>\d+)
...: (?P<request_url_length>\d+)
...: (?P<expires>\d+)
...: (?P<last_modified>\d+)
...: (?P<method>\w+)
...: (?P<domain>\d+..)\.(?P<extension>\w*)(?P<rest_of_url>\S*)
...: (?P<protocol>.*)""".strip().replace('\n', '\s+')

In [5]: prog = re.compile(pattern)

In [6]: records = b.map(prog.match).map(lambda m: m.groupdict())

This returns instantly. We only compute results when necessary. We trigger computation by calling list.

In [7]: list(records.take(1))
Out[7]:
'client_ip': '168.237.7.10',
'client_port': '1163',
'domain': '21068906053917068819.',
'expires': '2462',
'extension': 'html',
'if_modified_since': '4294967295',
'last_modified': '39',
'method': 'GET',
'protocol': 'HTTP/1.0',
'request_time': '846890339:661920',
'request_url_length': '176',
'response_data_length': '846615753',
'response_end': '846890340:197141',
'response_start': '846890339:755475',
'rest_of_url': '',
'server_ip': '83.153.38.208',
'server_port': '80'}]

Because bag operates lazily this small result also returns immediately.

To demonstrate depth we find the ten client/server pairs with the most connections.

In [8]: counts = records.pluck(['client_ip', 'server_ip']).frequencies()

In [9]: %time list(counts.topk(10, key=lambda x: x[1]))
CPU times: user 11.2 s, sys: 1.15 s, total: 12.3 s
Wall time: 50.4 s
Out[9]:
[(('247.193.34.56', '243.182.247.102'), 35353),
(('172.219.28.251', '47.61.128.1'), 22333),
(('240.97.200.0', '108.146.202.184'), 17492),
(('229.112.177.58', '47.61.128.1'), 12993),
(('146.214.34.69', '119.153.78.6'), 12554),
(('17.32.139.174', '179.135.20.36'), 10166),
(('97.166.76.88', '65.81.49.125'), 8155),
(('55.156.159.21', '157.229.248.255'), 7533),
(('55.156.159.21', '124.77.75.86'), 7506),
(('55.156.159.21', '97.5.181.76'), 7501)]

## Comparison with Spark

First, it is silly and unfair to compare with PySpark running locally. PySpark offers much more in a distributed context.

In [1]: import pyspark

In [2]: sc = pyspark.SparkContext('local')

In [3]: from glob import glob
In [4]: filenames = sorted(glob('UCB-home-*.log'))
In [5]: rdd = sc.parallelize(filenames, numSlices=4)

In [6]: import re
In [7]: pattern = ...
In [8]: prog = re.compile(pattern)

In [9]: lines = rdd.flatMap(lambda fn: list(open(fn)))
In [10]: records = lines.map(lambda line: prog.match(line).groupdict())
In [11]: ips = records.map(lambda rec: (rec['client_ip'], rec['server_ip']))

In [12]: from toolz import topk
In [13]: %time dict(topk(10, ips.countByValue().items(), key=1))
CPU times: user 1.32 s, sys: 52.2 ms, total: 1.37 s
Wall time: 1min 21s
Out[13]:
{('146.214.34.69', '119.153.78.6'): 12554,
('17.32.139.174', '179.135.20.36'): 10166,
('172.219.28.251', '47.61.128.1'): 22333,
('229.112.177.58', '47.61.128.1'): 12993,
('240.97.200.0', '108.146.202.184'): 17492,
('247.193.34.56', '243.182.247.102'): 35353,
('55.156.159.21', '124.77.75.86'): 7506,
('55.156.159.21', '157.229.248.255'): 7533,
('55.156.159.21', '97.5.181.76'): 7501,
('97.166.76.88', '65.81.49.125'): 8155}

So, given a compute-bound mostly embarrassingly parallel task (regexes are comparatively expensive) on a single machine they are comparable.

Reasons you would want to use Spark

• You want to use many machines and interact with HDFS
• Shuffling operations

Reasons you would want to use dask.bag

• Trivial installation
• No mucking about with JVM heap sizes or config files
• Nice error reporting. Spark error reporting includes the typical giant Java Stack trace that comes with any JVM solution.
• Easier/simpler for Python programmers to hack on. The implementation is 350 lines including comments.

Again, this is really just a toy experiment to show that the dask model isn’t just about arrays. I absolutely do not want to throw Dask in the ring with Spark.

## Conclusion

However I do want to stress the importance of single-machine parallelism. Dask.bag targets this application well and leverages common hardware in a simple way that is both natural and accessible to most Python programmers.

A skilled developer could extend this to work in a distributed memory context. The logic to create the task dependency graphs is separate from the scheduler.

Special thanks to Erik Welch for finely crafting the dask optimization passes that keep the data flowly smoothly.

## February 13, 2015

### Matthew Rocklin

#### Towards Out-of-core ND-Arrays -- Slicing and Stacking

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr Dask.arrays can slice and stack. This is useful for weather data.

## Introduction

This is the sixth in a sequence of posts constructing an out-of-core nd-array using NumPy, and dask. You can view these posts here:

Now we talk about slicing and stacking. We use meteorological data as an example use case.

## Slicing

Dask.array now supports most of the NumPy slicing syntax. In particular it supports the following:

• Slicing by integers and slices x[0, :5]
• Slicing by a list/np.ndarray of integers x[[1, 2, 4]]
• Slicing by a list/np.ndarray of booleans x[[False, True, True, False, True]]

It does not currently support the following:

• Slicing one dask.array with another x[x > 0]
• Slicing with lists in multiple axes x[[1, 2, 3], [3, 2, 1]]

## Stack and Concatenate

We often store large arrays on disk in many different files. We want to stack or concatenate these arrays together into one logical array. Dask solves this problem with the stack and concatenate functions, which stitch many arrays together into a single array, either creating a new dimension with stack or along an existing dimension with concatenate.

### Stack

We stack many existing dask arrays into a new array, creating a new dimension as we go.

>>> import dask.array as da
>>> arrays = [from_array(np.ones((4, 4)), blockshape=(2, 2))
...            for i in range(3)]  # A small stack of dask arrays

>>> da.stack(arrays, axis=0).shape
(3, 4, 4)

>>> da.stack(arrays, axis=1).shape
(4, 3, 4)

>>> da.stack(arrays, axis=2).shape
(4, 4, 3)

This creates a new dimension with length equal to the number of slices

### Concatenate

We concatenate existing arrays into a new array, extending them along an existing dimension

>>> import dask.array as da
>>> arrays = [from_array(np.ones((4, 4)), blockshape=(2, 2))
...            for i in range(3)]  # small stack of dask arrays

>>> da.concatenate(arrays, axis=0).shape
(12, 4)

>>> da.concatenate(arrays, axis=1).shape
(4, 12)

## Case Study with Meteorological Data

To test this new functionality we download meteorological data from the European Centre for Medium-Range Weather Forecasts. In particular we have the temperature for the Earth every six hours for all of 2014 with spatial resolution of a quarter degree. We download this data using this script (please don’t hammer their servers unnecessarily) (Thanks due to Stephan Hoyer for pointing me to this dataset).

As a result, I now have a bunch of netCDF files!

$into myfile.csv ssh://hostname:myfile.json --delimter ','  We also have docs! http://into.readthedocs.org/en/latest/ ## February 09, 2015 ### numfocus #### NumFOCUS 2014 Review While not represented in our blog activity, 2014 turned out to be a very busy year for NumFOCUS. We sponsored more projects, gained new board members, and added programs. It has been hard to keep up with all the activity within the organization, but perhaps much harder for those outside. Let me go through a few of the highlights from the year to get folks caught up. First and foremost, we had the first election process. This has allowed us to welcome Lorena Barba, Brian Granger, Stefan Karpinski and Cindee Madison to the board of directors. In an effort to grow the number of people helping out we have also formed two new ways for people to contribute: Advisor Council and Vice President roles. The inaugural advisor council consists of Fernando Perez and Travis Oliphant. They will help guide the organization as it grows in both scope and operations. The new Vice President are Sheila Miguez, James Powell and Ben Zaitlen, all committed community members who have helped in various ways to get us started. We are always looking for more volunteers to help build our organization, but with these added team members we hope to continue to deliver high quality services to our community. Over the year we held four PyData conferences, supported one John Hunter Technology Fellow, lead Women in Technology bootcamps, helped between 20 and 30 conference attendees travel to important conferences, and more. As the new year starts, we will be doing many of these things plus working with the Training Up program. We see the need to not only sponsor projects but to create important programs that get the community together. As always, we are always looking for more volunteers for these events, so please do get in touch. Perhaps one of the most exciting things that we did this year is expand our language scope from a Python centric organization to include several non-Python projects. We have ROpenSci and Julia now a part of our line up. Joining us on the education front are Software Carpentry and sister organization Data Carpentry. We also finished signing our Pyhton projects that we had been working with for a while including: SymPy, IPython, and AstroPy. While fiscal sponsorship doesn’t make sense for every project, it is a service we happily provide to help projects sustain and grow in natural ways. In the coming year, we plan to be a great deal more vocal about our activities. Personally, two major goals I have is to expand our reach to Europe and start making headway on project governance recommendations. But that is another story. ## February 08, 2015 ### Titus Brown #### Review: &quot;Accurate multi-kb reads resolve complex populations and detect rare microorganisms&quot; Here is the top bit of a review I wrote of a very nice paper by Itai Sharon et al., from Jill Banfield's lab, on using Illumina TruSeq long reads (a.k.a. Moleculo), to look at complex metagenomes. The paper is newly available here (although it is behind a paywall ;(. Citation: Accurate, multi-kb reads resolve complex populations and detect rare microorganisms Genome Res. gr.183012.114. Published in Advance February 9, 2015. doi: 10.1101/gr.183012.114 This is an excellent application of new long-read technology to further illuminate the characteristics of several medium-to-high complexity microbial communities. The methods are expert, the results are fascinating, and the discussion is well done. Objectives: 1. test the efficacy of assembling Moleculo reads to improve short-read contigs; 2. evaluate accuracy of curated short-read assemblies; 3. look at organisms present at very low abundance levels; 4. evaluate levels of sequence variation & genomic content in strains that could not otherwise be resolved by short-read assembly; Results: 1. Long-read data revealed many very abundant organisms... 2. ...that were entirely missed in short-read assemblies. 3. Genome architecture and metabolic potential were reconstructed using a new synteny based method. 4. "Long tail" of low-abundance organisms belong to phyla represented by highly abundant organisms. 5. Diversity of closely-related strains & rare organisms account for major portion of the communities. The portion of the results that is most novel and most fascinating is the extensive analysis of rare sequences and the disparity in observations from Illumina (assemblies) and Moleculo (long reads and assemblies). The basic results are, on first examination, counter-intuitive: many long-read sequences are obtained from abundant organisms that simply don't show up in Illumina short-read assemblies. The statement is made that this is because of strain variation in the community, i.e. that Illumina assemblies are fragmented due to strain variation and this blocks the observation of the majority of the community. This is to some extent born out by the low mapping percentages (which is the primary evidence offered by the authors), and also matches our own observations. --titus ## February 07, 2015 ### Titus Brown #### How we develop software (2015 version) A colleague who is starting their own computational lab just asked me for some advice on how to run software projects, and I wrote up the following. Comments welcome! A brief summary of what we've converged on for our own needs is this: --titus ## February 05, 2015 ### Continuum Analytics #### Continuum Analytics - February Tech Events The Continuum Analytics team believes in engaging with our community and fostering relationships throughout the tech world. To keep you all in the loop, we’re updating our blog each month with a list of events where you can find us attending, hosting, or participating! ## February 04, 2015 ### Juan Nunez-Iglesias #### jnuneziglesias Update: See the also the clarifications to this post, and submit code by creating an issue in our GitHub repo! It’s official! Harriet Dashnow, Stéfan van der Walt, and I will be writing an O’Reilly book about the SciPy library and the surrounding ecosystem. The book is called Elegant SciPy, and is intended to teach SciPy to fledgling Pythonistas, guided by the most elegant SciPy code examples we can find. So, if you recently came across scientific Python code that made you go “Wow!” with its elegance, simplicity, cleverness, or power, please point us to it! As an example, have a look at Vighnesh Birodkar’s code to build a region adjacency graph from an n-dimensional image, which I highlighted previously here. Each chapter will have one or two snippets that we will work towards. Each of these will be credited as “written by/nominated by”, and needs to be published under a permissive license such as MIT, BSD, or public domain to be considered for inclusion. We would especially like nominations in the following categories: • statistics (using scipy.stats) • image processing and computer vision • Fourier transforms • sparse matrix operations • eigendecomposition and linear algebra • optimization • streaming data analysis • spatial and geophysical data analysis We’ll also consider other parts of the SciPy library and ecosystem. We invite you to submit code snippets for inclusion in the book. We’d also appreciate a small description of about one paragraph explaining what the code is used for and why you think it’s elegant, even though this is often self-evident. =) ### How to submit Thank you, Juan, Harriet, and Stéfan. ## February 03, 2015 ### Matthew Rocklin #### ReIntroducing Into This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project tl;dr into efficiently migrates data between formats. ## Motivation We spend a lot of time migrating data from common interchange formats, like CSV, to efficient computation formats like an array, a database or binary store. Worse, many don’t migrate data to efficient formats because they don’t know how or can’t manage the particular migration process for their tools. Your choice of data format is important. It strongly impacts performance (10x is a good rule of thumb) and who can easily use and interpret your data. When advocating for Blaze I often say “Blaze can help you query your data in a variety of formats.” This assumes that you’re able to actually get it in to that format. ## Enter the into project The into function efficiently migrates data between formats. These formats include both in-memory data structures like the following: list, set, tuple, Iterator numpy.ndarray, pandas.DataFrame, dynd.array Streaming Sequences of any of the above  as well as persistent data living outside of Python like the following: CSV, JSON, line-delimited-JSON Remote versions of the above HDF5 (both standard and Pandas formatting), BColz, SAS SQL databases (anything supported by SQLAlchemy), Mongo  The into project migrates data between any pair of these formats efficiently by using a network of pairwise conversions. (visualized towards the bottom of this post) ## How to use it The into function takes two arguments, a source and a target. It moves data in the source to the target. The source and target can take the following forms  Target Source Example Object Object A particular DataFrame or list String String 'file.csv', 'postgresql://hostname::tablename' Type Like list or pd.DataFrame So the following would be valid calls to into >>> into(list, df) # create new list from Pandas DataFrame >>> into([], df) # append onto existing list >>> into('myfile.json', df) # Dump dataframe to line-delimited JSON >>> into(Iterator, 'myfiles.*.csv') # Stream through many CSV files >>> into('postgresql://hostname::tablename', df) # Migrate dataframe to Postgres >>> into('postgresql://hostname::tablename', 'myfile.*.csv') # Load CSVs to Postgres >>> into('myfile.json', 'postgresql://hostname::tablename') # Dump Postgres to JSON >>> into(pd.DataFrame, 'mongodb://hostname/db::collection') # Dump Mongo to DataFrame Note that into is a single function. We’re used to doing this with various to_csv, from_sql methods on various types. The into api is very small; Here is what you need in order to get started: $ pip install into

>>> from into import into

See into on github

## Examples

We now show some of those same examples in more depth.

Turn list into numpy array

>>> import numpy as np
>>> into(np.ndarray, [1, 2, 3])
array([1, 2, 3])

Load CSV file into Python list

>>> into(list, 'accounts.csv')
[(1, 'Alice', 100),
(2, 'Bob', 200),
(3, 'Charlie', 300),
(4, 'Denis', 400),
(5, 'Edith', 500)]

Translate CSV file into JSON

>>> into('accounts.json', 'accounts.csv')
$head accounts.json {"balance": 100, "id": 1, "name": "Alice"} {"balance": 200, "id": 2, "name": "Bob"} {"balance": 300, "id": 3, "name": "Charlie"} {"balance": 400, "id": 4, "name": "Denis"} {"balance": 500, "id": 5, "name": "Edith"}  Translate line-delimited JSON into a Pandas DataFrame >>> import pandas as pd >>> into(pd.DataFrame, 'accounts.json') balance id name 0 100 1 Alice 1 200 2 Bob 2 300 3 Charlie 3 400 4 Denis 4 500 5 Edith ## How does it work? This is challenging. Robust and efficient conversions between any two pairs of formats is fraught with special cases and bizarre libraries. The common solution is to convert through a common format like a DataFrame, or streaming in-memory lists, dicts, etc. (see dat) or through a serialization format like ProtoBuf or Thrift. These are excellent options and often what you want. Sometimes however this can be slow, particularly when dealing with live computational systems or with finicky storage solutions. Consider for example, migrating between a numpy.recarray and a pandas.DataFrame. We can migrate this data very quickly in place. The bytes of data don’t need to change, only the metadata surrounding them. We don’t need to serialize to an interchange format or translate to intermediate pure Python objects. Consider migrating data from a CSV file to a PostgreSQL database. Using Python iterators through SQLAlchemy we rarely exceed migration speeds greater than 2000 records per second. However using direct CSV loaders native to PostgreSQL we can achieve speeds greater than 50000 records per second. This is the difference between an overnight job and a cup of coffee. However this requires that we’re flexible enough to use special code in special situations. Expert pairwise interactions are often an order of magnitude faster than generic solutions. Into is a network of these pairwise migrations. We visualize that network below: Into's decentralized migration scheme. Complex but powerful Each node is a data format. Each directed edge is a function that transforms data between two formats. A single call to into may traverse multiple edges and multiple intermediate formats. For example, we when migrate a CSV file to a Mongo database we might take the following route: • Load in to a DataFrame (pandas.read_csv) • Convert to np.recarray (DataFrame.to_records) • Then to a Python Iterator (np.ndarray.tolist) • Finally to Mongo (pymongo.Collection.insert) Alternatively we could write a special function that uses MongoDB’s native CSV loader and shortcut this entire process with a direct edge CSV -> Mongo. To find the most efficient route we weight the edges of this network with relative costs (measured ad-hoc.) We use networkx to find the shortest path before we start the migration. If for some reason an edge fails (raises NotImplementedError) we can reroute automatically. In this way we’re both efficient and robust to failure. Note that we color some nodes red. These nodes can be larger than memory. When we migrate between two red nodes (both the input and output may be larger than memory) then we limit our path to the red subgraph to ensure that we don’t blow up mid-migration. One format to note is chunks(...) like chunks(DataFrame) which is an iterable of in-memory DataFrames. This convenient meta-format allows us to use compact data structures like numpy arrays and pandas DataFrames on large data while keeping only a few tens of megabytes in memory at a time. The networked approach allows developers to write specialized code for special situations and know that this code will only be used in the right situation. This approach allows us to handle a very complex problem in an isolated and separable manner. The central dispatching system keeps us sane. ## History I wrote about into long ago in connection to Blaze. I then promptly shut up about it. This was because the old implementation (before the network approach) was difficult to extend/maintain and wasn’t ready for prime-time. I am very happy with this network. Unexpected applications very often just work and into is now ready for prime-time. It’s also available independently from Blaze, both via conda and via pip. The major dependencies are NumPy, Pandas, and NetworkX so it’s relatively lightweight for most people who read my blog. If you want to take advantage of some of the higher performing formats, like HDF5, you’ll need to install those libraries as well (pro-tip, use conda). ## How do I get started? You should download a recent version. $ pip install --upgrade git+https://github.com/ContinuumIO/into
or
\$ conda install into --channel blaze


You then might want to go through the first half of this tutorial

Or just give it a shot without reading anything. My hope is that the interface is simple enough (just one function!) that users can pick it up naturally. If you run in to issues then I’d love to hear about them at blaze-dev@continuum.io

## February 02, 2015

### Titus Brown

#### khmer development sprint: Feb 19-20 and 23-25th, 2015

Michael R. Crusoe and I are throwing a sprint!

Somewhat in the vein of last year's mini-Hackathon, Michael and I and other members of the lab are going to focus in on reviewing contributions and closing issues on the khmer project for a 5 day period.

To track the sprint, subscribe to the github issue.

Michael and I will be working ~9-5pm Eastern, Thu/Fri/Mon/Tues/Wed, Feb 19-25th (so weekend excepted), and people are welcome to drop in anytime. We are planning to focus on khmer, screed, khmer-protocols, and khmer-recipes; other lab projects (like paper pipelines or workshop materials) are fair game, too. We'll have a list of issues posted for people who are looking for a small task.

We don't have any particular physical location reserved, but you're welcome to join us at Michigan State University to hang out in person. We also plan to be fully online-enabled within the 9-5pm EDT period, and will have a Google Hangout running that anyone can join. We can also set up one-to-one video conferences and screen sharing with people who need help. (We will not be working outside of those hours: family, sanity, etc.)

• you're interested in seeing the "github flow" model in action, for scientific software (including automated tests, continuous integration, code review, and a pre-merge checklist!);
• you have a particular bug or problem you want to fix in any of our software;
• you work for or with the lab;
• you want to see some of the technical underpinnings of our approach(es) to open science;

Again, subscribe here to be kept in the loop as our plans progress. And check out our WSSPE report on last July's hackathon!

cheers, --titus

## February 01, 2015

### Titus Brown

#### Lab for Data Intensive Biology at UC Davis joins Software Carpentry as an Affiliate

We are pleased to announce that the Laboratory for Data Intensive Biology at UC Davis has joined the Software Carpentry Foundation as an Affiliate Member for three years, starting in January 2015.

"We've been long-term supporters of Software Carpentry, and Affiliate status lets us support the Software Carpentry Foundation in a tangible way," said Dr. C. Titus Brown, the lab director. "This status also gives us the opportunity to include Software Carpentry as part of a larger biological data science training program at UC Davis."

Software Carpentry is a volunteer organisation whose goal is to make scientists more productive, and their work more reliable, by teaching them basic computing skills. Founded in 1998, it runs short, intensive workshops that cover program design, version control, testing, and task automation. It has taught over 10,000 learners, and primarily uses a two-day workshop format.

In October 2014, a non-profit Software Carpentry Foundation was created to act as a governing body for the project, with Dr. Greg Wilson as the founding Executive Director. Affiliate status with the SCF provides preferential access to instructor training as well as waiving per-workshop fees.

The Laboratory for Data Intensive Biology (DIB) is newly started with Dr. Brown's move to UC Davis in January 2015. The lab is in the Department of Population Health and Reproduction in the School of Veterinary Medicine, and is part of both the Data Science Initiative and the Genome Center at UC Davis. DIB does fundamental and applied research on making use of the increasing volume and variety of biological data, and Dr. Brown currently has funding from the NIH, the USDA, and the Moore Foundation.

Read more about Software Carpentry at the Software Carpentry blog, and also see Software Carpentry's announcement of UC Davis' affiliate status.

## January 31, 2015

### The Nipy blog

#### Ugly speaking

I just skip-read this post about Python 3 by Armin Ronacher. I skip-read it because I found it so ugly.

There was a BBC sitcom called One foot in the grave. It had a character called Victor Meldrew who was always angry. His catchphrase was an enraged I can't believe it.

For some reason, this genre seems to be a standard for blog posts. The author just can't believe that X (a language or operating system) could be so stupid as to do Y.

It is a little puzzling, because I guess most people reading someone saying how stupid they are, do not say "ah yes, thank you" but "f#@& you asshole". So, if we do write in that style, we make it less likely the problem will be fixed. I suppose that's obvious to the person writing, so there must be some other thing that is attractive about labeling the other person / system / idea as a loser.

We technical types don't like to think about soft issues like tone. Like the drunk under the lamp-post we prefer technical problems, and ignore other important problems such as the way we are making our readers feel.

When I am reading a blog post of the "I can't believe it" type, I feel like someone with bad breath is shouting in my face, and my first and strong reaction is to move away and avoid that person whenever I see them. For example, I suppose I won't click on a link to blog post by Armin Ronacher again, because I predict it will make me feel bad. I am sure that is a shame for me, because I have every reason to think Mr Ronacher is competent, intelligent and well-informed. It is just I don't want to feel that way. I find it harder to work when I feel that way. I am less gentle with my friends when I feel that way.

### Randy Olson

#### Python usage survey 2014

Remember that Python usage survey that went around the interwebs late last year? Well, the results are finally out and I’ve visualized them below for your perusal. This survey has been running for two years now (2013-2014), so where we have data for both years, I’ve charted the results so we can see the changes in Python usage over time.

I’ll note that a big focus of this survey is to find out if Python users are transitioning over to Python 3, and if they aren’t, then why they aren’t making that transition. If you’re on the fence about switching from Python 2 to 3, there are some great articles out there about the key differences between the two versions and the many, many, …many advantages that Python 3 offers over Python 2.

If you want to check out the underlying data yourself, head on over to Bruno Cauet’s page.

The first big question that we all we know is: How many Python users have actually used Python 2 and 3?

As expected, nearly every Python user has used Python 2 at some point in their career. We also see good news for Python 3 over the past year: Python 3 usage increased 12 percentage points in 2014, up to nearly 3/4 of the surveyed Python users.

Of course, “writing code with Python 3 once” doesn’t mean that they actually use it regularly. The question below gets at that question more directly.

Here we see yet more good news for Python 3: As much as 10% more Python users are primarily using Python 3 than in 2013, now accounting for 1/3 of all Python users. It seems the transition to Python 3 will be slow but steady for the next few years.

The transition we see above may be caused by project managers at work. What version do Python users go to when working on their personal projects?

Here we see a more close divide between the two versions: Nearly half of all users will start up their own projects with Python 3, whereas the other half still ardently remains pro-Python 2.

Let’s break these usage patterns down by the specific version now.

Somehow, Python v. 2.5 and 2.6 are still in use in some places, but 2.7 still dominates the Python 2 landscape. We also see telltale signs that Python 3 is becoming a mature language in its own right, with users stuck in the older 3.2 and 3.3 versions.

To summarize so far: Over 2/3 of all Python users are still using Python 2, with the majority of them sitting at 2.7. This tells us that Python is still very much a divided language, with a large portion of the user base unwilling to upgrade to the latest version despite the fact that Python 2 nearly 5 years old now. (That’s ancient in programming language years!)

A common complaint of the ardent Python 2 users is that Python 3 was a huge mistake. How does that turn out in this survey?

Surprisingly, it seems the complainers are a minority: Only 12% of the Python users surveyed think Python 3 was a mistake. 1/3 of Python users think Python 3 is great (probably the ones who made the switch!), whereas over half of Python users think the Python developers could’ve made the Python 2 -> 3 transition more fluid.

So, most Python users don’t think Python 3 was a mistake, but 2 in 3 of them still haven’t made the switch. That leaves us to wonder: Why haven’t the Python 2 users made the switch yet?

Package dependencies are — by far — the most-cited reasons for refusing to switch to Python 3. Many of us rely on specific packages — such as numpy, scipy, pandas, etc. — for our day-to-day work. If the packages haven’t been upgraded to Python 3 yet, then why should we? Lucky for us, there’s a web site dedicated to tracking the packages that are Python 3 ready. Are all of your packages on there? Then maybe it’s time to make the switch.

Another common reason for refusing to switch to Python 3 is that there’s no incentive. We’re comfortable with Python 2, we know its ins and outs, and everything works fine. Why bother upgrading? If you fall in this category, I’ll point you to this article again that compares the key differences between Python 2 and 3. And don’t forget about the 2 pounds of reasons why Python 3 is a huge upgrade over Python 2.

The last two major reasons for refusing to switch to Python 3 are a little harder to address. If you have a large legacy code base or your manager simply refuses to make the upgrade, then it’s up to you to convince management (or yourself) that the upgrade is worth the work. The links I included in the above paragraph are a good start.

Finally, the survey wanted to look at cross-compatibility between Python 2 and 3.

Here we start to see more good news for the future of Python 3: We saw a significant increase in the number of users who have ported their code from Python 2 to 3.

It’s not very well-known that there Python has packages for converting code from Python 2 to 3 and Python 3 to 2. When the survey asked users whether they’d used either of these packages even once, well, see for yourself…

It seems like the Python devs need to do a better job of raising awareness of these code conversion packages.

Because Python 2 and 3 really aren’t that different, it’s not necessary to write code for just one version. As the famous taco shell commercial goes: “Por que no los dos?” (Why not both?)

To that end, the survey also polled users on whether they write Python 2- and 3- compatible code.

Apparently only 1 in 3 Python developers bother to write multi-version compatible code, and there hasn’t been much of a change since 2013.

I look forward to seeing how the Python 3 transition progresses in 2014. If you have any more resources for helping (and convincing!) Python developers to make the transition, please share them in the comments below.

# Advice for applying Machine Learning¶

This post is based on a tutorial given in a machine learning course at University of Bremen. It summarizes some recommendations on how to get started with machine learning on a new problem. This includes

• ways of visualizing your data
• choosing a machine learning method suitable for the problem at hand
• identifying and dealing with over- and underfitting
• dealing with large (read: not very small) datasets
• pros and cons of different loss functions.

The post is based on "Advice for applying Machine Learning" from Andrew Ng. The purpose of this notebook is to illustrate the ideas in an interactive way. Some of the recommendations are debatable. Take them as suggestions, not as strict rules.

In [1]:
import time
import numpy as np
np.random.seed(0)

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
Expand Code
# Modified from http://scikit-learn.org/stable/auto_examples/plot_learning_curve.html
from sklearn.learning_curve import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and traning learning curve.

Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.

title : string
Title for the chart.

X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.

y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.

ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.

cv : integer, cross-validation generator, optional
If an integer is passed, it is the number of folds (defaults to 3).
Specific cross-validation objects can be passed, see
sklearn.cross_validation module for the list of possible objects
"""

plt.figure()
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=5, n_jobs=1, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")

plt.xlabel("Training examples")
plt.ylabel("Score")
plt.legend(loc="best")
plt.grid("on")
if ylim:
plt.ylim(ylim)
plt.title(title)


### Dataset¶

We will generate some simple toy data using sklearn's make_classification function:

In [4]:
from sklearn.datasets import make_classification
X, y = make_classification(1000, n_features=20, n_informative=2,
n_redundant=2, n_classes=2, random_state=0)

from pandas import DataFrame
df = DataFrame(np.hstack((X, y[:, None])),
columns = range(20) + ["class"])


Notice that we generate a dataset for binary classification consisting of 1000 datapoints and 20 feature dimensions. We have used the DataFrame class from pandas to encapsulate the data and the classes into one joint data structure. Let's take a look at the first 5 datapoints:

In [5]:
df[:5]

Out[5]:
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 class
0 -1.063780 0.676409 1.069356 -0.217580 0.460215 -0.399167 -0.079188 1.209385 -0.785315 -0.172186 ... -0.993119 0.306935 0.064058 -1.054233 -0.527496 -0.074183 -0.355628 1.057214 -0.902592 0
1 0.070848 -1.695281 2.449449 -0.530494 -0.932962 2.865204 2.435729 -1.618500 1.300717 0.348402 ... 0.225324 0.605563 -0.192101 -0.068027 0.971681 -1.792048 0.017083 -0.375669 -0.623236 1
2 0.940284 -0.492146 0.677956 -0.227754 1.401753 1.231653 -0.777464 0.015616 1.331713 1.084773 ... -0.050120 0.948386 -0.173428 -0.477672 0.760896 1.001158 -0.069464 1.359046 -1.189590 1
3 -0.299517 0.759890 0.182803 -1.550233 0.338218 0.363241 -2.100525 -0.438068 -0.166393 -0.340835 ... 1.178724 2.831480 0.142414 -0.202819 2.405715 0.313305 0.404356 -0.287546 -2.847803 1
4 -2.630627 0.231034 0.042463 0.478851 1.546742 1.637956 -1.532072 -0.734445 0.465855 0.473836 ... -1.061194 -0.888880 1.238409 -0.572829 -1.275339 1.003007 -0.477128 0.098536 0.527804 0

5 rows × 21 columns

It is hard to get any clue of the problem by looking at the raw feature values directly, even on this low-dimensional example. Thus, there is a wealth of ways of providing more accessible views of your data; a small subset of these is discussed in the next section.

### Visualization¶

First step when approaching a new problem should nearly always be visualization, i.e., looking at your data.

Seaborn is a great package for statistical data visualization. We use some of its functions to explore the data.

First step is to generate scatter-plots and histograms using the pairplot. The two colors correspond to the two classes and we use a subset of the features and only the first 50 datapoints to keep things simple.

In [6]:
_ = sns.pairplot(df[:50], vars=[8, 11, 12, 14, 19], hue="class", size=1.5)


Based on the histograms, we can see that some features are more helpful to distinguish the two classes than other. In particular, feature 11 and 14 seem to be informative. The scatterplot of those two features shows that the classes are almost linearly separable in this 2d space. A further thing to note is that feature 12 and 19 are highly anti-correlated. We can explore correlations more systematically by using corrplot:

In [7]:
plt.figure(figsize=(12, 10))
_ = sns.corrplot(df, annot=False)


We can see our observations from before confirmed here: feature 11 and 14 are strongly correlated with the class (they are informative). They are also strongly correlated with each other (via the class). Furthermore, feature 12 is highly anti-correlated with feature 19, which in turn is correlated with feature 14. We have thus some redundant features. This can be problematic for some classifiers, e.g., naive Bayes which assumes the features being independent given the class. The remaining features are mostly noise; they are neither correlated with each other nor with the class.

Notice that data visualization becomes more challenging if you have more feature dimensions and less datapoints. We give an example for visualiszing high-dimensional data later.

### Choice of the method¶

Once we have visually explored the data, we can start applying machine learning to it. Given the wealth of methods for machine learning, it is often not easy to decide which method to try first. This simple cheat-sheet (credit goes to Andreas Müller and the sklearn-team) can help to select an appropriate ML method for your problem (see http://dlib.net/ml_guide.svg for an alternative cheat sheet).

In [8]:
from IPython.display import Image
Image(filename='ml_map.png', width=800, height=600)

Out[8]: