## May 21, 2015

### Titus Brown

#### Comparing and evaluating assembly with graph alignment

One of our long-term interests has been in figuring out what the !$!$!#!#%! assemblers actually do to real data, given all their heuristics. A continuing challenge in this space is that short-read assemblers deal with really large amounts of noisy data, and it can be extremely hard to look at assembly results without running into this noise head-on. It turns out that being able to label De Bruijn graphs efficiently and align reads to graphs can help us explore assemblies in a variety of ways.

The two basic challenges are noisy data and lots of data. When (for example) looking at what fraction of reads has been incorporated into an assembly, noise causes problems because a read may have been corrected during assembly. This is where graph alignment comes in handy, because we can use it to align reads to the full graph and get rid of much of this noise. Lots of data complicates things because it's very hard to look at reads individually - you need to treat them in aggregate, and it's much easier just to look at the reads that match to your assembly than to investigate the oddball reads that don't assemble. And this is where the combination of graph alignment and labeling helps, because it's easy to count and extract reads based on overlaps with labels, as well as to summarize those overlaps.

The main question we will be asking below is: can we measure overlaps and disjoint components in graph extents, that is, in unique portions of assembly graphs? We will be doing this using our sparse graph instead of counting nodes or k-mers, for two reasons: first, the covering is largely independent of coverage, and second, the number of sparse nodes is a lot smaller than the total number of k-mers.

The underlying approach is straightforward:

• load contigs or reads from A into the graph, tagging sparse nodes as we go;
• load contigs or reads from B into the graph, tagging sparse nodes as we go;
• count the number of tagged nodes that are unique to A, unique to B, and in the overlap;
• optionally do graph alignment as you load in reads, to ignore errors.

## Some basics

Let's start with simulations, as usual. We'll set up two randomly generated chromosomes, a and b, of equal size, both in genomes.fa, and look at genome-a extent within the context of both (target 'fake_a' in Makefile):

./compare-graphs.py genomes.fa genome-b.fa
all tags: 52
n tags in A: 52
n tags in B: 26
tags in A but not in B 26
tags in B but not in A 0


So far so good -- there's a 50% overlap between one of the chromosomes and the total.

If we now generate reads from genome-b.fa and do the graph comparison with the reads, we get silly results (target 'fake_b' in Makefile):

./compare-graphs.py genomes.fa reads-b.fa
all tags: 135
n tags in A: 109
n tags in B: 107
tags in A but not in B 28
tags in B but not in A 26


Despite knowing by construction that all of the reads came from genome-b, we're getting results that there's a lot of tags in the reads that aren't in the genome. This is because of errors in the reads, which introduce many spurious branches in the graph.

This is now where the read aligner comes in; we can do the same comparison, but this time we can ask that the reads be aligned to the genome, thus eliminating most of the errors in the comparison:

./compare-graphs.py genomes.fa reads-b.fa --align-b
all tags: 99
n tags in A: 99
n tags in B: 72
tags in A but not in B 27
tags in B but not in A 0


At this point we can go in and look at the original tags in A that aren't covered in B (there are 52) and note that B is missing approximately half of the graph extent in A.

## Trying it out on some real data

Let's try evaluating a reference against some low-coverage reads. Using the same mouse reference transcriptome & subset of reads that we've been using in previous blog posts, we can ask "how many sparse nodes are unaccounted for in the mouse transcriptome when we look at the reads?" (Note, the mouse transcriptome was not generated from this data set; this is the reference transcriptome.)

The answer (target rna-compare-noalign.txt in the Makefile) is:

all tags: 1959121
n tags in A: 1878475
n tags in B: 644963
tags in A but not in B 1314158
tags in B but not in A 80646


About 12.5% of the reads in (B; 80646 / 644963) don't pick up tags in the official reference transcriptome (A).

Interestingly, the results with alignment are essentially the same (target rna-compare-align.txt):

all tags: 1958219
n tags in A: 1877685
n tags in B: 643655
tags in A but not in B 1314564
tags in B but not in A 80534


suggesting that, by and large, these reads are disjoint from the existing assembly, and not mere sequencing errors. (This may be because we require that the entire read be mappable to the graph in order to count it, though.)

## Evaluating trimming

One of the interesting questions that's somewhat hard to investigate in terms of transcriptome assembly is, how beneficial is read trimming to the assembly? The intuition here (that I agree with) is that generally sequence trimming lowers the effective coverage for assembly, and hence loses you assembled sequence. Typically this is measured by running an assembler against the reads, which is slightly problematic because the assembler could have all sorts of strange interactions with the trimming.

So, can we look at the effect of trimming in terms of sparse nodes? Sure!

Suppose we do a stringent round of trimming on our RNAseq (Trimmomatic SLIDINGWINDOW:4:30) - what do we lose?

On this low coverage data set, where A is the graph formed from the trimmed reads and B is the graph from the raw reads, we see (target rseq-hardtrim-ba-noalign.txt):

all tags: 588615
n tags in A: 518980
n tags in B: 588615
tags in A but not in B 0
tags in B but not in A 69635


we see about 12% of the sparse nodes missing from the trimmed data.

If we run the read aligner with a low coverage cutoff (target rseq-hardtrim-ba-align1.txt), we see:

all tags: 569280
n tags in A: 519396
n tags in B: 561757
tags in A but not in B 7523
tags in B but not in A 49884


Basically, we recover about 20,000 tags in B (69,635 - 49,884) with alignment vs exact matches, so a few percent; but we also lose about half that (7,500) for reasons that we don't entirely understand (wiggle in the graph aligner?)

We have no firm conclusions here, except to say that this should be a way to evaluate the effect of different trimming on graph extent, which should be more reliable than looking at the effect on assemblies.

## Notes and miscellany

• There is no inherent coverage model embedded here, so as long as we can correct for the density of tags, we can apply these approaches to genomes, metagenomes, and transcriptomes.
• It's actually very easy to extract the reads that do or don't match, but our current scripts don't let us do so based on labels.
• We aren't really using the labeling here, just the tagging - but labeling can enable n-way comparisons between e.g. different assemblies and different treatments, because it lets us examine which tags show up in different combinations of data sets.

### Appendix: Running this code

The computational results in this blog post are Rather Reproducible (TM). Please see https://github.com/dib-lab/2015-khmer-wok5-eval/blob/master/README.rst for instructions on replicating the results on a virtual machine or using a Docker container.

### Continuum Analytics

#### Conda for Data Science

tl; dr: We discuss how data scientists working with Python, R, or both can benefit from using conda in their workflow.

Conda is a package and environment manager that can help data scientists manage their project dependencies and easily share environments with their peers. Conda works with Linux, OSX, and Windows, and is language agnostic, which allows us to use it with any programming language or even multi-language projects.

This post explores how to use conda in a multi-language data science project. We’ll use a project named topik, which combines Python and R libraries, as an example.

## May 20, 2015

### Titus Brown

#### Labeling a sparse covering of a De Bruijn graph, and utility thereof

So far, in this week of khmer blog posts (1, 2, 3), we've been focusing on the read-to-graph aligner ("graphalign"), which enables sequence alignments to a De Bruijn graph. One persistent challenge with this functionality as introduced is that our De Bruijn graphs nodes are anonymous, so we have no way of knowing the sources of the graph sequences to which we're aligning.

Without being able to label the graph with source sequences and coordinates, we can't do some pretty basic things, like traditional read mapping, counting, and variant calling. It would be nice to be able to implement those in a graph-aware manner, we think.

To frame the problem, graphalign lets us query into graphs in a flexible way, but we haven't introduced any way to link the matches back to source sequences. There are several things we could do -- one basic idea is to annotate each node in the graph -- but what we really want is a lightweight way to build a labeled graph (aka "colored graph" in Iqbal parlance).

This is where some nice existing khmer technology comes into play.

## Partitioning, tagging, and labelhash

Back in 2012, we published a paper (Pell et al., 2012) that introduced a lightweight representation of implicit De Bruijn graphs. Our main purpose for this representation was something called "partitioning", in which we identified components (disconnected subgraphs) of metagenome assembly graphs for the purpose of scaling metagenome assembly.

A much underappreciated part of the paper is buried in the Materials,

For discovering large components we tag the graph at a minimum density by using the underlying reads as a guide. We then exhaustively explore the graph around these tags in order to connect tagged k-mers based on graph connectivity. The underlying reads in each component can then be separated based on their partition.

The background is that we were dealing with extremely large graphs (30-150 billion nodes), and we needed to exhaustively explore the graphs in order to determine if any given node was transitively connected to any other node; from this, we could determine which nodes belonged to which components. We didn't want to label all the nodes in the graph, or traverse from all the nodes, because this was prohibitive computationally.

### A sparse graph covering

To solve this problem, we built what I call a sparse graph covering, in which we chose a subset of graph nodes called "tags" such that every node in the graph was within a distance 'd' of a tag. We then used this subset of tags as a proxy for the graph structure overall, and could do things like build "partitions" of tags representing disconnected components. We could guarantee the distance 'd' by using the reads themselves as guides into the graph (Yes, this was one of the trickiest bits of the paper. ;)

Only later did I realize that this tagging was analogous to sparse graph representations like succinct De Bruijn graphs, but that's another story.

The long and short of it is this: we have a nice, simple, robust, and somewhat lightweight way to label graph paths. We also have functionality already built in to exhaustively explore the graph around any node and collect all tagged nodes within a given distance.

What was missing was a way to label these nodes efficiently and effectively, with multiple labels.

### Generic labeling

Soon after Camille Scott, a CS graduate student at MSU (and now at Davis), joined the lab, she proposed an expansion to the tagging code to enable arbitrary labels on the tags. She implemented this within khmer, and built out a nice Python API called "labelhash".

With labelhash, we can do things like this:

lh = khmer.CountingLabelHash(...)
lh.consume_fasta_and_tag_with_labels(sequence_file)


and then query labelhash with specific sequences:

labels = lh.sweep_label_neighborhood(query, dist)


where 'labels' now contains the labels of all tags that overlap with 'query', including tags that are within an optional distance 'dist' of any node in query.

Inconveniently, however, this kind of query was only useful when what you were looking for was in the graph already; it was a way to build an index of sequences, but fuzzy matching wasn't possible. With the high error rate of sequencing and high polymorphism rates in things we worked on, we were worried about its poor effectiveness.

### Querying via graphalign, retrieving with labelhash

This is where graphalign comes in - we can query into the graph in approximate ways, and retrieve a path that's actually in the graph from the query. This is essentially like doing a BLASTN query into the graph. And, combined with labelhash, we can retrieve the reference sequence(s) that match to the query.

This is roughly what it looks like, once you've built a labelhash as above. First, run the query:

aligner = khmer.ReadAligner(lh.graph, trusted_coverage, 1.0)
score, graph_path, query_path, is_truncated = aligner.align(query)


and then retrieve the associated labels:

labels = lh.sweep_label_neighborhood(graph_path)


...which you can then use with a preexisting database of the sequence.

### Why would you do any of this?

If this seems like an overly complicated way of doing a BLAST, here are some things to consider:

• when looking at sequence collections that share lots of sequence this is an example of "compressive computing", in which the query is against a compressed representation of the database. In particular, this type of solution might be good when we have many, many closely related genomes and we want to figure out which of them have a specific variant.
• graphs are notoriously heavyweight in general, but these graphs are actually quite low memory.
• you can do full BLASTX or protein HMM queries against these graphs as well. While we haven't implemented that in khmer, both a BLAST analog and a HMMER analog have been implemented on De Bruijn graphs.
• another specific use case is retrieving all of the reads that map to a particular region of an assembly graph; this is something we were very interested in back when we were trying to figure out why large portions of our metagenomes were high coverage but not assembling.

One use case that is not well supported by this scheme is labeling all reads - the current label storage scheme is too heavyweight to readily allow for millions of labels, although it's something we've been thinking about.

## Some examples

We've implemented a simple (and, err, somewhat hacky) version of this in make-index.py and do-align.py.

To see them in action, you'll need the 2015-wok branch of khmer, and a copy of the prototype (https://github.com/dib-lab/2015-khmer-wok4-multimap) -- see the README for full install instructions.

Then, type:

make fake


and you should see something like this (output elided):

./do-align.py genomes reads-a.fa



showing that we can correctly assign reads sampled from randomly constructed genomes - a good test case :).

### Assigning reads to reference genomes

We can also index a bunch of bacterial genomes and map against all of them simultaneously -- target 'ecoli' will map reads from E. coli P12B against all Escherichia genomes in NCBI. (Spoiler alert: all of the E. coli strains are very closely related, so the reads map to many references!)

It turns out to be remarkably easy to implement a counting-via-mapping approach -- see do-counting.py. To run this on the same RNAseq data set as in the counting blog post, run build the 'rseq.labelcount' target.

## Flaws in our current implementation

A few points --

• we haven't introduced any positional labeling in the above labels, so all we can do is retrieve the entire sequence around submatches. This is enough to do some things (like counting transcripts) but for many purposes (like pileups / variant calling via mapping) we would need to do something with higher resolution.
• there's no reason we couldn't come up with different tagging and labeling schemes that focus on features of interests - specific variants, or branch points for isoforms, or what have you. Much of this is straightforward and can be done via the Python layer, too.
• "labeled De Bruijn graphs" are equivalent in concept to "colored De Bruijn graphs", but we worry that "colored" is already a well-used term in graph theory and we are hoping that we can drop "colored" in favor of "labeled".

### Appendix: Running this code

The computational results in this blog post are Rather Reproducible (TM). Please see https://github.com/dib-lab/2015-khmer-wok4-labelhash/blob/master/README.rst for instructions on replicating the results on a virtual machine or using a Docker container.

## May 19, 2015

### Titus Brown

#### Abundance counting of sequences in graphs with graphalign

De Bruijn graph alignment should also be useful for exploring concepts in transcriptomics/mRNAseq expression. As with variant calling graphalign can also be used to avoid the mapping step in quantification; and, again, as with the variant calling approach, we can do so by aligning our reference sequences to the graph rather than the reads to the reference sequences.

The basic concept here is that you build a (non-abundance-normalized) De Bruijn graph from the reads, and then align transcripts or genomic regions to the graph and get the k-mer counts across the alignment. This is nice because it gives you a few options for dealing with multimapping issues as well as variation across the reference. You can also make use of the variant calling code to account for certain types of genomic/transcriptomic variation and potentially address allelic bias issues.

Given the existence of Sailfish/Salmon and the recent posting of Kallisto, I don't want to be disingenuous and pretend that this is any way a novel idea! It's been clear for a long time that using De Bruijn graphs in RNAseq quantification is a worthwhile idea. Also, whenever someone uses k-mers to do something in bioinformatics, there's an overlap with De Bruijn graph concepts (...pun intended).

What we like about the graphalign code in connection with transcriptomics is that it makes a surprisingly wide array of things easy to do. By eliminating or at least downgrading the "noisiness" of queries into graphs, we can ask all sorts of questions, quickly, about read counts, graph structure, isoforms, etc. Moreover, by building the graph with error corrected reads, the counts should in theory become more accurate. (Note that this does have the potential for biasing against low-abundance isoforms because low-coverage reads can't be error corrected.)

For one simple example of the possibilities, let's compare mapping counts (bowtie2) against transcript graph counts from the graph (khmer) for a small subset of a mouse mRNAseq dataset. We measure transcript graph counts here by walking along the transcript in the graph and averaging over k-mer counts along the path. This is implicitly a multimapping approach; to get results comparable to bowtie2's default parameters (which random-map), we divide out the number of transcripts in which each k-mer appears (see count-median-norm.py, 'counts' vs 'counts2').

This graph shows some obvious basic level of correlation, but it's not great. What happens if we use corrected mRNAseq reads (built using graphalign)?

This looks better - the correlation is about the same, but when we inspect individual counts, they have moved further to the right, indicating (hopefully) greater sensitivity. This is to be expected - error correction is collapsing k-mers onto the paths we're traversing, increasing the abundance of each path on average.

What happens if we now align the transcripts to the graph built from the error corrected reads?

Again, we see mildly greater sensitivity, due to "correcting" transcripts that may differ only by a base or two. But we also see increased counts above the main correlation, especially above the branch of counts at x = 0 (poor graph coverage) but with high mapping coverage - what gives? Inspection reveals that these are reads with high mapping coverage but little to no graph alignment. Essentially, the graph alignment is getting trapped in a local region. There are at least two overlapping reasons for this -- first, we're using the single seed/local alignment approach (see error correction) rather than the more generous multiseed alignment, and so if the starting point for graph alignment is poorly chosen, we get trapped into a short alignment. Second, in all of these cases, the transcript isn't completely covered by reads, a common occurrence due to both low coverage data as well as incomplete transcriptomes.

In this specific case, the effect is largely due to low coverage; if you drop the coverage further, it's even more exacerbated.

Two side notes here -- first, graphalign will align to low coverage (untrusted) regions of the graph if it has to, although the algorithm will pick trusted k-mers when it can. As such it avoids the common assembler problem of only recovering high abundance paths.

And second, no one should use this code for counting. This is not even a proof of concept, but rather an attempt to see how well mapping and graph counting fit with an intentionally simplistic approach.

## Isoform structure and expression

Another set of use cases worth thinking about is looking at isoform structure and expression across data sets. Currently we are somewhat at the mercy of our reference transcriptome, unless we re-run de novo assembly every time we get a new data set. Since we don't do this, for some model systems (especially emerging model organisms) isoform families may or may not correspond well to the information in the individual samples. This leads to strange-looking situations where specific transcripts have high coverage in one region and low coverage in another (see SAMmate for a good overview of this problem.)

Consider the situation where a gene with four exons, 1-2-3-4, expresses isoform 1-2-4 in tissue A, but expresses 1-3-4 in tissue B. If the transcriptome is built only from data from tissue A, then when we map reads from tissue B to the transcriptome, exon 2 will have no coverage and counts from exon 3 will (still) be missing. This can lead to poor sensitivity in detecting low-expressed genes, weird differential splicing results, and other scientific mayhem.

(Incidentally, it should be clear from this discussion that it's kind of insane to build "a transcriptome" once - what we really want do is build a graph of all relevant RNAseq data where the paths and counts are labeled with information about the source sample. If only we had a way of efficiently labeling our graphs in khmer! Alas, alack!)

With graph alignment approaches, we can short-circuit the currently common ( mapping-to-reference->summing up counts->looking at isoforms ) approach, and go directly to looking directly at counts along the transcript path. Again, this is something that Kallisto and Salmon also enable, but there's a lot of unexplored territory here.

We've implemented a simple, short script to explore this here -- see explore-isoforms-assembled.py, which correctly picks out the exon boundaries from three simulated transcripts (try running it on 'simple-mrna.fa').

### Other thoughts

• these counting approaches can be used directly on metagenomes as well, for straight abundance counting as well as analysis of strain variation. This is of great interest to our lab.
• calculating differential expression on an exonic level, or at exon-exon junctions, is also an interesting direction.

### References and previous work

• Kallisto is the first time I've seen paths in De Bruin graphs explicitly used for RNAseq quantification rather than assembly. Kallisto has some great discussion of where this can go in the future (allele specific expression being one very promising direction).
• There are lots of De Bruijn graph based assemblers for mRNAseq (Trinity, Oases, SOAPdenovo-Trans, and Trans-ABySS.

### Appendix: Running this code

The computational results in this blog post are Rather Reproducible (TM). Please see https://github.com/dib-lab/2015-khmer-wok3-counting/blob/master/README.rst for instructions on replicating the results on a virtual machine or using a Docker container.

### Matthew Rocklin

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr We lay out the pieces of Dask, a system for parallel computing

## Introduction

Dask started five months ago as a parallel on-disk array; it has since broadened out. I’ve enjoyed writing about its development tremendously. With the recent 0.5.0 release I decided to take a moment to give an overview of dask’s various pieces, their state, and current development.

## Collections, graphs, and schedulers

Dask modules can be separated as follows:

On the left there are collections like arrays, bags, and dataframes. These copy APIs for NumPy, PyToolz, and Pandas respectively and are aimed towards data science users, allowing them to interact with larger datasets. Operations on these dask collections produce task graphs which are recipes to compute the desired result using many smaller computations that each fit in memory. For example if we want to sum a trillion numbers then we might break the numbers into million element chunks, sum those, and then sum the sums. A previously impossible task becomes a million and one easy ones.

On the right there are schedulers. Schedulers execute task graphs in different situations, usually in parallel. Notably there are a few schedulers for a single machine, and a new prototype for a distributed scheduler.

In the center is the directed acyclic graph. This graph serves as glue between collections and schedulers. The dask graph format is simple and doesn’t include any dask classes; it’s just functions, dicts, and tuples and so is easy to build on and low-tech enough to understand immediately. This separation is very useful to dask during development; improvements to one side immediately affect the other and new developers have had surprisingly little trouble. Also developers from a variety of backgrounds have been able to come up to speed in about an hour.

This separation is useful to other projects too. Directed acyclic graphs are popular today in many domains. By exposing dask’s schedulers publicly, other projects can bypass dask collections and go straight for the execution engine.

A flattering quote from a github issue:

dask has been very helpful so far, as it allowed me to skip implementing all of the usual graph operations. Especially doing the asynchronous execution properly would have been a lot of work.

Dask developers work closely with a few really amazing users:

1. Stephan Hoyer at Climate Corp has integrated dask.array into xray a library to manage large volumes of meteorlogical data (and other labeled arrays.)

2. Scikit image now includes an apply_parallel operation (github PR) that uses dask.array to parallelize image processing routines. (work by Blake Griffith)

3. Mariano Tepper a postdoc at Duke, uses dask in his research on matrix factorizations. Mariano is also the primary author of the dask.array.linalg module, which includes efficient and stable QR and SVD for tall and skinny matrices. See Mariano’s paper on arXiv.

4. Finally I personally use dask on daily work related to the XData project. This tends to drive some of the newer features.

A few other groups pop up on github from time to time; I’d love to know more detail about how people use dask.

## What works and what doesn’t

Dask is modular. Each of the collections and each of the schedulers are effectively separate projects. These subprojects are at different states of development. Knowing the stability of each subproject can help you to determine how you use and depend on dask.

Dask.array and dask.threaded work well, are stable, and see constant use. They receive relatively minor bug reports which are dealt with swiftly.

Dask.bag and dask.multiprocessing undergo more API churn but are mostly ready for public use with a couple of caveats. Neither dask.dataframe nor

dask.distributed are ready for public use; they undergo significant API churn and have known errors.

## Current work

The current state of development as I see it is as follows:

1. Dask.bag and dask.dataframe are progressing nicely. My personal work depends on these modules, so they see a lot of attention.
• At the moment I focus on grouping and join operations through fast shuffles; I hope to write about this problem soon.
• The Pandas API is large and complex. Reimplementing a subset of it in a blocked way is straightforward but also detailed and time consuming. This would be a great place for community contributions.
2. Dask.distributed is new. It needs it tires kicked but it’s an exciting development.
• For deployment we’re planning to bootstrap off of IPython parallel which already has decent coverage of many parallel job systems, (see #208 by Blake)
3. Dask.array development these days focuses on outreach. We’ve found application domains where dask is very useful; we’d like to find more.
4. The collections (Array, Bag, DataFrame) don’t cover all cases. I would like to start finding uses for the task schedulers in isolation. They serve as a release valve in complex situations.

You can install dask with conda

conda install dask


or with pip

pip install dask
or
or


## May 18, 2015

### Titus Brown

#### Graph alignment and variant calling

There's an interesting and intuitive connection between error correction and variant calling - if you can do one well, it lets you do (parts of) the other well. In the previous blog post on some new features in khmer, we introduced our new "graphalign" functionality, that lets us align short sequences to De Bruijn graphs, and we discussed how we use it for error correction. Now, let's try it out for some simple variant calling!

Graphalign can potentially be used for variant calling in a few different ways - by mapping reads to the reference graph and then using a pileup approach, or by error correcting reads against the graph with a tunable threshold for errors and then looking to see where all the reads disagree - but I've become enamored of an approach based on the concept of reference-guided assembly.

The essential idea is to build a graph that contains the information in the reads, and then "assemble" a path through the graph using a reference sequence as a guide. This has the advantage of looking at the reads only once (to build a DBG, which can be done in a single pass), and also potentially being amenable to a variety of heuristics. (Like almost all variant calling, it is limited by the quality of the reference, although we think there are probably some ways around that.)

## Basic graph-based variant calling

Implementing this took a little bit of extra effort beyond the basic read aligner, because we want to align past gaps in the graph. The way we implemented this was to break the reference up into a bunch of local alignments, each aligned independently, then stitched together.

Again, we tried to keep the API simple. After creating a ReadAligner object,

aligner = khmer.ReadAligner(graph, trusted_cutoff, bits_theta)


there's a single function that takes in the graph and the sequence (potentially genome/chr sized) to align:

score, alignment = align_long(graph, aligner, sequence)


What is returned is a score and an alignment object that gives us access to the raw alignment, some basic stats, and "variant calling" functionality - essentially, reporting of where the alignments are not identical. This is pretty simple to implement:

for n, (a, b) in enumerate(zip(graph_alignment, read_alignment)):
if a != b:
yield n, a, b


The current implementation of the variant caller does nothing beyond reporting where an aligned sequence differs from the graph; this is kind of like guided assembly. In the future, the plan is to extend it with reference-free assembly.

To see this in action for a simulated data set, look at the file sim.align.out -- we get alignments like this, highlighting mismatches:

ATTTTGTAAGTGCTCTATCCGTTGTAGGAAGTGAAAGATGACGTTGCGGCCGTCGCTGTT
|||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||
ATTTTGTAAGTGCTCTATCCCTTGTAGGAAGTGAAAGATGACGTTGCGGCCGTCGCTGTT


(Note that the full alignment shows there's a bug in the read aligner at the ends of graphs. :)

It works OK for whole-genome bacterial stuff, too. If we take an E. coli data set (the same one we used in the semi-streaming paper) and just run the reads against the known reference genome, we'll get 74 differences between the graph and the reference genome, out of 4639680 positions -- an identity of 99.998% (variants-ecoli.txt). On the one hand, this is not that great (consider that for something the size of the human genome, with this error rate we'd be seeing 50,000 false positives!); on the other hand, as with error correction, the whole analysis stack is surprisingly simple, and we haven't spent any time tuning it yet.

## Simulated variants, and targeted variant calling

With simulated variants in the E. coli genome, it does pretty well. Here, rather than changing up the genome and generating synthetic reads, we went with the same real reads as before, and instead changed the reference genome we are aligning to the reads. This was done with the patch-ecoli.py script, which changes an A to a C at position 500,000, removes two bases at position 2m, and adds two bases at position 3m.

When we align the "patched" E. coli genome against the read graph, we indeed recover all three alignments (see variants-patched.txt) in the background of the same false positives we saw in the unaltered genome. So that's kind of nice.

What's even neater is that we can do targeted variant calling directly against the graph -- suppose, for example, that we're interested in just a few regions of the reference. With the normal mapping-based variant calling, you need to map all the reads first before querying for variants by location, because mapping requires the use of the entire reference. Here, you are already looking at all the reads in the graph form, so you can query just the regions you're interested in.

So, for example, here you can align just the patched regions (in ecoli-patched-segments.fa) against the read graph and get the same answer you got when aligning the entire reference (target ecoli-patched-segments.align.out). This works in part because we're stitching together local alignments, so there are some caveats in cases where different overlapping query sequences might lead to different optimal alignments - further research needed.

## Speed considerations

Once you've created the graph (which is linear time with respect to the number of reads), things are pretty fast. For the E. coli data set, it takes about 25 seconds to do a full reference-to-graph alignment on my Mac laptop. Much of the code is still written in Python so we hope to get this under 5 seconds.

In the future, we expect to get much faster. Since the alignment is guided and piecewise, it should be capable of aligning through highly repetitive repeats and is also massively parallelizable. We think that the main bottleneck is going to be loading in the reads. We're working on optimizing the loading separately, but we're hoping to get down to about 8 hours for a full ~50x human genome variant calling with this method on a single CPU.

## Memory considerations

The memory is dominated by graph size, which in turn is dominated by the errors in short-read Illumina data. We have efficient ways of trimming some of these errors, and/or compressing down the data, even if we don't just correct them; the right approach will depend on details of the data (haploid? diploid? polyploid?) and will have to be studied.

For E. coli, we do the above variant calling in under 400 MB of RAM. We should be able to get that down to under 100 MB of RAM easily enough, but we will have to look into exactly what happens as we compress our graph down.

From the Minia paper, we can place some expectations on the memory usage for diploid human genome assembly. (We don't use cascading Bloom filters, but our approaches are approximately equivalent.) We believe we can get down to under 10 GB of RAM here.

As with most of our methods, this approach should work directly for variant calling on RNAseq and metagenomic data with little alteration. We have a variety of graph preparation methods (straight-up graph loading as well as digital normalization and abundance slicing) that can be applied to align to everything while favoring high-coverage reads, or only to high coverage, or to error-trimmed reads, or...

In effect, what we're doing is (rather boring) reference-guided assembly. Wouldn't it be nice if we extended it to longer indels, as in Holtgrewe et al., 2015? Yes, it would. Then we could ask for an assembly to be done between two points... This would enable the kinds of approaches that (e.g.) Rimmer et al., 2014 describe.

One big problem with this approach is that we're only returning positions in the reference where the graph has no agreement - this will cause problems when querying diploid data sets with a single reference, where we really want to know all variants, including heterozygous ones where the reference contains one of the two. We can think of several approaches to resolving this, but haven't implemented them yet.

A related drawback of this approach so far is that we have (so far) presented no way of representing multiple data sets in the same graph; this means that you can't align to many different data sets all at once. You also can't take advantage of things like the contiguity granted by long reads in many useful ways, nor can you do haplotyping with the long reads. Stay tuned...

## References and previous work

A number of people have done previous work on graph-based variant calling --

• Zam Iqbal and Mario Caccamo's Cortex is the first article that introduced me to this area. Since then, Zam's work as well as some of the work that Jared Simpson is doing on FM indices has been a source of inspiration.

(See especially Zam's very nice comment on our error correction post!)

• Heng Li's FermiKit does something very similar to what we're proposing to do, although it seems like he effectively does an assembly before calling variants. This has some positives and some negatives that we'll have to explore.

• Kimura and Koike (2015) do variant calling on a Burrows- Wheeler transform of short-read data, which is very similar to what we're doing.

• Using k-mers to find variation is nothing new. Two articles that caught my eye -- BreaKmer (Abo et al, 2015) and kSNP3 (Gardner et al., 2015) both do this to great effect.

• the GA4GH is working on graph-based variant calling, primarily for human. So far it seems like they are planning to rely on well curated genomes and variants; I'm going to be working with (much) poorer quality genomes, which may account for some differences in how we're thinking about things.

## Appendix: Running this code

The computational results in this blog post are Rather Reproducible (TM). Please see https://github.com/dib-lab/2015-khmer-wok2-vc/blob/master/README.rst for instructions on replicating the results on a virtual machine or using a Docker container.

## May 17, 2015

### Titus Brown

#### Read-to-graph alignment and error correction

One of the newer features in khmer that we're pretty excited about is the read-to-graph aligner, which gives us a way to align sequences to a De Bruijn graph; our nickname for it is "graphalign."

Briefly, graphalign uses a pair-HMM to align a sequence to a k-mer graph (aka De Bruijn graph) allowing both mismatches and indels, and taking into account coverage using a binary model (trusted and untrusted k-mers). The core code was written by Jordan Fish when he was a graduate student in the lab, based on ideas stemming from Jason Pell's thesis work on error correction. It was then refactored by Michael Crusoe.

Graphalign actually lets us do lots of things, including align both short and long sequences to DBG graphs, error correct, and call variants. We've got a simple Python API built into khmer, and we're working to extend it.

The core graphalign API is based around the concept of a ReadAligner object:

aligner = khmer.ReadAligner(graph, trusted_cov, bits_theta)


where 'graph' is a De Bruijn graph (implemented as a counting table in khmer), 'trusted_cov' defines what the trusted k-mer coverage is, and 'bits_theta' adjusts a scoring parameter used to extend alignments.

The 'aligner' object can be used to align short sequences to the graph:

score, graph_alignment, read_alignment, truncated = \


Here, 'graph_alignment' and 'read_alignment' are strings; if 'truncated' is false, then they are of the same length, and constitute a full gapped alignment of the DNA sequence in 'read' to the graph.

The approach used by 'align' is to seed an alignment at the first trusted k-mer, and then extend the alignment along the graph in both directions. Thus, it's effectively a local aligner.

## Error correction

Our initial motivation for graphalign was to use it to do error correction, with specific application to short-read sequences. There was (and to some extent still is) a dearth of error correction approaches that can be used for metagenome and transcriptome data sets, and since that kind of data is what our lab works on, we needed an error correction approach for those data. We also wanted something a bit more programmable than the existing error correctors, which were primarily command-line tools; we've found a lot of value in building libraries, and wanted to use that approach here, too.

The basic idea is this: we build a graph from our short-read data, and then go back through and align each short read to the graph. A successful alignment is then the corrected read. The basic code looks like this:

graph = build_graph(dataset)



In conjunction with our work on semi-streaming algorithms, we can directly convert this into a semi-streaming algorithm that works on genomes, metagenomes, and transcriptomes. This is implemented in the correct-reads script.

## Some results

If we try this out on a simulated data set (random genome, randomly chosen reads - see target compare-sim.txt in Makefile), it takes the simulated data from an error rate of around 1% to about 0.1%; see compare-sim.txt.

Applying this to a ~7m read subset of mRNAseq that we tackled in the semi-streaming paper (the data itself is from the Trinity paper, Grabherr et al, 2011), we take the data from an error rate of about 1.59% to 0.98% (see target rseq-compare.txt in Makefile). There are several reasons why this misses so many errors - first, error correction depends on high coverage, and much of this RNAseq data set is low coverage; second, this data set has a lot of errors; and third, RNAseq may have a broader k-mer abundance distribution than genomic sequencing.

One important side note: we use exactly the same script for error correcting RNAseq data as we do for genomic data.

## How good is the error correction?

tl; dr? It's pretty good but still worse than current methods. When we compare to Quake results on an E. coli data set (target compare-ecoli.txt in the Makefile), we see:

Data set Error rate
Uncorrected 1.587%
Quake 0.009%
khmer 0.013%

This isn't too bad - two orders of magnitude decrease in error rate! - but we'd like to at least be able to beat Quake :).

(Note that here we do a fair comparison by looking only at errors on sequences that Quake doesn't discard; to get comparable results on your data with khmer, you'd also have to trim your reads. We are also making use of the approach developed in the streaming paper where we digitally normalize the graph in advance, in order to decrease the number of errors and the size of the graph.)

## Concluding thoughts

What attracts us to this approach is that it's really simple. The basic error correction is a few lines, although it's surrounded by a bunch of machinery for doing semi-streaming analysis and keeping pairing intact. (The two-pass/offline script for error correction is much cleaner, because it omits all of this machinery.)

It's also nice that this applies to all shotgun sequencing, not just genomic; that's a trivial extension of our semi-streaming paper.

We also suspect that this approach is quite tunable, although we are just beginning to investigate the proper way to build parameters for the pair-HMM, and we haven't nailed down the right coverage/cutoff parameters for error correction either. More work to be done!

In any case, there's also more than error correction to be done with the graphalign approach -- stay tuned!

## References and previous work

This is by no means novel - we're building on a lot of ideas from a lot of people. Our interest is in bridging from theory to practice, and providing a decent tunable implementation in an open-source package, so that we can explore these ideas more widely.

Here is short summary of previous work, surely incomplete --

• Much of this was proximally inspired by Jordan's work on Xander, software to do HMM-guided gene assembly from metagenomic data. (An accompanying paper has been accepted for publication; will blog about that when it hits.)
• More generally, my MSU colleague Yanni Sun has had several PhD students that have worked on HMMs and graph alignment, and she and her students have been great sources of ideas! (She co-advised Jordan.)
• BlastGraph, like Xander, built on the idea of graph alignment. It is the earliest reference I know of to graph alignment, but I haven't looked very hard.
• Yuzhen Ye and Haixu Tang at Indiana have developed very similar functionality that I became aware of when reviewing their nice paper on graph alignment for metatranscriptomics.
• Jared Simpson has been doing nice work on aligning Nanopore reads to a reference sequence. My guess is that the multiple sequence alignment approach described in Jonathan Dursi's blog post is going to prove relevant to us.
• The error corrector Coral (Salmela and Schroder, 2011) bears a strong philosophical resemblance to graphalign in its approach to error correction, if you think of a De Bruijn graph as a kind of multiple-sequence alignment.

## Appendix: Running this code

The computational results in this blog post are Rather Reproducible (TM). Please see https://github.com/dib-lab/2015-khmer-wok1-ec/blob/master/README.rst for instructions on replicating the results on a virtual machine or using a Docker container.

### Gaël Varoquaux

#### Software for reproducible science: let’s not have a misunderstanding

Note

tl;dr:   Reproducibilty is a noble cause and scientific software a promising vessel. But excess of reproducibility can be at odds with the housekeeping required for good software engineering. Code that “just works” should not be taken for granted.

This post advocates for a progressive consolidation effort of scientific code, rather than putting too high a bar on code release.

Titus Brown recently shared an interesting war story in which a reviewer refuses to review a paper until he can run the code on his own files. Titus’s comment boils down to:

“Please destroy this software after publication”.

Note

Reproducible science: Does the emperor have clothes?

In other words, code for a publication is often not reusable. This point of view is very interesting from someone like Titus, who is a vocal proponent of reproducible science. His words triggered some surprises, which led Titus to wonder if some of the reproducible science crowd folks live in a bubble. I was happy to see the discussion unroll, as I think that there is a strong risk of creating a bubble around reproducible science. Such a bubble will backfire.

## Replication is a must for science and society

Science advances by accumulating knowledge built upon observations. It’s easy to forget that these observations, and the corresponding paradigmatic conclusions, are not always as simple to establish as the fact that hot air rises: replicating many times the scientific process transforms an evidence into a truth.

One striking example of scientific replication is the on-going effort in psychology to replay the evidence behind well-accepted findings central to current line of thoughts in psychological sciences. It implies setting up the experiments accordingly to the seminal publications, acquiring the data, and processing it to come up to the same conclusions. Surprisingly, not everything that was taken for granted holds.

Note

Findings later discredited backed economic policy

Another example, with massive consequences on Joe Average’s everyday, is the failed replication of Reinhart and Rogoff’s “Growth in a Time of Debt” publication. The original paper, published in 2010 in the American Economic Review, claimed empirical findings linking important public debt to failure of GDP growth. In a context of economical crisis, it was used by policy makers as a justification for restricted public spending. However, while pursuing a mere homework assignment to replicate these findings, a student uncovered methodological flaws with the paper. Understanding the limitations of the original study took a while, and discredited the academic backing to the economical doctrine of austerity. Critically, the analysis of the publication was possible only because Reinhart and Rogoff released their spreadsheet, with data and analysis details.

## Sharing code can make science reproducible

A great example of sharing code to make a publication reproducible is the recent paper on orthogonalization of regressors in fMRI models, by Mumford, Poline and Poldrack. The paper is a didactic refutation of non-justified data processing practices. The authors made their point much stronger by giving an IPython notebook to reproduce their figures. The recipe works perfectly here, because the ideas underlying the publication are simple and can be illustrated on synthetic data with relatively inexpensive computation. A short IPython notebook is all it takes to convince the reader.

Note

Sharing complex code… chances are it won’t run on new data.

At the other end of the spectrum, a complex analysis pipeline will not be as easy to share. For instance, a feat of strength such as Miyawaki et al’s visual image reconstruction from brain activity requires complex statistical signal processing to extract weak signatures. Miyawaki et al shared the data. They might share the code, but it would be a large chunk of code, probably fragile to changes in the environment (Matlab version, OS…). Chances are that it wouldn’t run on new data. This is the scenario that prompted Titus’s words:

“Please destroy this software after publication”.

I have good news: you can reproduce Miyawaki’s work with an example in nilearn, a library for machine learning on brain images. The example itself is concise, readable and it reliably produces figures close to that of the paper.

Note

Maintained libraries make feats of strength routinely reproducible.

This easy replication is only possible because the corresponding code leverages a set of libraries that encapsulate the main steps of the analysis, mainly scikit-learn and nilearn here. These libraries are tested, maintained and released. They enable us to go from a feat of strength to routine replication.

## Reproducibility is not sustainable for everything

Thinking is easy, acting is difficult       —       Goethe

Note

Keeping a physics apparatus running for replication years later?

I started my scientific career doing physics, and fairly “heavy” physics: vacuum systems, lasers, free-falling airplanes. In such settings, the cost of maintaining an experiment is apparent to the layman. No-one is expected to keep an apparatus running for replication years later. The pinnacle of reproducible research is when the work becomes doable in a students lab. Such progress is often supported by improved technology, driven by wider applications of the findings.

However, not every experiment will give rise to a students lab. Replicating the others will not be easy. Even if the instruments are still around the lab, they will require setting up, adjusting and wiring. And chances are that connectors or cables will be missing.

Software is no different. Storing and sharing it is cheaper. But technology evolves very fast. Every setup is different. Code for a scientific paper has seldom been built for easy maintenance: lack of tests, profusion of exotic dependencies, inexistent documentation. Robustness, portability, isolation, would be desirable, but it is difficult and costly.

Software developers know that understanding the constraints to design a good program requires writing a prototype. Code for a scientific paper is very much a prototype: it’s a first version of an idea, that proves its feasibility. Common sense in software engineering says that prototypes are designed to be thrown away. Prototype code is fragile. It’s untested, probably buggy for certain usage. Releasing prototypes amounts to distributing semi-functioning code. This is the case for most code accompanying a publication, and it is to be expected given the very nature of research: exploration and prototyping [1].

## No success without quality, …

Note

Highly-reliable is more useful than state-of-the-art.

My experience with scientific code has taught me that success require quality. Having a good implementation of simple, well-known, methods seems to matter more than doing something fancy. This is what the success of scikit-learn has taught us: we are really providing classic “old” machine learning methods, but with a good API, good docs, computational performance, and stable numerics controlled by stringent tests. There exists plenty of more sophisticated machine-learning methods, including some that I have developed specifically for my data. Yet, I find myself advising my co-workers to use the methods in scikit-learn, because I know that the implementation is reliable and that they will be able to use them [2].

This quality is indeed central to doing science with code. What good is a data analysis pipeline if it crashes when I fiddle with the data? How can I draw conclusions from simulations if I cannot change their parameters? As soon as I need trust in code supporting a scientific finding, I find myself tinkering with its input, and often breaking it. Good scientific code is code that can be reused, that can lead to large-scale experiments validating its underlying assumptions.

Sqlite is so much used that its developers have been woken up at night by users.

You might say that I am putting the bar too high; that slightly buggy code is more useful than no code. But I frown at the idea of releasing code for which I am unable to do proper quality assurance. I may have done too much of that in the past. And because I am a prolific coder, many people are using code that has been through my hands. My mailbox looks like a battlefield, and when I go the coffee machine I find myself answering questions.

## … and making difficult choices

Note

Achieving quality requires making choices. Not only because time is limited, but also because the difficulty to maintain and improve a codebase increases much quicker than the numbers of features [3]. This phenomena is actually frightening to watch: adding a feature in scikit-learn these days is much much harder than what it used to be in the early days. Interactions between features is a killer: when you modify something, something else unrelated breaks. For a given functionality, nothing makes the code more incomprehensible than cyclomatic complexity: the multiplicity of branching, if/then clauses, for loops. This complexity naturally appears when supporting different input types, or minor variants of a same method.

The consequence is that ensuring quality for many variants of a method is prohibitory. This limit is a real problem for reproducible science, as science builds upon comparing and opposing models. However, ignoring it simply leads to code that fails doing what it claims to do. What this is telling us, is that if we are really trying to do long-term reproducibility, we need to identify successful and important research and focus our efforts on it.

If you agree with my earlier point that the code of a publication is a prototype, this iterative process seems natural. Various ideas can be thought of as competing prototypes. Some will not lead to publication at all, while others will end up having a high impact. Knowing before-hand is impossible. Focusing too early on achieving high quality is counter productive. What matters is progressively consolidating the code.

## Reproducible science, a rich trade-off space

Note

Verbatim replication or reuse?

Does Reinhart and Rogoff’s “Growth in a Time of Debt” paper face the same challenges as the manuscript under review by Titus? One is describing mechanisms while the other is introducing a method. The code of the former is probably much simpler than that of the latter. Different publications come with different goals and code that is more or less easy to share. For verbatim replication of the analysis of a paper, a simple IPython notebook without tests or API is enough. To go beyond requires applying the analysis to different problems or data: reuse. Reuse is very difficult and cannot be a requirement for all publications.

Conventional wisdom in academia is that science builds upon ideas and concepts rather than methods and code. Galileo is known for his contribution to our understanding of the cosmos. Yet, methods development underpins science. Galileo is also the inventor of the telescope, which was a huge technical achievement. He needed to develop it to back his cosmological theories. Today, Galileo’s measurements are easy to reproduce because telescopes are readily-available as consumer products.

Standing on the shoulders of giants     —     Isaac Newton, on software libraries

 [1] To make my point very clear, releasing buggy untested code is not a good thing. However, it is not possible to ask for all research papers to come with industial-quality code. I am trying here to push for a collective, reasoned, undertaking of consolidation.
 [2] Theory tells us that there is there is no universal machine learning algorithm. Given a specific machine-learning application, it is always possible to devise a custom strategy that out-performs a generic one. However, do we need hundreds of classifiers to solve real world classification problems? Empirical results [Delgado 2014] show that most of the benefits can be achieved with a small number of strategies. Is it desirable and sustainable to distribute and keep alive the code of every machine learning paper?
 [3] Empirical studies on the workload for programmers to achieve a given task showed that 25 percent increase in problem complexity results in a 100 percent increase in programming complexity: An Experiment on Unit increase in Problem Complexity, Woodfield 1979.

I need to thank my colleague Chris Filo Gorgolewski and my sister Nelle Varoquaux for their feedback on this note.

## May 13, 2015

### Titus Brown

#### Adventures in replicable scientific papers: Docker

About a month ago, I took some time to try out Docker, a container technology that lets you bundle together, distribute, and execute applications in a lightweight Linux container. It seemed neat but I didn't apply it to any real problems. (Heng Li also tried it out, and came to some interesting conclusions -- note especially the packaging discussion in the comments.)

At the sprint, I decided to try building a software container for our latest paper submission on semi-streaming algorithms for DNA sequence analysis, but I got interrupted by other things. Part of the problem was that I had a tough time conceptualizing exactly what my use case for Docker was. There are a lot of people starting to use Docker in science, but so far only nucleotid.es has really demonstrated its utility.

Fast forward to yesterday, when I talked with Michael Crusoe about various ideas. We settled on using Docker to bundle together the software needed to run the full paper pipeline for the streaming paper. The paper was already highly replicable because we had used my lab's standard approach to replication (first executed three years ago!) This wasn't a terribly ambitious use of Docker but seemed like it could be useful.

In the end, it turned out to be super easy! I installed Docker on an AWS m3.xlarge, create a Dockerfile, and wrote up some instructions.

The basic idea we implemented is this:

• install all the software in a Docker container (only needs to be done once, of course);
• clone the repository on the host machine;
• copy the raw data into the pipeline/ sub-directory of the paper repository;
• run the docker container with the root of the paper repository (on the host, wherever it might be) bound to a standard location ('/paper') in the image;
• voila, raw data in, analyzed results out!

(The whole thing takes about 15 hours to run.)

## The value proposition of Docker for data-intensive papers

So what are my conclusions?

I get the sense that this is not really the way people are thinking about using Docker in science. Most of what I've seen has to do with workflows, and I get the sense that the remaining people are trying to avoid issues with software packaging. In this case, it simply didn't make sense to me to break our workflow steps for this paper out into different Docker images, since our workflow only depends on a few pieces of software that all work together well. (I could have broken out one bit of software, the Quake/Jellyfish code, but that was really it.)

I'm not sure how to think about the volume binding, either - I'm binding a path on the Docker container directly to a local disk, so the container isn't self-sufficient. The alternative was to package the data in the container, but in this case, it's 15-20 GB, which seemed like too much! This dependence on external data does limit our ability to deploy the container to compute farms though, and it also means that we can't put the container on the Docker hub.

The main value that I see for this container is in not polluting my work environment on machines where I can run Docker. (Sadly this does not yet include our HPC at MSU.) I could also use a Project Jupyter container to build our figures, and perhaps use a separate Latex container to build the paper... overkill? :).

One nice outcome of the volume binding is that I can work on the Makefile and workflow outside of the docker container, run it all inside the container, and then examine the artifacts outside of the container. (Is there a more standard way to do this?)

I also really like the explicit documentation of the install and execution steps. That's super cool and probably the most important bit for paper replication. The scientific world would definitely be a better place if the computational setup for data analysis and modeling components of papers came in a Dockerfile-style format! "Here's the software you need, and the command to run; put the data here and push the 'go' button!"

I certainly see the value of docker for running many different software packages, like nucleotid.es does. I think we should re-tool our k-mer counting benchmark paper to use containers to run each k-mer counting package benchmark. In fact, that may be my next demo, unless I get sidetracked by my job :).

## Next steps

I'm really intrigued by two medium-term directions -- one is the bioboxes-style approach for connecting different Docker containers into a workflow, and the other is the nucleotid.es approach for benchmarking software. If this benchmarking can be combined with github repos ("go benchmark the software in this github project!") then that might enable continuously running testing and benchmarks on a wide range of software.

Longer term, I'd like to have a virtual computing environment in which I can use my Project Jupyter notebook running in a Docker environment to quickly and easily spin up a data-intensive workflow involving N docker containers running on M machines with data flowing through them like so. I can already do this with AWS but it's a bit clunky; I foresee a much lighter-weight future for ultra-configurable computing.

In the shorter term, I'm hoping we can put some expectations in place for what dockerized paper replication pipelines might look like. (Hint: binary blobs should not be acceptable!) If we have big data sets, we probably don't want to put them on the Docker Hub; is the right solution to combine use of a data repository (e.g. figshare) with a docker container (to run all the software) and a tag in a github repository (for the paper pipeline/workflow)?

Now, off to review that paper that comes with a Docker container... :)

--titus

#### Modifications to our development process

After a fair amount of time thinking about software's place in science (see blog posts 1, 2, 3, and 4), and thinking about khmer's short- and long-term future, we're making some changes to our development process.

Semantic versioning: The first change, and most visible one, is that we are going to start bumping version numbers a lot faster. One of the first things Michael Crusoe put in place was semantic versioning, which places certain compatibility guarantees on version numbers used. These compatibility guarantees (on the command line API only, for khmer) are starting to hold us back from sanding down the corners. Moving forward, we're going to bump version numbers as quickly as needed for the code we've merged, rather than holding off on cleanup.

Michael just released khmer v1.4; my guess is that 2.0 will follow soon after. We'll try to batch major versions a little bit, but when in doubt we'll push forward rather than holding back, I think. We'll see how it goes.

Improving the command-line user experience. At the same time, we're going to be focusing more on user experience issues; see #988 for an example. Tamer Mansour, one of my new postdocs at Davis, took a fresh look at the command line and argued strenuously for a number of changes, and this aligns pretty well with our interests.

Giving more people explicit merge authority. 'til now, it was mostly Michael and myself doing merges; we've asked Luiz Irber and Camille Scott to step up and do not only code review but merges on their own recognizance. This should free up Michael to focus more on coding, as well as speeding up response times when Michael and I are both busy or traveling. I'm also asking mergers to fix minor formatting issues and update the ChangeLog for pull requests that are otherwise good - this will accelerate the pace of change and decrease frustration around quick fixes.

This is part of my long-term plan to involve more of the lab in software engineering. Most experimental labs have lab duties for grad students and postdocs; I'd like to try out the model where the grad students and postdocs have software engineering duties, independent of their research.

Deferring long-term plans and deprecating sprint/training efforts. We will defer our roadmap and decrease our sprint and training interactions. As a small project trying to get more funding, we can't afford the diversion of energy at this point. That having been said, both the roadmap planning and the sprints thus far were tremendously valuable for thinking ahead and making our contribution process more robust, and we hope to pursue both in the future.

Paying technical debt maintenance fees, instead of decreasing debt. We still have lots of issues that are burdening the codebase, especially at the Python and C++ interface levels, but we're going to ignore them for now and focus instead on adding new features (hopefully without increasing technical debt, note - we're keeping the code review and continuous integration and test coverage and ...). Again, we're a small project trying to get more funding... hard choices must be made.

I'm writing a grant now to ask for sustained funding on a ~5 year time scale, for about 3 employees - probably a software engineer / community manager, a super-postdoc/software engineer, and a grad student. If we can get another round of funding, we will reactivate the roadmap and think about how best to tackle technical debt.

--titus

p.s. Special thanks to Ethan White, Greg Wilson, and Neil Chue Hong for their input!

## May 08, 2015

### Titus Brown

#### My review of a review of &quot;Influential Works in Data Driven Discovery&quot;

I finally got a chance to more thoroughly read Mark Stalzer and Chris Mentzel's arxiv preprint, "A Preliminary Review of Influential Works in Data-Driven Discovery". This is a short review paper that discusses concepts highlighted by the 1,000+ "influential works" lists submitted to the Moore Foundation's Data Driven Discovery (DDD) Investigator Competition. (Note, I was one of the awardees.)

The core of this arxiv preprint is the section on "Clusters of Influential Works", in which Stalzer & Mentzel go in detail through the eight different concept clusters that emerged from their analysis of the submissions. This is a fascinating section that should be at the top of everyone's reading list. The topics covered are, in the order presented in the paper, as follows:

• Foundational theory, including Bayes' Theorem, information theory, and Metropolis sampling;
• Astronomy, and specifically the Sloan Digital Sky Survey;
• Genomics, focused around the Human Genome Project and methods for searching and analyzing sequencing data;
• Classical statistical methods, including the lasso, bootstrap methods, boosting, expectation-maximization, random forests, false discovery rate, and "isomap" (which I'd never heard of!);
• Machine learning, including Support Vector Machines, artificial Neural Networks (and presumably deep learning?), logistic belief networks, and hidden Markov models;
• The Google! Including PageRank, MapReduce, and "the overall anatomy" of how Google does things; specific implementations included Hadoop, BigTable, and Cloud DataFlow.
• General tools, programming languages, and computational methods, including Numerical Recipes, the R language, the IPython Notebook (Project Jupyter), the Visual Display of Quantitative Information, and SQL databases;
• Centrality of the Scientific Method (as opposed to specific tools or concepts). Here the discussion focused around the Fourth Paradigm book which lays out the expansion of the scientific method from empirical observation to theory to simulation to "big data science"; here, I thought the point that computers were used for both theory and observation was well-made. This section is particularly worth reading, in my opinion.

This collection of concepts is simply delightful - Stalzer and Mentzel provide both a summary of the concepts and a fantastic curated set of high-level references.

Since I don't know many of these areas that well (I've heard of most of the subtopics, but I'm certainly not expert in ... any of them? yikes) I evaluated the depth of their discussion by looking at the areas I was most familiar with - genomics and tools/languages/methods. My sense from this was that they covered the highlights of tools better than the highlights of genomics, but this may well be because genomics is a much larger and broader field at the moment.

## Data-Driven Discovery vs Data Science

One interesting question that comes up frequently is what the connection and overlap is between data-driven discovery, data science, big data, data analysis, computational science, etc. This paper provides a lot of food for thought and helps me draw some distinctions. For example, it's clear that computational science includes or at least overlaps with all of the concepts above, but computational science also includes things like modeling that I don't think clearly fit with the "data-driven discovery" theme. Similarly, in my experience "data science" encompasses tools and methods, along with intelligent application of them to specific problems, but practically speaking does not often integrate with theory and prediction. Likewise, "big data", in the sense of methods and approaches designed to scale to analysis and integration of large data set, is clearly one important aspect of data-driven discovery - but only in the sense that in many cases more data seems to be better.

Ever since the "cage match" round of the Moore DDD competition, where we discussed these issues in breakout groups, I've been working towards the internal conclusion that data-driven discovery is the exploration and acceleration of science through development of new data science theory, methods, and tools. This paper certainly helps nail that down by summarizing the components of "data driven discovery" in the eyes of its practitioners.

## Is this a framework for a class or graduate training theme?

I think a lot about research training, in several forms. I do a lot of short-course peer instruction form (e.g. Data Carpentry, Software Carpentry, and my DIB efforts); I've been talking with people about graduate courses and graduate curricula, with especial emphasis on data science (e.g. the Data Science Initiative at UC Davis); and, most generally, I'm interested in "what should graduate students know if they want to work in data-driven discovery"?

From the training perspective, this paper lays out the central concepts that could be touched on either in a survey course or in an entire graduate program; while my sense is that a PhD would require coupling to a specific domain, I could certainly imagine a Master's program or a dual degree program that touched on the theory and practice of data driven discovery.

For one example, I would love to run a survey course on these topics, perhaps in the area of biology. Such a course could go through each of the subsections above, and discuss them in relation to biology - for example, how Bayes' Theorem is used in medicine, or how concepts from the Sloan Digital Sky Survey could be applied to genomics, or where Google-style infrastructure could be used to support research.

There's more than enough meat in there to have a whole graduate program, though. One or two courses could integrate theory and tools, another course could focus on practical application in a specific domain, a third course could talk about general practice and computing tools, and a fourth course could discuss infrastructure and scaling.

## The missing bits - "open science" and "training"

Something that I think was missing from the paper was an in-depth perspective on the role that open source, open data, and open science can play. While these concepts were directly touched on in a few of the subsections - most of the tools described were open source, for example, and Michael Nielsen's excellent book "Reinventing Discovery" was mentioned briefly in the context of network effects in scientific communication and access - I felt that "open science" was an unacknowledged undercurrent throughout.

It's clear that progress in science has always relied on sharing ideas, concepts, methods, theory, and data. What I think is not yet as clear to many is the extent to which practical, efficient, and widely available implementations of methods have become important in the computer age. And, for data-driven discovery, an increasingly critical aspect is the infrastructure to support data sharing, collaboration, and the application of these methods to large data sets. These two themes -- sharing of implementation and importance of infrastructure cut across many of the subsections in this paper, including the specific domains of astronomy and human genomics, as well as the Google infrastructure and languages/tools/implementation subsections. I think the paper could usefully add a section on this.

Interestingly, the Moore Foundation DDD competition implicitly acknowledged this importance by enriching for open scientists in their selection of the awardees -- a surprising fraction of the Investigators are active in open science, including myself and Ethan White, and virtually all the Investigators are openly distributing their research methodology. In that sense, open science is a notable omission from the paper.

It's also interesting to note that training is missing from the paper. If you believe data-driven discovery is part of the future of science, then training is important because there's a general lack of researchers and institutions that cover these topics. I'd guess that virtually no one researcher is well versed in a majority of the topics, especially since many of the topics are entire scientific super-fields, and the rest are vast technical domains. In academic research we're kind of used to the idea that we have to work in collaboration (practice may be different...), but here academia really fails to cover the entire data-driven discovery spectrum because of the general lack of emphasis on expert use of tools and infrastructure in universities.

So I think that investment in training is where the opportunities lie for universities that want to lead in data-driven discovery, and this is the main chance for funders that want to enable the network effect.

## Training in open science, tools, and infrastructure as competitive advantages

Forward-thinking universities who are in it for the long game & interested in building a reputation in data-driven discovery, might consider the following ideas:

• scientists trained in open science, tool use, and how to use existing infrastructure, are more likely to be able to quickly take advantages of new data and methods.
• scientists trained in open science are more likely to produce results that can be built on.
• scientists trained in open science are more likely to produce useful data sets.
• scientists trained in open science and tool building are more likely to produce useful tools.
• funding agencies are increasingly interested in maximizing impact by requiring open source, open data, and open access.

All of these should lead to more publications, more important publications, a better reputation, and more funding.

In sum, I think investments in training in the most ignored bits of data-driven discovery (open science, computational tool use and development, and scalable infrastructure use and development) should be a competitive advantage for institutions. And, like most competitive advantages, those who ignore it will be at a significant disadvantage. This is also an opportunity for foundations to drive progress by targeted investments, although (since they are much more nimble than universities) they are already doing this to some extent.

In the end, what I like most about this paper is that it outlines and summarizes the concepts in which we need to invest in order to advance science through data-driven discovery. I think it's an important contribution and I look forward to its further development and ultimate publication!

--titus

## May 07, 2015

### Continuum Analytics

#### Continuum Analytics - May Tech Events

The Continuum team is gearing up for a summer full of conferences, including PyData Seattle, taking place July 24-26, hosted by Microsoft. But first we’ve got a few May conferences to keep an eye out for, all over the globe! Join us in Austin, Argentina, Berlin, and Boston this month.

## May 06, 2015

### Abraham Escalante

#### My GSoC experience

Hello all,

My name is Abraham Escalante and I'm a mexican software engineer. The purpose of this blog is to relate my experiences and motivations to participate in the 2015 Google Summer of Code.

I am not much of a blogger (in fact, this is my first blog entry ever) but if you got here, then chances are you are interested in either the GSoC, the personal experience of a GSoCer or maybe we have a relationship of some sort and you have a personal interest (I'm looking at you Hélène). Either way, I will do my best to walk you through my experience with the hope that this may turn out to be useful for someone in the future, be it to help you get into the GSoC programme or just to get to know me a little better if you find that interesting enough.

I have some catching up to do because this journey started for me several months ago. The list of selected student proposals has already been published (**spoiler alert** I got selected) and the coding period will start in about three weeks time but for now I just wanted to write a first entry to get the ball rolling and so you get an idea of what you can expect, should you choose to continue reading these blog entries. I will begin my storytelling soon.

Cheers,
Abraham.

## May 05, 2015

### Titus Brown

#### A workshop report from the May 2015 non-model RNAseq workshop at UC Davis

We just finished teaching the second of my RNAseq workshops at UC Davis -- the fifth workshop I've hosted since I took a faculty position here in VetMed. In order, we've done a Train the Trainers, a Data Carpentry, a reference-guided RNAseq assembly workshop, a mothur (microbial ecology) workshop, and a de novo RNAseq assembly workshop -- you can see all of the links at the Data Intensive Biology Training Program Web site. This workshop was the May de novo mRNAseq assembly workshop, which I co-taught with Tamer Mansour and Camille Scott.

The workshops are still maturing, and I'm trying to figure out how to keep this going for the medium term, but so far I think we're doing an OK job. We can always improve the material and the delivery, but I think at least we're on a good trajectory.

This workshop (and the many excellent questions raised by the attendees) reminded me how much of RNAseq analysis is still research -- it's not just a question of what assembler and quantification method to use, but much more fundamental questions of data evaluation, assembly evaluation, and how to tie this into the biology you're trying to do. My lab works on this a lot, and too much of the time we have to say "we just don't know" - often because the experts don't agree, or because the answer is just unknown.

I also despair sometimes that the energy and effort we're putting into this isn't enough. There is a huge demand, and these two day workshops are at best a stopgap measure, and I really have no idea whether they're going to help biologists starting from scratch to analyze their own data.

I do have other arrows in my quiver. Once my lab "lands" at Davis (sometime between June and September) I expect to start up a biology "data space" of some sort, where every week people who have been through one of my workshops can come work on their data analysis; the hope is that, much like the Davis R Users Group, we can start to build a community around biological data analysis. Stay tuned.

I'm also planning to start running more advanced workshops. One great idea that Tamer pitched to me this morning was to run a follow-on workshop entitled "publishing your transcriptome", which would focus on quality measures and analysis downstream of your first-blush transcriptome assembly/annotation/quantification. I'm also hoping to put together an "automation and reproducibility" workshop in the fall, along with a variety of more focused workshops on specific platforms and questions.

And, of course, we'll continue running the intro workshops. In addition to the mRNASeq workshops, in the fall I'd like to do workshops on microbial genome assembly and annotation, metagenome and metatranscriptome assembly, and advanced UCSC genome browser use/misuse (think assembly hubs etc.).

--titus

### Juan Nunez-Iglesias

#### jnuneziglesias

I use Twitter favourites almost exclusively to mark posts that I know will be useful in some not-too-distant future; kind of like a Twitter Evernote. Recently I was looking through my list in search of this excellent blog post detailing how to build cross-platform binary distributions for conda.

I came across two other tweets from the EuroSciPy 2014 conference: this one by Ian Ozsvald about his IPython memory usage profiler, right next to this one by Alexandre Chabot about Aaron O’Leary’s notedown. I’d forgotten that this was how I came across these two tools, but since then I have contributed code to both (1, 2). I’d met Ian at EuroSciPy 2013, but I’ve never met Aaron, yet nevertheless there is my code in the latest version of his notedown library.

How remarkable the open-source Python community has become. Talks from Python conferences are posted to YouTube, usually as the conference is happening. (Add to that plenty of live tweeting.) Thus, even when I can’t attend the conferences, I can keep up with the latest open source libraries, from the other side of the world. And then I can grab the source code on GitHub, fiddle with it to my heart’s content, and submit a pull request to the author with my changes. After a short while, code that I wrote for my own utility is available for anyone else to use through PyPI or conda.

My point is: join us! Make your code open source, and conversely, when you need some functionality, don’t reinvent the wheel. See if there’s a library that almost meets your needs, and contribute!

## April 28, 2015

### Matthieu Brucher

#### Book review: scikit-learn Cookbook

There are now a few books on sickit-learn, for instance a general one on machine learning systems, and a cookbook. I was a technical reviewer for the first one, and now I’m reviewing the cookbook.

#### Content and opinions

A cookbook is a collection of recipes, it is not intended to help you understand how your oven works. It is the same for this book, it won’t help you install your oven or set it up, you will have to know how to install the required packages.

It will help you decide what tool to use for which problem. It is complementary to the tutorials and the gallery on the scikit website as it adds some thoughts on what the algorithm does and where to pay attention. If Building Machine Learning Systems in Python is quite broad and goes from the installation to specific algorithms, this book tries to cover more algorithms, with explanations of what you are doing, but with less depth, and it is more or less only focused on scikit-learn.

#### Conclusion

If you know a little bit about machine learning and Python, a cookbook may be more appropriate than a more “vertical” book. As such this book covers quite a bit of the scikit, with some useful tips. But as it doesn’t go in too many details, you still need to confront data and parameters against a book like Bishop’s Pattern Recognition and Machine Learning.

## April 26, 2015

### Titus Brown

#### Proposal: Integrating the OSF into Galaxy as a remote data store

Note - this was an internal funding request solicited by the Center for Open Science. It's been funded!

Brief: We propose to integrate OSF into Galaxy as a data store. For this purpose, we request 3 months of funding (6 months, half-time) for one developer, plus travel.

Introduction and summary: Galaxy is a commonly used open source biomedical/biological sequence data analysis platform that enables biologists to put together reproducible pipelines and execute analyses locally or in the cloud. Galaxy has a robust and sophisticated Web-based user interface for setting up these pipelines and analyzing data. One particular challenge for Galaxy is that on cloud instances, data storage and publication must be done using local filesystems and remote URLs, which adds a significant amount of complexity for biologists interested in doing reproducible computing. Recently, Galaxy gained a data abstraction layer that permits object stores to be used instead of local filesystems. The Center for Open Science’s Open Science Framework (OSF), in turn, is a robust platform for storing, manipulating, and sharing scientific data, and provides APIs for accessing such data; the OSF can also act as a broker for accessing and managing remote data stores, on e.g. cloud providers. Integrating the OSF’s object store into Galaxy would let Galaxy use OSF for data persistence and reproducibility, and would let Galaxy users take advantage of OSF’s data management interface, APIs, and authentication to expand their reproducible biomedical science workflows. This integration would also rigorously test and exercise newly developed functionality in both Galaxy and the OSF, providing valuable use cases and testing.

Our “stretch” goal would be to expand beyond Galaxy and work with Project Jupyter/IPython Notebook’s data abstraction layer to provide an OSF integration for Project Jupyter.

We note with enthusiasm that all groups mentioned here are robust participants in the open source/open science ecosystem, and all projects are full open source projects with contributor guidelines and collaboration workflows!

Broader impacts: If successful, the proposed project addresses several broader issues. First, the OSF would have an external consumer of its APIs for data access, which would drive the maturation of these APIs with use cases. Second, the OSF would expand to support connections with a visible project in a non-psychology domain, giving COS a proof-of-concept demonstration for expansion into new communities. Third, the Galaxy biomedical community would gain connections to the OSF’s functionality, which would help in execution, storage, and publication of biomedical data analyses. Fourth, the Brown Lab would then be able to explore further work to build their Moore-DDD-funded data analysis portal on top of both Galaxy and the OSF, leveraging the functionality of both projects to advance open science and reproducibility. Even a partial failure would be informative by exposing faults in the OSF or Galaxy public APIs and execution models, which could then be addressed by the projects individually. This project would also serve as a “beta test” of the COS as an incubator of open science software projects.

Longer-term outcomes: the Brown Lab and the COS are both interested in exploring the OSF as a larger hub for data storage for workflow execution, teaching and training in data-intensive science, and hosting the reproducible publications. This proposed project is a first step in those directions.

#### Popping the open source/open science bubble.

One of the things that became clear to me over the last two weeks is just how much of a open source/open science bubble my blog and Twitter commenters live in. Don't take that as a negative -- I'm in here with you, and it's a great place to live :). But it's still a bubble.

Two specific points brought this home to me.

First, a lot of the Twitter and blog commentary on Please destroy this software after publication. kthxbye. expressed shock and dismay that I would be OK with non-OSS software being published. (Read Mick Watson's blog post and Kai Blin's comment.) Many really good reasons why I was wrong were brought up, and, well, I have to say it was terrifically convincing and I'm going to change my own policy as a reviewer. So far, so good. But it turns out that only a few journals require an actual open source license (Journal of Open Research Software and Journal of Statistical Software). So there is a massive disparity between what some of my tweeps (and now me) believe, and what is codified practice.

Second, many eloquent points were made about software as a major product and enabler of research -- see especially the comments on "software as communication" and "software as experimental design" by others (linked to here - see "Software as..." section). These points were very convincing as well, although I'm still trying to figure out how exactly to evolve my own views. And yet here again I think we can be quite clear that most biologists and perhaps even some bioinformaticians would have either no considered opinion on software, or be outright dismissive of the idea that software itself is intellectual output. Again, very different from what the people on Twitter and my blog think.

I was already pretty surprised with how strong the case was for open source software as a requirement (go read the links above). I was even more surprised with how eloquently and expansively people defended the role of software in research. Many, many strong arguments were put forth.

So, how do we evolve current practice??

But first...

## If software is so important, software is fair game for peer review

I promise this wasn't a stealth goal of my original blog post but people realize that an obvious conclusion here is that software is fully fair game for in depth peer review, right? (Never mind that most scientists probably aren't capable of doing good peer review of code, or that any reasonably strong code review requirements would mean that virtually no more software would be published - an effective but rather punitive way to ensure only good software is published in science :)

A few weeks back I received a response to my review of an application note, and the senior author objected strenuously to my reviewing their actual software in any way. It really pissed me off, frankly -- I was pretty positive about their packaged software and made some suggestions for how they could improve its presentation to others, and basically got back a punch to the nose asking how dare I make such suggestions. As part of my own rather intemperate response, I said:

This is an application note. The application itself is certainly fair game for review...

How much angrier would this person have been if I'd rejected the paper because I actually had comments on edge cases in the source code??

Two years ago now we had another big eruption ("big" in the Twitter sense, at least) around code review. A year even before that I proposed optional review criteria for bioinformatics papers that my students, at least, have started to use to do reviews.

In all that time very little has changed. There are three objections that I've heard in these last three years that bear up over time --

First, scientists neither know how to review code nor how to write reasonable code; this would lead at best to inconsistency in reviews, or at worst simply lead to a massive waste of time.

Second, I am not aware of any code review guidelines or standards for scientific code. Code review in industry has at least some basic good practices; code review in science is a different beast.

Third, code review can be used to unfairly block publication. This came up again recently (READ THAT COMMENT) and I think it's a great reason to worry about code review as a way to block publication. I still don't know how to deal with this but we need some guidelines for editors.

The bottom line is that if software is fair game for peer review, then we need a trained and educated body of reviewers - just as we do for molecular methods, biological sequencing, and statistics. This will inevitably involve the evolution of the community of practice around both software generation (s...l...o...w...l...y... happening) and software peer review (<envision birds chirping in the absence of conversation>).

(One solution I think I'm going to try is this: I'm going to ask the Software Carpentry community for a volunteer to do code review for every computational paper I edit, and I will provide suggested (optional) guidelines. Evil? Maybe so. Effective? I hope so.)

## We need some guidelines and position papers.

Of the discussion around computation as a primary research product, Dan Katz asked,

"I wonder if a collaborative paper on this would find a home somewhere?"

Yes. To break out of the bubble, I think we need a bunch of position papers and guidelines on this sort of thing, frankly. It's clear to me that the online community has a tremendous amount of wisdom to offer, but we are living in a bubble, and we need to communicate outside of that -- just as the open access and open data folk are.

One important note: we need simple, clear, minimum requirements, with broadly relevant justifications. Otherwise we will fail to convince or be useful to anyone, including our own community.

A few ideas:

• We need a clear, concise, big-tent writeup of "why software is important, and why it should be OSS and reviewed when published";
• We need to discuss good minimum requirements in the near term for code review, and figure out what some end goals are;
• We need some definitions of what "responsible conduct of computational research" looks like (Responsible Conduct of Research is a big thing in the US, now; I think it's a useful concept to employ here).
• We need some assessment metrics (via @kaythaney) that disentangle "responsible conduct of research" (a concept that nobody should disagree with) from "open science" (which some people disagree with :).

and probably a bunch of other things... what else do we need, and how should we move forward?

--titus

### Filipe Saraiva

#### Cantor in KDE Applications 15.04

KDE Applications 15.04 release brings a new version of the scientific programming software Cantor, with a lot of news. I am specially happy with this release because I worked in several parts of these new features. =)

Come with me™ and let’s see what is new in Cantor.

## Cantor ported to Qt5/KF5

Cantor Qt5/KF5 + Breeze theme. In the image it is possible to see the terminal/worksheet, variable management panel, syntax highlighting, code completion, and the standard interface

I started the Cantor port to Qt5/KF5 during previous LaKademy and I continued the development along the year. Maybe I had pushed code from 5 different countries since the beginning of this work.

The change for this new technology was successfully completed, and for the moment we don’t notice any feature missed or new critical bug. All the backends and plugins were ported, and some new bugs created during this work were fixed.

We would like to ask for Cantor users to report any problem or bug in bugzilla. Anyway, the software is really very stable.

When you run Cantor Qt5/KF5 version on the first time, the software will look for Cantor Qt4 configurations and, if it exists, the configurations will be automagically migrated to Cantor Qt5/KF5.

## Backend for Python 3

In Season of KDE 2014 I was the mentor of Minh Ngo in the project to create a backend for Python 3, increasing the number of backends in Cantor to 10!

Backend selection screen: Python 3 and their 9 brothers

The backend developed by Minh uses D-Bus protocol to allow communication between Cantor and Python 3. This architecture is different of Python 2, but it is present in others backends, as in the backend for R.

The cool thing is Cantor can be interesting for pythonistas using Python 2 and/or Python 3 now. We would like to get feedback from you, guys!

## Icon!

Cantor first release was originally in 2009, with KDE SC 4.4. Since that date the software did not have an icon.

The Cantor Qt5/KF5 release marks a substantial change in the development of the application, then it is also a good time to release an icon to the software.

Cantor icon

The art is excellent! It presents the idea of Cantor: a blackboard to you write and develop your equations and formulas while scratches his head and think “and now, what I need to do to solve it?”. =)

Thank you Andreas Kainz and Uri Herrera, members of VDG team and authors of Cantor icon!

## Other changes and bug fixes

Most bugs added in the Qt5/KF5 port were fixed before the release.

There are some small changes to be cited: in KNewStuff categories world, “Python2″ category was changed to “Python 2″ and “Python 3″ category was added; the automatic loading of pylab module in Python backends was dropped; now it is possible to run Python commands mixed with comments in the worksheet; and more.

You can see a complete log of commits, bugfixes, and new features added in this release in this page.

## Future works

As future work maybe the high-priority for this moment is to drop KDELibs4Support from Cantor. Lucas developed part of this work and we would like to finish it for the next release.

I intend to test if D-Bus communication can be a good solution for Scilab backend. Another task is to redesign the graphical generation assistants of Python backends. A long-term work is to follow the creation of Jupyter project, the future of IPython notebooks. If Cantor can to be compatible with Jupyter, it will be really nice for users and to encourage the collaboration between different communities interested in scientific programming and open science.

I will take advantage of the Cantor Qt5/KF5 release to write about how to use Cantor in two different ways: the Matlab way and the IPython notebooks way. Keep your eyes in the updates from this blog! =)

If you would like to help in Cantor development, please contact me or mail kde-edu maillist and let’s talk about bug fixes, development of new features, and more.

## Donations to KDE Brasil – LaKademy 2015!

If you would like to support my work, please make a donation to KDE Brasil. We will host the KDE Latin-American Summit – LaKademy and we need some money to put some latin-american contributors to work together face-to-face. I will focus my LaKademy work in the previously mentioned future works.

You can read more about LaKademy in this dot.KDE history. This page in English explain how to donate. There is other page with the same content in Spanish.

## April 23, 2015

### Titus Brown

#### More on scientific software

So I wrote this thing that got an awful lot of comments, many telling me that I'm just plain wrong. I think it's impossible to respond comprehensively :). But here are some responses.

## What is, what could be, and what should be

In that blog post, I argued that software shouldn't be considered a primary output of scientific research. But I completely failed to articulate a distinction between what we do today with respect to scientific software, what we could be doing in the not-so-distant future, and what we should be doing. Worse, I mixed them all up!

Peer reviewed publications and grants are the current coin of the realm. When we submit papers and grants for peer review, we have to deal with what those reviewers think right now. In bioinformatics, this largely means papers get evaluated on their perceived novelty and impact (even in impact-blind journals). Software papers are generally evaluated poorly on these metrics, so it's hard to publish bioinformatics software papers in visible places, and it's hard to argue in grants to the NIH (and most of the biology-focused NSF) that pure software development efforts are worthwhile. This is what is, and it makes it hard for methods+software research to get publications and funding.

Assuming that you agree that methods+software research is important in bioinformatics, what could we be doing in the near distant future to boost the visibility of methods+software? Giving DOIs to software is one way to accrue credit to software that is highly used, but citations take a long time to pile up, reviewers won't know what to expect in terms of numbers (50 citations? is that a lot?), and my guess is that they will be poorly valued in important situations like funding and advancement. It's an honorable attempt to hack the system and software DOIs are great for other purposes, but I'm not optimistic about their near- or middle-term impact.

We could also start to articulate values and perspectives to guide reviewers and granting systems. And this is what I'd like to do. But first, let me rant a bit.

I think people underestimate the hidden mass in the scientific iceberg. Huge amounts of money are spent on research, and I would bet that there are at least twenty thousand PI-level researchers around the world in biology. In biology-related fields, any of these people may be called upon to review your grant or your paper, and their opinions will largely be, well, their own. To get published, funded, or promoted, you need to convince some committee containing these smart and opinionated researchers that what you're doing is both novel and impactful. To do that, you have to appeal largely to values and beliefs that they already hold.

Moreover, this set of researchers - largely made of people who have reached tenured professor status - sits on editorial boards, funding agency panels, and tenure and promotion committees. None of these boards and funding panels exist in a vacuum, and while to some extent program managers can push in certain directions, they are ultimately beholden to the priorities of the funding agency, which are (in the best case) channeled from senior scientists.

If you wonder why open access took so damn long to happen, this is one reason - the cultural "mass" of researchers that needs to shift their opinions is huge and unwieldy and resistant to change. And they are largely invisible, and subject to only limited persuasion.

One of the most valuable efforts we can make is to explore what we should be doing, and place it on a logical and sensical footing, and put it out there. For example, check out the CRA's memo on best practices in Promotion and Tenure of Interdisciplinary Faculty - great and thoughtful stuff, IMO. We need a bunch of well thought out opinions in this vein. What guidelines do we want to put in place for evaluating methods+software? How should we evaluate methods+software researchers for impact? When we fund software projects, what should we be looking for?

And that brings me to what we should be doing, which is ultimately what I am most interested in. For example, I must admit to deep confusion about what a maturity model for bioinformatics software should look like; this feeds into funding requests, which ultimately feeds into promotion and tenure. I don't know how to guide junior faculty in this area either; I have lots of opinions, but they're not well tested in the marketplace of ideas.

I and others are starting to have the opportunity to make the case for what we should be doing in review panels; what case should we make?

It is in this vein, then, that I am trying to figure out what value to place on software itself, and I'm interested in how to promote methods+software researchers and research. Neil Saunders had an interesting comment that I want to highlight here: he said,

My own feeling is that phrases like "significant intellectual contribution" are just unhelpful academic words,

I certainly agree that this is an imprecise concept, but I can guarantee that in the US, this is one of the three main questions for researchers at hiring, promotion, and tenure. (Funding opportunities and fit are my guesses for the other two.) So I would push on this point: researchers need to appear to have a clear intellectual contribution at every stage of the way, whatever that means. What it means is what I'm trying to explore.

## Software is a tremendously important and critical part of the research endeavor

...but it's not enough. That's my story, and I'm sticking to it :).

I feel like the conversation got a little bit sidetracked by discussions of Nobel Prizes (mea partly culpa), and I want to discuss PhD theses instead. To get a PhD, you need to do some research; if you're a bioinformatics or biology grad student who is focused on methods+software, how much of that research can be software, and what else needs to be there?

And here again I get to dip into my own personal history.

I spent 9 years in graduate school. About 6 years into my PhD, I had a conversation with my advisor that went something like this:

Me, age ~27 - "Hey, Eric, I've got ~two first-author papers, and another one or two coming, along with a bunch of papers. How about I defend my PhD on the basis of that work, and stick around to finish my experimental work as a postdoc?"

Eric - blank look "All your papers are on computational methods. None of them count for your PhD."

Me - "Uhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhmmmmmmmmmm..."

(I did eventually graduate, but only after three more years of experiments.)

In biology, we have to be able to defend our computational contributions in the face of an only slowly changing professoriate. And I'm OK with that, but I think we should make it clear up front.

Since then, I've graduated three (soon to be five, I hope!) graduate students, one in biology and two in CS. In every single case, they've done a lot of hacking. And in every single case they've been asked to defend their intellectual contribution. This isn't just people targeting my students - I've sat on committees where students have produced masses of experimental data, and if they weren't prepared to defend their experimental design, their data interpretation, and the impact and significance of their data interpretation, they weren't read to defend. This is a standard part of the PhD process at Caltech, at MSU, and presumably at UC Davis.

So: to successfully receive a PhD, you should have to clearly articulate the problem you're tackling, its place in the scientific literature, the methods and experiments you're going to use, the data you got, the interpretation you place on that data, and the impact of their results on current scientific thinking. It's a pretty high bar, and one that I'm ok with.

One of the several failure modes I see for graduate students is the one where graduate students spend a huge amount of time developing software and more or less assume that this work will lead to a PhD. Why would they be thinking that?

• Their advisor may not be particularly computational and may be giving poor guidance (which includes poorly explained criteria).
• Their advisor may be using them (intentionally or unintentionally) - effective programmers are hard to find.
• The grad student may be resistant to guidance.

I ticked all of these as a graduate student, but I had the advantage of being a 3rd-generation academic, so I knew the score. (And I still ran into problems.) In my previous blog post, I angered and upset some people by my blunt words (I honestly didn't think "grad student hacker fallacy" was so rude ;( but it's a real problem that I confront regularly.

Computational PhD students need to do what every scientific PhD student needs to do: clearly articulate their problem, place it in the scientific literature, define the computational methods and experiments they're going to do/have done, explain the data and their interpretation of it, and explore how it impacts science. Most of this involves things other than programming and running software! It's impossible to put down percent effort estimates that apply broadly, but my guess is that PhD students should spend at least a year understanding your results and interpreting and explaining their work.

Conveniently, however, once you've done that for your PhD, you're ready to go in the academic world! These same criteria (expanded in scope) apply to getting a postdoc, publishing as a postdoc, getting a faculty position, applying for grants, and getting tenure. Moreover, I believe many of the same criteria apply broadly to research outside of academia (which is one reason I'm still strongly +1 on getting a PhD, no matter your ultimate goals).

(Kyle Cranmer's comment on grad student efforts here was perfect.)

## Software as...

As far as software being a primary product of research -- Konrad Hinsen nails it. It's not, but neither are papers, and I'm good with both statements :). Read his blog post for the full argument. The important bit is that very little stands on its own; there always needs to be communication effort around software, data, and methods.

Ultimately, I learned a lot by admitting confusion! Dan Katz and Konrad Hinsen pointed out that software is communication, and Kai Blin drew a great analogy between software and experimental design. These are perspectives that I hadn't seen said so clearly before and they've really made me think differently; both are interesting and provocative analogies and I'm hoping that we can develop them further as a community.

## How do we change things?

Kyle Cranmer and Rory Kirchner had a great comment chain on broken value systems and changing the system. I love the discussion, but I'm struggling with how to respond. My tentative and mildly unhappy conclusion is that I may have bought into the intellectual elitism of academia a bit too much (see: third generation academic), but this may also be how I've gotten where I am, so... mixed bag? (Rory made me feel old and dull, too, which is pretty cool in a masochistic kind of way.)

One observation is that, in software, novelty is cheap. It's very, very easy to tweak something minorly, and fairly easy to publish it without generating any new understanding. How do we distinguish a future Heng Li or an Aaron Quinlan (who have enabled new science by cleanly solving "whole classes of common problems that you don't even have to think about anymore") from humdrum increment, and reward them properly in the earlier stages of their career? I don't know, but the answer has to be tied to advancing science, which is hard to measure on any short timescale. (Sean Eddy's blog post has the clearest view on solutions that I've yet seen.)

Another observation (nicely articulated by Daisie Huang) is that (like open data) this is another game theoretic situation, where the authors of widely used software sink their time and energy into the community but don't necessarily gain wide recognition for their efforts. There's a fat middle ground of software that's reasonably well used but isn't samtools, and this ecosystem needs to be supported. This is much harder to argue - it's a larger body of software, it's less visible, and it's frankly much more expensive to support. (Carl Boettiger's comment is worth reading here.) The funding support isn't there, although that might change in the next decade. (This is the proximal challenge for me, since I place my own software, khmer, in this "fat middle ground"; how do I make a clear argument for funding?)

Kyle Cranmer and others pointed to some success in "major instrumentation" and methods-based funding and career paths in physics (help, can't find link/tweets!). This is great, but I think it's also worth discussing the overall scale of things. Physics has a few really big and expensive instruments, with a few big questions, and with thousands of engineers devoted to them. Just in sequencing, biology has thousands (soon millions) of rather cheap instruments, devoted to many thousands of questions. If my prediction that software will "eat" this part of the world becomes true, we will need tens of thousands of data intensive biologists at a minimum, most working to some large extent on data analysis and software. I think the scale of the need here is simply much, much larger than in physics.

I am supremely skeptical of the idea that universities as we currently conceive of them are the right home for stable, mature software development. We either need to change universities in the right way (super hard) or find other institutions (maybe easier). Here, the model to watch may well be the Center for Open Science, which produces the Open Science Framework (among others). My interpretation is that they are trying to merge scientific needs with the open source development model. (Tellingly, they are doing so largely with foundation funding; the federal funding agencies don't have good mechanisms for funding this kind of thing in biology, at least.) This may be the right model (or at least on the path towards one) for sustained software development in the biological sciences: have an institution focused on sustainability and quality, with a small diversity of missions, that can afford to spend the money to keep a number of good software engineers focused on those missions.

Thanks, all, for the comments and discussions!

--titus

## April 22, 2015

### Gaël Varoquaux

#### MLOSS: machine learning open source software workshop @ ICML 2015

Note

This year again we will have an exciting workshop on the leading-edge machine-learning open-source software. This subject is central to many, because software is how we propagate, reuse, and apply progress in machine learning.

Want to present a project? The deadline for the call for papers is Apr 28th, in a few days : http://mloss.org/workshop/icml15/

The workshop will be help at the ICML conference, in Lille France, on July 10th. ICML –International Conference in Machine Learning– is the leading venue for academic research in machine learning. It’s a fantastic place to hold such a workshop, as the actors of theoretical progress are all around. Software is the bridge that brings this progress beyond papers.

There is a long tradition of MLOSS workshop, with one every year and a half. Last time, at NIPS 2013, I could feel a bit of a turning point, as people started feeling that different software slotted together, to create an efficient and state-of-the art working environment. For this reason, we have entitled this year’s workshop ‘open ecosystems’, stressing that contributions in the scope of the workshop, that build a thriving work environment, are not only machine learning software, but also better statistics or numerical tools.

We have two keynotes with important contributions to such ecosystems:

• John Myles White (Facebook), lead developer of Julia statistics and machine learning: “Julia for machine learning: high-level syntax with compiled-code speed”
• Matthew Rocklin (Continuum Analytics), developer of Python computational tools, in particular Blaze (confirmed): “Blaze, a modern numerical engine with out-of-core and out-of-order computations”.

There will be also a practical presentation on how to set up an open-source project, discussing hosting, community development, quality assurance, license choice, by yours truly.

## April 21, 2015

### Titus Brown

#### Is software a primary product of science?

Update - I've written Yet Another blog post, More on scientific software on this topic. I think this blog post is a mess so you should read that one first ;).

This blog post was spurred by a simple question from Pauline Barmby on Twitter. My response didn't, ahem, quite fit in 144 characters :).

First, a little story. (To paraphrase Greg Wilson, "I tell a lot of stories. Some of them aren't true. But this one is!")

When we were done writing Best Practices for Scientific Computing, we tried submitting it to a different high-profile journal than the one that ultimately accepted it (PLoS Biology, where it went on to become the most highly read article of 2014 in PLoS Biology). The response from the editor went something like this: "We recognize the importance of good engineering, but we regard writing software as equivalent to building a telescope - it's important to do it right, but we don't regard a process paper on how to build telescopes better as an intellectual contribution." (Disclaimer: I can't find the actual response, so this is a paraphrase, but it was definitely a "no" and for about that reason.)

## Is scientific software like instrumentation?

When I think about scientific software as a part of science, I inevitably start with its similarities to building scientific instruments. New instrumentation and methods are absolutely essential to scientific progress, and it is clear that good engineering and methods development skills are incredibly helpful in research.

So, why did the editors at High Profile Journal bounce our paper? I infer that they drew exactly this parallel and thought no further.

But scientific software is only somewhat like new methods or instrumentation.

First, software can spread much faster and be used much more like a black box than most methods, and instrumentation inevitably involves either construction or companies that act as middlemen. With software, it's like you're shipping kits or plans for 3-D printing - something that is as close to immediately usable as it comes. If you're going to hand someone an immediately usable black box (and pitch it as such), I would argue that you should take a bit more care in building said black box.

Second, complexity in software scales much faster than in hardware (citation needed). This is partly due to human nature & a failure to think long-term, and partly due to the nature of software - software can quickly have many more moving parts than hardware, and at much less (short term) cost. Frankly, most software stacks resemble massive Rube Goldberg machines (read that link!) This means that different processes are needed here.

Third, at least in my field (biology), we are undergoing a transition to data intensive research, and software methods are becoming ever more important. There's no question that software is going to eat biology just like it's eating the rest of the world, and an increasingly large part of our primary scientific output in biology is going to hinge directly on computation (think: annotations. 'nuff said).

If we're going to build massively complex black boxes that under-pin all of our science, surely that means that the process is worth studying intellectually?

## Is scientific software a primary intellectual output of science?

No.

I think concluding that it is is an example of the logical fallacy "affirming the consequent" - or, "confusion of necessity and sufficiency". I'm not a logician, but I would phrase it like this (better phrasing welcome!) --

Good software is necessary for good science. Good science is an intellectual contribution. Therefore good software is an intellectual contribution.

Hopefully when phrased that way it's clear that it's nonsense.

I'm naming this "the fallacy of grad student hackers", because I feel like it's a common failure mode of grad students that are good at programming. I actually think it's a tremendously dangerous idea that is confounding a lot of the discussion around software contributions in science.

To illustrate this, I'll draw the analog to experimental labs: you may have people who are tremendously good at doing certain kinds of experiments (e.g. expert cloners, or PCR wizards, or micro-injection aficionados, or WMISH bravados) and with whom you can collaborate to rapidly advance your research. They can do things that you can't, and they can do them quickly and well! But these people often face dead ends in academia and end up as eterna-postdocs, because (for better or for worse) what is valued for first authorship and career progression is intellectual contribution, and doing experiments well is not sufficient to demonstrate an intellectual contribution. Very few people get career advancement in science by simply being very good at a technique, and I believe that this is OK.

Back to software - writing software may become necessary for much of science but I don't think it should ever be sufficient as a primary contribution. Worse, it can become (often becomes?) an engine of procrastination. Admittedly, that procrastination leads to things like IPython Notebook, so I don't want to ding it, but neither are all (or even most ;) grad students like Fernando Perez, either.

## Let's admit it, I'm just confused

This leaves us with a conundrum.

Software is clearly a force multiplier - "better software, better research!.

However, I don't think it can be considered a primary output of science. Dan Katz said, "Nobel prizes have been given for inventing instruments. I'm eagerly awaiting for one for inventing software [sic]" -- but I think he's wrong. Nobels have been given because of the insight enabled by inventing instruments, not for inventing instruments. (Corrections welcome!) So while I, too, eagerly await the explicit recognition that software can push scientific insight forward in biology, I am not holding my breath - I think it's going to look much more like the 2013 Chemistry Nobel, which is about general computational methodology. (My money here would be on a Nobel in Medicine for genome assembly methods, which should follow on separately from massively parallel sequencing methods and shotgun sequencing - maybe Venter, Church, and Myers/Pevzner deserve three different Nobels?)

Despite that, we do need to incentivize it, especially in biology but also more generally. Sean Eddy wrote AN AWESOME BLOG POST ON THIS TOPIC in 2010 (all caps because IT'S AWESOME AND WHY HAVEN'T WE MOVED FURTHER ON THIS <sob>). This is where DOIs for software usually come into play - hey, maybe we can make an analogy between software and papers! But I worry that this is a flawed analogy (for reasons outlined above) and will simply support the wrong idea that doing good hacking is sufficient for good science.

We also have a new problem - the so-called Big Data Brain Drain, in which it turns out that the skills that are needed for advancing science are also tremendously valuable in much more highly paid jobs -- much like physics number crunchers moving to finance, research professors in biology face a future where all our grad students go on to make more than us in tech. (Admittedly, this is only a problem if we think that more people clicking on ads is more important than basic research.) Jake Vanderplas (the author of the Big Data Brain Drain post) addressed potential solutions to this in Hacking Academia, about which I have mixed feelings. While I love both Jake and his blog post (platonically), there's a bit too much magical thinking in that post -- I don't see (m)any of those solutions getting much traction in academia.

The bottom line for me is that we need to figure it out, but I'm a bit stuck on practical suggestions. Natural selection may apply -- whoever figures this out in biology (basic research institutions and/or funding bodies) will have quite an edge in advancing biomedicine -- but natural selection works across multiple generations, and I could wish for something a bit faster. But I don't know. Maybe I'll bring it up at SciFoo this year - "Q: how can we kill off the old academic system faster?" :)

I'll leave you with two little stories.

## The problem, illustrated

In 2009, we started working on what would ultimately become Pell et al., 2012. We developed a metric shit-ton of software (that's a scientific measure, folks) that included some pretty awesomely scalable sparse graph labeling approaches. The software worked OK for our problem, but was pretty brittle; I'm not sure whether or not our implementation of this partitioning approach is being used by anyone else, nor am I sure if it should be :).

However, the paper has been a pretty big hit by traditional scientific metrics! We got it into PNAS by talking about the data structure properties and linking physics, computer science, and biology together. It helped lead directly to Chikhi and Rizk (2013), and it has been cited a whole bunch of times for (I think) its theoretical contributions. Yay!

Nonetheless, the incredibly important and tricky details of scalably partitioning 10 bn node graphs were lost from that paper, and the software was not a big player, either. Meanwhile, Dr. Pell left academia and moved on to a big software company where (on his first day) he was earning quite a bit more than me (good on him! I'd like a 5% tithe, though, in the future :) :). Trust me when I say that this is a net loss to academia.

Summary: good theory, useful ideas, lousy software. Traditional success. Lousy outcomes.

## A contrapositive

In 2011, we figured out that linear compression ratios for sequence data simply weren't going to cut it in the face of the continued rate of data generation, and we developed digital normalization, a deceptively simple idea that hasn't really been picked up by the theoreticians. Unlike the Pell work above, it's not theoretically well studied at all. Nonetheless, the preprint has a few dozen citations (because it's so darn useful) and the work is proving to be a good foundation for further research for our lab. Perhaps the truest measure of its memetic success is that it's been reimplemented by at least three different sequencing centers.

The software is highly used, I think, and many of our efforts on the khmer software have been aimed at making diginorm and downstream concepts more robust.

Summary: lousy theory, useful ideas, good software. Nontraditional success. Awesome outcomes.

## Ways forward?

I simply don't know how to chart a course forward. My current instinct (see below) is to shift our current focus much more to theory and ideas and further away from software, largely because I simply don't see how to publish or fund "boring" things like software development. (Josh Bloom has an excellent blog post that relates to this particular issue: Novelty Squared)

I've been obsessing over these topics of software and scientific focus recently (see The three porridge bowls of scientific software development and Please destroy this software after publication. kthxbye) because I'm starting to write a renewal for khmer's funding. My preliminary specific aims look something like this:

Aim 1: Expand low memory and streaming approaches for biological sequence analysis.

Aim 2: Develop graph-based approaches for analyzing genomic variation.

Aim 3: Optimize and extend a general purpose graph analysis library

Importantly, everything to do with software maintenance, support, and optimization is in Aim 3 and is in fact only a part of that aim. I'm not actually saddened by that, because I believe that software is only interesting because of the new science it enables. So I need to sell that to the NIH, and there software quality is (at best) a secondary consideration.

On the flip side, by my estimate 75% of our khmer funding is going to software maintenance, most significantly in paying down our technical debt. (In the grant I am proposing to decrease this to ~50%.)

I'm having trouble justifying this dichotomy mentally myself, and I can only imagine what the reviewers might think (although hopefully they will only glance at the budget ;).

So this highlights one conundrum: given my estimates and my priorities, how would you suggest I square these stated priorities with my funding allocations? And, in these matters, have I been wrong to focus on software quality, or should I have focused instead on accruing technical debt in the service of novel ideas and functionality? Inquiring minds want to know.

--titus

### Matthew Rocklin

#### Profiling Data Throughput

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

Disclaimer: This post is on experimental/buggy code.

tl;dr We measure the costs of processing semi-structured data like JSON blobs.

## Semi-structured Data

Semi-structured data is ubiquitous and computationally painful. Consider the following JSON blobs:

{'name': 'Alice',   'payments': [1, 2, 3]}
{'name': 'Bob',     'payments': [4, 5]}
{'name': 'Charlie', 'payments': None}


This data doesn’t fit nicely into NumPy or Pandas and so we fall back to dynamic pure-Python data structures like dicts and lists. Python’s core data structures are surprisingly good, about as good as compiled languages like Java, but dynamic data structures present some challenges for efficient parallel computation.

## Volume

Semi-structured data is often at the beginning of our data pipeline and so often has the greatest size. We may start with 100GB of raw data, reduce to 10GB to load into a database, and finally aggregate down to 1GB for analysis, machine learning, etc., 1kB of which becomes a plot or table.

Data Bandwidth (MB/s) In Parallel (MB/s)
Disk I/O500500
Decompression100500
Deserialization50250
In-memory computation2000oo
Shuffle930

Common solutions for large semi-structured data include Python iterators, multiprocessing, Hadoop, and Spark as well as proper databases like MongoDB and ElasticSearch. Two months ago we built dask.bag, a toy dask experiment for semi-structured data. Today we’ll strengthen the dask.bag project and look more deeply at performance in this space.

We measure performance with data bandwidth, usually in megabytes per second (MB/s). We’ll build intuition for why dealing with this data is costly.

## Dataset

As a test dataset we play with a dump of GitHub data from https://www.githubarchive.org/. This data records every public github event (commit, comment, pull request, etc.) in the form of a JSON blob. This data is representative fairly representative of a broader class of problems. Often people want to do fairly simple analytics, like find the top ten committers to a particular repository, or clean the data before they load it into a database.

We’ll play around with this data using dask.bag. This is both to get a feel for what is expensive and to provide a cohesive set of examples. In truth we won’t do any real analytics on the github dataset, we’ll find that the expensive parts come well before analytic computation.

Items in our data look like this:

>>> import json
>>> path = '/home/mrocklin/data/github/2013-05-0*.json.gz'
({u'actor': u'mjcramer',
u'actor_attributes': {u'gravatar_id': u'603762b7a39807503a2ee7fe4966acd1',
u'type': u'User'},
u'created_at': u'2013-05-01T00:01:28-07:00',
u'master_branch': u'master',
u'ref': None,
u'ref_type': u'repository'},
u'public': True,
u'repository': {u'created_at': u'2013-05-01T00:01:28-07:00',
u'description': u'',
u'fork': False,
u'forks': 0,
u'has_issues': True,
u'has_wiki': True,
u'id': 9787210,
u'master_branch': u'master',
u'name': u'settings',
u'open_issues': 0,
u'owner': u'mjcramer',
u'private': False,
u'pushed_at': u'2013-05-01T00:01:28-07:00',
u'size': 0,
u'stargazers': 0,
u'url': u'https://github.com/mjcramer/settings',
u'watchers': 0},
u'type': u'CreateEvent',
u'url': u'https://github.com/mjcramer/settings'},)

## Disk I/O and Decompression – 100-500 MB/s

Data Bandwidth (MB/s)
Parallel Read from disk with gzip.open500

A modern laptop hard drive can theoretically read data from disk to memory at 800 MB/s. So we could burn through a 10GB dataset in fifteen seconds on our laptop. Workstations with RAID arrays can do a couple GB/s. In practice I get around 500 MB/s on my personal laptop.

In [1]: import json
In [2]: import dask.bag as db
In [3]: from glob import glob
In [4]: path = '/home/mrocklin/data/github/2013-05-0*.json.gz'

In [5]: %time compressed = '\n'.join(open(fn).read() for fn in glob(path))
CPU times: user 75.1 ms, sys: 1.07 s, total: 1.14 s
Wall time: 1.14 s

In [6]: len(compressed) / 0.194 / 1e6  # MB/s
508.5912175438597

To reduce storage and transfer costs we often compress data. This requires CPU effort whenever we want to operate on the stored values. This can limit data bandwidth.

In [7]: import gzip
In [8]: %time total = '\n'.join(gzip.open(fn).read() for fn in glob(path))
CPU times: user 12.2 s, sys: 18.7 s, total: 30.9 s
Wall time: 30.9 s

In [9]: len(total) / 30.9 / 1e6         # MB/s  total bandwidth
Out[9]: 102.16563844660195

In [10]: len(compressed) / 30.9 / 1e6   # MB/s  compressed bandwidth
Out[10]: 18.763559482200648

So we lose some data bandwidth through compression. Where we could previously process 500 MB/s we’re now down to only 100 MB/s. If we count bytes in terms of the amount stored on disk then we’re only hitting 18 MB/s. We’ll get around this with multiprocessing.

## Decompression and Parallel processing – 500 MB/s

Fortunately we often have more cores than we know what to do with. Parallelizing reads can hide much of the decompression cost.

In [12]: import dask.bag as db

In [13]: %time nbytes = db.from_filenames(path).map(len).sum().compute()
CPU times: user 130 ms, sys: 402 ms, total: 532 ms
Wall time: 5.5 s

In [14]: nbytes / 5.5 / 1e6
Out[14]: 573.9850932727272

Dask.bag infers that we need to use gzip from the filename. Dask.bag currently uses multiprocessing to distribute work, allowing us to reclaim our 500 MB/s throughput on compressed data. We also could have done this with multiprocessing, straight Python, and a little elbow-grease.

## Deserialization – 30 MB/s

Data Bandwidth (MB/s)

Once we decompress our data we still need to turn bytes into meaningful data structures (dicts, lists, etc..) Our GitHub data comes to us as JSON. This JSON contains various encodings and bad characters so, just for today, we’re going to punt on bad lines. Converting JSON text to Python objects explodes out in memory a bit, so we’ll consider a smaller subset for this part, a single day.

In [20]: def loads(line):
...          except: return None

In [21]: path = '/home/mrocklin/data/github/2013-05-01-*.json.gz'
In [22]: lines = list(db.from_filenames(path))

In [23]: %time blobs = list(map(loads, lines))
CPU times: user 10.7 s, sys: 760 ms, total: 11.5 s
Wall time: 11.3 s

In [24]: len(total) / 11.3 / 1e6
Out[24]: 33.9486321238938

In [25]: len(compressed) / 11.3 / 1e6
Out[25]: 6.2989179646017694

So in terms of actual bytes of JSON we can only convert about 30MB per second. If we count in terms of the compressed data we store on disk then this looks more bleak at only 6 MB/s.

### This can be improved by using faster libraries – 50 MB/s

The ultrajson library, ujson, is pretty slick and can improve our performance a bit. This is what Pandas uses under the hood.

In [28]: import ujson
...          except: return None

In [30]: %time blobs = list(map(loads, lines))
CPU times: user 6.37 s, sys: 1.17 s, total: 7.53 s
Wall time: 7.37 s

In [31]: len(total) / 7.37 / 1e6
Out[31]: 52.05149837177748

In [32]: len(compressed) / 7.37 / 1e6
Out[32]: 9.657771099050203

### Or through Parallelism – 150 MB/s

This can also be accelerated through parallelism, just like decompression. It’s a bit cumbersome to show parallel deserializaiton in isolation. Instead we’ll show all of them together. This will under-estimate performance but is much easier to code up.

In [33]: %time db.from_filenames(path).map(loads).count().compute()
CPU times: user 32.3 ms, sys: 822 ms, total: 854 ms
Wall time: 2.8 s

In [38]: len(total) / 2.8 / 1e6
Out[38]: 137.00697964285717

In [39]: len(compressed) / 2.8 / 1e6
Out[39]: 25.420633214285715

## Mapping and Grouping - 2000 MB/s

Data Bandwidth (MB/s)
Simple Python operations1400
Complex CyToolz operations2600

Once we have data in memory, Pure Python is relatively fast. Cytoolz moreso.

In [55]: %time set(d['type'] for d in blobs)
CPU times: user 162 ms, sys: 123 ms, total: 285 ms
Wall time: 268 ms
Out[55]:
{u'CommitCommentEvent',
u'CreateEvent',
u'DeleteEvent',
u'FollowEvent',
u'ForkEvent',
u'GistEvent',
u'GollumEvent',
u'IssueCommentEvent',
u'IssuesEvent',
u'MemberEvent',
u'PublicEvent',
u'PullRequestEvent',
u'PullRequestReviewCommentEvent',
u'PushEvent',
u'WatchEvent'}

In [56]: len(total) / 0.268 / 1e6
Out[56]: 1431.4162052238805

In [57]: import cytoolz
In [58]: %time _ = cytoolz.groupby('type', blobs)  # CyToolz FTW
CPU times: user 144 ms, sys: 0 ns, total: 144 ms
Wall time: 144 ms

In [59]: len(total) / 0.144 / 1e6
Out[59]: 2664.024604166667

So slicing and logic are essentially free. The cost of compression and deserialization dominates actual computation time. Don’t bother optimizing fast per-record code, especially if CyToolz has already done so for you. Of course, you might be doing something expensive per record. If so then most of this post isn’t relevant for you.

## Shuffling - 5-50 MB/s

Data Bandwidth (MB/s)
Naive groupby with on-disk Shuffle25
Clever foldby without Shuffle250

For complex logic, like full groupbys and joins, we need to communicate large amounts of data between workers. This communication forces us to go through another full serialization/write/deserialization/read cycle. This hurts. And so, the single most important message from this post:

That being said, people will inevitably ignore this advice so we need to have a not-terrible fallback.

In [62]: %time dict(db.from_filenames(path)
...                   .groupby('type')
...                   .map(lambda (k, v): (k, len(v))))
CPU times: user 46.3 s, sys: 6.57 s, total: 52.8 s
Wall time: 2min 14s
Out[62]:
{'CommitCommentEvent': 17889,
'CreateEvent': 210516,
'DeleteEvent': 14534,
'FollowEvent': 35910,
'ForkEvent': 67939,
'GistEvent': 7344,
'GollumEvent': 31688,
'IssueCommentEvent': 163798,
'IssuesEvent': 102680,
'MemberEvent': 11664,
'PublicEvent': 1867,
'PullRequestEvent': 69080,
'PullRequestReviewCommentEvent': 17056,
'PushEvent': 960137,
'WatchEvent': 173631}

In [63]: len(total) / 134 / 1e6  # MB/s
Out[63]: 23.559091

This groupby operation goes through the following steps:

2. Decompress GZip
3. Deserialize with ujson
4. Do in-memory groupbys on chunks of the data
5. Reserialize with msgpack (a bit faster)
6. Append group parts to disk
7. Read in new full groups from disk
8. Deserialize msgpack back to Python objects
9. Apply length function per group

Some of these steps have great data bandwidths, some less-so. When we compound many steps together our bandwidth suffers. We get about 25 MB/s total. This is about what pyspark gets (although today pyspark can parallelize across multiple machines while dask.bag can not.)

Disclaimer, the numbers above are for dask.bag and could very easily be due to implementation flaws, rather than due to inherent challenges.

>>> import pyspark
>>> sc = pyspark.SparkContext('local[8]')
>>> rdd = sc.textFile(path)
...         .keyBy(lambda d: d['type'])
...         .groupByKey()
...         .map(lambda (k, v): (k, len(v)))
...         .collect())

I would be interested in hearing from people who use full groupby on BigData. I’m quite curious to hear how this is used in practice and how it performs.

## Creative Groupbys - 250 MB/s

Don’t use groupby. You can often work around it with cleverness. Our example above can be handled with streaming grouping reductions (see toolz docs.) This requires more thinking from the programmer but avoids the costly shuffle process.

In [66]: %time dict(db.from_filenames(path)
...                   .foldby('type', lambda total, d: total + 1, 0, lambda a, b: a + b))
Out[66]:
{'CommitCommentEvent': 17889,
'CreateEvent': 210516,
'DeleteEvent': 14534,
'FollowEvent': 35910,
'ForkEvent': 67939,
'GistEvent': 7344,
'GollumEvent': 31688,
'IssueCommentEvent': 163798,
'IssuesEvent': 102680,
'MemberEvent': 11664,
'PublicEvent': 1867,
'PullRequestEvent': 69080,
'PullRequestReviewCommentEvent': 17056,
'PushEvent': 960137,
'WatchEvent': 173631}
CPU times: user 322 ms, sys: 604 ms, total: 926 ms
Wall time: 13.2 s

In [67]: len(total) / 13.2 / 1e6  # MB/s
Out[67]: 239.16047181818183

We can also spell this with PySpark which performs about the same.

>>> dict(rdd.map(loads)  # PySpark equivalent
...         .keyBy(lambda d: d['type'])
...         .combineByKey(lambda d: 1, lambda total, d: total + 1, lambda a, b: a + b)
...         .collect())

## Use a Database

By the time you’re grouping or joining datasets you probably have structured data that could fit into a dataframe or database. You should transition from dynamic data structures (dicts/lists) to dataframes or databases as early as possible. DataFrames and databases compactly represent data in formats that don’t require serialization; this improves performance. Databases are also very clever about reducing communication.

Tools like pyspark, toolz, and dask.bag are great for initial cleanings of semi-structured data into a structured format but they’re relatively inefficient at complex analytics. For inconveniently large data you should consider a database as soon as possible. That could be some big-data-solution or often just Postgres.

## Better data structures for semi-structured data?

Dynamic data structures (dicts, lists) are overkill for semi-structured data. We don’t need or use their full power but we inherit all of their limitations (e.g. serialization costs.) Could we build something NumPy/Pandas-like that could handle the blob-of-JSON use-case? Probably.

DyND is one such project. DyND is a C++ project with Python bindings written by Mark Wiebe and Irwin Zaid and historically funded largely by Continuum and XData under the same banner as Blaze/Dask. It could probably handle the semi-structured data problem case if given a bit of love. It handles variable length arrays, text data, and missing values all with numpy-like semantics:

>>> from dynd import nd
>>> data = [{'name': 'Alice',                       # Semi-structured data
...          'location': {'city': 'LA', 'state': 'CA'},
...          'credits': [1, 2, 3]},
...         {'name': 'Bob',
...          'credits': [4, 5],
...          'location': {'city': 'NYC', 'state': 'NY'}}]

>>> dtype = '''var * {name: string,
...                   location: {city: string,
...                              state: string[2]},
...                   credits: var * int}'''        # Shape of our data

>>> x = nd.array(data, type=dtype)                  # Create DyND array

>>> x                                               # Store compactly in memory
nd.array([["Alice", ["LA", "CA"], [1, 2, 3]],
["Bob", ["NYC", "NY"], [4, 5]]])

>>> x.location.city                                 # Nested indexing
nd.array([ "LA", "NYC"],
type="strided * string")

>>> x.credits                                       # Variable length data
nd.array([[1, 2, 3],    [4, 5]],
type="strided * var * int32")

>>> x.credits * 10                                  # And computation
nd.array([[10, 20, 30],     [40, 50]],
type="strided * var * int32")

Sadly DyND has functionality gaps which limit usability.

>>> -x.credits                                      # Sadly incomplete :(
TypeError: bad operand type for unary -

I would like to see DyND mature to the point where it could robustly handle semi-structured data. I think that this would be a big win for productivity that would make projects like dask.bag and pyspark obsolete for a large class of use-cases. If you know Python, C++, and would like to help DyND grow I’m sure that Mark and Irwin would love the help

## Comparison with PySpark

1. Doesn’t engage the JVM, no heap errors or fiddly flags to set
2. Conda/pip installable. You could have it in less than twenty seconds from now.
3. Slightly faster in-memory implementations thanks to cytoolz; this isn’t important though
4. Good handling of lazy results per-partition
5. Faster / lighter weight start-up times
6. (Subjective) I find the API marginally cleaner

PySpark pros:

1. Supports distributed computation (this is obviously huge)
2. More mature, more filled out API
3. HDFS integration

Dask.bag reinvents a wheel; why bother?

1. Given the machinery inherited from dask.array and toolz, dask.bag is very cheap to build and maintain. It’s around 500 significant lines of code.
2. PySpark throws Python processes inside a JVM ecosystem which can cause some confusion among users and a performance hit. A task scheduling system in the native code ecosystem would be valuable.
3. Comparison and competition is healthy
4. I’ve been asked to make a distributed array. I suspect that distributed bag is a good first step.

## April 20, 2015

### Titus Brown

#### Statistics from applications to the 2015 course on NGS analysis

Here are some statistics from this year's applications to the NGS course. Briefly, this is a two-week workshop on sequence analysis at the command line and in the cloud.

The short version is that demand remains high; note that we admit only 24 applicants, so generally < 20%...

Year Number of applications Note
2010 33
2011 133
2012 170
2013 210
2014 170 (shifted the timing to Aug)
2015 155 (same timing as 2014)

The demand is still high, although maybe starting to dip?

Status: Number Percent
1st or 2nd year graduate student 20 12.6%
3rd year+ graduate student 40 25.2%
Post-doctoral researcher 36 22.6%
Non-tenure faculty or staff 20 12.6%
Tenure-line faculty 24 15.1%
Other 19 11.9%

Lots of tenure-line faculty feel they need this training...

Primary training/background: Number Percent
Bioinformatics 11 6.9%
Biology 112 70.4%
Computer Science 3 1.9%
Physics 0 0%
Other 33 20.8%

I should look into "Other"!

--titus

## April 19, 2015

### Titus Brown

#### Dear Doc Brown: how can I find a postdoc where I can do open science?

I got the following e-mail from a student I know -- lightly edited to protect the innocent:

I am at the stage where I am putting together a list of people that I want to post-doc with.

So, a question to you:

1. How can I find people who do open science?

2. Say I go for an interview, would it be "polite" to ask them to see if I can do open science even if they're not doing it themselves? Do you have any suggestions on how, exactly, to ask?

The reason why I am asking is because I rarely hear about openly doing (or even talking about) science in biomedical fields, outside of the standard communication methods (e.g. presenting at a meeting). Most of the people in my field seem somewhat conservative in this regard. Plus, I really don't want to butt heads with my postdoc mentor on this kind of topic.

Any advice? I have some but I'll save it for later so I can incorporate other people's advice ;).

thanks,

--titus

p.s. Yes, I have permission to post this question!

### Paul Ivanov

#### My first 200 K

Yesterday, I rode my longest bike ride to date - the El Cerrito-Davis 200K - with the San Francisco Randonneurs. A big thank you to all the volunteers and randos who made my first 200k so much fun.

First, for the uninitiated, an aside about randonneuring:

I discovered the sport because I'm a cheapskate. I had gotten more and more into cycling over the past 2 years or so, and though I was riding through the East Bay hills mostly alone, I wanted to do a "Century" - a 100 mile ride. Looking up local rides I found out that, while most centuries cost a nontrivial amount of money for the poor grad student I was back then ($60-$200), the San Francisco Randonneur rides were all $10-$20. As I dug deeper and learned about the sport, I found out that The reason for low cost, is that rando rides are unsupported - randonneuring is all about self sufficiency. You are expected to bring gear to fix your own flats, as well as carry or procure your own snacks and beverages.

The Randonneurs USA (RUSA) website succinctly summarizes the sport.

Randonneuring is long-distance unsupported endurance cycling. This style of riding is non-competitive in nature, and self-sufficiency is paramount. When riders participate in randonneuring events, they are part of a long tradition that goes back to the beginning of the sport of cycling in France and Italy. Friendly camaraderie, not competition, is the hallmark of randonneuring.

This description is strikingly similar to my beliefs both about cycling and computing, so I knew I found a new activity I would greatly enjoy. This has been borne out on three previous occasions when I have participated in the ~100 km Populaire rides, which are intended as a way to introduce riders to the sport, yet still be within reach of wide variety of cycling abilities.

We moved in late December, and between unpacking, rainy weekends, and being sick - I haven't been able to get much riding in. However, I've been wanting to do a 200K for a long time - my century, coveted for years, seemed within reach more than ever. To seal my commitment to the ride, I went ahead and ordered a spiffy looking SFR cycling jersey:

Here's the 213km (132 miles) ride map and elevation profile, what follows is my ride report.

## Start Control

The ride started at 8am at a Starbucks by El Cerrito BART station, so I just rode there from my house as I frequently do on my morning commute. I prepared my bike, packed my gear and food the night before, and the only thing I needed to do was to fill up my water bottle before the ride started. I showed up a dozen minutes before the start, with most riders already assembled, got my short drip and a heated up croissant, and totally failed to get any water. It's not a super terrible thing, I frequently don't end up drinking much at the beginning of my rides, but I had now set myself up to ride to the second control (44 km / 27 miles) without any water.

Randonneuring isn't about racing - as you traverse from control to control, there's just a window of time that you have for each control, and so long as you make it through each control withing that time window, you complete the event! There is no ordinal placement, the first rider to finish has just as much bragging rights as the last. The only time people talk about time is when they're trying to make a new personal best. The second control was open roughly between 9 and 11am, so I could have stopped off somewhere to pickup water, but that didn't fit into my ride plan.

You see, the cue sheet is two pages - with the first page dedicated to getting you to the second control, as there are a lot of turns to make in the East bay until you get to Benicia. Accordingly, my plan was to stick with the "fast" group who know the route well, so that I wouldn't have to look at the cue sheet at all. This was a success - and I got to chat with Jesse, whom I rode a fair chunk of the 111 km Lucas Valley Populaire back in October (here's a good video of the first chunk of that ride).

## Control #2

When 7 of us got to the Benicia control at 9:45, I volunteered to guard the bikes as folks filed into the gas station mart to buy something (getting a timestamped receipt is how you prove that you did the ride from one control to the next within the allotted time). Lesson learned: I should have used that time to shed my warm second jersey and long pants, as it had warmed up by then. It just so happens that by the time I went inside, there was a line for the bathroom, and by the time I got out, the fast group was just heading out, yet I still needed to change.

Because keeping up with the "fast" group wasn't that big of a deal and actually rather fun, I really wanted to try to catch up to them, so I stepped it up, and told a couple of other riders that I'd do my best to pull us to them (in case you didn't know - the aerodynamics of cycling make it much easier for those behind the leader to keep up the pace, even if that pace isn't something they could comfortably do on their own). We didn't succeed in catching them during the first 10 mile stretch of road, but I kept pushing with a high cadence in my highest gear, and about 5 miles later, we caught up with them! The only problem was, by this point, I had wasted so much energy, that I couldn't make the last 20 feet to them, though I was happy to see the two guys who'd been letting my push the whole time, use their fresher legs and join the group. So there I was, a few dozen feet from my intended riding partners, but as more and more pavement went past our wheels, the distance between us slowly widened.

Somewhat disheartened by this (though, again, happy that I all of my pulling wasn't all for naught, since I bridged two other to the group), and now overheating, I decide to stop by the side of the road, removed my long sleeved shirt, put on sunscreen (Lesson learned: don't forget your ears, too - they're the only part of my that burned). A fellow rando rider Eric went past,to my smiling cheer of "Go get 'em!". Then when pair of riders asked if I'm alright, I nodded, and decided to hop back on the bike and ride with them for a bit. It was nice to let someone else pull for a bit, but shortly thereafter, we started the first serious ascent, and my heart was pounding to hard from exhaustion and the heat - and I had to stop to again to catch my breath and get some more food in me.

Luckily, when I resumed riding up that hill, I had a mental shift, gave myself a break, and given how tired my legs were already, even though I wasn't even half-way through the ride, I reminded myself that I'll just spin in a low gear if I have to, there's no rush, I'm not racing anyone, and though I know this won't end up being as good of a 200k as it could have been had I trained more in recent months, it was still up to me to enjoy the ride. One of the highlights of the ride were all of the different butterflies I got to see along the way that I started noticing after this change in mental attitude.

After the long climb, followed by a very nice descent, I got onto the Silverado Trail for 14 miles of a straight road with minimal elevation changes. Though my legs were again cooperating more, it was starting to get kind of old, and then out of nowhere, one of the riders I had pulled earlier rides past me, but then proceeds to slow down and ride along side me for a chat. He hadn't been able to keep up with the fast group for very long, and ended up stopping somewhere along the way to eat, which is why I didn't notice that I had passed him. We took turns pulling for each other, which took away from the monotony of Silverado, but he had a lot more in the tank, and I again wasn't able to keep up, losing him with 4 miles to go to St Helena.

Needing another break, with 4 miles to the next control, I decided I would need to spend a while there, recuperating, if I am to make it through the rest of the ride.

## Control #3

I got to Model Bakery at 1:10pm, with many familiar faces from earlier in the day already enjoying their food, but ended up staying there until 2:30 - eating my food, drinking water, just letting my legs rest.

I ended up riding out solo, and really enjoyed the early parts of 128 (after missing the turnoff by a hundred feet) - luckily this was just the spot I had my last stop at, so I quickly turned around and got back on the road.

The problem was it kept getting hotter and hotter - it seemed that I couldn't go a mile without taking a good gulp of water. I still stuck to my strategy of just spinning fast without really pushing hard, since recovering from being out of breath is way faster than waiting for exhausted legs to obey your commands. I made it a good chunk of the way to Winters, but still ended up having to stop half way up the climb near Lake Berryessa. Another SFR rider, Julie, climbed past me, checking if I was OK as she went by. It's great to have that kind of camaraderie along the ride, a couple of people even gently expressed their concern that my rear wheel was out of true - which I knew but kept putting off getting it fixed. These nudges gave me more resolve to get that taken care of.

Finishing off the last of the Haribo Gummi Candy Gold-Bears that I brought with me (and you know you're tired and dehydrated when it takes effort to just chew), I got back on the bike and headed further up the hill. Then, finally, I didn't think I could ever be so cheered up by road sign (hint: they only put "TRUCKS USE LOWER GEAR" signs at the top of big hills).

## Control #4

I finally got to Winters at 5:37pm, got myself a Chai Smoothie, and Julie, who unfortunately was going to miss her train home, proposed that we ride together the rest of the way to Davis. Again - though I really enjoy my alone time while cycling, it's also quite fun to have strong riders to ride with, so as the sun started descending behind us, and no longer scorchingly hot, we set out for the final 17 miles to Davis at a good clip, given how much riding we had already done.

## Finish Control

Dodging drunk college kids (it was Picnic Day at UC Davis) was the last challenge of the ride. As a UCD alum, this was a homecoming of sorts, so I lead the way through town as we made our way to the Amtrak station. We finished just before 7, and I caught the 7:25pm train back to the Bay Area - enjoying the company of a handful of other randonneurs.

Thanks for reading my ride report!

                   _
/ \
A*   \^   -
,./   _.\\ / \
/ ,--.S    \/   \
/  "~,_     \    \
__o           ?
_ \<,_         /:\
--(_)/-(_)----.../ | \
--------------.......J


## April 18, 2015

### Titus Brown

#### First thoughts on Galileo's Middle Finger

I'm reading Galileo's Middle Finger by Dr. Alice Dreger (@alicedreger), and it's fantastic. It's a paean to evidence-based popular discourse on scientific issues -- something I am passionate about -- and it's very well written.

I bought the book because I ran across Dr. Dreger's excellent and hilarious live-tweeting of her son's sex-ed class (see storify here), which reminded me that I'd first read about her in the article Reluctant Crusader, on "Why Alice Dreger's writing on sex and science makes liberals so angry." While I'm pretty liberal in outlook, I'm also a scientist by inclination and training, and I often see the kinds of conflicts that Dr. Dreger talks about (where what people want to be true is unsupported by evidence, or directly conflicts with evidence).

The book is chock full of examples where Dr. Dreger examines controversies in science. The common theme is that some plucky scientist or set of scientists publish some perspective that is well supported by their data, but that runs against some commonly held perspective (or at least some perspective that an activist group holds). Vested interests of some kind then follow with scurrilous public or academic attacks that take years or decades to figure out. Dr. Dreger spends much of the book (so far) exploring the "playbook" used by these attackers to smear, harass, and undermine the original researchers.

At this point, I think it's important to note that most of the controversies that Dr. Dreger discusses are not "settled" -- science rarely settles things quickly -- but that in all the cases, there appears to be strong empirical evidence to support the conclusions being published. What Dr. Dreger never argues is that the particular science she is discussing is settled; rather, she often argues that it's not settled, and that the attackers are trying to make it look like it is, as a way to shut down further investigation. The kind of "double negative" espoused by Dr. Dreger ("my research doesn't show that gravity works, it shows that there is no reason to believe that gravity doesn't work") is how I try to operate in my own research, and I have an awful lot of sympathy for this general strategy as I think it's how science should work.

There are a few things about Dr. Dreger's book that rub me the wrong way, and I may or may not blog about them in detail when I'm done with the book. (Two brief items: despite showing how easily peer review can be manipulated to support personal vendettas, she consistently uses "peer reviewed" as a label to put work that supports her own positions beyond doubt. Also, she's so impassioned about the issues that she comes off as wildly un-objective at times. I think she also downplays just how complicated a lot of the research she's examining is to do and understand.) Most of my concerns can be attributed to this being an advocacy book aimed at a popular audience, where careful and objective presentations of the underlying science need to be weighed against the audience, and in compensation for this Dr. Dreger provides plenty of footnotes and citations that let interested people follow up on specific items.

But! What I wanted to write about in this blog post is two things -- first, MONEY. And second, media, the Internet, and Internet harassment.

## First, MONEY.

I'm only about halfway through the book, but it's striking that Dr. Dreger has so far not talked about some really big issues like global warming, which is kind of a poster child for science denialism these days. All of the issues in the book have to do with controversies where relatively small amounts of money were on the line. Unlike global warming, or tobacco and cancer, "all" that is at stake in the controversies in the book is sociopolitical agendas and human identity - crucially, nothing that hits at a big industry's bottom line.

What Dr. Dreger points out, though, is that even in circumstances where money is not the main issue, it is very hard for evidence to even get a fair hearing. (Even discounting those research imbroglios that Dr. Dreger is herself involved in, she presents plenty of data that the way our society handles contentious situations is just broken. More on that below.) But it doesn't take much imagination to guess that when real money is on the line, the "plucky scientist" faces even more massive obstacles. We've seen exactly how this plays out in the case of tobacco and cancer, where it took decades for scientists and patient advocates to overcome the industry-funded nonsense.

The same thing seems to be happening in climate change policy. I'm minded of a comment on Twitter that I've since lost track of -- badly paraphrased, it goes something like this: "Giant oil companies would have us believe that scientists are the ones with the overwhelming conflict of interest in the global warming discussion. What?"

I like to think about these things in terms of Bayesian priors. Who am I more likely to believe in a disagreement? The interest group which has billions of dollars at stake, or the scientists who are at least trained in objective inquiry? It's almost insane to fail to take into account the money stakes. Moreover, there are plenty of indications that, even when wrong initially, scientists are self-correcting; but I've never seen an interest group go "oops, I guess we got that wrong, let's rethink." So I guess you know my position here :). But thinking of it in terms of priors, however strong they may be, means that I'm more alert to the possibility of counter-evidence.

Anyway, I'm not an expert on any of this; my area of expertise is confined to a few areas in genomics and computational biology at this point. So it's very hard for me to evaluate the details of some of Dr. Dreger's cases. But I think that's one of the points she's getting at in the book -- who do we trust when seemingly trustworthy academic societies get manipulated by activist agendas? How do we reach some sort of conclusion, if not broad consensus then at least academic consensus, on issues that are (at the least) not well understood yet? And (one place where Dr. Dreger excels) how do we evaluate and decide on the ethical standards to apply to biomedical (or other) research? All very relevant to many things going on today, and all very tricky.

## Media, and the Internet

At the point I'm at in the book, Dr. Dreger has undergone a transition from being press-driven to being Internet driven. She points out that in the recession, the big media companies essentially lost the staff to pursue deep, technically tricky stories; coincident, "mainstream media" lost a lot of trust as the Internet came along and helped enable audience fragmentation. This meant that she had to take her fights to the Internet to convince the masses (or at least the Google search engine) much more so than relying on deep investigation by arbiters of "official" social policy like the NY Times or New Yorker.

I think in many ways this is a massive improvement -- I hate the fact that, in science, we've got these same kinds of official arbiters -- but it's interesting to read through the book and recognize Dr. Dreger's shift in thinking and tactics. It's especially ironic given that the same kind of tactics she starts to use in the dexamethasone investigation are the ones used against her in some of her intersex work. But, again, this is kind of the point of her book - it's not enough to rely on someone's credentials and sense of justice to evaluate their message, you need to actually look into the evidence yourself.

It's particularly interesting to fold Dr. Dreger's observations about Internet harassment and shaming into my general body of knowledge about how the Internet is used to, well, harass, shame, derail, and otherwise sandbag particular messages and people. Which reminds me, I need to watch all of Dr. Gabriella Coleman's PyCon 2015 keynote on Anonymous...

## A Not-Really Conclusion

At the end of the day, I worry that the "trust no one, investigate yourself" message is too challenging for our culture to grasp in a productive way. And yet I'm firmly convinced it's the only way forward. How can we do this better as scientists and educators?

--titus

p.s. I'm thinking about instituting a commenting policy, perhaps one based on Captain Awkward's policy. Thoughts?

p.p.s. There's some sort of irony in me leaving Michigan State just as I discover that Dr. Dreger is local. I may try to track her down for coffee while I'm still in town, although I'm sure she's super busy...

## April 16, 2015

### Titus Brown

#### Please destroy this software after publication. kthxbye.

tl;dr? A while back I wrote that there are three uses of research software: replication, reproduction, and reuse. The world of computational science would be better off if people clearly delineated whether or not they wanted anyone else to reuse their software, and I think it's a massive mistake to expect that everyone's software should be reusable.

A few months back, I reviewed a pretty exciting paper - one I will probably highlight on my blog, when it comes out. The paper outlined a fairly simple concept for comparing sequences and then used that to develop some new ultra-scalable functionality. The theory seemed novel, the computational results were pretty good, and I recommended acceptance (or minor revisions). This was in spite of the fact that the authors stated quite clearly that they had produced largely unusable software.

Other reviewers were not quite so forgiving, however -- one reviewer declined to review the paper until they could run the software on their own data.

This got me thinking - I think I have a reputation as wanting people to release their software in a useful manner, but I've actually shied away from requiring it on several occasions. Here was a situation that was a pretty direct conflict: neat result, with software that was not intended for reuse. Interestingly, I've drawn this line before, although without much personal comment. In my blog post on review criteria for bioinformatics papers, there's nothing there about whether or not the software is reusable - it must just be legally readable and executable. But I'm also pretty loud-mouthed about wanting good quality (or at least better quality) software out there in the bioinformatics world!

So what gives? I felt that the new theory looked pretty awesome, and would be tremendously useful, while the implementation was (as stated) unlikely to be something I (or others) used. So what? Publish!

I think this highlights that there are two different possible goals for bioinformatics papers. One goal is the standard scientific goal: to demonstrate a new method or technique, whether it be mathematical or computational. The other goal is different, and in some ways much harder: to provide a functioning tool for use and reuse. These should have different review standards, and that maybe the authors should be given the opportunity to distinguish clearly between the two goals.

There's actually a lot of commonality between what I would request of the software from either kind of paper, a technique paper or a tool paper.

• Both need to be accessible for download and viewing - otherwise, how can I understand the details of the implementation?
• Both types of software need to be usable enough to reproduce the results in the paper, in theory (e.g. given sufficient access to compute infrastructure).
• Both should be in a publicly accessible and archived format, to avoid loss of the software from personal Web sites, etc.
• Both should show evidence of decent principles of basic software engineering, including the use of version control, some form of testing (albeit unit testing or functional testing or even just defined input with known good output), release/version information, installation/dependency information, and the like.

However, there are some interesting differences. Off the top of my head, I'm thinking that:

• Crucially, the software from the technique paper would not need to be open source - by the OSI definition, the technique code would not need to be freely modifiable or re-sharable.

(To be clear, I know of neither any formal (journal) requirements nor ethical requirements that a particular implementation be modifiable or redistributable.)

• Nor need the software from the technique paper be written in a general way (e.g. to robustly process different formats), or for broader re-use. In particular, this means that documentation and unit/functional tests might be minimal - enough to support replication but no more.

• The software from the technique paper should be accessible to basic review, but should not be subject to code review on style or implementation - correctness only.

• Software from a "tools" paper, by contrast, should be held to much higher standards, and be subject to code review (in parts) and examination for documentation and installation and ... oh, heck, just start with the sustainability evaluation checklist at the SSI!

I'm aware that by having such relaxed constraints on technique publication I'm more or less directly contradicting myself in my earlier blog post on automated testing and research software - all of that definitely holds for software that you hope to be reused.

I'm not sure how or where to draw the line here, exactly. It's certainly reasonable to say that software that doesn't have unit tests is likely to be wrong, and therefore unit tests should be required - but, in science, we should never rely on a single study to prove something anyway, so I'm not sure why it matters if software is wrong in some details. This is where the difference between "replicability" and "reproducibility" becomes important. If I can't replicate your computation (at least in theory) then you have no business publishing it; but reproducing it is something that is a much larger task, outside the scope of any given paper.

I want to quote David States, who wrote a comment two years ago on my blog:

Too often, developers work in isolation, and this creates a high risk for code errors and science errors. Good code needs to be accessible and this includes not just sharing of the source code itself but also use of effective style, inclusion of tests and validation procedures and appropriate documentation.

I think I agree - but what's the minimum, here, for a technique paper that is meant to be a demonstration of a technique and no more?

One final point: in all of this we should recognize that the current situation is quite poor, in that quite a bit of software is simply inaccessible for replication purposes. (This mirrors my personal experiences in bioinformatics review, too.)

Improving this situation is important, but I think we need to be precise about what the minimim is. I don't think we're going to get very far by insisting that all code be held to high standards; that's a generational exercise (and part of why I'm so bullish on Software Carpentry).

So: what's the minimum necessary for decent science?

--titus

p.s. In case anyone is wondering, I don't think our software really meets my own criteria for tool publication, although it's getting closer.

p.p.s. Drawing this distinction leads in some very good directions for publishers and funding bodies to think about, too. More on that in another blog post, if I get the chance.

p.p.p.s. My 2004 paper (Brown and Callan) has a table that's wrong due to a fencepost error. But it's not seriously wrong. shrug

#### The PyCon 2015 Ally's Workshop

At PyCon 2015, I had the pleasure of attending the Ally Skills Workshop, organized by @adainitiative (named after Ada Lovelace).

The workshop was a 3 hour strongly guided discussion centering around 4-6 person group discussion of short scenarios. There's a guide to running them here, although I personally would not have wanted to run one without attending one first!

I attended the workshop for at least three reasons --

First, I want to do better myself. I have put some effort into (and received a lot of encouragement for) making my lab an increasingly open and welcoming place. While I have heard concerns about being insufficiently critical and challenging of bad ideas in science (and I have personally experienced a few rather odd situations where obviously bad ideas weren't called out in my past labs), I don't see any inherent conflict between being welcoming and being intellectually critical - in fact, I rather suspect they are mutually supportive, especially for the more junior people.

But, doing better is surprisingly challenging; everyone needs a mentor, or at least guideposts. So when I heard about this workshop, I leapt at the chance to attend!

Second, I am interested in connecting these kinds of things to my day job in academia, where I am now a professor at UC Davis. UC Davis is the home of the somewhat notorious Jonathan Eisen, who is notorious for many reasons that include boycotting and calling out conferences that have low diversity. UC Davis also has an effort to increase diversity at the faculty level, and I think that this is an important effort. I'm hoping to be involved in this when I actually take up residence in Davis, and learning to be a male ally is one way to help. More, I think that Davis would be a natural home to some of these ally workshops, and so I attended the Ally Skills workshop to explore this.

And third, I was just curious! It's surprisingly tricky to confront and talk about sexism effectively, and I thought seeing how the the pros did it would a good way to start.

Interestingly, 2/3 of my lab attended the workshop, too - without me requesting it. I think they found it valuable, too.

## The workshop itself

Valerie Aurora ran the workshop, and it's impossible to convey how good it was, but I'll try by picking out some choice quotes:

"You shouldn't expect praise or credit for behaving like a decent human being."

"Sometimes, you just need a flame war to happen." (paraphrase)

"It's not up to the victim whether you enforce your code of conduct."

"The physiological effects of alcohol are actually limited, and most effects of alcohol are socially and/or culturally mediated."

"Avoid rules lawyering. I don't now if you've ever worked with lawyers, but software engineers are almost as bad."

"One problem for male allies is the assumption that you are only talking to a woman because you are sexually interested in them."

"Trolls are good at calibrating their level of awfulness to something that you will feel guilty about moderating."

Read the blog post "Tone policing only goes one way.

Overall, a great experience and something I hope to help host more of at UC Davis.

--titus

### Continuum Analytics

#### Find Continuum at PyData Dallas

PyData Dallas, the first PyData conference in Texas, is taking place next week, April 24-26. PyData has been a wonderful conference for fostering the Python community and giving developers and other Python enthusiasts the opportunity to share their ideas, projects and the future of Python. Continuum Analytics is proud to be a founding sponsor for such an innovative, community-driven conference.

## April 15, 2015

### Titus Brown

#### The three porridge bowls of sustainable scientific software development

(The below issues are very much on my mind as I think about how to apply for another NIH grant to fund continued development on the khmer project.)

Imagine that we have a graph of novel functionality versus software engineering effort for a particular project, cast in the shape of a tower or pyramid, i.e. a support structure for cool science.

The more novel functionality implemented, the taller the building, and the broader the software engineering base needs to be to support the building. If you have too much novel functionality with too little software engineering base, the tower will have too little support and catastrophe can ensue - either no new functionality can be added past a certain point, or we discover that much of the implemented functionality is actually unstable and incorrect.

Since everybody likes novel functionality - for example, it's how we grade grants in science -- this is a very common failure mode. It is particularly problematic in situations where we have built a larger structure by placing many of these individual buildings on top of others; the entire structure is not much stronger than its weakest (least supported) component.

Another possible failure mode is if the base becomes too big too soon:

That is, if too much effort is spent on software engineering at the expense of building novel functionality on top of it, then the building remains the same height while the base broadens. This is a failure for an individual project, because no new functionality gets built, and the project falls out of funding.

In the worst case, the base can become over-wrought and be designed to support functionality that doesn't yet exist. In most situations, this work will be entirely wasted, either because the base was designed for the wrong functionality, or because the extra work put into the base will delay the work put into new features.

Where projects are designed to be building blocks from the start, as opposed to a leap into the unknown like most small-lab computational science projects, a different structure is worth investing in -- but I'm skeptical that this is ever the way to start a project.

Supporting this kind of project is something that Dan Katz has written and presented about; see (for example) A Method to Select e-Infrastructure Components to Sustain.

And, of course, the real danger is that we end up in a situation where a poorly engineered structure is used to support a much larger body of scientific work:

The question that I am trying to understand is this: what are the lifecycle stages for research software, and how should we design for them (as researchers), and how should we think about funding them (as reviewers and program officers)?

To bring things back to the title, how do we make sure we mix the right amount of software development (cold porridge) with novel functionality (hot porridge) to make something edible for little bears?

--titus

### Matthieu Brucher

#### Announcement: ATKStereoCompressor 1.0.0

I’m happy to announce the release of a stereo compressor based on the Audio Toolkit. It is available on Windows and OS X (min. 10.8) in different formats. This stereo compressor can work on two channels, left/right or middle/side, possibly in linked mode (only one set of parameters), and can be set up to mix the input signal with the compressed signal (serial/parallel compression).

The supported formats are:

• VST2 (32bits/64bits on Windows, 64bits on OS X)
• VST3 (32bits/64bits on Windows, 64bits on OS X)
• Audio Unit (64bits, OS X)

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

ATK SD1, ATKCompressor and ATKUniversalDelay were upgraded after AU validation failed. This is now fixed.

## April 14, 2015

### Jan Hendrik Metzen

#### Probability calibration

As a follow-up of my previous post on reliability diagrams, I have worked jointly with Alexandre Gramfort, Mathieu Blondel and Balazs Kegl (with reviews by the whole team, in particular Olivier Grisel) on adding probability calibration and reliability diagrams to scikit-learn. Those have been added in the recent 0.16 release of scikit-learn as CalibratedClassifierCV and calibration_curve.

This post contains an interactive version of the documentation in the form of an IPython notebook; parts of the text/code are thus due to my coauthors.

Note that the 0.16 release of scikit-learn contains a bug in IsotonicRegression, which has been fixed in the 0.16.1 release. For obtaining correct results with this notebook, you need to use 0.16.1 or any later version.

## Reliability curves

In [1]:
Expand Code
import numpy as np
np.random.seed(0)

import matplotlib
matplotlib.use("svg")
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline

from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.metrics import (brier_score_loss, precision_score, recall_score,
f1_score, log_loss)
from sklearn.cross_validation import train_test_split


When performing classification you often want not only to predict the class label, but also obtain a probability of the respective label. This probability gives you some kind of confidence on the prediction. Some models can give you poor estimates of the class probabilities and some even do not not support probability prediction. The calibration module allows you to better calibrate the probabilities of a given model, or to add support for probability prediction.

Well calibrated classifiers are probabilistic classifiers for which the output of the predict_proba method can be directly interpreted as a confidence level. For instance, a well calibrated (binary) classifier should classify the samples such that among the samples to which it gave a predict_proba value close to 0.8, approximately 80% actually belong to the positive class. The following plot compares how well the probabilistic predictions of different classifiers are calibrated:

In [2]:
Expand Code
X, y = datasets.make_classification(n_samples=100000, n_features=20,
n_informative=2, n_redundant=2)

train_samples = 100  # Samples used for training the models

X_train = X[:train_samples]
X_test = X[train_samples:]
y_train = y[:train_samples]
y_test = y[train_samples:]

# Create classifiers
lr = LogisticRegression()
gnb = GaussianNB()
svc = LinearSVC(C=1.0)
rfc = RandomForestClassifier(n_estimators=100)

In [3]:
Expand Code
plt.figure(figsize=(9, 9))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))

ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
for clf, name in [(lr, 'Logistic'),
(gnb, 'Naive Bayes'),
(svc, 'Support Vector Classification'),
(rfc, 'Random Forest')]:
clf.fit(X_train, y_train)
if hasattr(clf, "predict_proba"):
prob_pos = clf.predict_proba(X_test)[:, 1]
else:  # use decision function
prob_pos = clf.decision_function(X_test)
prob_pos = \
(prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())
fraction_of_positives, mean_predicted_value = \
calibration_curve(y_test, prob_pos, n_bins=10)

ax1.plot(mean_predicted_value, fraction_of_positives, "s-",
label="%s" % (name, ))

ax2.hist(prob_pos, range=(0, 1), bins=10, label=name,
histtype="step", lw=2)

ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title('Calibration plots  (reliability curve)')

ax2.set_xlabel("Mean predicted value")
ax2.set_ylabel("Count")
ax2.legend(loc="upper center", ncol=2)

plt.tight_layout()


LogisticRegression returns well calibrated predictions by default as it directly optimizes log-loss. In contrast, the other methods return biased probabilities; with different biases per method:

• Naive Bayes (GaussianNB) tends to push probabilties to 0 or 1 (note the counts in the histograms). This is mainly because it makes the assumption that features are conditionally independent given the class, which is not the case in this dataset which contains 2 redundant features.

• RandomForestClassifier shows the opposite behavior: the histograms show peaks at approximately 0.2 and 0.9 probability, while probabilities close to 0 or 1 are very rare. An explanation for this is given by Niculescu-Mizil and Caruana [4]: "Methods such as bagging and random forests that average predictions from a base set of models can have difficulty making predictions near 0 and 1 because variance in the underlying base models will bias predictions that should be near zero or one away from these values. Because predictions are restricted to the interval [0,1], errors caused by variance tend to be one-sided near zero and one. For example, if a model should predict p = 0 for a case, the only way bagging can achieve this is if all bagged trees predict zero. If we add noise to the trees that bagging is averaging over, this noise will cause some trees to predict values larger than 0 for this case, thus moving the average prediction of the bagged ensemble away from 0. We observe this effect most strongly with random forests because the base-level trees trained with random forests have relatively high variance due to feature subseting." As a result, the calibration curve shows a characteristic sigmoid shape, indicating that the classifier could trust its "intuition" more and return probabilties closer to 0 or 1 typically.

• Linear Support Vector Classification (LinearSVC) shows an even more sigmoid curve as the RandomForestClassifier, which is typical for maximum-margin methods (compare Niculescu-Mizil and Caruana [4]), which focus on hard samples that are close to the decision boundary (the support vectors).

## Calibration of binary classifiers¶

Two approaches for performing calibration of probabilistic predictions are provided: a parametric approach based on Platt's sigmoid model and a non-parametric approach based on isotonic regression (sklearn.isotonic). Probability calibration should be done on new data not used for model fitting. The class CalibratedClassifierCV uses a cross-validation generator and estimates for each split the model parameter on the train samples and the calibration of the test samples. The probabilities predicted for the folds are then averaged. Already fitted classifiers can be calibrated by CalibratedClassifierCV via the paramter cv="prefit". In this case, the user has to take care manually that data for model fitting and calibration are disjoint.

The following images demonstrate the benefit of probability calibration. The first image present a dataset with 2 classes and 3 blobs of data. The blob in the middle contains random samples of each class. The probability for the samples in this blob should be 0.5.

In [4]:
Expand Code
n_samples = 50000
n_bins = 3  # use 3 bins for calibration_curve as we have 3 clusters here

# Generate 3 blobs with 2 classes where the second blob contains
# half positive samples and half negative samples. Probability in this
# blob is therefore 0.5.
centers = [(-5, -5), (0, 0), (5, 5)]
X, y = datasets.make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0,
centers=centers, shuffle=False, random_state=42)

y[:n_samples // 2] = 0
y[n_samples // 2:] = 1
sample_weight = np.random.RandomState(42).rand(y.shape[0])

# split train, test for calibration
X_train, X_test, y_train, y_test, sw_train, sw_test = \
train_test_split(X, y, sample_weight, test_size=0.9, random_state=42)

plt.figure()
y_unique = np.unique(y)
colors = cm.rainbow(np.linspace(0.0, 1.0, y_unique.size))
for this_y, color in zip(y_unique, colors):
this_X = X_train[y_train == this_y]
this_sw = sw_train[y_train == this_y]
plt.scatter(this_X[:, 0], this_X[:, 1], s=this_sw * 50, c=color, alpha=0.5,
label="Class %s" % this_y)
plt.legend(loc="best")
plt.title("Data")

Out[4]:
<matplotlib.text.Text at 0x5b37b10>

The following image shows on the data above the estimated probability using a Gaussian naive Bayes classifier without calibration, with a sigmoid calibration and with a non-parametric isotonic calibration. One can observe that the non-parametric model provides the most accurate probability estimates for samples in the middle, i.e., 0.5.

In [5]:
Expand Code
# Gaussian Naive-Bayes with no calibration
clf = GaussianNB()
clf.fit(X_train, y_train)  # GaussianNB itself does not support sample-weights
prob_pos_clf = clf.predict_proba(X_test)[:, 1]

# Gaussian Naive-Bayes with isotonic calibration
clf_isotonic = CalibratedClassifierCV(clf, cv=2, method='isotonic')
clf_isotonic.fit(X_train, y_train, sw_train)
prob_pos_isotonic = clf_isotonic.predict_proba(X_test)[:, 1]

# Gaussian Naive-Bayes with sigmoid calibration
clf_sigmoid = CalibratedClassifierCV(clf, cv=2, method='sigmoid')
clf_sigmoid.fit(X_train, y_train, sw_train)
prob_pos_sigmoid = clf_sigmoid.predict_proba(X_test)[:, 1]

print("Brier scores: (the smaller the better)")

clf_score = brier_score_loss(y_test, prob_pos_clf, sw_test)
print("No calibration: %1.3f" % clf_score)

clf_isotonic_score = brier_score_loss(y_test, prob_pos_isotonic, sw_test)
print("With isotonic calibration: %1.3f" % clf_isotonic_score)

clf_sigmoid_score = brier_score_loss(y_test, prob_pos_sigmoid, sw_test)
print("With sigmoid calibration: %1.3f" % clf_sigmoid_score)

Brier scores: (the smaller the better)
No calibration: 0.104
With isotonic calibration: 0.084
With sigmoid calibration: 0.109

In [6]:
Expand Code
plt.figure()
order = np.lexsort((prob_pos_clf, ))
plt.plot(prob_pos_clf[order], 'r', label='No calibration (%1.3f)' % clf_score)
plt.plot(prob_pos_isotonic[order], 'g', linewidth=3,
label='Isotonic calibration (%1.3f)' % clf_isotonic_score)
plt.plot(prob_pos_sigmoid[order], 'b', linewidth=3,
label='Sigmoid calibration (%1.3f)' % clf_sigmoid_score)
plt.plot(np.linspace(0, y_test.size, 51)[1::2],
y_test[order].reshape(25, -1).mean(1),
'k', linewidth=3, label=r'Empirical')
plt.ylim([-0.05, 1.05])
plt.xlabel("Instances sorted according to predicted probability "
"(uncalibrated GNB)")
plt.ylabel("P(y=1)")
plt.legend(loc="upper left")
plt.title("Gaussian naive Bayes probabilities")

Out[6]:
<matplotlib.text.Text at 0x623ce90>

The following experiment is performed on an artificial dataset for binary classification with 100.000 samples (1.000 of them are used for model fitting) with 20 features. Of the 20 features, only 2 are informative and 10 are redundant. The figure shows the estimated probabilities obtained with logistic regression, a linear support-vector classifier (SVC), and linear SVC with both isotonic calibration and sigmoid calibration. The calibration performance is evaluated with Brier score brier_score_loss, reported in the legend (the smaller the better).

In [7]:
Expand Code
# Create dataset of classification task with many redundant and few
# informative features
X, y = datasets.make_classification(n_samples=100000, n_features=20,
n_informative=2, n_redundant=10,
random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.99,
random_state=42)

In [8]:
Expand Code
def plot_calibration_curve(est, name, fig_index):
"""Plot calibration curve for est w/o and with calibration. """
# Calibrated with isotonic calibration
isotonic = CalibratedClassifierCV(est, cv=2, method='isotonic')

# Calibrated with sigmoid calibration
sigmoid = CalibratedClassifierCV(est, cv=2, method='sigmoid')

# Logistic regression with no calibration as baseline
lr = LogisticRegression(C=1., solver='lbfgs')

fig = plt.figure(fig_index, figsize=(9, 9))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))

ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
for clf, name in [(lr, 'Logistic'),
(est, name),
(isotonic, name + ' + Isotonic'),
(sigmoid, name + ' + Sigmoid')]:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
if hasattr(clf, "predict_proba"):
prob_pos = clf.predict_proba(X_test)[:, 1]
else:  # use decision function
prob_pos = clf.decision_function(X_test)
prob_pos = \
(prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())

clf_score = brier_score_loss(y_test, prob_pos, pos_label=y.max())
print("%s:" % name)
print("\tBrier: %1.3f" % (clf_score))
print("\tPrecision: %1.3f" % precision_score(y_test, y_pred))
print("\tRecall: %1.3f" % recall_score(y_test, y_pred))
print("\tF1: %1.3f\n" % f1_score(y_test, y_pred))

fraction_of_positives, mean_predicted_value = \
calibration_curve(y_test, prob_pos, n_bins=10)

ax1.plot(mean_predicted_value, fraction_of_positives, "s-",
label="%s (%1.3f)" % (name, clf_score))

ax2.hist(prob_pos, range=(0, 1), bins=10, label=name,
histtype="step", lw=2)

ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title('Calibration plots  (reliability curve)')

ax2.set_xlabel("Mean predicted value")
ax2.set_ylabel("Count")
ax2.legend(loc="upper center", ncol=2)

plt.tight_layout()

In [9]:
# Plot calibration cuve for Linear SVC
plot_calibration_curve(LinearSVC(), "SVC", 2)

Logistic:
Brier: 0.099
Precision: 0.872
Recall: 0.851
F1: 0.862

SVC:
Brier: 0.163
Precision: 0.872
Recall: 0.852
F1: 0.862

SVC + Isotonic:
Brier: 0.100
Precision: 0.853
Recall: 0.878
F1: 0.865

SVC + Sigmoid:
Brier: 0.099
Precision: 0.874
Recall: 0.849
F1: 0.861



One can observe here that logistic regression is well calibrated as its curve is nearly diagonal. Linear SVC's calibration curve has a sigmoid curve, which is typical for an under-confident classifier. In the case of LinearSVC, this is caused by the margin property of the hinge loss, which lets the model focus on hard samples that are close to the decision boundary (the support vectors). Both kinds of calibration can fix this issue and yield nearly identical results. The next figure shows the calibration curve of Gaussian naive Bayes on the same data, with both kinds of calibration and also without calibration.

In [10]:
# Plot calibration cuve for Gaussian Naive Bayes
plot_calibration_curve(GaussianNB(), "Naive Bayes", 1)

Logistic:
Brier: 0.099
Precision: 0.872
Recall: 0.851
F1: 0.862

Naive Bayes:
Brier: 0.118
Precision: 0.857
Recall: 0.876
F1: 0.867

Naive Bayes + Isotonic:
Brier: 0.098
Precision: 0.883
Recall: 0.836
F1: 0.859

Naive Bayes + Sigmoid:
Brier: 0.109
Precision: 0.861
Recall: 0.871
F1: 0.866



One can see that Gaussian naive Bayes performs very badly but does so in an other way than linear SVC: While linear SVC exhibited a sigmoid calibration curve, Gaussian naive Bayes' calibration curve has a transposed-sigmoid shape. This is typical for an over-confident classifier. In this case, the classifier's overconfidence is caused by the redundant features which violate the naive Bayes assumption of feature-independence.

Calibration of the probabilities of Gaussian naive Bayes with isotonic regression can fix this issue as can be seen from the nearly diagonal calibration curve. Sigmoid calibration also improves the brier score slightly, albeit not as strongly as the non-parametric isotonic calibration. This is an intrinsic limitation of sigmoid calibration, whose parametric form assumes a sigmoid rather than a transposed-sigmoid curve. The non-parametric isotonic calibration model, however, makes no such strong assumptions and can deal with either shape, provided that there is sufficient calibration data. In general, sigmoid calibration is preferable if the calibration curve is sigmoid and when there is few calibration data while isotonic calibration is preferable for non- sigmoid calibration curves and in situations where many additional data can be used for calibration.

## Multi-class classification¶

CalibratedClassifierCV can also deal with classification tasks that involve more than two classes if the base estimator can do so. In this case, the classifier is calibrated first for each class separately in an one-vs-rest fashion. When predicting probabilities for unseen data, the calibrated probabilities for each class are predicted separately. As those probabilities do not necessarily sum to one, a postprocessing is performed to normalize them.

The next image illustrates how sigmoid calibration changes predicted probabilities for a 3-class classification problem. Illustrated is the standard 2-simplex, where the three corners correspond to the three classes. Arrows point from the probability vectors predicted by an uncalibrated classifier to the probability vectors predicted by the same classifier after sigmoid calibration on a hold-out validation set. Colors indicate the true class of an instance (red: class 1, green: class 2, blue: class 3).

In [11]:
Expand Code
np.random.seed(0)

# Generate data
X, y = datasets.make_blobs(n_samples=1000, n_features=2, random_state=42,
cluster_std=5.0)
X_train, y_train = X[:600], y[:600]
X_valid, y_valid = X[600:800], y[600:800]
X_train_valid, y_train_valid = X[:800], y[:800]
X_test, y_test = X[800:], y[800:]

In [12]:
Expand Code
# Train uncalibrated random forest classifier on whole train and validation
# data and evaluate on test data
clf = RandomForestClassifier(n_estimators=25)
clf.fit(X_train_valid, y_train_valid)
clf_probs = clf.predict_proba(X_test)
score = log_loss(y_test, clf_probs)

# Train random forest classifier, calibrate on validation data and evaluate
# on test data
clf = RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)
clf_probs = clf.predict_proba(X_test)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid", cv="prefit")
sig_clf.fit(X_valid, y_valid)
sig_clf_probs = sig_clf.predict_proba(X_test)
sig_score = log_loss(y_test, sig_clf_probs)

In [13]:
Expand Code
# Plot changes in predicted probabilities via arrows
plt.figure(0, figsize=(10, 8))
colors = ["r", "g", "b"]
for i in range(clf_probs.shape[0]):
plt.arrow(clf_probs[i, 0], clf_probs[i, 1],
sig_clf_probs[i, 0] - clf_probs[i, 0],
sig_clf_probs[i, 1] - clf_probs[i, 1],

# Plot perfect predictions
plt.plot([1.0], [0.0], 'ro', ms=20, label="Class 1")
plt.plot([0.0], [1.0], 'go', ms=20, label="Class 2")
plt.plot([0.0], [0.0], 'bo', ms=20, label="Class 3")

# Plot boundaries of unit simplex
plt.plot([0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0], 'k', label="Simplex")

# Annotate points on the simplex
plt.annotate(r'($\frac{1}{3}$, $\frac{1}{3}$, $\frac{1}{3}$)',
xy=(1.0/3, 1.0/3), xytext=(1.0/3, .23), xycoords='data',
arrowprops=dict(facecolor='black', shrink=0.05),
horizontalalignment='center', verticalalignment='center')
plt.plot([1.0/3], [1.0/3], 'ko', ms=5)
plt.annotate(r'($\frac{1}{2}$, $0$, $\frac{1}{2}$)',
xy=(.5, .0), xytext=(.5, .1), xycoords='data',
arrowprops=dict(facecolor='black', shrink=0.05),
horizontalalignment='center', verticalalignment='center')
plt.annotate(r'($0$, $\frac{1}{2}$, $\frac{1}{2}$)',
xy=(.0, .5), xytext=(.1, .5), xycoords='data',
arrowprops=dict(facecolor='black', shrink=0.05),
horizontalalignment='center', verticalalignment='center')
plt.annotate(r'($\frac{1}{2}$, $\frac{1}{2}$, $0$)',
xy=(.5, .5), xytext=(.6, .6), xycoords='data',
arrowprops=dict(facecolor='black', shrink=0.05),
horizontalalignment='center', verticalalignment='center')
plt.annotate(r'($0$, $0$, $1$)',
xy=(0, 0), xytext=(.1, .1), xycoords='data',
arrowprops=dict(facecolor='black', shrink=0.05),
horizontalalignment='center', verticalalignment='center')
plt.annotate(r'($1$, $0$, $0$)',
xy=(1, 0), xytext=(1, .1), xycoords='data',
arrowprops=dict(facecolor='black', shrink=0.05),
horizontalalignment='center', verticalalignment='center')
plt.annotate(r'($0$, $1$, $0$)',
xy=(0, 1), xytext=(.1, 1), xycoords='data',
arrowprops=dict(facecolor='black', shrink=0.05),
horizontalalignment='center', verticalalignment='center')
plt.grid("off")
for x in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
plt.plot([0, x], [x, 0], 'k', alpha=0.2)
plt.plot([0, 0 + (1-x)/2], [x, x + (1-x)/2], 'k', alpha=0.2)
plt.plot([x, x + (1-x)/2], [0, 0 + (1-x)/2], 'k', alpha=0.2)

plt.title("Change of predicted probabilities after sigmoid calibration")
plt.xlabel("Probability class 1")
plt.ylabel("Probability class 2")
plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)
plt.legend(loc="best")

print("Log-loss of")
print(" * uncalibrated classifier trained on 800 datapoints: %.3f "
% score)
print(" * classifier trained on 600 datapoints and calibrated on "
"200 datapoint: %.3f" % sig_score)

Log-loss of
* uncalibrated classifier trained on 800 datapoints: 1.280
* classifier trained on 600 datapoints and calibrated on 200 datapoint: 0.536


The base classifier is a random forest classifier with 25 base estimators (trees). If this classifier is trained on all 800 training datapoints, it is overly confident in its predictions and thus incurs a large log-loss. Calibrating an identical classifier, which was trained on 600 datapoints, with method='sigmoid' on the remaining 200 datapoints reduces the confidence of the predictions, i.e., moves the probability vectors from the edges of the simplex towards the center:

In [14]:
Expand Code
# Illustrate calibrator
plt.figure(1, figsize=(10, 8))
# generate grid over 2-simplex
p1d = np.linspace(0, 1, 20)
p0, p1 = np.meshgrid(p1d, p1d)
p2 = 1 - p0 - p1
p = np.c_[p0.ravel(), p1.ravel(), p2.ravel()]
p = p[p[:, 2] >= 0]

calibrated_classifier = sig_clf.calibrated_classifiers_[0]
prediction = np.vstack([calibrator.predict(this_p)
for calibrator, this_p in
zip(calibrated_classifier.calibrators_, p.T)]).T
prediction /= prediction.sum(axis=1)[:, None]

# Ploit modifications of calibrator
for i in range(prediction.shape[0]):
plt.arrow(p[i, 0], p[i, 1],
prediction[i, 0] - p[i, 0], prediction[i, 1] - p[i, 1],
# Plot boundaries of unit simplex
plt.plot([0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0], 'k', label="Simplex")

plt.grid("off")
for x in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
plt.plot([0, x], [x, 0], 'k', alpha=0.2)
plt.plot([0, 0 + (1-x)/2], [x, x + (1-x)/2], 'k', alpha=0.2)
plt.plot([x, x + (1-x)/2], [0, 0 + (1-x)/2], 'k', alpha=0.2)

plt.title("Illustration of sigmoid calibrator")
plt.xlabel("Probability class 1")
plt.ylabel("Probability class 2")
plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)

Out[14]:
(-0.05, 1.05)