## June 30, 2015

### Matthieu Brucher

#### Audio Toolkit: Anatomy of a transient shaper

When I first about transient shaper, I was like “what’s the difference with a compressor? Is there one?”. And I tried to see how to get these transient without relying on the transient energy, with a relative power (ratio between the instant power and the mean power) filter, or its derivative, but nothing worked. Until someone explained that the gain was driven by the difference between a fast attack filtered power and a slower one. So here it goes.

First, let’s have a look on the filter graph:
Transient shaper pipeline

I’ve surrounded the specific transient shaper part in with a dotted line. This is the difference with a compressor/limiter/expander: the way the signal steered the gain computation is generated.

Let’s start from a kick. The fast envelope follower can then be generated (red curve) as well as the slow envelope follower (green curve). The difference is always positive (if the two follower have the same release time value), so we can use it to compute a gain through our usual GainCompressorFilter.
Depending on whether you want to increase the transient or attenuate it, the ratio will be below or higher than 1 (respectively), which is what is shown in the last two plots here:
Transient shaper plot

In the end, it’s all about the right algorithms. If you have a proper framework, you may already have everything you need for some filters. In the case of a transient shaper, I already had all the filters in my toolbox. Now I just need to make a plugin out of this simple pipeline!

The code for the last plots can be found on Github here: https://github.com/mbrucher/AudioTK/blob/master/Examples/dynamic/display_transient_shaper.py

## June 29, 2015

### Wei Xue

#### GSoC Week 5

The week 5 began with a discussion with whether we should deprecate params. I fixed some bugs in checking functions, random number generator and one of covariance updating methods. In the following days, I completed the main functions of GaussianMixutre and all test cases, except AIC, BIC and sampling functions. The tests are some kind of challenging, sine the current implementation in the master branch contains very old test cases imported from Weiss's implementation which is never got improved. I simplified the test cases, and wrote more tests that are not covered by the current implementation, such as covariance estimation, ground truth parameter prediction, and other user-friendly warnings and errors.

Next week, I will begin to code BayesianGaussianMixture.

## June 26, 2015

### Matthew Rocklin

#### Write Complex Parallel Algorithms

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We discuss the use of complex dask graphs for non-trivial algorithms. We show off an on-disk parallel SVD.

## Most Parallel Computation is Simple

Most parallel workloads today are fairly trivial:

>>> import dask.bag as db
>>> b = db.from_s3('githubarchive-data', '2015-01-01-*.json.gz')
.map(lambda d: d['type'] == 'PushEvent')
.count()

Graphs for these computations look like the following:

This is great; these are simple problems to solve efficiently in parallel. Generally these simple computations occur at the beginning of our analyses.

## Sophisticated Algorithms can be Complex

Later in our analyses we want more complex algorithms for statistics , machine learning, etc.. Often this stage fits comfortably in memory, so we don’t worry about parallelism and can use statsmodels or scikit-learn on the gigabyte result we’ve gleaned from terabytes of data.

However, if our reduced result is still large then we need to think about sophisticated parallel algorithms. This is fresh space with lots of exciting academic and software work.

## Example: Parallel, Stable, Out-of-Core SVD

I’d like to show off work by Mariano Tepper, who is responsible for dask.array.linalg. In particular he has a couple of wonderful algorithms for the Singular Value Decomposition (SVD) (also strongly related to Principal Components Analysis (PCA).) Really I just want to show off this pretty graph.

>>> import dask.array as da
>>> x = da.ones((5000, 1000), chunks=(1000, 1000))
>>> u, s, v = da.linalg.svd(x)

This algorithm computes the exact SVD (up to numerical precision) of a large tall-and-skinny matrix in parallel in many small chunks. This allows it to operate out-of-core (from disk) and use multiple cores in parallel. At the bottom we see the construction of our trivial array of ones, followed by many calls to np.linalg.qr on each of the blocks. Then there is a lot of rearranging of various pieces as they are stacked, multiplied, and undergo more rounds of np.linalg.qr and np.linalg.svd. The resulting arrays are available in many chunks at the top and second-from-top rows.

The dask dict for one of these arrays, s, looks like the following:

>>> s.dask
{('x', 0, 0): (np.ones, (1000, 1000)),
('x', 1, 0): (np.ones, (1000, 1000)),
('x', 2, 0): (np.ones, (1000, 1000)),
('x', 3, 0): (np.ones, (1000, 1000)),
('x', 4, 0): (np.ones, (1000, 1000)),
('tsqr_2_QR_st1', 0, 0): (np.linalg.qr, ('x', 0, 0)),
('tsqr_2_QR_st1', 1, 0): (np.linalg.qr, ('x', 1, 0)),
('tsqr_2_QR_st1', 2, 0): (np.linalg.qr, ('x', 2, 0)),
('tsqr_2_QR_st1', 3, 0): (np.linalg.qr, ('x', 3, 0)),
('tsqr_2_QR_st1', 4, 0): (np.linalg.qr, ('x', 4, 0)),
('tsqr_2_R', 0, 0): (operator.getitem, ('tsqr_2_QR_st2', 0, 0), 1),
('tsqr_2_R_st1', 0, 0): (operator.getitem,('tsqr_2_QR_st1', 0, 0), 1),
('tsqr_2_R_st1', 1, 0): (operator.getitem, ('tsqr_2_QR_st1', 1, 0), 1),
('tsqr_2_R_st1', 2, 0): (operator.getitem, ('tsqr_2_QR_st1', 2, 0), 1),
('tsqr_2_R_st1', 3, 0): (operator.getitem, ('tsqr_2_QR_st1', 3, 0), 1),
('tsqr_2_R_st1', 4, 0): (operator.getitem, ('tsqr_2_QR_st1', 4, 0), 1),
('tsqr_2_R_st1_stacked', 0, 0): (np.vstack,
[('tsqr_2_R_st1', 0, 0),
('tsqr_2_R_st1', 1, 0),
('tsqr_2_R_st1', 2, 0),
('tsqr_2_R_st1', 3, 0),
('tsqr_2_R_st1', 4, 0)])),
('tsqr_2_QR_st2', 0, 0): (np.linalg.qr, ('tsqr_2_R_st1_stacked', 0, 0)),
('tsqr_2_SVD_st2', 0, 0): (np.linalg.svd, ('tsqr_2_R', 0, 0)),
('tsqr_2_S', 0): (operator.getitem, ('tsqr_2_SVD_st2', 0, 0), 1)}

So to write complex parallel algorithms we write down dictionaries of tuples of functions.

The dask schedulers take care of executing this graph in parallel using multiple threads. Here is a profile result of a larger computation on a 30000x1000 array:

## Low Barrier to Entry

Looking at this graph you may think “Wow, Mariano is awesome” and indeed he is. However, he is more an expert at linear algebra than at Python programming. Dask graphs (just dictionaries) are simple enough that a domain expert was able to look at them say “Yeah, I can do that” and write down the very complex algorithms associated to his domain, leaving the execution of those algorithms up to the dask schedulers.

You can see the source code that generates the above graphs on GitHub.

## Randomized Parallel Out-of-Core SVD

A few weeks ago a genomics researcher asked for an approximate/randomized variant to SVD. Mariano had a solution up in a few days.

>>> import dask.array as da
>>> x = da.ones((5000, 1000), chunks=(1000, 1000))
>>> u, s, v = da.linalg.svd_compressed(x, k=100, n_power_iter=2)

I’ll omit the full dict for obvious space reasons.

## Final Thoughts

Dask graphs let us express parallel algorithms with very little extra complexity. There are no special objects or frameworks to learn, just dictionaries of tuples of functions. This allows domain experts to write sophisticated algorithms without fancy code getting in their way.

## June 24, 2015

### Titus Brown

#### A review of &quot;Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees&quot;

(This is a review of Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees, Solomon and Kingsford, 2015.)

In this paper, Solomon and Kingsford present Sequence Bloom Trees (SBTs). SBT provides an efficient method for indexing multiple sequencing datasets and finding in which datasets a query sequence is present.

The new method is based on using multiple Bloom filters and organizing them in a binary tree, where leaves represent specific datasets and internal nodes contain all the k-mers present in their subtrees. A query starts by breaking the sequence into a set of k-mers and checking if they are present in the node Bloom filter at a specific threshold. If yes then the query is repeated for children nodes, but if it isn't the subtree is pruned and search proceeds on other nodes. If all searches are pruned before reaching a leaf then the sequence is not present in any dataset. They prove the false positive rate for a k-mer can be quite higher than traditional applications of Bloom filters, since they are interested in finding if the whole set of k-mers is over a threshold. This leads to very small data structures that remain capable of approximating the correct answer.

Compared to alternative software (like SRA-BLAST or STAR) it has both decreased runtime and memory consumption, and it also can be used as a filter to make these tools faster.

## Overall review

The paper is well written, clear, mostly expert in the area (but see below), and lays out the approach and tool well.

The approach is novel within bioinformatics, as far as we know. More, we think it's a tremendously important approach; it's by far the most succinct representation of large data sets we've seen (and Bloom filters are notoriously efficient), and it permits efficient indexing, storage of indices, and queries of indices.

A strange omission is the work that has been done by our group and others with Bloom filters. Pell et al., 2012 (pmid 22847406), showed that implicit De Bruijn graphs could be stored in Bloom filters in exactly the way the authors are doing here; work by Chikhi and Rizk, 2013 (pmid 24040893) implemented exact De Bruijn graphs efficiently using Bloom filters; and Salikhov et al, 2014 (pmid 24565280) further used Cascading Bloom filters. Our group has also used the median k-mer abundance (which, in a Bloom filter, equals median k-mer presence) to estimate read presence and coverage in a very similar way to Solomon and Kingsford (Brown et al., 2012, "digital normalization"). We also showed experimentally that this is very robust to high false positive rates (Zhang et al., 2014, pmid 25062443, buried in the back).

There are three points to make here --

1. Previous work has been done connecting Bloom filters and k-mer storage, in ways that seem to be ignored by this paper; the authors should cite some of this literature. Given citation space limitations, this doesn't need to be exhaustive, but either Salikhov or Pell seems particularly relevant.
2. The connection between Bloom filters and implicit De Bruijn graphs should be explicitly made in the paper, as it's a powerful theoretical connection.
3. All of our previous result support the conclusions reached in this paper, and this paper makes the false-positive robustness argument much more strongly, which is a nice conclusion!

---

We have found that users are often very confused about how to pick the size of Bloom filters. My sense here is that the RRR compression means that very large Bloom filters will be stored efficiently, so you might as well start big, because there's no way to do progressive size increases on the Bloom filter; do the authors agree with that conclusion, or am I missing something?

One possible writing improvement is to add another level under the leaves in Supp Fig 1 to make it clear that traditional alignment or other alternatives are required, since SBT only finds if the query is present in the dataset (but not where). The speed comparisons in the paper could be qualified a bit more to make it clear that this is only for basic search, although some of us think it's already clear enough so it's advice, not a requested or required change.

However, there is a solid point to be made that (in our opinion) the true value of the SBT approach is not necessarily in speeding up the overall process (3.5x speedup) but in doing the search in very low memory across an index that can be distributed independently of the data.

page 16, Theorem 2 says the probability that ... is nearly 0 when FPR is << theta, fraction threshold. But next in the example, theta is 0.5 and FPR is also 0.5, so here the FPR is NOT << theta, as in Theorem 2. How to conclude that "by Theorem 2, we will be unlikely to observe > theta fraction of false positive kmers in the filter."?

## Software and tool publication

Bioinformatics paper checklist (http://ivory.idyll.org/blog/blog-review-criteria-for-bioinfo.html):

The software source code is available to readers: Yes, https://github.com/Kingsford-Group/bloomtree

The data for replication is available for download: Yes, public data from SRA; it's listed on supp materials, but could be added to the tool site too.

The data format is either standard, or straightforward, or documented. Yes

Recommendations:

• we strongly recommend that a lab-independent URL be used as the official URL for the software (e.g. the github page, instead of the CMU page). Lab Web sites tend to fall out of date or otherwise decay.
• One of the big drawbacks to Bloom filters is that they are fixed in size. Guidance on choosing Bloom filter size would be welcome. One way to do this is to use an efficient method to calculate cardinality, and khmer has a BSD- licensed implementation of the HyperLogLog cardinality counter that they'd be welcome to copy wholesale.

Signed,

C. Titus Brown

Luiz Irber

Qingpeng Zhang

#### Thoughts on Sequence Bloom Trees

We just submitted our review of the paper Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees., by Brad Solomon and Carl Kingsford.

The paper outlines a fairly simple and straightforward way to query massive amounts of sequence data (5 TB of mRNAseq!) in very small disk (~70 GB) and memory (~under a GB), fairly quickly (~2.5 days).

The short version is that I think this is an incredibly powerful approach that I am now planning to build on for our own Moore DDD project.

The review was a lot of fun; it's right up our alley, and we've thought a lot about related (but somewhat distinct) issues. Here are some extended comments that I thought were probably not appropriate for the official review because they're somewhat forward looking.

First, we did the review in approved Open Science style. Since Brad and Carl did us the favor of using GitHub (source code) and bioRxiv (preprint), and I sign my reviews, there need be no mystery in this process. Therefore I am posting our review publicly with only minor edits. Moreover, I filed three issues on GitHub (#1, #2, and #4) and submitted two pull requests (#3 and #5), both of which were merged.

Second, because we work in this area and are very interested in it, I put together both a demo of their software (see 2015-sbt-demo) and also did a simple reimplementation in khmer (see 2015-khmer-sequence-bloom-trees) to make sure that their software worked, and that I thoroughly understood the basic issues involved.

Note that we are unable to use their implementation as licensed, as it is under the GPL and would contaminate our source tree, which is under BSD :(.

Third, I have a lot of suggestions and thoughts! For example,

• The use of RRR compression is awesome and is something we should look into for khmer (dib-lab/khmer#1074).
• We get a 20% performance increase from a simple optimization applied to our k-mer lookups that could apply to SBTs -- see just-merged dib-lab/khmer#862, by Camille Scott.
• The authors might also be interested in making use of our HyperLogLog implementation for k-mer cardinality counting, which could help users choose the right size for their Bloom filters.
• Streaming diginorm/semi-streaming in general (see Crossing the streams) could be a very useful pre-filter for building SBTs. My guess is that with k-mer prefiltering a la digital normalization, there would be no loss to sensitivity but a substantial decrease in memory requirements.
• It would be really interesting to brainstorm about how far this can be taken. We have reasonably strong evidence & intuition that you can do straight abundance estimation directly off of counting Bloom filters, and it doesn't seem like a stretch to say, "hey, let's store EVERYTHING in an sequence Bloom tree analog and do comparative expression analysis that way!" We don't have an immediate use case for this ourselves, but I'm sure one will present itself...
• Qingpeng Zhang and I immediately started talking about how to apply this to metagenomics, and the main worry is that the method seems to depend on low diversity of true k-mers. This can partly be mitigated by diginorm and/or semi-streaming k-mer abundance trimming, but ultimately things are going to scale at best with the true size of the De Bruijn graph. It will be interesting to see how this plays out.

--titus

## June 23, 2015

### Matthieu Brucher

#### Announcement: Audio TK 0.6.0

The main changes for this release are first trials at modulated filters, C++11 usage (nullptr, override and final), and some API changes (the main process_impl function is now const).

Changelog:

0.6.0
* Added override and final keywords in virtual calls
* Changed the API so that process_impl is now const
* Exposed full_setup to the user (direct reset of the internal state, already called when changing sample rate)
* Fix resetting the internal state of all delays by using full_setup
* Added a CustomFIRFilter with Python wrapper

0.5.1
* Added time-varying IIR filters (variable frequency, coded as transposed direct form II)
* Added second order time varying filter implementations
* Added a RelativePowerFilter with Python wrappers
* Added a DerivativeFilter with Python wrappers
* Added Python wrappers for the InWavFilter
* Fixed some warnings during compilation

### Matthew Rocklin

#### Distributed Scheduling

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We evaluate dask graphs with a variety of schedulers and introduce a new distributed memory scheduler.

Most dask users use the dask collections, Array, Bag, and DataFrame. These collections are convenient ways to produce dask graphs. A dask graph is a dictionary of tasks. A task is a tuple with a function and arguments.

The graph comprising a dask collection (like a dask.array) is available through its .dask attribute.

>>> import dask.array as da
>>> x = da.arange(15, chunks=(5,))  # 0..14 in three chunks of size five

>>> x.dask  # dask array holds the graph to create the full array
{('x', 0): (np.arange, 0, 5),
('x', 1): (np.arange, 5, 10),
('x', 2): (np.arange, 10, 15)}

Further operations on x create more complex graphs

>>> z = (x + 100).sum()
{('x', 0): (np.arange, 0, 5),
('x', 1): (np.arange, 5, 10),
('x', 2): (np.arange, 10, 15),
('y', 0): (add, ('x', 0), 100),
('y', 1): (add, ('x', 1), 100),
('y', 2): (add, ('x', 2), 100),
('z', 0): (np.sum, ('y', 0)),
('z', 1): (np.sum, ('y', 1)),
('z', 2): (np.sum, ('y', 2)),
('z',): (sum, [('z', 0), ('z', 1), ('z', 2)])}

We can make dask graphs by hand without dask collections. This involves creating a dictionary of tuples of functions.

>>> def add(a, b):
...     return a + b

>>> # x = 1
>>> # y = 2
>>> # z = add(x, y)

>>> dsk = {'x': 1,
...        'y': 2,
...        'z': (add, 'x', 'y')}

We evaluate these graphs with one of the dask schedulers

>>> from dask.threaded import get
>>> get(dsk, 'z')   # Evaluate graph with multiple threads
3

>>> get(dsk, 'z')   # Evaluate graph with multiple processes
3

We separate the evaluation of the graphs from their construction.

## Distributed Scheduling

The separation of graphs from evaluation allows us to create new schedulers. In particular there exists a scheduler that operates on multiple machines in parallel, communicating over ZeroMQ.

This system has a single centralized scheduler, several workers, and potentially several clients.

Clients send graphs to the central scheduler which farms out those tasks to workers and coordinates the execution of the graph. While the scheduler centralizes metadata, the workers themselves handle transfer of intermediate data in a peer-to-peer fashion. Once the graph completes the workers send data to the scheduler which passes it through to the appropriate user/client.

## Example

And so now we can execute our dask graphs in parallel across multiple machines.

$ipython # On your laptop$ ipython  # Remote Process #1:  Scheduler
...     return a + b                        >>> s = Scheduler(port_to_workers='4444',
...               port_to_clients='5555',
>>> dsk = {'x': 1,                          ...               hostname='notebook')
...        'y': 2,
...        'z': (add, 'x', 'y')}            $ipython # Remote Process #2: Worker >>> from dask.distributed import Worker >>> from dask.threaded import get >>> w = Worker('tcp://notebook:4444') >>> get(dsk, 'z') # use threads 3$ ipython  # Remote Process #3:  Worker
>>> w = Worker('tcp://notebook:4444')

>>> c = Client('tcp://notebook:5555')

>>> c.get(dsk, 'z') # use distributed network
3

This graph is small. We didn’t need a distributed network of machines to compute it (a single thread would have been much faster) but this simple example can be easily extended to more important cases, including generic use with the dask collections (Array, Bag, DataFrame). You can control the scheduler with a keyword argument to any compute call.

>>> import dask.array as da
>>> x = da.random.normal(10, 0.1, size=(1000000000,), chunks=(1000000,))

>>> x.mean().compute(get=c.get)  # use distributed network

Alternatively you can set the default scheduler in dask with dask.set_options

>>> import dask
>>> dask.set_options(get=c.get)  # use distributed scheduler by default

## Known Limitations

We intentionally made the simplest and dumbest distributed scheduler we could think of. Because dask separates graphs from schedulers we can iterate on this problem many times; building better schedulers after learning what is important. This current scheduler learns from our single-memory system but is the first dask scheduler that has to think about distributed memory. As a result it has the following known limitations:

1. It does not consider data locality. While linear chains of tasks will execute on the same machine we don’t think much about executing multi-input tasks on nodes where only some of the data is local.
2. In particular, this scheduler isn’t optimized for data-local file-systems like HDFS. It’s still happy to read data from HDFS, but this results in unnecessary network communication. We’ve found that it’s great when paired with S3.
3. This scheduler is new and hasn’t yet had its tires kicked. Vocal beta users are most welcome.
4. We haven’t thought much about deployment. E.g. somehow you need to ssh into a bunch of machines and start up workers, then tear them down when you’re done. Dask.distributed can bootstrap off of an IPython Parallel cluster, and we’ve integrated it into anaconda-cluster but deployment remains a tough problem.

The dask.distributed module is available in the last release but I suggest using the development master branch. There will be another release in early July.

## Further Information

Blake Griffith has been playing with dask.distributed and dask.bag together on data from http://githubarchive.org. He plans to write a blogpost to give a better demonstration of the use of dask.distributed on real world problems. Look for that post in the next week or two.

You can read more about the internal design of dask.distributed at the dask docs.

## Thanks

Special thanks to Min Regan-Kelley, John Jacobsen, Ben Zaitlen, and Hugo Shi for their advice on building distributed systems.

Also thanks to Blake Griffith for serving as original user/developer and for smoothing over the user experience.

## June 22, 2015

### Titus Brown

#### We're throwing a Software Carpentry! And it's already full...

Yesterday morning, we announced a Software Carpentry workshop here at UC Davis, running July 6-7 -- see the Web site for more information. I'm organizing, and Easton White and Noam Ross are co-lead instructors. (This is the first workshop I'm running since we became an affiliate!)

I'd love it if you could all come!

I announced the workshop on Monday morning via the dib-training mailing list. We opened up 30 seats to all comers, and reserved 30 seats for people from the UC Davis School of Veterinary Medicine, which is hosting the workshop (by employing me).

Within 12 hours, the 30 open seats were filled. Wow.

The VetMed seats will be opened to the waiting list next Monday, and (if no one else signs up ;) there is still plenty of room. So if you're in the area, take a look and sign yourself up!

--titus

## June 19, 2015

### Titus Brown

#### Some personal perspectives on academic freedom and free speech

Some background: I'm a white, male, tenured faculty member at UC Davis, and a 3rd generation academic. I work in relatively uncontroversial areas of science (primarily bioinformatics & genomics) at a university that is about as protective of academic freedom as you can get these days. I also live in a country that is (at least formally and legally if not always in practice) more protective of free speech than most countries in the world. For these reasons, I have less to fear from expressing my opinions -- on my blog, on Twitter, or in person -- than virtually anyone, anywhere, ever has.

A week ago, I tweeted Dr. Alice Dreger's article, Wondering if I'm the Next Tim Hunt. I was surprised, frustrated, and upset by the response of a number of colleagues on Twitter, who essentially said "speech that I disagree with is not protected by academic freedom." (This is a bit of a paraphrase, but I think an accurate one; judge for yourselves.)

I must say that I don't really care about Dr. Tim Hunt per se, and that whole issue has been covered very well elsewhere. He made clearly sexist and harmful remarks and is not a credible role model. To the extent that I have anything useful to say, I am worried about the reported actions of UCL in this interview with Mary Collins.

What I am much more worried about is the degree to which academics seem oblivious to the issue of academic freedom. It takes a special kind of obliviousness and subjectivity to look at the history of science and argue that academic researchers should be restricted in the scope of their opinions by their employer. For more on the sordid history of academic freedom, see the wikipedia page.

But, you know what? I don't work on academic freedom, and I'm not a lawyer, so I can't comment on nuances of employment contracts vs teaching vs publication, speech vs action, etc. All I can do is tell you why I care about free speech and academic freedom, and what my personal experiences are in this area, and try to explain my perspective.

## Communism and my family

In a very real sense, my brothers and sisters and I only exist because of practical limits on free speech.

My father left the United States after receiving his PhD because he'd briefly been a member of the Communist Party. From personal conversations with him, I believe he was keenly aware of the danger of staying in the US, and knew that his academic career would have been in danger had he remained here.

In England, he met his first wife with whom he had my three oldest siblings, all of whom were born in Europe. (He met my mother when he returned to Princeton.)

Ironically, he managed to run afoul of both the US Government (by being a former member of the Communist Party and leaving their reach) and the Communist Party (from which he was evicted for asking too many questions).

I grew up on stories of him being called into British government buildings to meet with US officials who wanted him to surrender his passport (which would have prevented him from traveling); when he wouldn't surrender his passport, the US officials tried to get the Brits to take it from him. That never happened; the Brit response was to ask for a letter from the US, which of course wasn't forthcoming because all of this was unofficial persecution.

I also grew up on stories of him being wined and dined by the Stasi when he was visiting his first wife's family in Eastern Europe, in attempts to recruit him. This, together with his habit of sending theoretical nuclear physics journals to colleagues in Russia, led to a frightening-in-hindsight visit by several serious men in dark sunglasses from the FBI in the early 80s (I was about 10).

Interestingly, despite having been very politically active early on, my father was completely apolitical during my life. In the last years of his life, I got some minimal insight into this from his recounting of the Peekskill riots, but I never got the whole story.

## Academic freedom doesn't protect you from being sued personally

Some 20-odd years ago (I'm feeling old today) I mirrored a spam blacklist site on my employee account at Caltech. (This was back when the Internet was new enough that such things could be maintained somewhat manually. ;) One of the people on the spam blacklist got very upset and sent some very nasty e-mails threatening everyone and everything at Caltech with lawsuits unless we removed it.

The resulting conversation escalated all the way to the provost (who I was actually doing research for at the time - see below), and I had the awkward conversation where I was told:

• we have no problems hosting this, if you make the following modifications to make it clear this isn't Caltech's official opinion;
• we're not going to fire you or anything like that;
• but we won't protect you yourself from libel litigation, so good luck!

Nothing ever happened, but it made the indelible point on me that academic freedom is a thin reed to clutch - people can still bring legal action against you for what you said, even if you face no blowback from your employer.

## Climate studies and modeling

I worked for a few years on climate studies with Dr. Steven Koonin, back when he was provost at Caltech (and before he took up his position at BP). My scientific career exists in part because of the work I did with him. I regard him as a mentor, colleague, and friend.

In 2014, Dr. Koonin wrote an op-ed on climate change (here) that I think makes many good points. Knowing him personally, I trust his judgement; having worked (a small bit) in this area, and having a reasonable amount of experience in modeling, I am in agreement with many of his central points.

This measured response is a good example of true scientific debate in action. We need more of this, in general. (I'm a particular fan of Dr. Koonin's suggestion on model evaluation; he's a smart scientist.)

Several people privately told me that they thought Dr. Koonin was an idiot for writing this, and others told me it was our responsibility as scientists to toe the climate change line for fear of doing further damage to the environment. I disagree with both of these groups of people, even though I believe that climate change is anthropogenic and we need to do something about it. I think Dr. Koonin made some good points that needed to be made.

## Blogging and intellectual community

About 2-3 times a year, I get a request to change something in a blog post. Very rarely is it because what I've said is wrong; it's usually because it makes someone uncomfortable or unhappy. As a matter of policy, I refuse to do so (plus my blog is under version control, and I'm certainly not going to rewrite my git history :).

(I don't have any problem with posting explicit corrections when I'm wrong, obviously.)

A key point is that I don't expect to be fired for anything I say in my blog posts. Completely apart from having an awful lot of privilege (white, male, tenure, supportive family, no health problems), there's an expectation that what I say on my blog is subject to academic freedom. I've never gotten any pushback from my employer and I don't expect to, no matter how critical I am of them.

Joe Pickrell makes a very good point that intellectual community is key to academia. How can we have robust discussion and without academic freedom? (Rebecca Schuman makes an excellent related point about adjuncts, job security and academic freedom, here, with which I greatly sympathize.)

## Privilege, and free speech, and academic freedom

(I'm not a lawyer, so please correct me. This is my understanding.)

Free speech is a constitutional right in the US; as such it only applies to government action. If my employer is upset with my speech, they are free to fire me; Twitter is under no obligation to allow me to tweet whatever I want; etc.

Academic freedom is, essentially, free speech commuted to academic employees: basically, universities should not fire people for something they said. While I am still individually liable for what I say under the law of the country I'm in, my employer cannot fire me without some substantial process (if at all) for what I say.

There are a lot of tricky bits in there, though.

For example, when I wrote on Twitter, "academic ideal: I should be able to hold & defend ideas w/o fear of losing my job", I got a very important response from a colleague -- White men exercising their entitlement to this ideal seems to be at odds with marginalized people gaining the same privileges.

I don't have a sophisticated response to offer; as a tenured white guy whose research isn't in this area, I am only slowly learning about this area, and a large part of that learning is being open to colleagues who tell me about their experiences (latest horrific example, of many: Julie Libarkin, with whom I work on learning evaluation). For this reason I tend to simply stay quiet and do what I can to foster a welcoming environment. I certainly don't feel qualified to say anything intelligent on the specific question of marginalization.

I do have two tentative thoughts that I keep on coming back to, though, and I'd welcome feedback.

One thought is this: we can only have conversations about sexism and privilege and systemic oppression because of free speech, and, in the university, because discussions of these controversial topics are protected by academic freedom. I have colleagues and mentees who come from "free speech challenged" countries (I'm not being more specific in order to protect them), and the stories they tell me of government and institutional oppression are horrifying. With respect to one actual real-life example that happened to the family of a colleague, I can confirm that I would say virtually anything you want me to if you took my children, put them in a jail cell, and threatened them until I acquiesced. We are fairly far from that in the US (with national security and terrorism being one horrible counterexample), and I value that tremendously. I would hate to see that weakened even in the service of efforts that I believe in passionately.

My other thought is this: limits to academic freedom and free speech are and always have been a double edged sword. This is almost the definition of a "slippery slope" situation - it's very hard to enact precise limitations on free speech that don't have seriously unintended consequences. It's pretty easy to find pairs of examples to juxtapose -- consider gun rights vs animal rights. I bet relatively few people are sympathetic to both lawsuits on any grounds other than academic freedom! But most people will be sympathetic to at least one. How else to square this but academic freedom??

So inasmuch as I have anything to say, it's this: we should be careful what we wish for, because your well-intentioned limits on free speech and academic freedom today will be used used against you tomorrow. And if you don't agree that happens, you are taking an ahistorical position.

## Concluding thoughts

There's a long and righteous history of defending the most disgusting and horrifying actions based on due process. For one example, Miranda rights rest on a despicable character, Ernesto Miranda, who was later convicted of some horrible crimes. Presumably most of my readers would agree that Miranda rights are a net win for the rights of the accused, but note that it was controversial -- for example, the Supreme Court decision was 5-4. (The wikipedia page is a very good read.)

So, ultimately, I don't think there's any conflict in arguing for due process or legal protections of free speech, academic freedom, or anything else, no matter how heinous the speech being protected is. And if you disagree, then I think you're not only wrong but dangerously so.

That having been said, I'm unsympathetic to people who want me to host their obnoxious speech. I can't see any reason why I, personally, am required to pay attention to what anyone else is saying. I don't have any reason to put up with (say) sexist speech within my lab, or on my blog. Nor do I have to engage with, pay attention to, or promote, those who have opinions I find to be silly or nonsensical. (One exception here - academic norms require me to engage with those opinions that bear on my own academic research.)

--titus

p.s. Respectful comments only, abiding by the Principle of Charity; others may be deleted without notice, and commenters may be banned. My blog, my rules. Read the above if you're confused :).

### Abraham Escalante

#### Scipy and the first few GSoC weeks

Hi all,

We're about three (and a half) weeks into the GSoC and it's been one crazy ride so far. Being my first experience working in OpenSource projects and not being much of an expert in statistics was challenging at first, but I think I might be getting the hang of it now.

First off, for those of you still wondering what I'm actually doing, here is an abridged version of the abstract from my proposal to the GSoC (or you can click here for the full proposal):

"scipy.stats is one of the largest and most heavily used modules in Scipy. [...] it must be ensured that the quality of this module is up to par and [..] there are still some milestones to be reached. [...] Milestones include a number of enhancements and [...] maintenance issues; most of the scope is already outlined and described by the community in the form of open issues or proposed enhancements."

So basically, the bulk of my project consists on getting to work on open issues for the StatisticsCleanup milestone within the statistics module of SciPy (a Python-based OpenSource library for scientific computing). I suppose this is an unusual approach for a GSoC project since it focuses on maintaining and streamlining an already stable module (in preparation for the release of SciPy 1.0), rather than adding a new module or a specific function within.

The unusual approach allows me to make several small contributions and it gives me a wide (although not as deep) scope, rather than a narrow one. This is precisely the reason why I chose it. I feel like I can benefit (and contribute) a lot more this way, while I get acquainted with the OpenSource way and it also helps me to find new personal interests (win-win).

However, there are also some nuances that may be uncommon. During the first few weeks I have discovered that my proposal did not account for the normal life-cycle of issues and PRs in scipy; my estimations we're too hopeful.

One of OpenSource's greatest strengths is the community getting involved in peer reviews; this allows a developer to "in the face of ambiguity, refuse the temptation to guess". If you didn't get that [spoiler alert] it was a reference to the zen of python (and if you're still reading this and your name is Hélène, I love you).

The problem with this is that even the smooth PRs can take much longer than one week to be merged because of the back and forth with feedback from the community and code update (if it's a controversial topic, discussions can take months). Originally, I had planned to work on four or five open issues a week, have the PRs merged and then continue with the next four or five issues for the next week but this was too naive so I have had to make some changes.

I spent the last week compiling a list of next steps for pretty much all of the open issues and I am now trying to work on as many as I can at a time, thus minimising the impact of waiting periods between feedback cycles for each PR. I can already feel the snowball effect it is having on the project and on my motivation. I am learning a lot more (and in less time) than before which was the whole idea behind doing the Summer of Code.

I will get back in touch soon. I feel like I have rambled on for too long, so I will stop and let you continue to be awesome and get on with your day.

Cheers,
Abraham.

## June 18, 2015

### Matthew Rocklin

#### Pandas Categoricals

tl;dr: Pandas Categoricals efficiently encode and dramatically improve performance on data with text categories

Disclaimer: Categoricals were created by the Pandas development team and not by me.

## There is More to Speed Than Parallelism

I usually write about parallelism. As a result people ask me how to parallelize their slow computations. The answer is usually just use pandas in a better way

• Q: How do I make my pandas code faster with parallelism?
• A: You don’t need parallelism, you can use Pandas better

This is almost always simpler and more effective than using multiple cores or multiple machines. You should look towards parallelism only after you’ve made sane choices about storage format, compression, data representation, etc..

Today we’ll talk about how Pandas can represent categorical text data numerically. This is a cheap and underused trick to get an order of magnitude speedup on common queries.

## Categoricals

Often our data includes text columns with many repeated elements. Examples:

• Stock symbols – GOOG, APPL, MSFT, ...
• Gender – Female, Male, ...
• Experiment outcomes – Healthy, Sick, No Change, ...
• States – California, Texas, New York, ...

We usually represent these as text. Pandas represents text with the object dtype which holds a normal Python string. This is a common culprit for slow code because object dtypes run at Python speeds, not at Pandas’ normal C speeds.

Pandas categoricals are a new and powerful feature that encodes categorical data numerically so that we can leverage Pandas’ fast C code on this kind of text data.

>>> # Example dataframe with names, balances, and genders as object dtypes
>>> df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie', 'Danielle'],
...                    'balance': [100.0, 200.0, 300.0, 400.0],
...                    'gender': ['Female', 'Male', 'Male', 'Female']},
...                    columns=['name', 'balance', 'gender'])

>>> df.dtypes                           # Oh no!  Slow object dtypes!
name        object
balance    float64
gender      object
dtype: object

We can represent columns with many repeats, like gender, more efficiently by using categoricals. This stores our original data in two pieces

• Original data

 Female, Male, Male, Female

1. Index mapping each category to an integer

Female: 0
Male: 1
...

2. Normal array of integers

0, 1, 1, 0


This integer array is more compact and is now a normal C array. This allows for normal C-speeds on previously slow object dtype columns. Categorizing a column is easy:

In [5]: df['gender'] = df['gender'].astype('category')  # Categorize!

Lets look at the result

In [6]: df                          # DataFrame looks the same
Out[6]:
name  balance  gender
0     Alice      100  Female
1       Bob      200    Male
2   Charlie      300    Male
3  Danielle      400  Female

In [7]: df.dtypes                   # But dtypes have changed
Out[7]:
name         object
balance     float64
gender     category
dtype: object

In [8]: df.gender                   # Note Categories at the bottom
Out[8]:
0    Female
1      Male
2      Male
3    Female
Name: gender, dtype: category
Categories (2, object): [Female, Male]

In [9]: df.gender.cat.categories    # Category index
Out[9]: Index([u'Female', u'Male'], dtype='object')

In [10]: df.gender.cat.codes        # Numerical values
Out[10]:
0    0
1    1
2    1
3    0
dtype: int8                         # Stored in single bytes!

Notice that we can store our genders much more compactly as single bytes. We can continue to add genders (there are more than just two) and Pandas will use new values (2, 3, …) as necessary.

Our dataframe looks and feels just like it did before. Pandas internals will smooth out the user experience so that you don’t notice that you’re actually using a compact array of integers.

## Performance

Lets look at a slightly larger example to see the performance difference.

We take a small subset of the NYC Taxi dataset and group by medallion ID to find the taxi drivers who drove the longest distance during a certain period.

In [1]: import pandas as pd

In [3]: %time df.groupby(df.medallion).trip_distance.sum().sort(ascending=False,
CPU times: user 161 ms, sys: 0 ns, total: 161 ms
Wall time: 175 ms

Out[3]:
medallion
1E76B5DCA3A19D03B0FB39BCF2A2F534    870.83
6945300E90C69061B463CCDA370DE5D6    832.91
4F4BEA1914E323156BE0B24EF8205B73    811.99
191115180C29B1E2AF8BE0FD0ABD138F    787.33
B83044D63E9421B76011917CE280C137    782.78
Name: trip_distance, dtype: float64

That took around 170ms. We categorize in about the same time.

In [4]: %time df['medallion'] = df['medallion'].astype('category')
CPU times: user 168 ms, sys: 12.1 ms, total: 180 ms
Wall time: 197 ms

Now that we have numerical categories our computaion runs 20ms, improving by about an order of magnitude.

In [5]: %time df.groupby(df.medallion).trip_distance.sum().sort(ascending=False,
CPU times: user 16.4 ms, sys: 3.89 ms, total: 20.3 ms
Wall time: 20.3 ms

Out[5]:
medallion
1E76B5DCA3A19D03B0FB39BCF2A2F534    870.83
6945300E90C69061B463CCDA370DE5D6    832.91
4F4BEA1914E323156BE0B24EF8205B73    811.99
191115180C29B1E2AF8BE0FD0ABD138F    787.33
B83044D63E9421B76011917CE280C137    782.78
Name: trip_distance, dtype: float64

We see almost an order of magnitude speedup after we do the one-time-operation of replacing object dtypes with categories. Most other computations on this column will be similarly fast. Our memory use drops dramatically as well.

## Conclusion

Pandas Categoricals efficiently encode repetitive text data. Categoricals are useful for data like stock symbols, gender, experiment outcomes, cities, states, etc.. Categoricals are easy to use and greatly improve performance on this data.

We have several options to increase performance when dealing with inconveniently large or slow data. Good choices in storage format, compression, column layout, and data representation can dramatically improve query times and memory use. Each of these choices is as important as parallelism but isn’t overly hyped and so is often overlooked.

Jeff Reback gave a nice talk on categoricals (and other featuress in Pandas) at PyData NYC 2014 and is giving another this weekend at PyData London.

## June 17, 2015

### Wei Xue

#### GSoC Week 4: Progress Report

Updated in Jun 24.

1. [x] Completes derivation report.
2. [x] Adds new classes. One abstract class _BasesMixture. Three derived classes GaussianMixture, BayesianGaussianMixture, DirichletProcessGaussianMixture
3. [ ] Decouples large functions, especially in DirichletProcessGaussianMixture and BayesianGaussianMixture
4. [x] Removes numerical stability fixes for HMM. It seems that whenever there is a numerical issue, the code always adds 10*EPS in the computation. I think in some cases there is a better way to address the problem, such as normalization the extremely small variables earlier, or we just simply remove 10*EPS which is only for HMM.
5. [ ] Writes updating functions for BayesianGaussianMixtureModel and DirichletProcessGaussianMixtureModel according to the report.
6. [x] Provides methods that allow users to initialize the model with user-provided data
7. [x] Corrects kmeans initialization. It is weird when using kmeans initialization, only means is initialized. The weights and covariances are initialized by averaging.
8. [x] Writes several checking functions for the initialization data
9. [x] Adjusts the verbose messages. When verbose>1, it display log-likelihood and time used in the same line of the message Iteration x
10. [ ] Adjusts the time to compute log-likelihood. The code in the current master branch compute the log-likelihood of the model after E-step which is actually the score of the last iteration, and misses the score immediately after the initialization.
11. [x] Simplify fit_predict
12. [x] Adds warning for params!='wmc'
13. [ ] Studies and contrasts the convergence of classical MLE / EM GMM with Bayesian GMM against the number of samples and the number of components
14. [ ] Friendly warning and error messages, or automatically addressing if possible (e.g. random re-init of singular components)
15. [ ] Examples that shows how models can over-fit by comparing likelihood on training and validation sets (normalized by the number of samples). For instance extend the BIC score example with a cross-validated likelihood plot
16. [ ] Testing on 1-D dimensions
17. [ ] Testing on Degenerating cases
18. [ ] AIC, BIC for VBGMM DPGMM
19. [ ] Old faithful geyser data set
20. [optional] add a partial_fit function for incremental / out-of-core fitting of (classical) GMM, for instance http://arxiv.org/abs/0712.4273
21. [optional] ledoit_wolf covariance estimation

The most important progress I have done is the derivation report which include the updating functions, log-probability, and predictive distribution for all three models, and the implementation of the base class. Compared with the current scikit-learn math derivation documents, my report is consistent to PRML. It clearly depicts the updating functions of three models share a lot of patterns. We could abstract common functions into the abstract base class _MixtureBase. The three models could inherit it and override the updating methods.

Next week I will finish the GaussianMixture model with necessary testing functions.

## June 16, 2015

### Matthieu Brucher

#### Audio Toolkit: Anatomy of a middle-side compressor

Sometimes images are worth a thousand words, so let’s look at some pictures of a middle-side compressor behavior.

A middle-side compressor like ATKStereoCompressor can work in a middle-side workflow. This means that the stereo signal is split in a center/middle channel and a side (L-R) channel. Then each channel is processed through the compressor independently and then recreated after:

Stereo compressor pipeline

So let’s take a stereo signal (I won’t extract the middle/side channels, it is easy to get them from AudioTK):
Stereo signal input

From this, it is easy to generate the RMS power for each channel:
Middle-side power signals (RMS)

Now, after an attack-release filter, we can apply the gain stage with the desired ratio, knee…
Middle-side gain
The gain processed for the two channels are similar, following the same trend, but with vastly different features.

Finally, we can apply the gain on the middle and side channels before regenerating the stereo channels and get the following result:
Stereo output signal

Of course, changing attack/release values will also change the shape of the signals, as well as the ratio: with a higher ratio on the side, the signal will sound more like a mono signal, whereas a higher ratio on the middle, the stereo image will be broadened.

The code to generate these images for any stereo signal is available on github.

## June 15, 2015

### Wei Xue

#### GSoC Week 3

The week 3 has a very exciting start. I finished the derivation of DPGMM, as well as the lower bound and the predictive probability for each model.

The difference between my derivation and the current document is that the current models assume a simpler approximation. The model defined in PRML is more accurate and provides more knobs. The two approximations both appear in the literature. Maybe we should do some experiments to decide which one is better.

With regards to the new names of DPGMM and VBGMM, I think these two names are not suitable, just like someone calls SVM as SMO. Actually, the models are Bayesian GMM, Dirichlet Process Bayesian GMM (DPGMM is often used) respectively. Both of them are solved by variational inference. In other words, VBGMM is not a good name. The new names, I think, should have the meaning of 'Bayesian GMM solved by VB', 'DP(B)GMM solved by VB'.

I also took a close look at the code base. The code is not maintained well. The problem I am going to address is as follows.

• decouple some large functions, such as _fit
• use abstract class and inheritance to reduce code redundancy
• numerical stability. It seems that whenever there is a numerical issue. The code always like to add EPS. I think in some place there is a better way to address the problem, such as normalization the extremely small variables earlier.
• write updating functions for BayesianGaussianMixtureModel and DirichletProcessGaussianMixtureModel
• provide methods that allow users to initialize the model before fit
• correct kmeans initialization. It is weird when using kmean initialization, only means is initialized. The weights and covariances are initialized by averaging.
• write several checking functions for the initialization data
• [optional] add a partial_fit function for incremental / out-of-core fitting of (classical) GMM, for instance http://arxiv.org/abs/0712.4273
• [optional] ledoit_wolf covariance estimation

The last days of this week I implemented the structure of new classes. _MixtureModelBase, GaussianMixtureModel, BayesianMixtureModel, DirichletProcessMixtureModel. It provides us a big picture of the classes I am going to implement. I am looking forward the feedback.

## June 13, 2015

### Jake Vanderplas

#### Fast Lomb-Scargle Periodograms in Python

Image source: astroML. Source code here

The Lomb-Scargle periodogram (named for Lomb (1976) and Scargle (1982)) is a classic method for finding periodicity in irregularly-sampled data. It is in many ways analogous to the more familiar Fourier Power Spectral Density (PSD) often used for detecting periodicity in regularly-sampled data.

Despite the importance of this method, until recently there have not been any (in my opinion) solid implementations of the algorithm available for easy use in Python. That has changed with the introduction of the gatspy package, which I recently released. In this post, I will compare several available Python implementations of the Lomb-Scargle periodogram, and discuss some of the considerations required when using it to analyze data.

To cut to the chase, I'd recommend using the gatspy package for Lomb-Scargle periodograms in Python, and particularly its gatspy.periodic.LombScargleFast algorithm which implements an efficient pure-Python version of Press & Rybicki's $$O[N\log N]$$ periodogram. Below, I'll dive into the reasons for this recommendation.

## Example: Lomb-Scargle on Variable Stars

As an motivation, let's briefly consider some data from my own field: observations of an RR Lyrae-type variable star. RR Lyrae are small stars – about 50% the mass of our sun – which pulsate with a regular period on order half a day. Their relatively consistent peak intrinsic brightness allows for an accurate estimation of their distance from the sun, and thus they are important for studies such as understanding the substructure of the Milky Way galaxy. Because of this and other similar applications, detecting the telltale periodic variation of RR Lyrae stars within noisy data is an important statistical task for astronomers.

Here we will quickly demonstrate what this looks like in practice, using tools from the astroML package to download some data, and tools from the gatspy package to detect the periodicity.

In [1]:
# Do preliminary imports and notebook setup
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# use seaborn for plot styles
import seaborn; seaborn.set()


Now we'll download some data from the LINEAR dataset, using tools in astroML. We'll plot the data to see what we're working with:

In [2]:
from astroML.datasets import fetch_LINEAR_sample
LINEAR_data = fetch_LINEAR_sample()
star_id = 10040133
t, mag, dmag = LINEAR_data.get_light_curve(star_id).T

fig, ax = plt.subplots()
ax.errorbar(t, mag, dmag, fmt='.k', ecolor='gray')
ax.set(xlabel='Time (days)', ylabel='magitude',
title='LINEAR object {0}'.format(star_id))
ax.invert_yaxis();


This data has around 250 observations spread across about 2000 days, and we're hoping to detect a period of order 0.5 days. If the series were regularly-sampled, we'd be far above the Nyquist limit and all hope would be lost. Fortunately for astronomers, the assumptions behind the Nyquist sampling limit do not hold for irregular sampling rates, and we can proceed with no problem.

Let's start by computing and plotting the Lomb-Scargle Periodogram for this data, using tools from gatspy:

In [3]:
from gatspy.periodic import LombScargleFast
model = LombScargleFast().fit(t, mag, dmag)
periods, power = model.periodogram_auto(nyquist_factor=100)

fig, ax = plt.subplots()
ax.plot(periods, power)
ax.set(xlim=(0.2, 1.4), ylim=(0, 0.8),
xlabel='period (days)',
ylabel='Lomb-Scargle Power');


The periodogram gives a measure of periodic content as a function of period; we see here a strong peak at around 0.61 days. Other lower peaks are due to some combination of higher-order harmonics in the data and effects of the irregular survey window. While we could find this maximum manually from the above grid, gatspy provides a better way: a built-in two-stage grid-search that accurately determines the best period in a specified range:

In [4]:
# set range and find period
model.optimizer.period_range=(0.2, 1.4)
period = model.best_period
print("period = {0}".format(period))

Finding optimal frequency:
- Estimated peak width = 0.0032
- Using 5 steps per peak; omega_step = 0.00064
- User-specified period range:  0.2 to 1.4
- Computing periods at 42104 steps
Zooming-in on 5 candidate peaks:
- Computing periods at 1000 steps
period = 0.6105387801103276



We see that the optimizer determined that it needed a grid of over 40,000 points to adequately cover the frequency grid (more on this below), and in the end arrived at a best period of 0.6105 days. Given this detected period, we can fold the input data and over-plot a best-fit empirical RR Lyrae template to see the fit:

In [5]:
# Compute phases of the obsevations
phase = (t / period) % 1

# Compute best-fit RR Lyrae template
from gatspy.periodic import RRLyraeTemplateModeler
model = RRLyraeTemplateModeler('r').fit(t, mag, dmag)
phase_fit = np.linspace(0, 1, 1000)
mag_fit = model.predict(period * phase_fit, period=period)

# Plot the phased data & model
fig, ax = plt.subplots()
ax.errorbar(phase, mag, dmag, fmt='.k', ecolor='gray', alpha=0.5)
ax.plot(phase_fit, mag_fit, '-k')
ax.set(xlabel='Phase', ylabel='magitude')
ax.invert_yaxis();


This very close template fit gives a strong indication that the star in question is an RR Lyrae.

## Computational Considerations for Lomb-Scargle

The Lomb-Scargle periodogram involves the computation of a power $$P(\omega)$$ at a set of frequencies $$\omega_i$$. For data $$\{y_k\}$$ pre-centered such that $$\sum_k y_k = 0$$, the expression for the power is:

$P(\omega) \propto \frac{\left[\sum_k y_k \cos\omega(t_k - \tau)\right]^2} {\sum_k \cos^2\omega(t_k - \tau)} + \frac{\left[\sum_k y_k \sin\omega(t_k - \tau)\right]^2} {\sum_k \sin^2\omega(t_k - \tau)}$

where $$\tau$$ is an easily computed time-offset which orthogonalizes the model and makes $$P(\omega)$$ independent of a translation in $$t$$.

Rather than get lost in the math, I want to emphasize the key feature of this expression: for any frequency $$\omega$$, the power is an $$O[N]$$ computation involving simple trigonometric sums over the data, where $$N$$ is the number of observed data poitns. The main computational question then becomes: how many frequencies must you compute? In my experience, the most common mistake people make when doing this sort of periodic analysis is not thinking hard enough about the frequency grid. It turns out that the grid-spacing question is very important. If you choose too fine a grid, you do much more computation than is required. Worse, if you choose too coarse a grid, the periodogram peak may fall between grid points and you'll miss it entirely!

Let's think about the required frequency range and frequency spacing for Lomb-Scargle.

### Frequency spacing

First we'll choose the spacing of the frequency grid. If you're asking about a candidate frequency $$f$$, then data with range $$T = t_{max} - t_{min}$$ contains $$T \cdot f$$ complete cycles. If our error in frequency is $$\delta f$$, then $$T\cdot\delta f$$ is the error in number of cycles between the endpoints of the data. It's clear that this error must not be a significant fraction of a cycle, or the fit could be drastically affected. This leads to an approximate grid-spacing criterion:

$T\cdot\delta f \ll 1$

Commonly, we'll choose some oversampling factor (say, 5) and use $$\delta f = (5T)^{-1}$$ as our frequency grid spacing.

### Frequency limits

Next, we need to choose the upper and lower limits of the frequency grid. On the low end, $$f=0$$ is suitable, but causes some numerical problems – we'll go one step away and use $$\delta f$$ as our minimum frequency. But on the high end, we need to make a choice: what's the highest frequency we'd trust our data to be sensitive to? At this point, many people are tempted to mis-apply the Nyquist-Shannon sampling theorem, and choose some version of the Nyquist limit for the data (based on, say, the minimum or mean spacing between observations). But this is entirely wrong! The Nyquist frequency is derived from special properties of regularly-sampled data, and does not apply – even approximately – to irregularly-sampled time-series. In fact, as we saw above, irregularly-sampled data can be sensitive to much, much higher frequencies than even the minimum spacing between observations. With this in mind, the upper limit for frequencies should be determined based on what kind of signal you are looking for.

Still, a common (if dubious) rule-of-thumb is that the high frequency is some multiple of what Press & Rybicki call the "average" Nyquist frequency,

$\hat{f}_{Ny} = \frac{N}{2T}$

This means that the "typical" number of frequencies you'll need is

$N_{freq} \sim O\left[\frac{\hat{f}_{Ny}}{\delta f}\right] \sim O\left[\frac{N/(2T)}{1/T}\right] \sim O[N]$

That is, the number of frequencies to search will scale with the number of data points!

### Computational Complexity

From the above considerations, we see that the determination of the optimal Lomb-Scargle period within $$N$$ points requires computing an $$O[N]$$ expression for power across $$O[N]$$ grid points; that is, Lomb-Scargle is naively an $$O[N^2]$$ algorithm.

This computational complexity can be improved in one of several ways. Most notably, in a 1989 paper, Press and Rybicki proposed a clever method whereby a Fast Fourier Transform is used on a grid extirpolated from the original data, such that this naively $$O[N^2]$$ problem can be solved in $$O[N\log N]$$ time. The broad idea is that when you compute sums of sines and cosines for one frequency, this gives you some amount of information about those sums computed at another frequency, and by carefully using all information across a frequency grid, you can significantly reduce the number of required operations.

Thus the fundamental divide between Lomb-Scargle implementations is whether they use the naive $$O[N^2]$$ algorithm or the $$O[N\log N]$$ algorithm of Press & Rybicki and other similar approaches.

## Lomb-Scargle Algorithms in Python

Now we get to the meat of this post: Lomb-Scargle implementations written in Python. If you search this on Google, you'll currently find links to several available implementations. Here I'm going to delve into and compare the following four implementations:

• scipy.signal.lombscargle, an $$O[N^2]$$ implementation from SciPy.
• astroML.time_series.lomb_scargle, an $$O[N^2]$$ implementation from astroML.
• gatspy.periodic.LombScargle, an $$O[N^2]$$ implementation from gatspy.
• gatspy.periodic.LombScargleFast, an $$O[N\log N]$$ implementation, also from gatspy.

Let's see some examples of the above tools:

#### scipy.signal.lombscargle

The SciPy Lomb-Scargle periodogram is a C implementation of the naive $$O[N^2]$$ algorithm. The algorithm cannot account for noise in the data, and has some other quirks as well:

• it requires you to center your data (by subtracting the mean) before computing the periodogram. If you do not, the results will be garbage.
• it computes the unnormalized periodogram, which can be normalized manually as we'll see below.
• it takes angular frequencies as the argument.

Let's use scipy's algorithm to plot the periodogram of the data shown above. Note that the results will not be identical, because this algorithm ignores the noise in the data and doesn't fit for the data mean.

Against the above recommendations, we'll choose a simple regular grid in period for the plot:

In [6]:
from scipy.signal import lombscargle

# Choose a period grid
periods = np.linspace(0.2, 1.4, 4000)
ang_freqs = 2 * np.pi / periods

# compute the (unnormalized) periodogram
# note pre-centering of y values!
power = lombscargle(t, mag - mag.mean(), ang_freqs)

# normalize the power
N = len(t)
power *= 2 / (N * mag.std() ** 2)

# plot the results
fig, ax = plt.subplots()
ax.plot(periods, power)
ax.set(ylim=(0, 0.8), xlabel='period (days)',
ylabel='Lomb-Scargle Power');


Comparing to the first periodogram plot, we see that becuase our period grid here is too coarse at low frequencies, some of the peak structure is missed by this visualization. Consider this a warning against arbitrarily choosing a period gridding!

#### astroML.time_series.lomb_scargle

AstroML has two $$O[N^2]$$ implementations of Lomb-Scargle: one in astroML and one in astroML_addons, which is a collection of C extensions which replace slower functionality in the pure-python astroML package. In order to use the faster version, make sure you install both packages; e.g.

$pip install astroML$ pip install astroML_addons

Some important features of astroML's Lomb Scargle periodogram:

• unlike scipy, it uses an extended periodogram model which can correctly account for uncorrelated Gaussian measurement error.
• like scipy, it takes angular frequencies as its argument.
• unlike scipy, it implements a floating mean periodogram, meaning that the data centering required for scipy is not required here, but it goes beyond simple centering: the mean of the data is fit as part of the model, which has advantages in many real-world scenarios. To directly compare to scipy's standard Lomb Scargle pass generalized=False.

Let's repeat the above plot with this periodogram:

In [7]:
from astroML.time_series import lomb_scargle
power = lomb_scargle(t, mag, dmag, ang_freqs)

# plot the results
fig, ax = plt.subplots()
ax.plot(periods, power)
ax.set(ylim=(0, 0.8), xlabel='period (days)',
ylabel='Lomb-Scargle Power');


#### gatspy.periodic.LombScargle

Gatspy's basic Lomb-Scargle algorithm is an $$O[N^2]$$ implementation, but is implemented differently than either of the above versions. It uses a direct linear algebra approach which carries some additional computational and memory overhead. The reason for this approach is that it naturally accommodates several extensions to the periodogram, including floating mean, multiple terms, regularization, and multi-band models (more details in VanderPlas & Ivezic (2015), the paper that inspired gatspy).

Gatspy is a pure python package, and thus installation is easy and requires no compilation of C or Fortran code:

## May 26, 2015

### Titus Brown

#### DIB jclub: Spaced Seed Data Structures for De Novo Assembly

Note: at the Lab for Data Intensive Biology, We're trying out a new journal club format where we summarize our thoughts on the paper in a blog post. For this blog post, Camille wrote the majority of the text and the rest of us added questions and comments.

Inanç Birol, Justin Chu, Hamid Mohamadi, et al., “Spaced Seed Data Structures for De Novo Assembly,” International Journal of Genomics, Article ID 196591, in press.

The paper:

Some relevant background:

## Summary

The authors describe several data structures for integrating a special case of spaced seeds into de Bruijn Graphs. Traditionally, a spaced seed is a mask that's applied to an indexing sequence, increasing the sensitivity of seeding approaches. For example, we could use 110110110 to mask out the 3rd base in each codon of our seed, thus allowing for a wider range of matches given the high variability of that position in protein coding regions. The special case here has the mask follow the form 1..1..0..0..0..1..1, where the left and right pass regions are k-mers, and the mask in the middle is of length Δ; or in other words, given a region of length 2k+Δ, the seed comprises the prefix and suffix k-mers.

Three data structures are introduced. The first is a naive hash table formulated in the same manner as the current ABySS implementation, which stores:

• The concatenated seeds as a bit string, represented with the notation [ k : k ]
• The extensions of both k-mers in the seed pair, stored as a 16-bit string
• Observation frequencies for the two strands of the seeds
• Bookkeeping junk

The second is a “spaced seeds bloom filter,” which as one might expect, stores the seeds in a bloom filter. The novelty here is the way they hash the seeds. Given a string of length 2k representing the concatenated k-mers of the seed, they hash:

• The left k-mer (i.e., the first k bases)
• The right k-mer (i.e., the last k bases)
• The string formed from pulling out every other base starting at position 0 (i.e., the “odd” bases)
• The string formed from pulling out every other base starting at position 1 (i.e., the “even” bases)
• For each of the four described values, they actually hash the XOR of the forward and revcomp’d 2-bit encodings

One immediately useful application of this method is in its ability to infer the existence of single-base errors. For example, if we check a seed and find that our filter already contains the “left” and “odd” k-mers, but not the “right” and “even” ones, there’s a good chance that our query seed just has a single base error in one of the “right” and “even” positions (see Table 3 for complete listing).

Finally, they describe a “spaced seeds counting bloom filter” which, you guessed it, stores seeds in a counting bloom filter. This is a particularly nifty implementation, as it uses the minifloat standard to exactly count up to 15, and probabilistically count values from 15-~128,000. They use the bloom filter to first count existence, and then fall over to the counting filter when a seed is observed multiple times. The usefulness of a better counting bloom filter should be obvious to our group.

1. The seeding methodology allows for dBG’s to be scaled up to longer, error prone reads - a very important advancement to make if we want to dBG’s to continue to be relevant. The question remains as to whether we ought to be piling more and more duct tape on to dBG's to keep them in use.
2. The seeding also allows more accurate resolution of complex graph regions by retaining longer range correlations from within reads.
3. The aforementioned error identification. The hashing method allows one to quickly restrict the set of possible/likely erroneous k-mers in a read, which should speed up spectral error correction.
4. Generally, spaced seeds have better fault tolerance and uniqueness than long k-mers. Fig. 1 shows that spaced seeds of length 16 with Δ>100 have better uniqueness than k-mers of length 64, and are obviously less prone to single base errors because they are composed of less sequence.

## Other notes

Interesting, they routinely use a k size of 120 when assembling 150 bp reads.

The staged Bloom filter reminds us of the BFCounter implementation, which uses exact counting for k-mers seen two or more times.

For Illumina reads (100 - 150 bp long), the middle part of each read are ignored if delta is big like 100 bp. So, were the y axis in figure 1 changed to absolute number, unique 2k-mers should be much higher in number than unique spaced seeds.

A generally question we have is which one is the most memory efficient data structure for DBG, sDBG or bloom filters?

Spaced seeds let you take advantage of longer reads; the alternative, using longer k, would reduce coverage dramatically, be sensitive to errors as well, and consume more memory.

## Confusions

• The authors claim that this is better than the method used in the PathSet paper because spaced seeds “allow” for fixed distance between seed pairs. This is confusing to us, because the variable distance used in PathSet seems to be described as a feature — yet these authors posit that variable-distance seeds are sensitive to “read coverage fluctuations” for reasons. No justification was given for this statement.

• We don't see how spaced seeds are useful with the higher error rate of uncorrected long 'pore reads; granted the error correction for both nanopore and pacbio has gotten a lot better lately.

More specifically, this seems targeted at long, erroneous reads. What effect do indels etc have from pac bio? Do you need error corrected reads? If you have error corrected long reads aren't you already mostly done assembling and no longer need to use DBG? And what's the effect of indels vs effect of high substitution, especially given the spaced seeds with fixed spacing?

## May 25, 2015

### Titus Brown

#### Notes on the &quot;week of khmer&quot;

Last week we wrote five blog posts about some previously un-publicized features in the khmer software - most specifically, read-to-graph alignment and sparse graph labeling -- and what they enabled. We covered some half-baked ideas on graph-based error correction, variant calling, abundance counting, graph labeling, and assembly evaluation.

It was, to be frank, an immense writing and coding effort and one from which I'm still recovering!

## Some details on khmer and replicating results

For anyone interested in following up on implementation details or any other details of the analyses, all of the results we wrote up last week can be replicated from scratch using khmer and publicly available data & scripts. You can also use a Docker container to run everything. To try this all out, use the links at the bottom of each blog post and follow the instructions there.

The majority of the khmer codebase is C++ with a CPython wrapping that provides a Python interface to the data structures and algorithms. Some people are already using it primarily via the C++ interface, while our own group mainly uses the Python interface.

One wonderful outcome of the blog posts was a bunch of things to read! A few I was already aware of, others were new to me, and I was thoroughly reminded of my lack of knowledge in this area.

In no particular order,

Lex Nederbragt has a wonderful blog post introducing the concept of graph-based genomics, On graph-based representations of a (set of) genomes. The references at the bottom are good for people that want to dive into this more.

Heng Li wrote a nicely technical blog post with a bunch more references.

Zam Iqbal left a nice comment on my first post that largely reiterated the references from Lex and Heng's blog posts (which I should have put in there in the first place, sorry).

Several people pointed me at BGREAT, Read Mapping on de Bruijn graph. I need to read it thoroughly.

Rob Patro pointed me at several papers, including Compression of high throughput sequencing data with probabilistic de Bruijn graph and Reference-based compression of short-read sequences using path encoding. More to read.

Erik Garrison pointed me at 'vg', tools for working with variant graphs. To quote, "It includes SIMD-based "banded" string to graph alignment. Can read and write GFA." See the github repo.

## So what was the point?

I had many reasons for investing effort in the the blog posts, but, as with many decisions I make, the reasoning became less clear as I pushed forward. Here are some things I wrote down while thinking about the topic and writing things up --

• we've had a lot of this basic functionality implemented for a while, but had never really applied it to anything. This was an attempt to drive a vertical spike through some problems and see how things worked out.
• taking existing ideas and bridging them to practice is always a good way to understand those ideas better.
• from writing this up, I developed more mature use cases, found broken aspects of the implementation, provided minimal documentation for a bunch of features in khmer, and hopefully sharpened our focus a bit.
• not enough people realize how fundamental a concept graphs (in general) are, and (more specifically) how powerful De Bruijn graphs are! It was fun to write that up in a bit more detail.
• I've found it virtually impossible to think concretely about publishing any of this. Very little of it is particularly novel and I'm not so interested in micro-optimizing the code for specific use cases so that we can publish a "10% better" paper." So writing them up as blog posts seemed like a good way to go, even had that not been my native inclination.
• Providing low-memory and scalable implementations seems like a good idea, especially when it's as simple as ours.

So far I'm quite happy with the results of the blogging (quiet interest, more references, some real improvements in the code base, etc. etc.). For now, I don't have anything more to say than that I'd like to try more technical blogging as a way to release potentially interesting computational bits and bobs to the community, and discuss them openly. It seems like a good way to advance science.

--titus

# VBGMM

This week, I studied variational inference described in Chapter 10 of Pattern Recognition and Machine Learning (PRML) on GMM model. I derived the updating functions of VBGMM with "full" type covariance matrix. There are so many equations. Download the file from Dropbox. Currently, I have done the updating functions with other three covariance matrix "tied", "diag" and "sphere", but I have not typed into the latex file yet.

I also studied the adjustment of GMM API. The discussion on issue #2473, #4062 points out the inconsistency on score_ sample, score. So I changed and made a new API interface of some functions in the ipython notebook.

## May 21, 2015

### Titus Brown

#### Comparing and evaluating assembly with graph alignment

One of our long-term interests has been in figuring out what the !$!$!#!#%! assemblers actually do to real data, given all their heuristics. A continuing challenge in this space is that short-read assemblers deal with really large amounts of noisy data, and it can be extremely hard to look at assembly results without running into this noise head-on. It turns out that being able to label De Bruijn graphs efficiently and align reads to graphs can help us explore assemblies in a variety of ways.

The two basic challenges are noisy data and lots of data. When (for example) looking at what fraction of reads has been incorporated into an assembly, noise causes problems because a read may have been corrected during assembly. This is where graph alignment comes in handy, because we can use it to align reads to the full graph and get rid of much of this noise. Lots of data complicates things because it's very hard to look at reads individually - you need to treat them in aggregate, and it's much easier just to look at the reads that match to your assembly than to investigate the oddball reads that don't assemble. And this is where the combination of graph alignment and labeling helps, because it's easy to count and extract reads based on overlaps with labels, as well as to summarize those overlaps.

The main question we will be asking below is: can we measure overlaps and disjoint components in graph extents, that is, in unique portions of assembly graphs? We will be doing this using our sparse graph instead of counting nodes or k-mers, for two reasons: first, the covering is largely independent of coverage, and second, the number of sparse nodes is a lot smaller than the total number of k-mers.

The underlying approach is straightforward:

• load contigs or reads from A into the graph, tagging sparse nodes as we go;
• load contigs or reads from B into the graph, tagging sparse nodes as we go;
• count the number of tagged nodes that are unique to A, unique to B, and in the overlap;
• optionally do graph alignment as you load in reads, to ignore errors.

## Some basics

Let's start with simulations, as usual. We'll set up two randomly generated chromosomes, a and b, of equal size, both in genomes.fa, and look at genome-a extent within the context of both (target 'fake_a' in Makefile):

./compare-graphs.py genomes.fa genome-b.fa
all tags: 52
n tags in A: 52
n tags in B: 26
tags in A but not in B 26
tags in B but not in A 0


So far so good -- there's a 50% overlap between one of the chromosomes and the total.

If we now generate reads from genome-b.fa and do the graph comparison with the reads, we get silly results (target 'fake_b' in Makefile):

./compare-graphs.py genomes.fa reads-b.fa
all tags: 135
n tags in A: 109
n tags in B: 107
tags in A but not in B 28
tags in B but not in A 26


Despite knowing by construction that all of the reads came from genome-b, we're getting results that there's a lot of tags in the reads that aren't in the genome. This is because of errors in the reads, which introduce many spurious branches in the graph.

This is now where the read aligner comes in; we can do the same comparison, but this time we can ask that the reads be aligned to the genome, thus eliminating most of the errors in the comparison:

./compare-graphs.py genomes.fa reads-b.fa --align-b
all tags: 99
n tags in A: 99
n tags in B: 72
tags in A but not in B 27
tags in B but not in A 0


At this point we can go in and look at the original tags in A that aren't covered in B (there are 52) and note that B is missing approximately half of the graph extent in A.

## Trying it out on some real data

Let's try evaluating a reference against some low-coverage reads. Using the same mouse reference transcriptome & subset of reads that we've been using in previous blog posts, we can ask "how many sparse nodes are unaccounted for in the mouse transcriptome when we look at the reads?" (Note, the mouse transcriptome was not generated from this data set; this is the reference transcriptome.)

The answer (target rna-compare-noalign.txt in the Makefile) is:

all tags: 1959121
n tags in A: 1878475
n tags in B: 644963
tags in A but not in B 1314158
tags in B but not in A 80646


About 12.5% of the reads in (B; 80646 / 644963) don't pick up tags in the official reference transcriptome (A).

Interestingly, the results with alignment are essentially the same (target rna-compare-align.txt):

all tags: 1958219
n tags in A: 1877685
n tags in B: 643655
tags in A but not in B 1314564
tags in B but not in A 80534


suggesting that, by and large, these reads are disjoint from the existing assembly, and not mere sequencing errors. (This may be because we require that the entire read be mappable to the graph in order to count it, though.)

## Evaluating trimming

One of the interesting questions that's somewhat hard to investigate in terms of transcriptome assembly is, how beneficial is read trimming to the assembly? The intuition here (that I agree with) is that generally sequence trimming lowers the effective coverage for assembly, and hence loses you assembled sequence. Typically this is measured by running an assembler against the reads, which is slightly problematic because the assembler could have all sorts of strange interactions with the trimming.

So, can we look at the effect of trimming in terms of sparse nodes? Sure!

Suppose we do a stringent round of trimming on our RNAseq (Trimmomatic SLIDINGWINDOW:4:30) - what do we lose?

On this low coverage data set, where A is the graph formed from the trimmed reads and B is the graph from the raw reads, we see (target rseq-hardtrim-ba-noalign.txt):

all tags: 588615
n tags in A: 518980
n tags in B: 588615
tags in A but not in B 0
tags in B but not in A 69635


we see about 12% of the sparse nodes missing from the trimmed data.

If we run the read aligner with a low coverage cutoff (target rseq-hardtrim-ba-align1.txt), we see:

all tags: 569280
n tags in A: 519396
n tags in B: 561757
tags in A but not in B 7523
tags in B but not in A 49884


Basically, we recover about 20,000 tags in B (69,635 - 49,884) with alignment vs exact matches, so a few percent; but we also lose about half that (7,500) for reasons that we don't entirely understand (wiggle in the graph aligner?)

We have no firm conclusions here, except to say that this should be a way to evaluate the effect of different trimming on graph extent, which should be more reliable than looking at the effect on assemblies.

## Notes and miscellany

• There is no inherent coverage model embedded here, so as long as we can correct for the density of tags, we can apply these approaches to genomes, metagenomes, and transcriptomes.
• It's actually very easy to extract the reads that do or don't match, but our current scripts don't let us do so based on labels.
• We aren't really using the labeling here, just the tagging - but labeling can enable n-way comparisons between e.g. different assemblies and different treatments, because it lets us examine which tags show up in different combinations of data sets.

### Appendix: Running this code

The computational results in this blog post are Rather Reproducible (TM). Please see https://github.com/dib-lab/2015-khmer-wok5-eval/blob/master/README.rst for instructions on replicating the results on a virtual machine or using a Docker container.

### Continuum Analytics

#### Conda for Data Science

tl; dr: We discuss how data scientists working with Python, R, or both can benefit from using conda in their workflow.

Conda is a package and environment manager that can help data scientists manage their project dependencies and easily share environments with their peers. Conda works with Linux, OSX, and Windows, and is language agnostic, which allows us to use it with any programming language or even multi-language projects.

This post explores how to use conda in a multi-language data science project. We’ll use a project named topik, which combines Python and R libraries, as an example.

## May 20, 2015

### Titus Brown

#### Labeling a sparse covering of a De Bruijn graph, and utility thereof

So far, in this week of khmer blog posts (1, 2, 3), we've been focusing on the read-to-graph aligner ("graphalign"), which enables sequence alignments to a De Bruijn graph. One persistent challenge with this functionality as introduced is that our De Bruijn graphs nodes are anonymous, so we have no way of knowing the sources of the graph sequences to which we're aligning.

Without being able to label the graph with source sequences and coordinates, we can't do some pretty basic things, like traditional read mapping, counting, and variant calling. It would be nice to be able to implement those in a graph-aware manner, we think.

To frame the problem, graphalign lets us query into graphs in a flexible way, but we haven't introduced any way to link the matches back to source sequences. There are several things we could do -- one basic idea is to annotate each node in the graph -- but what we really want is a lightweight way to build a labeled graph (aka "colored graph" in Iqbal parlance).

This is where some nice existing khmer technology comes into play.

## Partitioning, tagging, and labelhash

Back in 2012, we published a paper (Pell et al., 2012) that introduced a lightweight representation of implicit De Bruijn graphs. Our main purpose for this representation was something called "partitioning", in which we identified components (disconnected subgraphs) of metagenome assembly graphs for the purpose of scaling metagenome assembly.

A much underappreciated part of the paper is buried in the Materials,

For discovering large components we tag the graph at a minimum density by using the underlying reads as a guide. We then exhaustively explore the graph around these tags in order to connect tagged k-mers based on graph connectivity. The underlying reads in each component can then be separated based on their partition.

The background is that we were dealing with extremely large graphs (30-150 billion nodes), and we needed to exhaustively explore the graphs in order to determine if any given node was transitively connected to any other node; from this, we could determine which nodes belonged to which components. We didn't want to label all the nodes in the graph, or traverse from all the nodes, because this was prohibitive computationally.

### A sparse graph covering

To solve this problem, we built what I call a sparse graph covering, in which we chose a subset of graph nodes called "tags" such that every node in the graph was within a distance 'd' of a tag. We then used this subset of tags as a proxy for the graph structure overall, and could do things like build "partitions" of tags representing disconnected components. We could guarantee the distance 'd' by using the reads themselves as guides into the graph (Yes, this was one of the trickiest bits of the paper. ;)

Only later did I realize that this tagging was analogous to sparse graph representations like succinct De Bruijn graphs, but that's another story.

The long and short of it is this: we have a nice, simple, robust, and somewhat lightweight way to label graph paths. We also have functionality already built in to exhaustively explore the graph around any node and collect all tagged nodes within a given distance.

What was missing was a way to label these nodes efficiently and effectively, with multiple labels.

### Generic labeling

Soon after Camille Scott, a CS graduate student at MSU (and now at Davis), joined the lab, she proposed an expansion to the tagging code to enable arbitrary labels on the tags. She implemented this within khmer, and built out a nice Python API called "labelhash".

With labelhash, we can do things like this:

lh = khmer.CountingLabelHash(...)
lh.consume_fasta_and_tag_with_labels(sequence_file)


and then query labelhash with specific sequences:

labels = lh.sweep_label_neighborhood(query, dist)


where 'labels' now contains the labels of all tags that overlap with 'query', including tags that are within an optional distance 'dist' of any node in query.

Inconveniently, however, this kind of query was only useful when what you were looking for was in the graph already; it was a way to build an index of sequences, but fuzzy matching wasn't possible. With the high error rate of sequencing and high polymorphism rates in things we worked on, we were worried about its poor effectiveness.

### Querying via graphalign, retrieving with labelhash

This is where graphalign comes in - we can query into the graph in approximate ways, and retrieve a path that's actually in the graph from the query. This is essentially like doing a BLASTN query into the graph. And, combined with labelhash, we can retrieve the reference sequence(s) that match to the query.

This is roughly what it looks like, once you've built a labelhash as above. First, run the query:

aligner = khmer.ReadAligner(lh.graph, trusted_coverage, 1.0)
score, graph_path, query_path, is_truncated = aligner.align(query)


and then retrieve the associated labels:

labels = lh.sweep_label_neighborhood(graph_path)


...which you can then use with a preexisting database of the sequence.

### Why would you do any of this?

If this seems like an overly complicated way of doing a BLAST, here are some things to consider:

• when looking at sequence collections that share lots of sequence this is an example of "compressive computing", in which the query is against a compressed representation of the database. In particular, this type of solution might be good when we have many, many closely related genomes and we want to figure out which of them have a specific variant.
• graphs are notoriously heavyweight in general, but these graphs are actually quite low memory.
• you can do full BLASTX or protein HMM queries against these graphs as well. While we haven't implemented that in khmer, both a BLAST analog and a HMMER analog have been implemented on De Bruijn graphs.
• another specific use case is retrieving all of the reads that map to a particular region of an assembly graph; this is something we were very interested in back when we were trying to figure out why large portions of our metagenomes were high coverage but not assembling.

One use case that is not well supported by this scheme is labeling all reads - the current label storage scheme is too heavyweight to readily allow for millions of labels, although it's something we've been thinking about.

## Some examples

We've implemented a simple (and, err, somewhat hacky) version of this in make-index.py and do-align.py.

To see them in action, you'll need the 2015-wok branch of khmer, and a copy of the prototype (https://github.com/dib-lab/2015-khmer-wok4-multimap) -- see the README for full install instructions.

Then, type:

make fake


and you should see something like this (output elided):

./do-align.py genomes reads-a.fa



showing that we can correctly assign reads sampled from randomly constructed genomes - a good test case :).

### Assigning reads to reference genomes

We can also index a bunch of bacterial genomes and map against all of them simultaneously -- target 'ecoli' will map reads from E. coli P12B against all Escherichia genomes in NCBI. (Spoiler alert: all of the E. coli strains are very closely related, so the reads map to many references!)

It turns out to be remarkably easy to implement a counting-via-mapping approach -- see do-counting.py. To run this on the same RNAseq data set as in the counting blog post, run build the 'rseq.labelcount' target.

## Flaws in our current implementation

A few points --

• we haven't introduced any positional labeling in the above labels, so all we can do is retrieve the entire sequence around submatches. This is enough to do some things (like counting transcripts) but for many purposes (like pileups / variant calling via mapping) we would need to do something with higher resolution.
• there's no reason we couldn't come up with different tagging and labeling schemes that focus on features of interests - specific variants, or branch points for isoforms, or what have you. Much of this is straightforward and can be done via the Python layer, too.
• "labeled De Bruijn graphs" are equivalent in concept to "colored De Bruijn graphs", but we worry that "colored" is already a well-used term in graph theory and we are hoping that we can drop "colored" in favor of "labeled".

### Appendix: Running this code

The computational results in this blog post are Rather Reproducible (TM). Please see https://github.com/dib-lab/2015-khmer-wok4-labelhash/blob/master/README.rst for instructions on replicating the results on a virtual machine or using a Docker container.

## May 19, 2015

### Titus Brown

#### Abundance counting of sequences in graphs with graphalign

De Bruijn graph alignment should also be useful for exploring concepts in transcriptomics/mRNAseq expression. As with variant calling graphalign can also be used to avoid the mapping step in quantification; and, again, as with the variant calling approach, we can do so by aligning our reference sequences to the graph rather than the reads to the reference sequences.

The basic concept here is that you build a (non-abundance-normalized) De Bruijn graph from the reads, and then align transcripts or genomic regions to the graph and get the k-mer counts across the alignment. This is nice because it gives you a few options for dealing with multimapping issues as well as variation across the reference. You can also make use of the variant calling code to account for certain types of genomic/transcriptomic variation and potentially address allelic bias issues.

Given the existence of Sailfish/Salmon and the recent posting of Kallisto, I don't want to be disingenuous and pretend that this is any way a novel idea! It's been clear for a long time that using De Bruijn graphs in RNAseq quantification is a worthwhile idea. Also, whenever someone uses k-mers to do something in bioinformatics, there's an overlap with De Bruijn graph concepts (...pun intended).

What we like about the graphalign code in connection with transcriptomics is that it makes a surprisingly wide array of things easy to do. By eliminating or at least downgrading the "noisiness" of queries into graphs, we can ask all sorts of questions, quickly, about read counts, graph structure, isoforms, etc. Moreover, by building the graph with error corrected reads, the counts should in theory become more accurate. (Note that this does have the potential for biasing against low-abundance isoforms because low-coverage reads can't be error corrected.)

For one simple example of the possibilities, let's compare mapping counts (bowtie2) against transcript graph counts from the graph (khmer) for a small subset of a mouse mRNAseq dataset. We measure transcript graph counts here by walking along the transcript in the graph and averaging over k-mer counts along the path. This is implicitly a multimapping approach; to get results comparable to bowtie2's default parameters (which random-map), we divide out the number of transcripts in which each k-mer appears (see count-median-norm.py, 'counts' vs 'counts2').

This graph shows some obvious basic level of correlation, but it's not great. What happens if we use corrected mRNAseq reads (built using graphalign)?

This looks better - the correlation is about the same, but when we inspect individual counts, they have moved further to the right, indicating (hopefully) greater sensitivity. This is to be expected - error correction is collapsing k-mers onto the paths we're traversing, increasing the abundance of each path on average.

What happens if we now align the transcripts to the graph built from the error corrected reads?

Again, we see mildly greater sensitivity, due to "correcting" transcripts that may differ only by a base or two. But we also see increased counts above the main correlation, especially above the branch of counts at x = 0 (poor graph coverage) but with high mapping coverage - what gives? Inspection reveals that these are reads with high mapping coverage but little to no graph alignment. Essentially, the graph alignment is getting trapped in a local region. There are at least two overlapping reasons for this -- first, we're using the single seed/local alignment approach (see error correction) rather than the more generous multiseed alignment, and so if the starting point for graph alignment is poorly chosen, we get trapped into a short alignment. Second, in all of these cases, the transcript isn't completely covered by reads, a common occurrence due to both low coverage data as well as incomplete transcriptomes.

In this specific case, the effect is largely due to low coverage; if you drop the coverage further, it's even more exacerbated.

Two side notes here -- first, graphalign will align to low coverage (untrusted) regions of the graph if it has to, although the algorithm will pick trusted k-mers when it can. As such it avoids the common assembler problem of only recovering high abundance paths.

And second, no one should use this code for counting. This is not even a proof of concept, but rather an attempt to see how well mapping and graph counting fit with an intentionally simplistic approach.

## Isoform structure and expression

Another set of use cases worth thinking about is looking at isoform structure and expression across data sets. Currently we are somewhat at the mercy of our reference transcriptome, unless we re-run de novo assembly every time we get a new data set. Since we don't do this, for some model systems (especially emerging model organisms) isoform families may or may not correspond well to the information in the individual samples. This leads to strange-looking situations where specific transcripts have high coverage in one region and low coverage in another (see SAMmate for a good overview of this problem.)

Consider the situation where a gene with four exons, 1-2-3-4, expresses isoform 1-2-4 in tissue A, but expresses 1-3-4 in tissue B. If the transcriptome is built only from data from tissue A, then when we map reads from tissue B to the transcriptome, exon 2 will have no coverage and counts from exon 3 will (still) be missing. This can lead to poor sensitivity in detecting low-expressed genes, weird differential splicing results, and other scientific mayhem.

(Incidentally, it should be clear from this discussion that it's kind of insane to build "a transcriptome" once - what we really want do is build a graph of all relevant RNAseq data where the paths and counts are labeled with information about the source sample. If only we had a way of efficiently labeling our graphs in khmer! Alas, alack!)

With graph alignment approaches, we can short-circuit the currently common ( mapping-to-reference->summing up counts->looking at isoforms ) approach, and go directly to looking directly at counts along the transcript path. Again, this is something that Kallisto and Salmon also enable, but there's a lot of unexplored territory here.

We've implemented a simple, short script to explore this here -- see explore-isoforms-assembled.py, which correctly picks out the exon boundaries from three simulated transcripts (try running it on 'simple-mrna.fa').

### Other thoughts

• these counting approaches can be used directly on metagenomes as well, for straight abundance counting as well as analysis of strain variation. This is of great interest to our lab.
• calculating differential expression on an exonic level, or at exon-exon junctions, is also an interesting direction.

### References and previous work

• Kallisto is the first time I've seen paths in De Bruin graphs explicitly used for RNAseq quantification rather than assembly. Kallisto has some great discussion of where this can go in the future (allele specific expression being one very promising direction).
• There are lots of De Bruijn graph based assemblers for mRNAseq (Trinity, Oases, SOAPdenovo-Trans, and Trans-ABySS.

### Appendix: Running this code

The computational results in this blog post are Rather Reproducible (TM). Please see https://github.com/dib-lab/2015-khmer-wok3-counting/blob/master/README.rst for instructions on replicating the results on a virtual machine or using a Docker container.

### Matthew Rocklin

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr We lay out the pieces of Dask, a system for parallel computing

## Introduction

Dask started five months ago as a parallel on-disk array; it has since broadened out. I’ve enjoyed writing about its development tremendously. With the recent 0.5.0 release I decided to take a moment to give an overview of dask’s various pieces, their state, and current development.

## Collections, graphs, and schedulers

Dask modules can be separated as follows:

On the left there are collections like arrays, bags, and dataframes. These copy APIs for NumPy, PyToolz, and Pandas respectively and are aimed towards data science users, allowing them to interact with larger datasets. Operations on these dask collections produce task graphs which are recipes to compute the desired result using many smaller computations that each fit in memory. For example if we want to sum a trillion numbers then we might break the numbers into million element chunks, sum those, and then sum the sums. A previously impossible task becomes a million and one easy ones.

On the right there are schedulers. Schedulers execute task graphs in different situations, usually in parallel. Notably there are a few schedulers for a single machine, and a new prototype for a distributed scheduler.

In the center is the directed acyclic graph. This graph serves as glue between collections and schedulers. The dask graph format is simple and doesn’t include any dask classes; it’s just functions, dicts, and tuples and so is easy to build on and low-tech enough to understand immediately. This separation is very useful to dask during development; improvements to one side immediately affect the other and new developers have had surprisingly little trouble. Also developers from a variety of backgrounds have been able to come up to speed in about an hour.

This separation is useful to other projects too. Directed acyclic graphs are popular today in many domains. By exposing dask’s schedulers publicly, other projects can bypass dask collections and go straight for the execution engine.

A flattering quote from a github issue:

dask has been very helpful so far, as it allowed me to skip implementing all of the usual graph operations. Especially doing the asynchronous execution properly would have been a lot of work.

Dask developers work closely with a few really amazing users:

1. Stephan Hoyer at Climate Corp has integrated dask.array into xray a library to manage large volumes of meteorlogical data (and other labeled arrays.)

2. Scikit image now includes an apply_parallel operation (github PR) that uses dask.array to parallelize image processing routines. (work by Blake Griffith)

3. Mariano Tepper a postdoc at Duke, uses dask in his research on matrix factorizations. Mariano is also the primary author of the dask.array.linalg module, which includes efficient and stable QR and SVD for tall and skinny matrices. See Mariano’s paper on arXiv.

4. Finally I personally use dask on daily work related to the XData project. This tends to drive some of the newer features.

A few other groups pop up on github from time to time; I’d love to know more detail about how people use dask.

## What works and what doesn’t

Dask is modular. Each of the collections and each of the schedulers are effectively separate projects. These subprojects are at different states of development. Knowing the stability of each subproject can help you to determine how you use and depend on dask.

Dask.array and dask.threaded work well, are stable, and see constant use. They receive relatively minor bug reports which are dealt with swiftly.

Dask.bag and dask.multiprocessing undergo more API churn but are mostly ready for public use with a couple of caveats. Neither dask.dataframe nor

dask.distributed are ready for public use; they undergo significant API churn and have known errors.

## Current work

The current state of development as I see it is as follows:

1. Dask.bag and dask.dataframe are progressing nicely. My personal work depends on these modules, so they see a lot of attention.
• At the moment I focus on grouping and join operations through fast shuffles; I hope to write about this problem soon.
• The Pandas API is large and complex. Reimplementing a subset of it in a blocked way is straightforward but also detailed and time consuming. This would be a great place for community contributions.
2. Dask.distributed is new. It needs it tires kicked but it’s an exciting development.
• For deployment we’re planning to bootstrap off of IPython parallel which already has decent coverage of many parallel job systems, (see #208 by Blake)
3. Dask.array development these days focuses on outreach. We’ve found application domains where dask is very useful; we’d like to find more.
4. The collections (Array, Bag, DataFrame) don’t cover all cases. I would like to start finding uses for the task schedulers in isolation. They serve as a release valve in complex situations.

You can install dask with conda

conda install dask


or with pip

pip install dask
or
or


## May 18, 2015

### Titus Brown

#### Graph alignment and variant calling

There's an interesting and intuitive connection between error correction and variant calling - if you can do one well, it lets you do (parts of) the other well. In the previous blog post on some new features in khmer, we introduced our new "graphalign" functionality, that lets us align short sequences to De Bruijn graphs, and we discussed how we use it for error correction. Now, let's try it out for some simple variant calling!

Graphalign can potentially be used for variant calling in a few different ways - by mapping reads to the reference graph and then using a pileup approach, or by error correcting reads against the graph with a tunable threshold for errors and then looking to see where all the reads disagree - but I've become enamored of an approach based on the concept of reference-guided assembly.

The essential idea is to build a graph that contains the information in the reads, and then "assemble" a path through the graph using a reference sequence as a guide. This has the advantage of looking at the reads only once (to build a DBG, which can be done in a single pass), and also potentially being amenable to a variety of heuristics. (Like almost all variant calling, it is limited by the quality of the reference, although we think there are probably some ways around that.)

## Basic graph-based variant calling

Implementing this took a little bit of extra effort beyond the basic read aligner, because we want to align past gaps in the graph. The way we implemented this was to break the reference up into a bunch of local alignments, each aligned independently, then stitched together.

Again, we tried to keep the API simple. After creating a ReadAligner object,

aligner = khmer.ReadAligner(graph, trusted_cutoff, bits_theta)


there's a single function that takes in the graph and the sequence (potentially genome/chr sized) to align:

score, alignment = align_long(graph, aligner, sequence)


What is returned is a score and an alignment object that gives us access to the raw alignment, some basic stats, and "variant calling" functionality - essentially, reporting of where the alignments are not identical. This is pretty simple to implement:

for n, (a, b) in enumerate(zip(graph_alignment, read_alignment)):
if a != b:
yield n, a, b


The current implementation of the variant caller does nothing beyond reporting where an aligned sequence differs from the graph; this is kind of like guided assembly. In the future, the plan is to extend it with reference-free assembly.

To see this in action for a simulated data set, look at the file sim.align.out -- we get alignments like this, highlighting mismatches:

ATTTTGTAAGTGCTCTATCCGTTGTAGGAAGTGAAAGATGACGTTGCGGCCGTCGCTGTT
|||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||
ATTTTGTAAGTGCTCTATCCCTTGTAGGAAGTGAAAGATGACGTTGCGGCCGTCGCTGTT


(Note that the full alignment shows there's a bug in the read aligner at the ends of graphs. :)

It works OK for whole-genome bacterial stuff, too. If we take an E. coli data set (the same one we used in the semi-streaming paper) and just run the reads against the known reference genome, we'll get 74 differences between the graph and the reference genome, out of 4639680 positions -- an identity of 99.998% (variants-ecoli.txt). On the one hand, this is not that great (consider that for something the size of the human genome, with this error rate we'd be seeing 50,000 false positives!); on the other hand, as with error correction, the whole analysis stack is surprisingly simple, and we haven't spent any time tuning it yet.

## Simulated variants, and targeted variant calling

With simulated variants in the E. coli genome, it does pretty well. Here, rather than changing up the genome and generating synthetic reads, we went with the same real reads as before, and instead changed the reference genome we are aligning to the reads. This was done with the patch-ecoli.py script, which changes an A to a C at position 500,000, removes two bases at position 2m, and adds two bases at position 3m.

When we align the "patched" E. coli genome against the read graph, we indeed recover all three alignments (see variants-patched.txt) in the background of the same false positives we saw in the unaltered genome. So that's kind of nice.

What's even neater is that we can do targeted variant calling directly against the graph -- suppose, for example, that we're interested in just a few regions of the reference. With the normal mapping-based variant calling, you need to map all the reads first before querying for variants by location, because mapping requires the use of the entire reference. Here, you are already looking at all the reads in the graph form, so you can query just the regions you're interested in.

So, for example, here you can align just the patched regions (in ecoli-patched-segments.fa) against the read graph and get the same answer you got when aligning the entire reference (target ecoli-patched-segments.align.out). This works in part because we're stitching together local alignments, so there are some caveats in cases where different overlapping query sequences might lead to different optimal alignments - further research needed.

## Speed considerations

Once you've created the graph (which is linear time with respect to the number of reads), things are pretty fast. For the E. coli data set, it takes about 25 seconds to do a full reference-to-graph alignment on my Mac laptop. Much of the code is still written in Python so we hope to get this under 5 seconds.

In the future, we expect to get much faster. Since the alignment is guided and piecewise, it should be capable of aligning through highly repetitive repeats and is also massively parallelizable. We think that the main bottleneck is going to be loading in the reads. We're working on optimizing the loading separately, but we're hoping to get down to about 8 hours for a full ~50x human genome variant calling with this method on a single CPU.

## Memory considerations

The memory is dominated by graph size, which in turn is dominated by the errors in short-read Illumina data. We have efficient ways of trimming some of these errors, and/or compressing down the data, even if we don't just correct them; the right approach will depend on details of the data (haploid? diploid? polyploid?) and will have to be studied.

For E. coli, we do the above variant calling in under 400 MB of RAM. We should be able to get that down to under 100 MB of RAM easily enough, but we will have to look into exactly what happens as we compress our graph down.

From the Minia paper, we can place some expectations on the memory usage for diploid human genome assembly. (We don't use cascading Bloom filters, but our approaches are approximately equivalent.) We believe we can get down to under 10 GB of RAM here.

As with most of our methods, this approach should work directly for variant calling on RNAseq and metagenomic data with little alteration. We have a variety of graph preparation methods (straight-up graph loading as well as digital normalization and abundance slicing) that can be applied to align to everything while favoring high-coverage reads, or only to high coverage, or to error-trimmed reads, or...

In effect, what we're doing is (rather boring) reference-guided assembly. Wouldn't it be nice if we extended it to longer indels, as in Holtgrewe et al., 2015? Yes, it would. Then we could ask for an assembly to be done between two points... This would enable the kinds of approaches that (e.g.) Rimmer et al., 2014 describe.

One big problem with this approach is that we're only returning positions in the reference where the graph has no agreement - this will cause problems when querying diploid data sets with a single reference, where we really want to know all variants, including heterozygous ones where the reference contains one of the two. We can think of several approaches to resolving this, but haven't implemented them yet.

A related drawback of this approach so far is that we have (so far) presented no way of representing multiple data sets in the same graph; this means that you can't align to many different data sets all at once. You also can't take advantage of things like the contiguity granted by long reads in many useful ways, nor can you do haplotyping with the long reads. Stay tuned...

## References and previous work

A number of people have done previous work on graph-based variant calling --

• Zam Iqbal and Mario Caccamo's Cortex is the first article that introduced me to this area. Since then, Zam's work as well as some of the work that Jared Simpson is doing on FM indices has been a source of inspiration.

(See especially Zam's very nice comment on our error correction post!)

• Heng Li's FermiKit does something very similar to what we're proposing to do, although it seems like he effectively does an assembly before calling variants. This has some positives and some negatives that we'll have to explore.

• Kimura and Koike (2015) do variant calling on a Burrows- Wheeler transform of short-read data, which is very similar to what we're doing.

• Using k-mers to find variation is nothing new. Two articles that caught my eye -- BreaKmer (Abo et al, 2015) and kSNP3 (Gardner et al., 2015) both do this to great effect.

• the GA4GH is working on graph-based variant calling, primarily for human. So far it seems like they are planning to rely on well curated genomes and variants; I'm going to be working with (much) poorer quality genomes, which may account for some differences in how we're thinking about things.

## Appendix: Running this code

The computational results in this blog post are Rather Reproducible (TM). Please see https://github.com/dib-lab/2015-khmer-wok2-vc/blob/master/README.rst for instructions on replicating the results on a virtual machine or using a Docker container.

## May 17, 2015

### Titus Brown

#### Read-to-graph alignment and error correction

One of the newer features in khmer that we're pretty excited about is the read-to-graph aligner, which gives us a way to align sequences to a De Bruijn graph; our nickname for it is "graphalign."

Briefly, graphalign uses a pair-HMM to align a sequence to a k-mer graph (aka De Bruijn graph) allowing both mismatches and indels, and taking into account coverage using a binary model (trusted and untrusted k-mers). The core code was written by Jordan Fish when he was a graduate student in the lab, based on ideas stemming from Jason Pell's thesis work on error correction. It was then refactored by Michael Crusoe.

Graphalign actually lets us do lots of things, including align both short and long sequences to DBG graphs, error correct, and call variants. We've got a simple Python API built into khmer, and we're working to extend it.

The core graphalign API is based around the concept of a ReadAligner object:

aligner = khmer.ReadAligner(graph, trusted_cov, bits_theta)


where 'graph' is a De Bruijn graph (implemented as a counting table in khmer), 'trusted_cov' defines what the trusted k-mer coverage is, and 'bits_theta' adjusts a scoring parameter used to extend alignments.

The 'aligner' object can be used to align short sequences to the graph:

score, graph_alignment, read_alignment, truncated = \


Here, 'graph_alignment' and 'read_alignment' are strings; if 'truncated' is false, then they are of the same length, and constitute a full gapped alignment of the DNA sequence in 'read' to the graph.

The approach used by 'align' is to seed an alignment at the first trusted k-mer, and then extend the alignment along the graph in both directions. Thus, it's effectively a local aligner.

## Error correction

Our initial motivation for graphalign was to use it to do error correction, with specific application to short-read sequences. There was (and to some extent still is) a dearth of error correction approaches that can be used for metagenome and transcriptome data sets, and since that kind of data is what our lab works on, we needed an error correction approach for those data. We also wanted something a bit more programmable than the existing error correctors, which were primarily command-line tools; we've found a lot of value in building libraries, and wanted to use that approach here, too.

The basic idea is this: we build a graph from our short-read data, and then go back through and align each short read to the graph. A successful alignment is then the corrected read. The basic code looks like this:

graph = build_graph(dataset)



In conjunction with our work on semi-streaming algorithms, we can directly convert this into a semi-streaming algorithm that works on genomes, metagenomes, and transcriptomes. This is implemented in the correct-reads script.

## Some results

If we try this out on a simulated data set (random genome, randomly chosen reads - see target compare-sim.txt in Makefile), it takes the simulated data from an error rate of around 1% to about 0.1%; see compare-sim.txt.

Applying this to a ~7m read subset of mRNAseq that we tackled in the semi-streaming paper (the data itself is from the Trinity paper, Grabherr et al, 2011), we take the data from an error rate of about 1.59% to 0.98% (see target rseq-compare.txt in Makefile). There are several reasons why this misses so many errors - first, error correction depends on high coverage, and much of this RNAseq data set is low coverage; second, this data set has a lot of errors; and third, RNAseq may have a broader k-mer abundance distribution than genomic sequencing.

One important side note: we use exactly the same script for error correcting RNAseq data as we do for genomic data.

## How good is the error correction?

tl; dr? It's pretty good but still worse than current methods. When we compare to Quake results on an E. coli data set (target compare-ecoli.txt in the Makefile), we see:

Data set Error rate
Uncorrected 1.587%
Quake 0.009%
khmer 0.013%

This isn't too bad - two orders of magnitude decrease in error rate! - but we'd like to at least be able to beat Quake :).

(Note that here we do a fair comparison by looking only at errors on sequences that Quake doesn't discard; to get comparable results on your data with khmer, you'd also have to trim your reads. We are also making use of the approach developed in the streaming paper where we digitally normalize the graph in advance, in order to decrease the number of errors and the size of the graph.)

## Concluding thoughts

What attracts us to this approach is that it's really simple. The basic error correction is a few lines, although it's surrounded by a bunch of machinery for doing semi-streaming analysis and keeping pairing intact. (The two-pass/offline script for error correction is much cleaner, because it omits all of this machinery.)

It's also nice that this applies to all shotgun sequencing, not just genomic; that's a trivial extension of our semi-streaming paper.

We also suspect that this approach is quite tunable, although we are just beginning to investigate the proper way to build parameters for the pair-HMM, and we haven't nailed down the right coverage/cutoff parameters for error correction either. More work to be done!

In any case, there's also more than error correction to be done with the graphalign approach -- stay tuned!

## References and previous work

This is by no means novel - we're building on a lot of ideas from a lot of people. Our interest is in bridging from theory to practice, and providing a decent tunable implementation in an open-source package, so that we can explore these ideas more widely.

Here is short summary of previous work, surely incomplete --

• Much of this was proximally inspired by Jordan's work on Xander, software to do HMM-guided gene assembly from metagenomic data. (An accompanying paper has been accepted for publication; will blog about that when it hits.)
• More generally, my MSU colleague Yanni Sun has had several PhD students that have worked on HMMs and graph alignment, and she and her students have been great sources of ideas! (She co-advised Jordan.)
• BlastGraph, like Xander, built on the idea of graph alignment. It is the earliest reference I know of to graph alignment, but I haven't looked very hard.
• Yuzhen Ye and Haixu Tang at Indiana have developed very similar functionality that I became aware of when reviewing their nice paper on graph alignment for metatranscriptomics.
• Jared Simpson has been doing nice work on aligning Nanopore reads to a reference sequence. My guess is that the multiple sequence alignment approach described in Jonathan Dursi's blog post is going to prove relevant to us.
• The error corrector Coral (Salmela and Schroder, 2011) bears a strong philosophical resemblance to graphalign in its approach to error correction, if you think of a De Bruijn graph as a kind of multiple-sequence alignment.

## Appendix: Running this code

The computational results in this blog post are Rather Reproducible (TM). Please see https://github.com/dib-lab/2015-khmer-wok1-ec/blob/master/README.rst for instructions on replicating the results on a virtual machine or using a Docker container.

### Gaël Varoquaux

#### Software for reproducible science: let’s not have a misunderstanding

Note

tl;dr:   Reproducibilty is a noble cause and scientific software a promising vessel. But excess of reproducibility can be at odds with the housekeeping required for good software engineering. Code that “just works” should not be taken for granted.

This post advocates for a progressive consolidation effort of scientific code, rather than putting too high a bar on code release.

Titus Brown recently shared an interesting war story in which a reviewer refuses to review a paper until he can run the code on his own files. Titus’s comment boils down to:

“Please destroy this software after publication”.

Note

Reproducible science: Does the emperor have clothes?

In other words, code for a publication is often not reusable. This point of view is very interesting from someone like Titus, who is a vocal proponent of reproducible science. His words triggered some surprises, which led Titus to wonder if some of the reproducible science crowd folks live in a bubble. I was happy to see the discussion unroll, as I think that there is a strong risk of creating a bubble around reproducible science. Such a bubble will backfire.

## Replication is a must for science and society

Science advances by accumulating knowledge built upon observations. It’s easy to forget that these observations, and the corresponding paradigmatic conclusions, are not always as simple to establish as the fact that hot air rises: replicating many times the scientific process transforms an evidence into a truth.

One striking example of scientific replication is the on-going effort in psychology to replay the evidence behind well-accepted findings central to current line of thoughts in psychological sciences. It implies setting up the experiments accordingly to the seminal publications, acquiring the data, and processing it to come up to the same conclusions. Surprisingly, not everything that was taken for granted holds.

Note

Findings later discredited backed economic policy

Another example, with massive consequences on Joe Average’s everyday, is the failed replication of Reinhart and Rogoff’s “Growth in a Time of Debt” publication. The original paper, published in 2010 in the American Economic Review, claimed empirical findings linking important public debt to failure of GDP growth. In a context of economical crisis, it was used by policy makers as a justification for restricted public spending. However, while pursuing a mere homework assignment to replicate these findings, a student uncovered methodological flaws with the paper. Understanding the limitations of the original study took a while, and discredited the academic backing to the economical doctrine of austerity. Critically, the analysis of the publication was possible only because Reinhart and Rogoff released their spreadsheet, with data and analysis details.

## Sharing code can make science reproducible

A great example of sharing code to make a publication reproducible is the recent paper on orthogonalization of regressors in fMRI models, by Mumford, Poline and Poldrack. The paper is a didactic refutation of non-justified data processing practices. The authors made their point much stronger by giving an IPython notebook to reproduce their figures. The recipe works perfectly here, because the ideas underlying the publication are simple and can be illustrated on synthetic data with relatively inexpensive computation. A short IPython notebook is all it takes to convince the reader.

Note

Sharing complex code… chances are it won’t run on new data.

At the other end of the spectrum, a complex analysis pipeline will not be as easy to share. For instance, a feat of strength such as Miyawaki et al’s visual image reconstruction from brain activity requires complex statistical signal processing to extract weak signatures. Miyawaki et al shared the data. They might share the code, but it would be a large chunk of code, probably fragile to changes in the environment (Matlab version, OS…). Chances are that it wouldn’t run on new data. This is the scenario that prompted Titus’s words:

“Please destroy this software after publication”.

I have good news: you can reproduce Miyawaki’s work with an example in nilearn, a library for machine learning on brain images. The example itself is concise, readable and it reliably produces figures close to that of the paper.

Note

Maintained libraries make feats of strength routinely reproducible.

This easy replication is only possible because the corresponding code leverages a set of libraries that encapsulate the main steps of the analysis, mainly scikit-learn and nilearn here. These libraries are tested, maintained and released. They enable us to go from a feat of strength to routine replication.

## Reproducibility is not sustainable for everything

Thinking is easy, acting is difficult       —       Goethe

Note

Keeping a physics apparatus running for replication years later?

I started my scientific career doing physics, and fairly “heavy” physics: vacuum systems, lasers, free-falling airplanes. In such settings, the cost of maintaining an experiment is apparent to the layman. No-one is expected to keep an apparatus running for replication years later. The pinnacle of reproducible research is when the work becomes doable in a students lab. Such progress is often supported by improved technology, driven by wider applications of the findings.

However, not every experiment will give rise to a students lab. Replicating the others will not be easy. Even if the instruments are still around the lab, they will require setting up, adjusting and wiring. And chances are that connectors or cables will be missing.

Software is no different. Storing and sharing it is cheaper. But technology evolves very fast. Every setup is different. Code for a scientific paper has seldom been built for easy maintenance: lack of tests, profusion of exotic dependencies, inexistent documentation. Robustness, portability, isolation, would be desirable, but it is difficult and costly.

Software developers know that understanding the constraints to design a good program requires writing a prototype. Code for a scientific paper is very much a prototype: it’s a first version of an idea, that proves its feasibility. Common sense in software engineering says that prototypes are designed to be thrown away. Prototype code is fragile. It’s untested, probably buggy for certain usage. Releasing prototypes amounts to distributing semi-functioning code. This is the case for most code accompanying a publication, and it is to be expected given the very nature of research: exploration and prototyping [1].

## No success without quality, …

Note

Highly-reliable is more useful than state-of-the-art.

My experience with scientific code has taught me that success require quality. Having a good implementation of simple, well-known, methods seems to matter more than doing something fancy. This is what the success of scikit-learn has taught us: we are really providing classic “old” machine learning methods, but with a good API, good docs, computational performance, and stable numerics controlled by stringent tests. There exists plenty of more sophisticated machine-learning methods, including some that I have developed specifically for my data. Yet, I find myself advising my co-workers to use the methods in scikit-learn, because I know that the implementation is reliable and that they will be able to use them [2].

This quality is indeed central to doing science with code. What good is a data analysis pipeline if it crashes when I fiddle with the data? How can I draw conclusions from simulations if I cannot change their parameters? As soon as I need trust in code supporting a scientific finding, I find myself tinkering with its input, and often breaking it. Good scientific code is code that can be reused, that can lead to large-scale experiments validating its underlying assumptions.

Sqlite is so much used that its developers have been woken up at night by users.

You might say that I am putting the bar too high; that slightly buggy code is more useful than no code. But I frown at the idea of releasing code for which I am unable to do proper quality assurance. I may have done too much of that in the past. And because I am a prolific coder, many people are using code that has been through my hands. My mailbox looks like a battlefield, and when I go the coffee machine I find myself answering questions.

## … and making difficult choices

Note

Achieving quality requires making choices. Not only because time is limited, but also because the difficulty to maintain and improve a codebase increases much quicker than the numbers of features [3]. This phenomena is actually frightening to watch: adding a feature in scikit-learn these days is much much harder than what it used to be in the early days. Interactions between features is a killer: when you modify something, something else unrelated breaks. For a given functionality, nothing makes the code more incomprehensible than cyclomatic complexity: the multiplicity of branching, if/then clauses, for loops. This complexity naturally appears when supporting different input types, or minor variants of a same method.

The consequence is that ensuring quality for many variants of a method is prohibitory. This limit is a real problem for reproducible science, as science builds upon comparing and opposing models. However, ignoring it simply leads to code that fails doing what it claims to do. What this is telling us, is that if we are really trying to do long-term reproducibility, we need to identify successful and important research and focus our efforts on it.

If you agree with my earlier point that the code of a publication is a prototype, this iterative process seems natural. Various ideas can be thought of as competing prototypes. Some will not lead to publication at all, while others will end up having a high impact. Knowing before-hand is impossible. Focusing too early on achieving high quality is counter productive. What matters is progressively consolidating the code.

## Reproducible science, a rich trade-off space

Note

Verbatim replication or reuse?

Does Reinhart and Rogoff’s “Growth in a Time of Debt” paper face the same challenges as the manuscript under review by Titus? One is describing mechanisms while the other is introducing a method. The code of the former is probably much simpler than that of the latter. Different publications come with different goals and code that is more or less easy to share. For verbatim replication of the analysis of a paper, a simple IPython notebook without tests or API is enough. To go beyond requires applying the analysis to different problems or data: reuse. Reuse is very difficult and cannot be a requirement for all publications.

Conventional wisdom in academia is that science builds upon ideas and concepts rather than methods and code. Galileo is known for his contribution to our understanding of the cosmos. Yet, methods development underpins science. Galileo is also the inventor of the telescope, which was a huge technical achievement. He needed to develop it to back his cosmological theories. Today, Galileo’s measurements are easy to reproduce because telescopes are readily-available as consumer products.

Standing on the shoulders of giants     —     Isaac Newton, on software libraries

 [1] To make my point very clear, releasing buggy untested code is not a good thing. However, it is not possible to ask for all research papers to come with industial-quality code. I am trying here to push for a collective, reasoned, undertaking of consolidation.
 [2] Theory tells us that there is there is no universal machine learning algorithm. Given a specific machine-learning application, it is always possible to devise a custom strategy that out-performs a generic one. However, do we need hundreds of classifiers to solve real world classification problems? Empirical results [Delgado 2014] show that most of the benefits can be achieved with a small number of strategies. Is it desirable and sustainable to distribute and keep alive the code of every machine learning paper?
 [3] Empirical studies on the workload for programmers to achieve a given task showed that 25 percent increase in problem complexity results in a 100 percent increase in programming complexity: An Experiment on Unit increase in Problem Complexity, Woodfield 1979.

I need to thank my colleague Chris Filo Gorgolewski and my sister Nelle Varoquaux for their feedback on this note.

## May 13, 2015

### Titus Brown

#### Adventures in replicable scientific papers: Docker

About a month ago, I took some time to try out Docker, a container technology that lets you bundle together, distribute, and execute applications in a lightweight Linux container. It seemed neat but I didn't apply it to any real problems. (Heng Li also tried it out, and came to some interesting conclusions -- note especially the packaging discussion in the comments.)

At the sprint, I decided to try building a software container for our latest paper submission on semi-streaming algorithms for DNA sequence analysis, but I got interrupted by other things. Part of the problem was that I had a tough time conceptualizing exactly what my use case for Docker was. There are a lot of people starting to use Docker in science, but so far only nucleotid.es has really demonstrated its utility.

Fast forward to yesterday, when I talked with Michael Crusoe about various ideas. We settled on using Docker to bundle together the software needed to run the full paper pipeline for the streaming paper. The paper was already highly replicable because we had used my lab's standard approach to replication (first executed three years ago!) This wasn't a terribly ambitious use of Docker but seemed like it could be useful.

In the end, it turned out to be super easy! I installed Docker on an AWS m3.xlarge, create a Dockerfile, and wrote up some instructions.

The basic idea we implemented is this:

• install all the software in a Docker container (only needs to be done once, of course);
• clone the repository on the host machine;
• copy the raw data into the pipeline/ sub-directory of the paper repository;
• run the docker container with the root of the paper repository (on the host, wherever it might be) bound to a standard location ('/paper') in the image;
• voila, raw data in, analyzed results out!

(The whole thing takes about 15 hours to run.)

## The value proposition of Docker for data-intensive papers

So what are my conclusions?

I get the sense that this is not really the way people are thinking about using Docker in science. Most of what I've seen has to do with workflows, and I get the sense that the remaining people are trying to avoid issues with software packaging. In this case, it simply didn't make sense to me to break our workflow steps for this paper out into different Docker images, since our workflow only depends on a few pieces of software that all work together well. (I could have broken out one bit of software, the Quake/Jellyfish code, but that was really it.)

I'm not sure how to think about the volume binding, either - I'm binding a path on the Docker container directly to a local disk, so the container isn't self-sufficient. The alternative was to package the data in the container, but in this case, it's 15-20 GB, which seemed like too much! This dependence on external data does limit our ability to deploy the container to compute farms though, and it also means that we can't put the container on the Docker hub.

The main value that I see for this container is in not polluting my work environment on machines where I can run Docker. (Sadly this does not yet include our HPC at MSU.) I could also use a Project Jupyter container to build our figures, and perhaps use a separate Latex container to build the paper... overkill? :).

One nice outcome of the volume binding is that I can work on the Makefile and workflow outside of the docker container, run it all inside the container, and then examine the artifacts outside of the container. (Is there a more standard way to do this?)

I also really like the explicit documentation of the install and execution steps. That's super cool and probably the most important bit for paper replication. The scientific world would definitely be a better place if the computational setup for data analysis and modeling components of papers came in a Dockerfile-style format! "Here's the software you need, and the command to run; put the data here and push the 'go' button!"

I certainly see the value of docker for running many different software packages, like nucleotid.es does. I think we should re-tool our k-mer counting benchmark paper to use containers to run each k-mer counting package benchmark. In fact, that may be my next demo, unless I get sidetracked by my job :).

## Next steps

I'm really intrigued by two medium-term directions -- one is the bioboxes-style approach for connecting different Docker containers into a workflow, and the other is the nucleotid.es approach for benchmarking software. If this benchmarking can be combined with github repos ("go benchmark the software in this github project!") then that might enable continuously running testing and benchmarks on a wide range of software.

Longer term, I'd like to have a virtual computing environment in which I can use my Project Jupyter notebook running in a Docker environment to quickly and easily spin up a data-intensive workflow involving N docker containers running on M machines with data flowing through them like so. I can already do this with AWS but it's a bit clunky; I foresee a much lighter-weight future for ultra-configurable computing.

In the shorter term, I'm hoping we can put some expectations in place for what dockerized paper replication pipelines might look like. (Hint: binary blobs should not be acceptable!) If we have big data sets, we probably don't want to put them on the Docker Hub; is the right solution to combine use of a data repository (e.g. figshare) with a docker container (to run all the software) and a tag in a github repository (for the paper pipeline/workflow)?

Now, off to review that paper that comes with a Docker container... :)

--titus

#### Modifications to our development process

After a fair amount of time thinking about software's place in science (see blog posts 1, 2, 3, and 4), and thinking about khmer's short- and long-term future, we're making some changes to our development process.

Semantic versioning: The first change, and most visible one, is that we are going to start bumping version numbers a lot faster. One of the first things Michael Crusoe put in place was semantic versioning, which places certain compatibility guarantees on version numbers used. These compatibility guarantees (on the command line API only, for khmer) are starting to hold us back from sanding down the corners. Moving forward, we're going to bump version numbers as quickly as needed for the code we've merged, rather than holding off on cleanup.

Michael just released khmer v1.4; my guess is that 2.0 will follow soon after. We'll try to batch major versions a little bit, but when in doubt we'll push forward rather than holding back, I think. We'll see how it goes.

Improving the command-line user experience. At the same time, we're going to be focusing more on user experience issues; see #988 for an example. Tamer Mansour, one of my new postdocs at Davis, took a fresh look at the command line and argued strenuously for a number of changes, and this aligns pretty well with our interests.

Giving more people explicit merge authority. 'til now, it was mostly Michael and myself doing merges; we've asked Luiz Irber and Camille Scott to step up and do not only code review but merges on their own recognizance. This should free up Michael to focus more on coding, as well as speeding up response times when Michael and I are both busy or traveling. I'm also asking mergers to fix minor formatting issues and update the ChangeLog for pull requests that are otherwise good - this will accelerate the pace of change and decrease frustration around quick fixes.

This is part of my long-term plan to involve more of the lab in software engineering. Most experimental labs have lab duties for grad students and postdocs; I'd like to try out the model where the grad students and postdocs have software engineering duties, independent of their research.

Deferring long-term plans and deprecating sprint/training efforts. We will defer our roadmap and decrease our sprint and training interactions. As a small project trying to get more funding, we can't afford the diversion of energy at this point. That having been said, both the roadmap planning and the sprints thus far were tremendously valuable for thinking ahead and making our contribution process more robust, and we hope to pursue both in the future.

Paying technical debt maintenance fees, instead of decreasing debt. We still have lots of issues that are burdening the codebase, especially at the Python and C++ interface levels, but we're going to ignore them for now and focus instead on adding new features (hopefully without increasing technical debt, note - we're keeping the code review and continuous integration and test coverage and ...). Again, we're a small project trying to get more funding... hard choices must be made.

I'm writing a grant now to ask for sustained funding on a ~5 year time scale, for about 3 employees - probably a software engineer / community manager, a super-postdoc/software engineer, and a grad student. If we can get another round of funding, we will reactivate the roadmap and think about how best to tackle technical debt.

--titus

p.s. Special thanks to Ethan White, Greg Wilson, and Neil Chue Hong for their input!

## May 08, 2015

### Titus Brown

#### My review of a review of &quot;Influential Works in Data Driven Discovery&quot;

I finally got a chance to more thoroughly read Mark Stalzer and Chris Mentzel's arxiv preprint, "A Preliminary Review of Influential Works in Data-Driven Discovery". This is a short review paper that discusses concepts highlighted by the 1,000+ "influential works" lists submitted to the Moore Foundation's Data Driven Discovery (DDD) Investigator Competition. (Note, I was one of the awardees.)

The core of this arxiv preprint is the section on "Clusters of Influential Works", in which Stalzer & Mentzel go in detail through the eight different concept clusters that emerged from their analysis of the submissions. This is a fascinating section that should be at the top of everyone's reading list. The topics covered are, in the order presented in the paper, as follows:

• Foundational theory, including Bayes' Theorem, information theory, and Metropolis sampling;
• Astronomy, and specifically the Sloan Digital Sky Survey;
• Genomics, focused around the Human Genome Project and methods for searching and analyzing sequencing data;
• Classical statistical methods, including the lasso, bootstrap methods, boosting, expectation-maximization, random forests, false discovery rate, and "isomap" (which I'd never heard of!);
• Machine learning, including Support Vector Machines, artificial Neural Networks (and presumably deep learning?), logistic belief networks, and hidden Markov models;
• The Google! Including PageRank, MapReduce, and "the overall anatomy" of how Google does things; specific implementations included Hadoop, BigTable, and Cloud DataFlow.
• General tools, programming languages, and computational methods, including Numerical Recipes, the R language, the IPython Notebook (Project Jupyter), the Visual Display of Quantitative Information, and SQL databases;
• Centrality of the Scientific Method (as opposed to specific tools or concepts). Here the discussion focused around the Fourth Paradigm book which lays out the expansion of the scientific method from empirical observation to theory to simulation to "big data science"; here, I thought the point that computers were used for both theory and observation was well-made. This section is particularly worth reading, in my opinion.

This collection of concepts is simply delightful - Stalzer and Mentzel provide both a summary of the concepts and a fantastic curated set of high-level references.

Since I don't know many of these areas that well (I've heard of most of the subtopics, but I'm certainly not expert in ... any of them? yikes) I evaluated the depth of their discussion by looking at the areas I was most familiar with - genomics and tools/languages/methods. My sense from this was that they covered the highlights of tools better than the highlights of genomics, but this may well be because genomics is a much larger and broader field at the moment.

## Data-Driven Discovery vs Data Science

One interesting question that comes up frequently is what the connection and overlap is between data-driven discovery, data science, big data, data analysis, computational science, etc. This paper provides a lot of food for thought and helps me draw some distinctions. For example, it's clear that computational science includes or at least overlaps with all of the concepts above, but computational science also includes things like modeling that I don't think clearly fit with the "data-driven discovery" theme. Similarly, in my experience "data science" encompasses tools and methods, along with intelligent application of them to specific problems, but practically speaking does not often integrate with theory and prediction. Likewise, "big data", in the sense of methods and approaches designed to scale to analysis and integration of large data set, is clearly one important aspect of data-driven discovery - but only in the sense that in many cases more data seems to be better.

Ever since the "cage match" round of the Moore DDD competition, where we discussed these issues in breakout groups, I've been working towards the internal conclusion that data-driven discovery is the exploration and acceleration of science through development of new data science theory, methods, and tools. This paper certainly helps nail that down by summarizing the components of "data driven discovery" in the eyes of its practitioners.

## Is this a framework for a class or graduate training theme?

I think a lot about research training, in several forms. I do a lot of short-course peer instruction form (e.g. Data Carpentry, Software Carpentry, and my DIB efforts); I've been talking with people about graduate courses and graduate curricula, with especial emphasis on data science (e.g. the Data Science Initiative at UC Davis); and, most generally, I'm interested in "what should graduate students know if they want to work in data-driven discovery"?

From the training perspective, this paper lays out the central concepts that could be touched on either in a survey course or in an entire graduate program; while my sense is that a PhD would require coupling to a specific domain, I could certainly imagine a Master's program or a dual degree program that touched on the theory and practice of data driven discovery.

For one example, I would love to run a survey course on these topics, perhaps in the area of biology. Such a course could go through each of the subsections above, and discuss them in relation to biology - for example, how Bayes' Theorem is used in medicine, or how concepts from the Sloan Digital Sky Survey could be applied to genomics, or where Google-style infrastructure could be used to support research.

There's more than enough meat in there to have a whole graduate program, though. One or two courses could integrate theory and tools, another course could focus on practical application in a specific domain, a third course could talk about general practice and computing tools, and a fourth course could discuss infrastructure and scaling.

## The missing bits - "open science" and "training"

Something that I think was missing from the paper was an in-depth perspective on the role that open source, open data, and open science can play. While these concepts were directly touched on in a few of the subsections - most of the tools described were open source, for example, and Michael Nielsen's excellent book "Reinventing Discovery" was mentioned briefly in the context of network effects in scientific communication and access - I felt that "open science" was an unacknowledged undercurrent throughout.

It's clear that progress in science has always relied on sharing ideas, concepts, methods, theory, and data. What I think is not yet as clear to many is the extent to which practical, efficient, and widely available implementations of methods have become important in the computer age. And, for data-driven discovery, an increasingly critical aspect is the infrastructure to support data sharing, collaboration, and the application of these methods to large data sets. These two themes -- sharing of implementation and importance of infrastructure cut across many of the subsections in this paper, including the specific domains of astronomy and human genomics, as well as the Google infrastructure and languages/tools/implementation subsections. I think the paper could usefully add a section on this.

Interestingly, the Moore Foundation DDD competition implicitly acknowledged this importance by enriching for open scientists in their selection of the awardees -- a surprising fraction of the Investigators are active in open science, including myself and Ethan White, and virtually all the Investigators are openly distributing their research methodology. In that sense, open science is a notable omission from the paper.

It's also interesting to note that training is missing from the paper. If you believe data-driven discovery is part of the future of science, then training is important because there's a general lack of researchers and institutions that cover these topics. I'd guess that virtually no one researcher is well versed in a majority of the topics, especially since many of the topics are entire scientific super-fields, and the rest are vast technical domains. In academic research we're kind of used to the idea that we have to work in collaboration (practice may be different...), but here academia really fails to cover the entire data-driven discovery spectrum because of the general lack of emphasis on expert use of tools and infrastructure in universities.

So I think that investment in training is where the opportunities lie for universities that want to lead in data-driven discovery, and this is the main chance for funders that want to enable the network effect.

## Training in open science, tools, and infrastructure as competitive advantages

Forward-thinking universities who are in it for the long game & interested in building a reputation in data-driven discovery, might consider the following ideas:

• scientists trained in open science, tool use, and how to use existing infrastructure, are more likely to be able to quickly take advantages of new data and methods.
• scientists trained in open science are more likely to produce results that can be built on.
• scientists trained in open science are more likely to produce useful data sets.
• scientists trained in open science and tool building are more likely to produce useful tools.
• funding agencies are increasingly interested in maximizing impact by requiring open source, open data, and open access.

All of these should lead to more publications, more important publications, a better reputation, and more funding.

In sum, I think investments in training in the most ignored bits of data-driven discovery (open science, computational tool use and development, and scalable infrastructure use and development) should be a competitive advantage for institutions. And, like most competitive advantages, those who ignore it will be at a significant disadvantage. This is also an opportunity for foundations to drive progress by targeted investments, although (since they are much more nimble than universities) they are already doing this to some extent.

In the end, what I like most about this paper is that it outlines and summarizes the concepts in which we need to invest in order to advance science through data-driven discovery. I think it's an important contribution and I look forward to its further development and ultimate publication!

--titus

### Wei Xue

#### GSoC Prelude

It is fortunate that my proposal about Gaussian mixture model is accepted by Google Summer of Code 2015. I am very grateful to scikit-learn, Python Software Foundation and Google Summer of Code. As a PhD student studying in Machine Learning and Data Mining, I frequently process various kinds of data using Matlab, Python and scikit-learn. Scikit-learn is a powerful and easy-to- use machine learning library for Python. Though I only have been using it for about one year, I cannot leave it in my many projects now.

I first heard of GSoC in 2012, when my colleague pluskid participated in Shogun project. The post he wrote about his experience is quite interesting and fun. Since I missed GSoC 2014 because of too much course projects, I began to read some code of scikit-learn and learn git. Anyway, I really looking forward to a wonderful journey this summer.

# Introduction

This summer, I focus on Gaussian mixture model and other two variances. Compared with other two GSoC projects, my project looks a bit different, since it is kind of fixing / refactoring rather than introducing new features. The following text is from my proposal.

Gaussian mixture model (GMM) is a popular unsupervised clustering method. GMM corresponds a linear combination of several Gaussian distributions to represent the probability distribution of observations. In GMM, with the prefix number of Gaussian component, a set of parameters should be estimated to represent the distribution of the training data. It includes means, covariances and the coefficients of the linear combination. Expectation Maximization (EM) is usually used to find the maximum likelihood parameters of the mixture model. In each iteration, E-step estimates the conditional distribution of the latent variables. M-step finds the model parameters that maximize the likelihood.

In variational Bayesian Gaussian mixture model (VBGMM), M-step is generalized into full Bayesian estimation, where the parameters are represented by the posterior distribution, instead of only single value like in maximum- likelihood estimation.

On the other hand, Dirichlet process Gaussian mixture model (DPGMM) allows a mixture of infinite Gaussian distributions. It uses Dirichlet process as a nonparametric prior of the distribution parameters, and the number of components could vary according to the data. Therefore, one does not have to preset the number of components ahead of time. The simplest way to infer DPGMM is Monte-Carlo Markov chain (MCMC), but it is generally slow to converge. In Blei's paper, truncated variational inference is proposed, which converges faster than MCMC.

However, in scikit-learn, the implementation suffers from interface incompatibility, incorrect model training and incompleteness of testing, which prohibits the widely use of these models.

# Next

In the rest of bonding time, I will continue reading the related papers. The next post will be about mathematical derivation. Stay tuned.

## May 07, 2015

### Continuum Analytics

#### Continuum Analytics - May Tech Events

The Continuum team is gearing up for a summer full of conferences, including PyData Seattle, taking place July 24-26, hosted by Microsoft. But first we’ve got a few May conferences to keep an eye out for, all over the globe! Join us in Austin, Argentina, Berlin, and Boston this month.

## May 06, 2015

### Abraham Escalante

#### My GSoC experience

Hello all,

My name is Abraham Escalante and I'm a mexican software engineer. The purpose of this blog is to relate my experiences and motivations to participate in the 2015 Google Summer of Code.

I am not much of a blogger (in fact, this is my first blog entry ever) but if you got here, then chances are you are interested in either the GSoC, the personal experience of a GSoCer or maybe we have a relationship of some sort and you have a personal interest (I'm looking at you Hélène). Either way, I will do my best to walk you through my experience with the hope that this may turn out to be useful for someone in the future, be it to help you get into the GSoC programme or just to get to know me a little better if you find that interesting enough.

I have some catching up to do because this journey started for me several months ago. The list of selected student proposals has already been published (**spoiler alert** I got selected. You can take a look at my proposal here) and the coding period will start in about three weeks time but for now I just wanted to write a first entry to get the ball rolling and so you get an idea of what you can expect, should you choose to continue reading these blog entries. I will begin my storytelling soon.

Cheers,
Abraham.

## May 05, 2015

### Titus Brown

#### A workshop report from the May 2015 non-model RNAseq workshop at UC Davis

We just finished teaching the second of my RNAseq workshops at UC Davis -- the fifth workshop I've hosted since I took a faculty position here in VetMed. In order, we've done a Train the Trainers, a Data Carpentry, a reference-guided RNAseq assembly workshop, a mothur (microbial ecology) workshop, and a de novo RNAseq assembly workshop -- you can see all of the links at the Data Intensive Biology Training Program Web site. This workshop was the May de novo mRNAseq assembly workshop, which I co-taught with Tamer Mansour and Camille Scott.

The workshops are still maturing, and I'm trying to figure out how to keep this going for the medium term, but so far I think we're doing an OK job. We can always improve the material and the delivery, but I think at least we're on a good trajectory.

This workshop (and the many excellent questions raised by the attendees) reminded me how much of RNAseq analysis is still research -- it's not just a question of what assembler and quantification method to use, but much more fundamental questions of data evaluation, assembly evaluation, and how to tie this into the biology you're trying to do. My lab works on this a lot, and too much of the time we have to say "we just don't know" - often because the experts don't agree, or because the answer is just unknown.

I also despair sometimes that the energy and effort we're putting into this isn't enough. There is a huge demand, and these two day workshops are at best a stopgap measure, and I really have no idea whether they're going to help biologists starting from scratch to analyze their own data.

I do have other arrows in my quiver. Once my lab "lands" at Davis (sometime between June and September) I expect to start up a biology "data space" of some sort, where every week people who have been through one of my workshops can come work on their data analysis; the hope is that, much like the Davis R Users Group, we can start to build a community around biological data analysis. Stay tuned.

I'm also planning to start running more advanced workshops. One great idea that Tamer pitched to me this morning was to run a follow-on workshop entitled "publishing your transcriptome", which would focus on quality measures and analysis downstream of your first-blush transcriptome assembly/annotation/quantification. I'm also hoping to put together an "automation and reproducibility" workshop in the fall, along with a variety of more focused workshops on specific platforms and questions.

And, of course, we'll continue running the intro workshops. In addition to the mRNASeq workshops, in the fall I'd like to do workshops on microbial genome assembly and annotation, metagenome and metatranscriptome assembly, and advanced UCSC genome browser use/misuse (think assembly hubs etc.).

--titus

### Juan Nunez-Iglesias

#### jnuneziglesias

I use Twitter favourites almost exclusively to mark posts that I know will be useful in some not-too-distant future; kind of like a Twitter Evernote. Recently I was looking through my list in search of this excellent blog post detailing how to build cross-platform binary distributions for conda.

I came across two other tweets from the EuroSciPy 2014 conference: this one by Ian Ozsvald about his IPython memory usage profiler, right next to this one by Alexandre Chabot about Aaron O’Leary’s notedown. I’d forgotten that this was how I came across these two tools, but since then I have contributed code to both (1, 2). I’d met Ian at EuroSciPy 2013, but I’ve never met Aaron, yet nevertheless there is my code in the latest version of his notedown library.

How remarkable the open-source Python community has become. Talks from Python conferences are posted to YouTube, usually as the conference is happening. (Add to that plenty of live tweeting.) Thus, even when I can’t attend the conferences, I can keep up with the latest open source libraries, from the other side of the world. And then I can grab the source code on GitHub, fiddle with it to my heart’s content, and submit a pull request to the author with my changes. After a short while, code that I wrote for my own utility is available for anyone else to use through PyPI or conda.

My point is: join us! Make your code open source, and conversely, when you need some functionality, don’t reinvent the wheel. See if there’s a library that almost meets your needs, and contribute!

## April 28, 2015

### Matthieu Brucher

#### Book review: scikit-learn Cookbook

There are now a few books on sickit-learn, for instance a general one on machine learning systems, and a cookbook. I was a technical reviewer for the first one, and now I’m reviewing the cookbook.

#### Content and opinions

A cookbook is a collection of recipes, it is not intended to help you understand how your oven works. It is the same for this book, it won’t help you install your oven or set it up, you will have to know how to install the required packages.

It will help you decide what tool to use for which problem. It is complementary to the tutorials and the gallery on the scikit website as it adds some thoughts on what the algorithm does and where to pay attention. If Building Machine Learning Systems in Python is quite broad and goes from the installation to specific algorithms, this book tries to cover more algorithms, with explanations of what you are doing, but with less depth, and it is more or less only focused on scikit-learn.

#### Conclusion

If you know a little bit about machine learning and Python, a cookbook may be more appropriate than a more “vertical” book. As such this book covers quite a bit of the scikit, with some useful tips. But as it doesn’t go in too many details, you still need to confront data and parameters against a book like Bishop’s Pattern Recognition and Machine Learning.

## April 26, 2015

### Titus Brown

#### Proposal: Integrating the OSF into Galaxy as a remote data store

Note - this was an internal funding request solicited by the Center for Open Science. It's been funded!

Brief: We propose to integrate OSF into Galaxy as a data store. For this purpose, we request 3 months of funding (6 months, half-time) for one developer, plus travel.

Introduction and summary: Galaxy is a commonly used open source biomedical/biological sequence data analysis platform that enables biologists to put together reproducible pipelines and execute analyses locally or in the cloud. Galaxy has a robust and sophisticated Web-based user interface for setting up these pipelines and analyzing data. One particular challenge for Galaxy is that on cloud instances, data storage and publication must be done using local filesystems and remote URLs, which adds a significant amount of complexity for biologists interested in doing reproducible computing. Recently, Galaxy gained a data abstraction layer that permits object stores to be used instead of local filesystems. The Center for Open Science’s Open Science Framework (OSF), in turn, is a robust platform for storing, manipulating, and sharing scientific data, and provides APIs for accessing such data; the OSF can also act as a broker for accessing and managing remote data stores, on e.g. cloud providers. Integrating the OSF’s object store into Galaxy would let Galaxy use OSF for data persistence and reproducibility, and would let Galaxy users take advantage of OSF’s data management interface, APIs, and authentication to expand their reproducible biomedical science workflows. This integration would also rigorously test and exercise newly developed functionality in both Galaxy and the OSF, providing valuable use cases and testing.

Our “stretch” goal would be to expand beyond Galaxy and work with Project Jupyter/IPython Notebook’s data abstraction layer to provide an OSF integration for Project Jupyter.

We note with enthusiasm that all groups mentioned here are robust participants in the open source/open science ecosystem, and all projects are full open source projects with contributor guidelines and collaboration workflows!

Broader impacts: If successful, the proposed project addresses several broader issues. First, the OSF would have an external consumer of its APIs for data access, which would drive the maturation of these APIs with use cases. Second, the OSF would expand to support connections with a visible project in a non-psychology domain, giving COS a proof-of-concept demonstration for expansion into new communities. Third, the Galaxy biomedical community would gain connections to the OSF’s functionality, which would help in execution, storage, and publication of biomedical data analyses. Fourth, the Brown Lab would then be able to explore further work to build their Moore-DDD-funded data analysis portal on top of both Galaxy and the OSF, leveraging the functionality of both projects to advance open science and reproducibility. Even a partial failure would be informative by exposing faults in the OSF or Galaxy public APIs and execution models, which could then be addressed by the projects individually. This project would also serve as a “beta test” of the COS as an incubator of open science software projects.

Longer-term outcomes: the Brown Lab and the COS are both interested in exploring the OSF as a larger hub for data storage for workflow execution, teaching and training in data-intensive science, and hosting the reproducible publications. This proposed project is a first step in those directions.

#### Popping the open source/open science bubble.

One of the things that became clear to me over the last two weeks is just how much of a open source/open science bubble my blog and Twitter commenters live in. Don't take that as a negative -- I'm in here with you, and it's a great place to live :). But it's still a bubble.

Two specific points brought this home to me.

First, a lot of the Twitter and blog commentary on Please destroy this software after publication. kthxbye. expressed shock and dismay that I would be OK with non-OSS software being published. (Read Mick Watson's blog post and Kai Blin's comment.) Many really good reasons why I was wrong were brought up, and, well, I have to say it was terrifically convincing and I'm going to change my own policy as a reviewer. So far, so good. But it turns out that only a few journals require an actual open source license (Journal of Open Research Software and Journal of Statistical Software). So there is a massive disparity between what some of my tweeps (and now me) believe, and what is codified practice.

Second, many eloquent points were made about software as a major product and enabler of research -- see especially the comments on "software as communication" and "software as experimental design" by others (linked to here - see "Software as..." section). These points were very convincing as well, although I'm still trying to figure out how exactly to evolve my own views. And yet here again I think we can be quite clear that most biologists and perhaps even some bioinformaticians would have either no considered opinion on software, or be outright dismissive of the idea that software itself is intellectual output. Again, very different from what the people on Twitter and my blog think.

I was already pretty surprised with how strong the case was for open source software as a requirement (go read the links above). I was even more surprised with how eloquently and expansively people defended the role of software in research. Many, many strong arguments were put forth.

So, how do we evolve current practice??

But first...

## If software is so important, software is fair game for peer review

I promise this wasn't a stealth goal of my original blog post but people realize that an obvious conclusion here is that software is fully fair game for in depth peer review, right? (Never mind that most scientists probably aren't capable of doing good peer review of code, or that any reasonably strong code review requirements would mean that virtually no more software would be published - an effective but rather punitive way to ensure only good software is published in science :)

A few weeks back I received a response to my review of an application note, and the senior author objected strenuously to my reviewing their actual software in any way. It really pissed me off, frankly -- I was pretty positive about their packaged software and made some suggestions for how they could improve its presentation to others, and basically got back a punch to the nose asking how dare I make such suggestions. As part of my own rather intemperate response, I said:

This is an application note. The application itself is certainly fair game for review...

How much angrier would this person have been if I'd rejected the paper because I actually had comments on edge cases in the source code??

Two years ago now we had another big eruption ("big" in the Twitter sense, at least) around code review. A year even before that I proposed optional review criteria for bioinformatics papers that my students, at least, have started to use to do reviews.

In all that time very little has changed. There are three objections that I've heard in these last three years that bear up over time --

First, scientists neither know how to review code nor how to write reasonable code; this would lead at best to inconsistency in reviews, or at worst simply lead to a massive waste of time.

Second, I am not aware of any code review guidelines or standards for scientific code. Code review in industry has at least some basic good practices; code review in science is a different beast.

Third, code review can be used to unfairly block publication. This came up again recently (READ THAT COMMENT) and I think it's a great reason to worry about code review as a way to block publication. I still don't know how to deal with this but we need some guidelines for editors.

The bottom line is that if software is fair game for peer review, then we need a trained and educated body of reviewers - just as we do for molecular methods, biological sequencing, and statistics. This will inevitably involve the evolution of the community of practice around both software generation (s...l...o...w...l...y... happening) and software peer review (<envision birds chirping in the absence of conversation>).

(One solution I think I'm going to try is this: I'm going to ask the Software Carpentry community for a volunteer to do code review for every computational paper I edit, and I will provide suggested (optional) guidelines. Evil? Maybe so. Effective? I hope so.)

## We need some guidelines and position papers.

Of the discussion around computation as a primary research product, Dan Katz asked,

"I wonder if a collaborative paper on this would find a home somewhere?"

Yes. To break out of the bubble, I think we need a bunch of position papers and guidelines on this sort of thing, frankly. It's clear to me that the online community has a tremendous amount of wisdom to offer, but we are living in a bubble, and we need to communicate outside of that -- just as the open access and open data folk are.

One important note: we need simple, clear, minimum requirements, with broadly relevant justifications. Otherwise we will fail to convince or be useful to anyone, including our own community.

A few ideas:

• We need a clear, concise, big-tent writeup of "why software is important, and why it should be OSS and reviewed when published";
• We need to discuss good minimum requirements in the near term for code review, and figure out what some end goals are;
• We need some definitions of what "responsible conduct of computational research" looks like (Responsible Conduct of Research is a big thing in the US, now; I think it's a useful concept to employ here).
• We need some assessment metrics (via @kaythaney) that disentangle "responsible conduct of research" (a concept that nobody should disagree with) from "open science" (which some people disagree with :).

and probably a bunch of other things... what else do we need, and how should we move forward?

--titus

### Filipe Saraiva

#### Cantor in KDE Applications 15.04

KDE Applications 15.04 release brings a new version of the scientific programming software Cantor, with a lot of news. I am specially happy with this release because I worked in several parts of these new features. =)

Come with me™ and let’s see what is new in Cantor.

## Cantor ported to Qt5/KF5

Cantor Qt5/KF5 + Breeze theme. In the image it is possible to see the terminal/worksheet, variable management panel, syntax highlighting, code completion, and the standard interface

I started the Cantor port to Qt5/KF5 during previous LaKademy and I continued the development along the year. Maybe I had pushed code from 5 different countries since the beginning of this work.

The change for this new technology was successfully completed, and for the moment we don’t notice any feature missed or new critical bug. All the backends and plugins were ported, and some new bugs created during this work were fixed.

We would like to ask for Cantor users to report any problem or bug in bugzilla. Anyway, the software is really very stable.

When you run Cantor Qt5/KF5 version on the first time, the software will look for Cantor Qt4 configurations and, if it exists, the configurations will be automagically migrated to Cantor Qt5/KF5.

## Backend for Python 3

In Season of KDE 2014 I was the mentor of Minh Ngo in the project to create a backend for Python 3, increasing the number of backends in Cantor to 10!

Backend selection screen: Python 3 and their 9 brothers

The backend developed by Minh uses D-Bus protocol to allow communication between Cantor and Python 3. This architecture is different of Python 2, but it is present in others backends, as in the backend for R.

The cool thing is Cantor can be interesting for pythonistas using Python 2 and/or Python 3 now. We would like to get feedback from you, guys!

## Icon!

Cantor first release was originally in 2009, with KDE SC 4.4. Since that date the software did not have an icon.

The Cantor Qt5/KF5 release marks a substantial change in the development of the application, then it is also a good time to release an icon to the software.

Cantor icon

The art is excellent! It presents the idea of Cantor: a blackboard to you write and develop your equations and formulas while scratches his head and think “and now, what I need to do to solve it?”. =)

Thank you Andreas Kainz and Uri Herrera, members of VDG team and authors of Cantor icon!

## Other changes and bug fixes

Most bugs added in the Qt5/KF5 port were fixed before the release.

There are some small changes to be cited: in KNewStuff categories world, “Python2″ category was changed to “Python 2″ and “Python 3″ category was added; the automatic loading of pylab module in Python backends was dropped; now it is possible to run Python commands mixed with comments in the worksheet; and more.

You can see a complete log of commits, bugfixes, and new features added in this release in this page.

## Future works

As future work maybe the high-priority for this moment is to drop KDELibs4Support from Cantor. Lucas developed part of this work and we would like to finish it for the next release.

I intend to test if D-Bus communication can be a good solution for Scilab backend. Another task is to redesign the graphical generation assistants of Python backends. A long-term work is to follow the creation of Jupyter project, the future of IPython notebooks. If Cantor can to be compatible with Jupyter, it will be really nice for users and to encourage the collaboration between different communities interested in scientific programming and open science.

I will take advantage of the Cantor Qt5/KF5 release to write about how to use Cantor in two different ways: the Matlab way and the IPython notebooks way. Keep your eyes in the updates from this blog! =)

If you would like to help in Cantor development, please contact me or mail kde-edu maillist and let’s talk about bug fixes, development of new features, and more.

## Donations to KDE Brasil – LaKademy 2015!

If you would like to support my work, please make a donation to KDE Brasil. We will host the KDE Latin-American Summit – LaKademy and we need some money to put some latin-american contributors to work together face-to-face. I will focus my LaKademy work in the previously mentioned future works.

You can read more about LaKademy in this dot.KDE history. This page in English explain how to donate. There is other page with the same content in Spanish.

## April 23, 2015

### Titus Brown

#### More on scientific software

So I wrote this thing that got an awful lot of comments, many telling me that I'm just plain wrong. I think it's impossible to respond comprehensively :). But here are some responses.

## What is, what could be, and what should be

In that blog post, I argued that software shouldn't be considered a primary output of scientific research. But I completely failed to articulate a distinction between what we do today with respect to scientific software, what we could be doing in the not-so-distant future, and what we should be doing. Worse, I mixed them all up!

Peer reviewed publications and grants are the current coin of the realm. When we submit papers and grants for peer review, we have to deal with what those reviewers think right now. In bioinformatics, this largely means papers get evaluated on their perceived novelty and impact (even in impact-blind journals). Software papers are generally evaluated poorly on these metrics, so it's hard to publish bioinformatics software papers in visible places, and it's hard to argue in grants to the NIH (and most of the biology-focused NSF) that pure software development efforts are worthwhile. This is what is, and it makes it hard for methods+software research to get publications and funding.

Assuming that you agree that methods+software research is important in bioinformatics, what could we be doing in the near distant future to boost the visibility of methods+software? Giving DOIs to software is one way to accrue credit to software that is highly used, but citations take a long time to pile up, reviewers won't know what to expect in terms of numbers (50 citations? is that a lot?), and my guess is that they will be poorly valued in important situations like funding and advancement. It's an honorable attempt to hack the system and software DOIs are great for other purposes, but I'm not optimistic about their near- or middle-term impact.

We could also start to articulate values and perspectives to guide reviewers and granting systems. And this is what I'd like to do. But first, let me rant a bit.

I think people underestimate the hidden mass in the scientific iceberg. Huge amounts of money are spent on research, and I would bet that there are at least twenty thousand PI-level researchers around the world in biology. In biology-related fields, any of these people may be called upon to review your grant or your paper, and their opinions will largely be, well, their own. To get published, funded, or promoted, you need to convince some committee containing these smart and opinionated researchers that what you're doing is both novel and impactful. To do that, you have to appeal largely to values and beliefs that they already hold.

Moreover, this set of researchers - largely made of people who have reached tenured professor status - sits on editorial boards, funding agency panels, and tenure and promotion committees. None of these boards and funding panels exist in a vacuum, and while to some extent program managers can push in certain directions, they are ultimately beholden to the priorities of the funding agency, which are (in the best case) channeled from senior scientists.

If you wonder why open access took so damn long to happen, this is one reason - the cultural "mass" of researchers that needs to shift their opinions is huge and unwieldy and resistant to change. And they are largely invisible, and subject to only limited persuasion.

One of the most valuable efforts we can make is to explore what we should be doing, and place it on a logical and sensical footing, and put it out there. For example, check out the CRA's memo on best practices in Promotion and Tenure of Interdisciplinary Faculty - great and thoughtful stuff, IMO. We need a bunch of well thought out opinions in this vein. What guidelines do we want to put in place for evaluating methods+software? How should we evaluate methods+software researchers for impact? When we fund software projects, what should we be looking for?

And that brings me to what we should be doing, which is ultimately what I am most interested in. For example, I must admit to deep confusion about what a maturity model for bioinformatics software should look like; this feeds into funding requests, which ultimately feeds into promotion and tenure. I don't know how to guide junior faculty in this area either; I have lots of opinions, but they're not well tested in the marketplace of ideas.

I and others are starting to have the opportunity to make the case for what we should be doing in review panels; what case should we make?

It is in this vein, then, that I am trying to figure out what value to place on software itself, and I'm interested in how to promote methods+software researchers and research. Neil Saunders had an interesting comment that I want to highlight here: he said,

My own feeling is that phrases like "significant intellectual contribution" are just unhelpful academic words,

I certainly agree that this is an imprecise concept, but I can guarantee that in the US, this is one of the three main questions for researchers at hiring, promotion, and tenure. (Funding opportunities and fit are my guesses for the other two.) So I would push on this point: researchers need to appear to have a clear intellectual contribution at every stage of the way, whatever that means. What it means is what I'm trying to explore.

## Software is a tremendously important and critical part of the research endeavor

...but it's not enough. That's my story, and I'm sticking to it :).

I feel like the conversation got a little bit sidetracked by discussions of Nobel Prizes (mea partly culpa), and I want to discuss PhD theses instead. To get a PhD, you need to do some research; if you're a bioinformatics or biology grad student who is focused on methods+software, how much of that research can be software, and what else needs to be there?

And here again I get to dip into my own personal history.

I spent 9 years in graduate school. About 6 years into my PhD, I had a conversation with my advisor that went something like this:

Me, age ~27 - "Hey, Eric, I've got ~two first-author papers, and another one or two coming, along with a bunch of papers. How about I defend my PhD on the basis of that work, and stick around to finish my experimental work as a postdoc?"

Eric - blank look "All your papers are on computational methods. None of them count for your PhD."

Me - "Uhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhmmmmmmmmmm..."

(I did eventually graduate, but only after three more years of experiments.)

In biology, we have to be able to defend our computational contributions in the face of an only slowly changing professoriate. And I'm OK with that, but I think we should make it clear up front.

Since then, I've graduated three (soon to be five, I hope!) graduate students, one in biology and two in CS. In every single case, they've done a lot of hacking. And in every single case they've been asked to defend their intellectual contribution. This isn't just people targeting my students - I've sat on committees where students have produced masses of experimental data, and if they weren't prepared to defend their experimental design, their data interpretation, and the impact and significance of their data interpretation, they weren't read to defend. This is a standard part of the PhD process at Caltech, at MSU, and presumably at UC Davis.

So: to successfully receive a PhD, you should have to clearly articulate the problem you're tackling, its place in the scientific literature, the methods and experiments you're going to use, the data you got, the interpretation you place on that data, and the impact of their results on current scientific thinking. It's a pretty high bar, and one that I'm ok with.

One of the several failure modes I see for graduate students is the one where graduate students spend a huge amount of time developing software and more or less assume that this work will lead to a PhD. Why would they be thinking that?

• Their advisor may not be particularly computational and may be giving poor guidance (which includes poorly explained criteria).
• Their advisor may be using them (intentionally or unintentionally) - effective programmers are hard to find.
• The grad student may be resistant to guidance.

I ticked all of these as a graduate student, but I had the advantage of being a 3rd-generation academic, so I knew the score. (And I still ran into problems.) In my previous blog post, I angered and upset some people by my blunt words (I honestly didn't think "grad student hacker fallacy" was so rude ;( but it's a real problem that I confront regularly.

Computational PhD students need to do what every scientific PhD student needs to do: clearly articulate their problem, place it in the scientific literature, define the computational methods and experiments they're going to do/have done, explain the data and their interpretation of it, and explore how it impacts science. Most of this involves things other than programming and running software! It's impossible to put down percent effort estimates that apply broadly, but my guess is that PhD students should spend at least a year understanding your results and interpreting and explaining their work.

Conveniently, however, once you've done that for your PhD, you're ready to go in the academic world! These same criteria (expanded in scope) apply to getting a postdoc, publishing as a postdoc, getting a faculty position, applying for grants, and getting tenure. Moreover, I believe many of the same criteria apply broadly to research outside of academia (which is one reason I'm still strongly +1 on getting a PhD, no matter your ultimate goals).

(Kyle Cranmer's comment on grad student efforts here was perfect.)

## Software as...

As far as software being a primary product of research -- Konrad Hinsen nails it. It's not, but neither are papers, and I'm good with both statements :). Read his blog post for the full argument. The important bit is that very little stands on its own; there always needs to be communication effort around software, data, and methods.

Ultimately, I learned a lot by admitting confusion! Dan Katz and Konrad Hinsen pointed out that software is communication, and Kai Blin drew a great analogy between software and experimental design. These are perspectives that I hadn't seen said so clearly before and they've really made me think differently; both are interesting and provocative analogies and I'm hoping that we can develop them further as a community.

## How do we change things?

Kyle Cranmer and Rory Kirchner had a great comment chain on broken value systems and changing the system. I love the discussion, but I'm struggling with how to respond. My tentative and mildly unhappy conclusion is that I may have bought into the intellectual elitism of academia a bit too much (see: third generation academic), but this may also be how I've gotten where I am, so... mixed bag? (Rory made me feel old and dull, too, which is pretty cool in a masochistic kind of way.)

One observation is that, in software, novelty is cheap. It's very, very easy to tweak something minorly, and fairly easy to publish it without generating any new understanding. How do we distinguish a future Heng Li or an Aaron Quinlan (who have enabled new science by cleanly solving "whole classes of common problems that you don't even have to think about anymore") from humdrum increment, and reward them properly in the earlier stages of their career? I don't know, but the answer has to be tied to advancing science, which is hard to measure on any short timescale. (Sean Eddy's blog post has the clearest view on solutions that I've yet seen.)

Another observation (nicely articulated by Daisie Huang) is that (like open data) this is another game theoretic situation, where the authors of widely used software sink their time and energy into the community but don't necessarily gain wide recognition for their efforts. There's a fat middle ground of software that's reasonably well used but isn't samtools, and this ecosystem needs to be supported. This is much harder to argue - it's a larger body of software, it's less visible, and it's frankly much more expensive to support. (Carl Boettiger's comment is worth reading here.) The funding support isn't there, although that might change in the next decade. (This is the proximal challenge for me, since I place my own software, khmer, in this "fat middle ground"; how do I make a clear argument for funding?)

Kyle Cranmer and others pointed to some success in "major instrumentation" and methods-based funding and career paths in physics (help, can't find link/tweets!). This is great, but I think it's also worth discussing the overall scale of things. Physics has a few really big and expensive instruments, with a few big questions, and with thousands of engineers devoted to them. Just in sequencing, biology has thousands (soon millions) of rather cheap instruments, devoted to many thousands of questions. If my prediction that software will "eat" this part of the world becomes true, we will need tens of thousands of data intensive biologists at a minimum, most working to some large extent on data analysis and software. I think the scale of the need here is simply much, much larger than in physics.

I am supremely skeptical of the idea that universities as we currently conceive of them are the right home for stable, mature software development. We either need to change universities in the right way (super hard) or find other institutions (maybe easier). Here, the model to watch may well be the Center for Open Science, which produces the Open Science Framework (among others). My interpretation is that they are trying to merge scientific needs with the open source development model. (Tellingly, they are doing so largely with foundation funding; the federal funding agencies don't have good mechanisms for funding this kind of thing in biology, at least.) This may be the right model (or at least on the path towards one) for sustained software development in the biological sciences: have an institution focused on sustainability and quality, with a small diversity of missions, that can afford to spend the money to keep a number of good software engineers focused on those missions.

Thanks, all, for the comments and discussions!

--titus

## April 22, 2015

### Gaël Varoquaux

#### MLOSS: machine learning open source software workshop @ ICML 2015

Note

This year again we will have an exciting workshop on the leading-edge machine-learning open-source software. This subject is central to many, because software is how we propagate, reuse, and apply progress in machine learning.

Want to present a project? The deadline for the call for papers is Apr 28th, in a few days : http://mloss.org/workshop/icml15/

The workshop will be help at the ICML conference, in Lille France, on July 10th. ICML –International Conference in Machine Learning– is the leading venue for academic research in machine learning. It’s a fantastic place to hold such a workshop, as the actors of theoretical progress are all around. Software is the bridge that brings this progress beyond papers.

There is a long tradition of MLOSS workshop, with one every year and a half. Last time, at NIPS 2013, I could feel a bit of a turning point, as people started feeling that different software slotted together, to create an efficient and state-of-the art working environment. For this reason, we have entitled this year’s workshop ‘open ecosystems’, stressing that contributions in the scope of the workshop, that build a thriving work environment, are not only machine learning software, but also better statistics or numerical tools.

We have two keynotes with important contributions to such ecosystems:

• John Myles White (Facebook), lead developer of Julia statistics and machine learning: “Julia for machine learning: high-level syntax with compiled-code speed”
• Matthew Rocklin (Continuum Analytics), developer of Python computational tools, in particular Blaze (confirmed): “Blaze, a modern numerical engine with out-of-core and out-of-order computations”.

There will be also a practical presentation on how to set up an open-source project, discussing hosting, community development, quality assurance, license choice, by yours truly.

## April 21, 2015

### Titus Brown

#### Is software a primary product of science?

Update - I've written Yet Another blog post, More on scientific software on this topic. I think this blog post is a mess so you should read that one first ;).

This blog post was spurred by a simple question from Pauline Barmby on Twitter. My response didn't, ahem, quite fit in 144 characters :).

First, a little story. (To paraphrase Greg Wilson, "I tell a lot of stories. Some of them aren't true. But this one is!")

When we were done writing Best Practices for Scientific Computing, we tried submitting it to a different high-profile journal than the one that ultimately accepted it (PLoS Biology, where it went on to become the most highly read article of 2014 in PLoS Biology). The response from the editor went something like this: "We recognize the importance of good engineering, but we regard writing software as equivalent to building a telescope - it's important to do it right, but we don't regard a process paper on how to build telescopes better as an intellectual contribution." (Disclaimer: I can't find the actual response, so this is a paraphrase, but it was definitely a "no" and for about that reason.)

## Is scientific software like instrumentation?

When I think about scientific software as a part of science, I inevitably start with its similarities to building scientific instruments. New instrumentation and methods are absolutely essential to scientific progress, and it is clear that good engineering and methods development skills are incredibly helpful in research.

So, why did the editors at High Profile Journal bounce our paper? I infer that they drew exactly this parallel and thought no further.

But scientific software is only somewhat like new methods or instrumentation.

First, software can spread much faster and be used much more like a black box than most methods, and instrumentation inevitably involves either construction or companies that act as middlemen. With software, it's like you're shipping kits or plans for 3-D printing - something that is as close to immediately usable as it comes. If you're going to hand someone an immediately usable black box (and pitch it as such), I would argue that you should take a bit more care in building said black box.

Second, complexity in software scales much faster than in hardware (citation needed). This is partly due to human nature & a failure to think long-term, and partly due to the nature of software - software can quickly have many more moving parts than hardware, and at much less (short term) cost. Frankly, most software stacks resemble massive Rube Goldberg machines (read that link!) This means that different processes are needed here.

Third, at least in my field (biology), we are undergoing a transition to data intensive research, and software methods are becoming ever more important. There's no question that software is going to eat biology just like it's eating the rest of the world, and an increasingly large part of our primary scientific output in biology is going to hinge directly on computation (think: annotations. 'nuff said).

If we're going to build massively complex black boxes that under-pin all of our science, surely that means that the process is worth studying intellectually?

## Is scientific software a primary intellectual output of science?

No.

I think concluding that it is is an example of the logical fallacy "affirming the consequent" - or, "confusion of necessity and sufficiency". I'm not a logician, but I would phrase it like this (better phrasing welcome!) --

Good software is necessary for good science. Good science is an intellectual contribution. Therefore good software is an intellectual contribution.

Hopefully when phrased that way it's clear that it's nonsense.

I'm naming this "the fallacy of grad student hackers", because I feel like it's a common failure mode of grad students that are good at programming. I actually think it's a tremendously dangerous idea that is confounding a lot of the discussion around software contributions in science.

To illustrate this, I'll draw the analog to experimental labs: you may have people who are tremendously good at doing certain kinds of experiments (e.g. expert cloners, or PCR wizards, or micro-injection aficionados, or WMISH bravados) and with whom you can collaborate to rapidly advance your research. They can do things that you can't, and they can do them quickly and well! But these people often face dead ends in academia and end up as eterna-postdocs, because (for better or for worse) what is valued for first authorship and career progression is intellectual contribution, and doing experiments well is not sufficient to demonstrate an intellectual contribution. Very few people get career advancement in science by simply being very good at a technique, and I believe that this is OK.

Back to software - writing software may become necessary for much of science but I don't think it should ever be sufficient as a primary contribution. Worse, it can become (often becomes?) an engine of procrastination. Admittedly, that procrastination leads to things like IPython Notebook, so I don't want to ding it, but neither are all (or even most ;) grad students like Fernando Perez, either.

## Let's admit it, I'm just confused

This leaves us with a conundrum.

Software is clearly a force multiplier - "better software, better research!.

However, I don't think it can be considered a primary output of science. Dan Katz said, "Nobel prizes have been given for inventing instruments. I'm eagerly awaiting for one for inventing software [sic]" -- but I think he's wrong. Nobels have been given because of the insight enabled by inventing instruments, not for inventing instruments. (Corrections welcome!) So while I, too, eagerly await the explicit recognition that software can push scientific insight forward in biology, I am not holding my breath - I think it's going to look much more like the 2013 Chemistry Nobel, which is about general computational methodology. (My money here would be on a Nobel in Medicine for genome assembly methods, which should follow on separately from massively parallel sequencing methods and shotgun sequencing - maybe Venter, Church, and Myers/Pevzner deserve three different Nobels?)

Despite that, we do need to incentivize it, especially in biology but also more generally. Sean Eddy wrote AN AWESOME BLOG POST ON THIS TOPIC in 2010 (all caps because IT'S AWESOME AND WHY HAVEN'T WE MOVED FURTHER ON THIS <sob>). This is where DOIs for software usually come into play - hey, maybe we can make an analogy between software and papers! But I worry that this is a flawed analogy (for reasons outlined above) and will simply support the wrong idea that doing good hacking is sufficient for good science.

We also have a new problem - the so-called Big Data Brain Drain, in which it turns out that the skills that are needed for advancing science are also tremendously valuable in much more highly paid jobs -- much like physics number crunchers moving to finance, research professors in biology face a future where all our grad students go on to make more than us in tech. (Admittedly, this is only a problem if we think that more people clicking on ads is more important than basic research.) Jake Vanderplas (the author of the Big Data Brain Drain post) addressed potential solutions to this in Hacking Academia, about which I have mixed feelings. While I love both Jake and his blog post (platonically), there's a bit too much magical thinking in that post -- I don't see (m)any of those solutions getting much traction in academia.

The bottom line for me is that we need to figure it out, but I'm a bit stuck on practical suggestions. Natural selection may apply -- whoever figures this out in biology (basic research institutions and/or funding bodies) will have quite an edge in advancing biomedicine -- but natural selection works across multiple generations, and I could wish for something a bit faster. But I don't know. Maybe I'll bring it up at SciFoo this year - "Q: how can we kill off the old academic system faster?" :)

I'll leave you with two little stories.

## The problem, illustrated

In 2009, we started working on what would ultimately become Pell et al., 2012. We developed a metric shit-ton of software (that's a scientific measure, folks) that included some pretty awesomely scalable sparse graph labeling approaches. The software worked OK for our problem, but was pretty brittle; I'm not sure whether or not our implementation of this partitioning approach is being used by anyone else, nor am I sure if it should be :).

However, the paper has been a pretty big hit by traditional scientific metrics! We got it into PNAS by talking about the data structure properties and linking physics, computer science, and biology together. It helped lead directly to Chikhi and Rizk (2013), and it has been cited a whole bunch of times for (I think) its theoretical contributions. Yay!

Nonetheless, the incredibly important and tricky details of scalably partitioning 10 bn node graphs were lost from that paper, and the software was not a big player, either. Meanwhile, Dr. Pell left academia and moved on to a big software company where (on his first day) he was earning quite a bit more than me (good on him! I'd like a 5% tithe, though, in the future :) :). Trust me when I say that this is a net loss to academia.

Summary: good theory, useful ideas, lousy software. Traditional success. Lousy outcomes.

## A contrapositive

In 2011, we figured out that linear compression ratios for sequence data simply weren't going to cut it in the face of the continued rate of data generation, and we developed digital normalization, a deceptively simple idea that hasn't really been picked up by the theoreticians. Unlike the Pell work above, it's not theoretically well studied at all. Nonetheless, the preprint has a few dozen citations (because it's so darn useful) and the work is proving to be a good foundation for further research for our lab. Perhaps the truest measure of its memetic success is that it's been reimplemented by at least three different sequencing centers.

The software is highly used, I think, and many of our efforts on the khmer software have been aimed at making diginorm and downstream concepts more robust.

Summary: lousy theory, useful ideas, good software. Nontraditional success. Awesome outcomes.

## Ways forward?

I simply don't know how to chart a course forward. My current instinct (see below) is to shift our current focus much more to theory and ideas and further away from software, largely because I simply don't see how to publish or fund "boring" things like software development. (Josh Bloom has an excellent blog post that relates to this particular issue: Novelty Squared)

I've been obsessing over these topics of software and scientific focus recently (see The three porridge bowls of scientific software development and Please destroy this software after publication. kthxbye) because I'm starting to write a renewal for khmer's funding. My preliminary specific aims look something like this:

Aim 1: Expand low memory and streaming approaches for biological sequence analysis.

Aim 2: Develop graph-based approaches for analyzing genomic variation.

Aim 3: Optimize and extend a general purpose graph analysis library

Importantly, everything to do with software maintenance, support, and optimization is in Aim 3 and is in fact only a part of that aim. I'm not actually saddened by that, because I believe that software is only interesting because of the new science it enables. So I need to sell that to the NIH, and there software quality is (at best) a secondary consideration.

On the flip side, by my estimate 75% of our khmer funding is going to software maintenance, most significantly in paying down our technical debt. (In the grant I am proposing to decrease this to ~50%.)

I'm having trouble justifying this dichotomy mentally myself, and I can only imagine what the reviewers might think (although hopefully they will only glance at the budget ;).

So this highlights one conundrum: given my estimates and my priorities, how would you suggest I square these stated priorities with my funding allocations? And, in these matters, have I been wrong to focus on software quality, or should I have focused instead on accruing technical debt in the service of novel ideas and functionality? Inquiring minds want to know.

--titus

### Matthew Rocklin

#### Profiling Data Throughput

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

Disclaimer: This post is on experimental/buggy code.

tl;dr We measure the costs of processing semi-structured data like JSON blobs.

## Semi-structured Data

Semi-structured data is ubiquitous and computationally painful. Consider the following JSON blobs:

{'name': 'Alice',   'payments': [1, 2, 3]}
{'name': 'Bob',     'payments': [4, 5]}
{'name': 'Charlie', 'payments': None}


This data doesn’t fit nicely into NumPy or Pandas and so we fall back to dynamic pure-Python data structures like dicts and lists. Python’s core data structures are surprisingly good, about as good as compiled languages like Java, but dynamic data structures present some challenges for efficient parallel computation.

## Volume

Semi-structured data is often at the beginning of our data pipeline and so often has the greatest size. We may start with 100GB of raw data, reduce to 10GB to load into a database, and finally aggregate down to 1GB for analysis, machine learning, etc., 1kB of which becomes a plot or table.

Data Bandwidth (MB/s) In Parallel (MB/s)
Disk I/O500500
Decompression100500
Deserialization50250
In-memory computation2000oo
Shuffle930

Common solutions for large semi-structured data include Python iterators, multiprocessing, Hadoop, and Spark as well as proper databases like MongoDB and ElasticSearch. Two months ago we built dask.bag, a toy dask experiment for semi-structured data. Today we’ll strengthen the dask.bag project and look more deeply at performance in this space.

We measure performance with data bandwidth, usually in megabytes per second (MB/s). We’ll build intuition for why dealing with this data is costly.

## Dataset

As a test dataset we play with a dump of GitHub data from https://www.githubarchive.org/. This data records every public github event (commit, comment, pull request, etc.) in the form of a JSON blob. This data is representative fairly representative of a broader class of problems. Often people want to do fairly simple analytics, like find the top ten committers to a particular repository, or clean the data before they load it into a database.

We’ll play around with this data using dask.bag. This is both to get a feel for what is expensive and to provide a cohesive set of examples. In truth we won’t do any real analytics on the github dataset, we’ll find that the expensive parts come well before analytic computation.

Items in our data look like this:

>>> import json
>>> path = '/home/mrocklin/data/github/2013-05-0*.json.gz'
({u'actor': u'mjcramer',
u'actor_attributes': {u'gravatar_id': u'603762b7a39807503a2ee7fe4966acd1',
u'type': u'User'},
u'created_at': u'2013-05-01T00:01:28-07:00',
u'master_branch': u'master',
u'ref': None,
u'ref_type': u'repository'},
u'public': True,
u'repository': {u'created_at': u'2013-05-01T00:01:28-07:00',
u'description': u'',
u'fork': False,
u'forks': 0,
u'has_issues': True,
u'has_wiki': True,
u'id': 9787210,
u'master_branch': u'master',
u'name': u'settings',
u'open_issues': 0,
u'owner': u'mjcramer',
u'private': False,
u'pushed_at': u'2013-05-01T00:01:28-07:00',
u'size': 0,
u'stargazers': 0,
u'url': u'https://github.com/mjcramer/settings',
u'watchers': 0},
u'type': u'CreateEvent',
u'url': u'https://github.com/mjcramer/settings'},)

## Disk I/O and Decompression – 100-500 MB/s

Data Bandwidth (MB/s)
Parallel Read from disk with gzip.open500

A modern laptop hard drive can theoretically read data from disk to memory at 800 MB/s. So we could burn through a 10GB dataset in fifteen seconds on our laptop. Workstations with RAID arrays can do a couple GB/s. In practice I get around 500 MB/s on my personal laptop.

In [1]: import json
In [2]: import dask.bag as db
In [3]: from glob import glob
In [4]: path = '/home/mrocklin/data/github/2013-05-0*.json.gz'

In [5]: %time compressed = '\n'.join(open(fn).read() for fn in glob(path))
CPU times: user 75.1 ms, sys: 1.07 s, total: 1.14 s
Wall time: 1.14 s

In [6]: len(compressed) / 0.194 / 1e6  # MB/s
508.5912175438597

To reduce storage and transfer costs we often compress data. This requires CPU effort whenever we want to operate on the stored values. This can limit data bandwidth.

In [7]: import gzip
In [8]: %time total = '\n'.join(gzip.open(fn).read() for fn in glob(path))
CPU times: user 12.2 s, sys: 18.7 s, total: 30.9 s
Wall time: 30.9 s

In [9]: len(total) / 30.9 / 1e6         # MB/s  total bandwidth
Out[9]: 102.16563844660195

In [10]: len(compressed) / 30.9 / 1e6   # MB/s  compressed bandwidth
Out[10]: 18.763559482200648

So we lose some data bandwidth through compression. Where we could previously process 500 MB/s we’re now down to only 100 MB/s. If we count bytes in terms of the amount stored on disk then we’re only hitting 18 MB/s. We’ll get around this with multiprocessing.

## Decompression and Parallel processing – 500 MB/s

Fortunately we often have more cores than we know what to do with. Parallelizing reads can hide much of the decompression cost.

In [12]: import dask.bag as db

In [13]: %time nbytes = db.from_filenames(path).map(len).sum().compute()
CPU times: user 130 ms, sys: 402 ms, total: 532 ms
Wall time: 5.5 s

In [14]: nbytes / 5.5 / 1e6
Out[14]: 573.9850932727272

Dask.bag infers that we need to use gzip from the filename. Dask.bag currently uses multiprocessing to distribute work, allowing us to reclaim our 500 MB/s throughput on compressed data. We also could have done this with multiprocessing, straight Python, and a little elbow-grease.

## Deserialization – 30 MB/s

Data Bandwidth (MB/s)

Once we decompress our data we still need to turn bytes into meaningful data structures (dicts, lists, etc..) Our GitHub data comes to us as JSON. This JSON contains various encodings and bad characters so, just for today, we’re going to punt on bad lines. Converting JSON text to Python objects explodes out in memory a bit, so we’ll consider a smaller subset for this part, a single day.

In [20]: def loads(line):
...          except: return None

In [21]: path = '/home/mrocklin/data/github/2013-05-01-*.json.gz'
In [22]: lines = list(db.from_filenames(path))

In [23]: %time blobs = list(map(loads, lines))
CPU times: user 10.7 s, sys: 760 ms, total: 11.5 s
Wall time: 11.3 s

In [24]: len(total) / 11.3 / 1e6
Out[24]: 33.9486321238938

In [25]: len(compressed) / 11.3 / 1e6
Out[25]: 6.2989179646017694

So in terms of actual bytes of JSON we can only convert about 30MB per second. If we count in terms of the compressed data we store on disk then this looks more bleak at only 6 MB/s.

### This can be improved by using faster libraries – 50 MB/s

The ultrajson library, ujson, is pretty slick and can improve our performance a bit. This is what Pandas uses under the hood.

In [28]: import ujson
...          except: return None

In [30]: %time blobs = list(map(loads, lines))
CPU times: user 6.37 s, sys: 1.17 s, total: 7.53 s
Wall time: 7.37 s

In [31]: len(total) / 7.37 / 1e6
Out[31]: 52.05149837177748

In [32]: len(compressed) / 7.37 / 1e6
Out[32]: 9.657771099050203

### Or through Parallelism – 150 MB/s

This can also be accelerated through parallelism, just like decompression. It’s a bit cumbersome to show parallel deserializaiton in isolation. Instead we’ll show all of them together. This will under-estimate performance but is much easier to code up.

In [33]: %time db.from_filenames(path).map(loads).count().compute()
CPU times: user 32.3 ms, sys: 822 ms, total: 854 ms
Wall time: 2.8 s

In [38]: len(total) / 2.8 / 1e6
Out[38]: 137.00697964285717

In [39]: len(compressed) / 2.8 / 1e6
Out[39]: 25.420633214285715

## Mapping and Grouping - 2000 MB/s

Data Bandwidth (MB/s)
Simple Python operations1400
Complex CyToolz operations2600

Once we have data in memory, Pure Python is relatively fast. Cytoolz moreso.

In [55]: %time set(d['type'] for d in blobs)
CPU times: user 162 ms, sys: 123 ms, total: 285 ms
Wall time: 268 ms
Out[55]:
{u'CommitCommentEvent',
u'CreateEvent',
u'DeleteEvent',
u'FollowEvent',
u'ForkEvent',
u'GistEvent',
u'GollumEvent',
u'IssueCommentEvent',
u'IssuesEvent',
u'MemberEvent',
u'PublicEvent',
u'PullRequestEvent',
u'PullRequestReviewCommentEvent',
u'PushEvent',
u'WatchEvent'}

In [56]: len(total) / 0.268 / 1e6
Out[56]: 1431.4162052238805

In [57]: import cytoolz
In [58]: %time _ = cytoolz.groupby('type', blobs)  # CyToolz FTW
CPU times: user 144 ms, sys: 0 ns, total: 144 ms
Wall time: 144 ms

In [59]: len(total) / 0.144 / 1e6
Out[59]: 2664.024604166667

So slicing and logic are essentially free. The cost of compression and deserialization dominates actual computation time. Don’t bother optimizing fast per-record code, especially if CyToolz has already done so for you. Of course, you might be doing something expensive per record. If so then most of this post isn’t relevant for you.

## Shuffling - 5-50 MB/s

Data Bandwidth (MB/s)
Naive groupby with on-disk Shuffle25
Clever foldby without Shuffle250

For complex logic, like full groupbys and joins, we need to communicate large amounts of data between workers. This communication forces us to go through another full serialization/write/deserialization/read cycle. This hurts. And so, the single most important message from this post:

That being said, people will inevitably ignore this advice so we need to have a not-terrible fallback.

In [62]: %time dict(db.from_filenames(path)
...                   .groupby('type')
...                   .map(lambda (k, v): (k, len(v))))
CPU times: user 46.3 s, sys: 6.57 s, total: 52.8 s
Wall time: 2min 14s
Out[62]:
{'CommitCommentEvent': 17889,
'CreateEvent': 210516,
'DeleteEvent': 14534,
'FollowEvent': 35910,
'ForkEvent': 67939,
'GistEvent': 7344,
'GollumEvent': 31688,
'IssueCommentEvent': 163798,
'IssuesEvent': 102680,
'MemberEvent': 11664,
'PublicEvent': 1867,
'PullRequestEvent': 69080,
'PullRequestReviewCommentEvent': 17056,
'PushEvent': 960137,
'WatchEvent': 173631}

In [63]: len(total) / 134 / 1e6  # MB/s
Out[63]: 23.559091

This groupby operation goes through the following steps:

2. Decompress GZip
3. Deserialize with ujson
4. Do in-memory groupbys on chunks of the data
5. Reserialize with msgpack (a bit faster)
6. Append group parts to disk
7. Read in new full groups from disk
8. Deserialize msgpack back to Python objects
9. Apply length function per group

Some of these steps have great data bandwidths, some less-so. When we compound many steps together our bandwidth suffers. We get about 25 MB/s total. This is about what pyspark gets (although today pyspark can parallelize across multiple machines while dask.bag can not.)

Disclaimer, the numbers above are for dask.bag and could very easily be due to implementation flaws, rather than due to inherent challenges.

>>> import pyspark
>>> sc = pyspark.SparkContext('local[8]')
>>> rdd = sc.textFile(path)
...         .keyBy(lambda d: d['type'])
...         .groupByKey()
...         .map(lambda (k, v): (k, len(v)))
...         .collect())

I would be interested in hearing from people who use full groupby on BigData. I’m quite curious to hear how this is used in practice and how it performs.

## Creative Groupbys - 250 MB/s

Don’t use groupby. You can often work around it with cleverness. Our example above can be handled with streaming grouping reductions (see toolz docs.) This requires more thinking from the programmer but avoids the costly shuffle process.

In [66]: %time dict(db.from_filenames(path)
...                   .foldby('type', lambda total, d: total + 1, 0, lambda a, b: a + b))
Out[66]:
{'CommitCommentEvent': 17889,
'CreateEvent': 210516,
'DeleteEvent': 14534,
'FollowEvent': 35910,
'ForkEvent': 67939,
'GistEvent': 7344,
'GollumEvent': 31688,
'IssueCommentEvent': 163798,
'IssuesEvent': 102680,
'MemberEvent': 11664,
'PublicEvent': 1867,
'PullRequestEvent': 69080,
'PullRequestReviewCommentEvent': 17056,
'PushEvent': 960137,
'WatchEvent': 173631}
CPU times: user 322 ms, sys: 604 ms, total: 926 ms
Wall time: 13.2 s

In [67]: len(total) / 13.2 / 1e6  # MB/s
Out[67]: 239.16047181818183

We can also spell this with PySpark which performs about the same.

>>> dict(rdd.map(loads)  # PySpark equivalent
...         .keyBy(lambda d: d['type'])
...         .combineByKey(lambda d: 1, lambda total, d: total + 1, lambda a, b: a + b)
...         .collect())

## Use a Database

By the time you’re grouping or joining datasets you probably have structured data that could fit into a dataframe or database. You should transition from dynamic data structures (dicts/lists) to dataframes or databases as early as possible. DataFrames and databases compactly represent data in formats that don’t require serialization; this improves performance. Databases are also very clever about reducing communication.

Tools like pyspark, toolz, and dask.bag are great for initial cleanings of semi-structured data into a structured format but they’re relatively inefficient at complex analytics. For inconveniently large data you should consider a database as soon as possible. That could be some big-data-solution or often just Postgres.

## Better data structures for semi-structured data?

Dynamic data structures (dicts, lists) are overkill for semi-structured data. We don’t need or use their full power but we inherit all of their limitations (e.g. serialization costs.) Could we build something NumPy/Pandas-like that could handle the blob-of-JSON use-case? Probably.

DyND is one such project. DyND is a C++ project with Python bindings written by Mark Wiebe and Irwin Zaid and historically funded largely by Continuum and XData under the same banner as Blaze/Dask. It could probably handle the semi-structured data problem case if given a bit of love. It handles variable length arrays, text data, and missing values all with numpy-like semantics:

>>> from dynd import nd
>>> data = [{'name': 'Alice',                       # Semi-structured data
...          'location': {'city': 'LA', 'state': 'CA'},
...          'credits': [1, 2, 3]},
...         {'name': 'Bob',
...          'credits': [4, 5],
...          'location': {'city': 'NYC', 'state': 'NY'}}]

>>> dtype = '''var * {name: string,
...                   location: {city: string,
...                              state: string[2]},
...                   credits: var * int}'''        # Shape of our data

>>> x = nd.array(data, type=dtype)                  # Create DyND array

>>> x                                               # Store compactly in memory
nd.array([["Alice", ["LA", "CA"], [1, 2, 3]],
["Bob", ["NYC", "NY"], [4, 5]]])

>>> x.location.city                                 # Nested indexing
nd.array([ "LA", "NYC"],
type="strided * string")

>>> x.credits                                       # Variable length data
nd.array([[1, 2, 3],    [4, 5]],
type="strided * var * int32")

>>> x.credits * 10                                  # And computation
nd.array([[10, 20, 30],     [40, 50]],
type="strided * var * int32")

Sadly DyND has functionality gaps which limit usability.

>>> -x.credits                                      # Sadly incomplete :(
TypeError: bad operand type for unary -

I would like to see DyND mature to the point where it could robustly handle semi-structured data. I think that this would be a big win for productivity that would make projects like dask.bag and pyspark obsolete for a large class of use-cases. If you know Python, C++, and would like to help DyND grow I’m sure that Mark and Irwin would love the help

## Comparison with PySpark

1. Doesn’t engage the JVM, no heap errors or fiddly flags to set
2. Conda/pip installable. You could have it in less than twenty seconds from now.
3. Slightly faster in-memory implementations thanks to cytoolz; this isn’t important though
4. Good handling of lazy results per-partition
5. Faster / lighter weight start-up times
6. (Subjective) I find the API marginally cleaner

PySpark pros:

1. Supports distributed computation (this is obviously huge)
2. More mature, more filled out API
3. HDFS integration

Dask.bag reinvents a wheel; why bother?

1. Given the machinery inherited from dask.array and toolz, dask.bag is very cheap to build and maintain. It’s around 500 significant lines of code.
2. PySpark throws Python processes inside a JVM ecosystem which can cause some confusion among users and a performance hit. A task scheduling system in the native code ecosystem would be valuable.
3. Comparison and competition is healthy
4. I’ve been asked to make a distributed array. I suspect that distributed bag is a good first step.

## April 20, 2015

### Titus Brown

#### Statistics from applications to the 2015 course on NGS analysis

Here are some statistics from this year's applications to the NGS course. Briefly, this is a two-week workshop on sequence analysis at the command line and in the cloud.

The short version is that demand remains high; note that we admit only 24 applicants, so generally < 20%...

Year Number of applications Note
2010 33
2011 133
2012 170
2013 210
2014 170 (shifted the timing to Aug)
2015 155 (same timing as 2014)

The demand is still high, although maybe starting to dip?

Status: Number Percent
1st or 2nd year graduate student 20 12.6%
3rd year+ graduate student 40 25.2%
Post-doctoral researcher 36 22.6%
Non-tenure faculty or staff 20 12.6%
Tenure-line faculty 24 15.1%
Other 19 11.9%

Lots of tenure-line faculty feel they need this training...

Primary training/background: Number Percent
Bioinformatics 11 6.9%
Biology 112 70.4%
Computer Science 3 1.9%
Physics 0 0%
Other 33 20.8%

I should look into "Other"!

--titus

## April 19, 2015

### Titus Brown

#### Dear Doc Brown: how can I find a postdoc where I can do open science?

I got the following e-mail from a student I know -- lightly edited to protect the innocent:

I am at the stage where I am putting together a list of people that I want to post-doc with.

So, a question to you:

1. How can I find people who do open science?

2. Say I go for an interview, would it be "polite" to ask them to see if I can do open science even if they're not doing it themselves? Do you have any suggestions on how, exactly, to ask?

The reason why I am asking is because I rarely hear about openly doing (or even talking about) science in biomedical fields, outside of the standard communication methods (e.g. presenting at a meeting). Most of the people in my field seem somewhat conservative in this regard. Plus, I really don't want to butt heads with my postdoc mentor on this kind of topic.

Any advice? I have some but I'll save it for later so I can incorporate other people's advice ;).

thanks,

--titus

p.s. Yes, I have permission to post this question!