September 03, 2015

Continuum Analytics

Continuum Analytics - September Tech Events

Our team is gearing up for a big presence at the Strata Hadoop World Conference at the end of this month, where we’ll be presenting theatre talks from our founder and developers, as well as demos, givewaways, and much more. Take a look at where we’ll be all month and let us know at info@continuum.io if you’d like to schedule a meeting.

September 01, 2015

Titus Brown

Requiescat in pace: Dr. Eric H. Davidson, 1937-2015

I just heard the sad news that Eric Davidson, my PhD advisor, passed away.

Eric was a giant in the field of developmental biology and gene regulatory networks. His work spanned more than fifty years, and had an indelible impact on gene regulation studies. (You can read up on his research on his wikipedia page, or see the lab Web site.)

I joined the lab around 1999, and started working in earnest in 2000. It's the lab where I learned all of the molecular biology, developmental biology, and genomics that I know. It's had an incredible impact on my life, and my career, as well as the lives and careers of dozens of others. The people I met in the lab have been a lifelong scientific community with which to engage, and I'm regularly in touch with many lab alumni.

From talking with others, it takes many graduates from Eric's lab about a decade to recover from the experience and re-engage with him; since I defended in 2006, I still hadn't quite gotten there. In fact, I'd only seen Eric a few times in the intervening years, because when I graduated I moved on to working in other fields. It's sad to think that I won't have a chance to talk with him ever again.

One strong memory of Eric was his availability: he was always available to talk science. Eric was incredibly busy with running a lab of 25-30 people and inevitably had several papers and grants on the burner, but he always made time for his lab members, and never seemed rushed when talking about science. As a graduate student, this was incredibly welcoming - you never got the sense that he didn't have time for you and your science. Come to think of it, this was pretty much Eric's defining characteristic: he was all about the science. I don't think I've ever been in as rigorous an intellectual space as when I was in his lab for graduate school, and that level of intellectualism is something I strive to recreate in myself and in my own lab on a daily basis. From the Facebook comments on his passing, this is a common theme for those who went through his lab.

It's hard to convey the totality of someone as complex as Eric, but this video does a pretty good job. Watch it all the way through to the end if you can -- it's a good story. https://www.youtube.com/watch?v=EtCHJ0cyYXE

RIP, boss.

--titus

William Stein

React, Flux, RethinkDB and SageMathCloud -- Summer 2015 update

I've been using databases and doing web development for over 20 years, and I've never really loved any database before and definitely didn't love any web development frameworks either. That all changed for me this summer...

SageMathCloud

SageMathCloud is a web application in which you collaboratively use Python, LaTeX, Markdown, Sage worksheets (sophisticated mathematics), task lists, R, Jupyter Notebooks, manage courses, write C programs, make chatrooms, and more. It is hosted on Google Compute Engine, but is also entirely open source and there is a pre-made Virtual Machine that you can download. A project in SMC is a Linux account, with resources constrained using cgroups and quotas. Many SMC users can collaborate on the same project, and have equal privileges in that project. Interaction with all file types (including Jupyter notebooks, task lists and course managements) is synchronized in realtime, like Google docs. There is also a global notifications feed that shows all editing activity on all files in all projects on which the user collaborates, which is a sort of highly technical version of Facebook's feed.

Rewrite motivation

I originally wrote the SageMathCloud frontend using progressive-refinement jQuery (no third-party framework beyond that) and the Cassandra database. These were reasonable choices when I started. There are much better approaches now, which are critical to dramatically improving the user experience with SMC, and also growing the developer base. So far SMC has had no nontrivial outside contributions, probably due to the difficulty of understanding the code. In fact, I think nobody besides me has ever even installed SMC, despite these install notes.

We (me, Jon Lee, Nicholas Ruhland) are currently completely rewriting the entire frontend of SMC using React.js, Flux, and RethinkDB. We started this rewrite in June 2015, with Jon being supported by Google Summer of Code (2015), Nich being supported some by NSF grants from Randy Leveque and Rekha Thomas, and with me being unemployed.

Terrible funding situation

I'm living on credit cards -- I have no NSF grant support anymore, and SageMathCloud is still losing a lot of money every month, and I'm unhappy about this situation. It was either completely quit working on SMC and instead teach or consult a lot, or lose tens of thousands of dollars. I am doing the latter right now. I was very caught off guard, since this is my first summer ever to not have NSF support since I got my Ph.D. in 2000, and I didn't expect to have my grant proposals all denied (which happened in June). There is some modest Angel investment in SageMath, Inc., but I can't bring myself to burn through that money on salary, since it would run out quickly, and I don't want to have to shut down the site due to not being able to pay the hosting bill. I've failed to get any significant free hosting, due to already getting free hosting in the past, and SageMath, Inc. not being in any incubators. For example, we tried very hard to get hosting from Google, but they flatly refused for these two reasons (they gave $60K in hosting to UW/Sage project in 2012). I'm clearly having trouble transitioning from an academic to an industry funding model. But if there are enough paying customers by January 2016, things will turn around. Jon, Nich, and I have been working on this rewrite for three months, and hope to finish it by the end of September, when Jon and Nich will become busy with classes again. However, it seems unlikely we'll be able to finish at the current rate. Fortunately, I don't start teaching fulltime again until January, and we put a lot of work into doing a release in mid-August that fully uses RethinkDB and partly uses React.js, so that we can finish the second stage of the rewrite iteratively, without any major technical surprises. RethinkDB Cassandra is an excellent database for many applications, but it is not the right database for SMC and I'm making no further use of Cassandra. SMC is a realtime application that does a lot more reading than writing to the database, and SMC greatly benefits from realtime push updates from the database. I've tried quite hard in the past to build an appropriate architecture for SMC on top of Cassandra, but it is the wrong tool for the job. RethinkDB scales up linearly (with sharding and replication), and has high availability and automatic failover as of version 2.1.2. See https://github.com/rethinkdb/rethinkdb/issues/4678 for my painful path to ensuring RethinkDB actually works for me (the RethinkDB developers are incredibly helpful!). React.js I learned about React.js first from some "random podcast", then got more interested in it when Chris Swenson gave a demo at a Sage Days workshop in San Diego in May 2015. React (+Flux) is a web development framework that actually has solid ideas behind it, backed by an implementation that has been optimized and tested by a highly nontrivial real world application: namely the Facebook website. Even if I were to have the idea of React, implementing in a way that is actually usable would be difficult. The key idea of React.js is that -- surprisingly -- it is possible to write efficient client-side code that describes how to render the application purely as a function of its state. React is different than jQuery. With jQuery, you write lots of code explaining how to transform the user interface of your application from one complicated state (that you might never have anticipated happening) to another complicated state. When using React.js you don't write code about how your application's visible state changes -- instead you write code to answer the question: "given this state, what should the application look like". For me, it's a game changer. This is like what one does when writing video games; the innovation is that some people at Facebook figured out how to practically program this way in a client side web browser application, then tuned their implementation based on huge amounts of real world data (Facebook has users). Oh, and they open sourced the result and ran several conferences explaining React. React.js reminds me of when Andrew Wiles proved Fermat's Last Theorem in the mid 1990s. Wiles (and Ken Ribet) had genuine new ideas, which dramatically reshaped the landscape of number theory. The best number theorists quickly realized this and adopted to the new world, pushing the envelope of Wiles work far beyond what I expected could happen. Other people pretended like Wiles didn't exist and continued studying Fibonnaci numbers. I browsed the web development section of Barnes and Noble last night and there were dozens of books on jQuery and zero on React.js. I feel for anybody who tries to learn client-side web development by reading books at Barnes and Noble. IPython/Jupyter and PhosphorJS I recently met with Fernando Perez, who founded IPython/Jupyter. He seemed to tell me that currently 9 people are working fulltime on rewriting the Jupyter web notebook using the PhosphorJS framework. I tried to understand PhosphorJS based on the github page, but couldn't, except to deduce that it is mostly the work of one person from Bloomberg/Continuum Analytics. Fernando told me that they chose PhosphorJS since it very fast, and that their main motivation is to (1) make Jupyter better use their huge high-resolution monitors on their new institute at Berkeley, and (2) make it easier for developers like me to integrate/extend Jupyter into their applications. I don't understand (2), because PhosphorJS is perhaps the least popular web framework I've ever heard of (is it a web framework -- I can't tell?). I pushed Fernando to explain why they made that design choice, but didn't really understand the answer, except that they had spent a lot of time investigating alternatives (like React first). I'm intimidated by their resources and concerned that I'm making the wrong choice; however, I just can't understand why they have made what seems to me to be the wrong choice. I hope to understand more at the joint Sage/Jupyter Days 70 that we are organizing together in Berkeley, CA in November. (Edit: see https://github.com/ipython/ipython/issues/8239 for a discussion of why IPython/Jupyter uses PhosphorJS.) Tables and RethinkDB Our rewrite of SMC is built on Tables, Flux and React. Tables are client-side technology I wrote inspired by Facebook's GraphQL/Relay technology (and Meteor, Firebase, etc.); they synchronize data between clients and the backend database in realtime. Tables are defined by a JSON schema file, which specifies the fields in the table, and explains what get and set queries are allowed. A table is a subset of a much larger table in the database, with the subset defined by conditions that are relative to the user making the query. For example, the projects table has one entry for each project that the user is a collaborator on. Tables are automatically synchronized between the user and the database whenever the database changes, using RethinkDB changefeeds. RethinkDB's innovation is to build realtime updates -- triggered when the result of a query to the database changes -- directly into the database at the lowest level. Of course it is possible to build something that looks the same from the outside using either a message queue (say using RabbitMQ or ZeroMQ), or by watching the replication stream from the database and triggering actions based on that (like Meteor does using MongoDB). RethinkDB's approach seems better to me, putting the abstraction at the right level. That said, based on mailing list traffic, searches, etc., it seems that very, very few people get RethinkDB yet. Also, despite years of development, RethinkDB only became "production ready" a few months ago, and only got automatic failover a few weeks ago. That said, after ironing out some kinks, I'm now using it with heavy traffic in production and it works very well. Flux Once data is automatically synchronized between the database and web browsers in realtime, we can build everything else on top of this. Facebook also introduced an architecture pattern that they call Flux, which works well with React. It's very different than MVC-style two-way binding frameworks, where objects are directly linked to UI elements, with an object changing causing the UI element to change and vice versa. In SMC each major part of the system has two objects associated to it: Actions and Stores. We think of them in terms of the classical CQRS pattern -- command query responsibility segregation. Actions are commands -- they are Javascript "functions" that get stuff done, but they do not return values; instead, they impact the state of the store. The store has functions that allow one to query for the state of the store, but they do not change the state of the store. The store functions must only be functions of the internal state of the store and nothing else. They might cache their results and format their output to be very convenient for rendering. But that's it. Actions usually cause the corresponding store (or stores) to change. When a store changes, it emit a change event, which causes any React components that depend on the store to be updated, which in many cases means they are re-rendered. There are optimizations one can introduce to reduce the amount of re-rendering, which if one isn't careful leads to subtle bugs; pretty much the only subtle React UI bugs one hits are caused by such optimizations. When the UI re-renders, the user sees their view of the world change. The user then clicks buttons, types, etc., which triggers actions, which in turn update stores (and tables, hence propogating changes to the ultimate source of truth, which is the RethinkDB database). As stores update, the UI again updates, etc. Status So far, we have completely (re-)written the project listing, file manager, help/status page, new file page, project log, file finder, project settings, course management system, account settings, billing, project upgrade system, and file use notifications using React, Flux, and Tables, and the result works well. Bugs are much easier to fix, and it is easy (possible?) to understand the state of the system, since it is defined by the state of the database and the corresponding client-side stores. We've completely rethought everything about the UI in doing the rewrite of the above components, and it has taken several months. Also, as mentioned above, I completely rewrote most of the backend to use RethinkDB instead of Cassandra. There were also the weeks of misery for me after we made the switch over. Even after weeks of thinking/testing/wondering "what could go wrong?", we found out all kinds of surprising little things within hours of pushing everything into production, which took more than a week of sleep deprived days to sort out. What's left? We have to rewrite the file editor tabs system, the project tabs system, and all the applications (except course management): editing text files using Codemirror, task lists (which are suprisingly complicated!), color xterm terminals, Jupyter notebooks (which will still use an iframe for the notebook itself), Sage worksheets (with complicated html output embedded in codemirror), compressed file de-archiver, the LaTeX editor, the wiki and markdown editors, and file chat. We hope to find a clean way to abstract away the various SMC applications as plugins, so that other people can easily write their own applications/plugins that will run inside of SMC. There will be a rich collection of example plugins to build on, namely the ones listed above, which are all driven by critical-to-us real world applications. Discussion about this blog post on Hacker News. Matthieu Brucher Sorting source files and projects in folders with CMake and Visual Studio/Xcode Sometimes Visual Studio and Xcode projects just get out of hand. The private project I’m working on has 130 subprojects, all in a single solution, that’s just too much to display in one window. And then I learnt that projects can actually be moved to folders, just like what is possible for files in a project (so you don’t have Source Files and Header Files, but something custom, for instance following the file hierarchy). They are activated differently, and it’s sometimes not as straightforward, but it works great once it is set up. And as this works for Xcode projects and Visual Studio projects, I was really eager to sort out my Audio Toolkit main project, so it will be the basis of the tests here. Sort files inside a project So let’s start with sorting files inside a project. My personal preference is to have all files that are in the same folder in the same folder in a project as well, following the file system hierarchy. I’ve modified a macro I found online for this purpose: MACRO(SOURCE_GROUP_BY_FOLDER target) SET(SOURCE_GROUP_DELIMITER "/") SET(last_dir "") SET(files "") FOREACH(file${${target}_SRC}${${target}_HEADERS}) file(RELATIVE_PATH relative_file "${PROJECT_SOURCE_DIR}/${target}"${file})
GET_FILENAME_COMPONENT(dir "${relative_file}" PATH) IF (NOT "${dir}" STREQUAL "${last_dir}") IF (files) SOURCE_GROUP("${last_dir}" FILES ${files}) ENDIF (files) SET(files "") ENDIF (NOT "${dir}" STREQUAL "${last_dir}") SET(files${files} ${file}) SET(last_dir "${dir}")
ENDFOREACH(file)
IF (files)
SOURCE_GROUP("${last_dir}" FILES${files})
ENDIF (files)
ENDMACRO(SOURCE_GROUP_BY_FOLDER)


C++ files are stored in a variable named ${target}_SRC and headers in another called${target}_HEADERS. All the files here will be sorted according to “${PROJECT_SOURCE_DIR}/${target}”. If target doesn’t exist, or if you want to create a different hierarchy, change this value.

The trick is managed through the SOURCE_GROUP command. It tells CMake to put one or more files in a specific folder inside a project. So here I used the file system hierarchy, but it is possible to sort the files by generation process, by type, by date…

Here is what it looks on ATKCore, compared to ATKDelay that hasn’t this macro:
Files hierarchy in Visual Studio

Sort projects inside the solution

This is the part that I had more trouble finding on Internet. There seems to be pieces everywhere, but the documentation itself was not clear enough for me, so here we go.

To activate the folders in a solution, first this property has to be set:

set_property(GLOBAL PROPERTY USE_FOLDERS ON)

Please note that this must be called before any target definition. Now each target can have some of its properties set. The one interesting here is FOLDER.

set_target_properties (ATKCore_static PROPERTIES
FOLDER C++/static
)

For instance, the result in one of my solutions will be like this:

Project hierarchy in Visual Studio

Of course, it makes sense when you have several hierarchies. A complete Audio Toolkit has Python, C++/static, C++/shared and tests as well.

Conclusion

On Linux, CMake generates directly Makefiles, and these don’t have a GUI behind. But for OS X and Windows developers, sorting and ordering files and projects makes it really easier to handle a project.

Continuum Analytics news

Modern Open Source Python­based analytics company continues to attract top talent across the organization

AUSTIN, TX. — September 1, 2015 — Continuum Analytics,​ which d​evelops Anaconda, the leading modern open source analytics platform powered by Python, has announced significant year­over­year company growth, including 50 new hires, including Michele Chambers as CMO and Randy Russell as CFO. With more than 200 of the Fortune 500 as customers, and 100% revenue growth year­over­year, Continuum Analytics is positioned for record momentum in 2015.

August 31, 2015

numfocus

NumFOCUS Announces New Fiscally Sponsored Project: PyTables

NumFOCUS is pleased to announce PyTables as our newest fiscally sponsored project. PyTables is a package for managing hierarchical datasets designed to efficiently cope with extremely large amounts of data.

PyTables has been in the Scientific Python ecosystem for well over a decade and has been the front end for data storage for many other projects over the years. PyTables is built on top of the HDF5 library, using the Python language and the NumPy package. It features an object-oriented interface that, combined with C extensions for the performance-critical parts of the code (generated using Cython), makes it a fast, yet extremely easy to use tool to interactively browse, process and search very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that data takes up much less space (specially if on-flight compression is used) than other solutions such as relational or object oriented databases. PyTables adoption has grown significantly in recent years as a result of being adopted to implement panda’s HDF storage capabilities.

PyTables can be applied in any scenario where one needs to deal with large datasets, including:

• Industrial applications
• Data acquisition in real time
• Quality control
• Fast data processing
• Scientific applications
• Meteorology, oceanography
• Numerical simulations
• Medicine (biological sensors, general data gathering & processing)
• Information systems
• System log monitoring & consolidation
• Tracing of routing data

The PyTables development team members—Anthony Scopatz, Andrea Bedini, Francesc Alted, and Antonio Valentino—are thought leaders in the data science community and have pushed forward new and successful ideas about data parallelism, querying and indexing data, and compression.

With the addition of PyTables, NumFOCUS now fiscally sponsors nine different open source data science projects. To make a donation to support PyTables or any of NumFOCUS’ other projects, click here.

August 30, 2015

Titus Brown

Small batch, artisanal, hand-crafted bioinformatics training

On June 11th, 2010, I remember dropping the last workshop attendee off at the Kalamazoo train station, turning the car towards home, and nearly sobbing in relief that workshop was over and done and I could finally get some sleep now. That workshop was the first of a series of now six summer workshops that I've coordinated over the last 6 years, and, as much as anything else, they've defined my academic career.

The third and final week of our sixth go, the 2015 workshop on Analyzing Next-Generation Sequencing Data, just finished, and (as far as I can tell) it went great. This year we essentially ran two consecutive workshops - the first two weeks was our "standard" zero-entry from-scratch bioinformatics workshop, and the third week was an almost entirely different workshop. During the third week, we had a bunch of instructors come and give advanced tutorials to alumni and other more advanced bioinformaticians.

Why did we do the third week?

The third week

The third week was an attempt to answer the question, "what next?" for people who already had some basic computing background and wanted to fill out their knowledge and experience base. As of last year, we have about 125 alumni - now up to 150 (!) - of the first two weeks, and I thought that some of them might be interested in getting together for ...more. I'd tried to do a third week last year but it fell apart in the face of bad organization (my fault) and too few attendees.

This year, I tried recruiting a bunch of external instructors. I wanted to find a bunch of enthusiastic and experienced trainers who could deliver bioinformatics-relevant material at a high level. So, in late March, I put out an open call on the Software Carpentry instructors mailing list:

Hi all,

see http://ivory.idyll.org/blog/2015-summer-course-NGS.html.

“””
For this third week, we are also looking for a few
instructors/lecturers; travel and lodging would be covered. Drop me
an e-mail if you are a trained Software Carpentry or Data Carpentry
instructor and are interested in coming and hanging out with us for
a week to develop materials and approaches!
“”"

If there are SWC instructors interested in trialing their own
genomics material on a captive audience, investigating
reproducibility awesomeness, or otherwise coming and hanging out to
teach and train and develop, please respond privately to this
e-mail.  The location is quite nice, and the people are great!


and... this worked! Astonishingly well! After applying stringent selection criteria (I accepted everyone who responded), and after sorting out all the travel, we ended up with eight instructors and myself. They were joined by approximately 10 students, along with our crack TA/organizing team, Amanda Charbonneau and Jessica Mizzi.

You can see the detailed schedule here, but essentially I just divvied the week up into three hour chunks and gave the instructors the following guidelines:

• please plan your presentations to be interactive, copy-paste, and cloud-based. They should be in Markdown or reST, and under a permissive CC license (I suggest CC0).
• ideally they would run on Amazon EC2. Happy to help you get that working. If they don’t work on EC2 and work on individual laptops, make sure to allow LOTS of time for installation!
• I would like to have them on the ANGUS web site, on github here: https://github.com/ngs-docs/angus. Please submit PRs! I suggest putting them under week3/. I am happy to organize/link into the schedule.
• I’m reserving about 3 hours on the schedule for each one, which (by typical software carpentry etc experience) means an even mix of talking, running things, and troubleshooting/debugging.
• finishing early and allowing time for during- and after- discussions is fine and frankly recommended :)
• you can see what we taught this year, here: http://angus.readthedocs.org/en/2015/ - gives you some sense of a style that we have found works well in this kind of workshop.

At this point let me digress a bit and say this: Software Carpentry is effin' magical. I don't know of another group of people to whom I could have sent an open invitation to come present, gotten back a bunch of replies from people I didn't know and invited them without screening them in any way, sent a bunch of vague instructions like the above, and then had them all show up and give great presentations.

So, yeah, the week3 stuff went really well. Completely apart from the student learning, several different instructors independently told me that they'd learned something from every presentation.

• Ryan Williams (Iowa State) gave a tutorial on multivariate stats that started out a bit slow and then all of a sudden we were like HOLY COW HOW DID WE WAIT I DIDN'T KNOW OK THAT'S COOL. (Lisa's notes)
• Lex Nederbragt (University of Oslo) did an interactive tutorial on assembly where he demoed two teaching techniques using Google Docs - ask questions poll for answers, discuss, repoll; and collaborative graphing, where we all added points to a Google Spreadsheet based on computing we did individually. Super neat. (See Lex's blog post for more on active learning, and Lisa's notes on assembly)
• Marian Schmidt (University of Michigan) powered through into the evening with a thorough introduction to RMarkdown, Git, and RStudio. While I missed most of this due to a phone call, I got to experience the "power pose" -- a way to pump up everyone's energy level before sitting back down at RStudio. Great quote from Lex: "Wow, this R stuff is really cool!" (Lisa's notes)
• Meeta Mistry (Harvard) gave an excellent three-way comparison of DEseq 2, Limma, and edgeR for differential expression analysis on an example RNAseq data set. (I will be using this tutorial in three weeks!) (Lisa's notes)
• Asela Wijeratne (Ohio State University) gave a very well received tutorial on pathway analysis in RNAseq. Sadly I missed most of this due to a migraine (bad weather + too much caffeine + too little sleep ;( but I got many positive reviews. I'm going to have to go through this on my own. (Lisa's notes)
• Tiffany Timbers (Simon Fraser University) showed us all how to do Genome Wide Association Studies. Her tutorial was a masterclass on data munging - she had us pipe data through about 6 different programs, and I think we ended up transforming the data using R, sed, grep and cut, multiple times. There was an entertaining moment when Lex figured out that she was presenting technical questions from her own research, in effect using us as pre-reviewers for her paper ;). (Lisa's notes)
• Leigh Sheneman (Michigan State University) gave an excellent presentation on using AMIs and snapshots on Amazon Web Services, for reproducibility purposes. People were incredibly thankful to have all this explained and I got several very positive reviews afterwards. (Lisa's notes)
• Chris Hamm (University of Kentucky) talked about detecting sex-linked differential expression via dosage compensation. Two highlights of his talk: (1) we all realized how insanely into butterflies he is (see: @butterflyology); (2) he managed to produce some figures so beautiful that we spontaneously applauded. (Lisa's notes)
• I gave two tutorials, one on Docker and one on GitHub pull requests & collaborative documentation editing. People seemed to find them both interesting, although Docker confused people the most of all the topics in the workshop. (See Lisa's notes on Docker and Lisa's notes on GitHub/PRs.)

Throughout all of this, the instructors and students were very engaged. It was kind of hilarious to have 1:1 ratio of instructors and students, when we were also using the sticky system -- no sooner would a pink sticky go up (indicating trouble) then would three different instructors converge on the pink sticky and work to solve the problem. Amazing to watch.

For me (and many instructors), the third week was also awesome in a different way. I had seen most of the subject material before, so while the details were interesting, I don't know that they would have held my attention in all cases. But, not only were the materials interesting, the instructors were awesome and each had their own bag of tricks. Most of them weren't something that I could write down, apart from the technical stuff mentioned above, but everyone had their own style and energy and approach for holding the attention of the audience, and it was a privilege to experience so many teaching styles.

Here's the feedback:

and here are Lisa's notes on the whole week.

Why did we all do the third week?

A week is ... a lot of time. Why did everyone show up and what did people get out of it? I have a few thoughts.

• We selected students who already had a reasonably strong exposure (alumni from previous workshops, or people with significant practical experience). This meant that we had 10x less in the way of problems with software installs and copy/paste/typing issues (which is what dominates the first week of the two-week course). This led in turn to a much faster pace, which I think was fun for everyone involved.
• Researchers are hungry for advanced materials. I had a lot of trouble figuring out how to pitch this, which is one reason why the 3rd week in 2014 failed, and why I worked extra hard this year to bring in students; people weren't willing to put in a week on the vague hope that it would be interesting: they wanted specifics. If and when we do this again, though, the pitch is easy: "Come learn the cutting edge of bioinformatics practice."
• Everyone was a great teacher - energetic, engaged, passionate. That's actually kind of rare in workshops :).
• Software Carpentry instructors rarely get a chance to learn en masse from other Software Carpentry instructors.
• Socializing and networking. The NGS workshop has always had a significant component of hanging out, because, well, that's fun. It's also productive for careers. This socializing is aided by things like trips to breweries, a lot of volleyball (with no expectations of expertise), a beautiful environment, and lots of downtime for relaxation and interaction. similar to Gordon Conferences.)
• Everyone likes to know what they know, especially if they learned it in isolation. Comments from the students, in particular, tended to mention that they had seen lots of this stuff, but hadn't necessarily put it all together or filled in the gaps in their knowledge. Finding out that you actually do know a lot is great; rounding it out with experience and more information is even better.
• Material development was an explicit goal of mine. We got a lot of good (open) material out of this, and I'm already planning on reusing a bunch of it!

Having run this once, I honestly don't anticipate a problem in "selling" it going forward.

Are you going to run it again?

tl;dr? Probably, but probably some other time/place.

This was an awesome experience for everyone I talked to.

But it was also three weeks, and the people who really stuck it out the entire time had our brains turned into mush by it. So I think we probably won't run three weeks again.

But there's really no reason to tie the third week to the first two-week workshop. So maybe we can do that elsewhere and elsewhen.

It cost (I estimate) $2500 for me to run. If I ran it "cold" (not tied to the two-week workshop) it would probably be about$5000. I have enough money to do that again, perhaps even a few times. (Most of the costs are in instructor travel/room/board.)

We could probably run a bunch on more specific topics like "RNAseq", "environmental metagenomics", etc, although I'd want to keep many of the technical things (Amazon, Docker, workflows, reproducibility, etc.) as those were well received.

This sounded great! How are you going to scale it so I can come?

I don't want to scale it up much. I think it would actually be a huge mistake to scale this beyond ~30-40 people, total. Good learning at this level (and maybe at any level) simply doesn't happen with mass teaching, or with low instructor-student ratios.

I'd love to see other people run things like this, though. I think the answer to scaling is "run more" not "run bigger", and it seems to be easy to sucker Software Carpentry instructors into advanced teaching in nice places.

With that in mind, I have an offer: if you want to run something like this in the area of data-intensive biology, let's chat. I have money and organizational capacity, and if you can supply a remote location with decent lodging and good weather, maybe we can work something out. There are a few strong requirements on my side (keep it cheap for students; all materials posted CC0 or CC-BY; you host, we run; we need tech advanced biology students; and probably a few more things to ensure a good experience for all concerned) but I'd love to see how far we can take this.

--titus

August 28, 2015

NeuralEnsemble

Docker images for neuronal network simulation

I've created some Docker images for biological neuronal network simulations with Python.

The images contain NEST 2.6, NEURON 7.3, Brian 1.4 and PyNN 0.8.0rc1, together with IPython, numpy, scipy and matplotlib.

The images are intended as a quick way to get simulation projects up-and-running on Linux, OS X and Windows (the latter two via the Docker Toolbox, which runs Docker in a VM). They can be used for teaching or as the basis for reproducible research projects that can easily be shared with others.

The images are available on Docker Hub.

To quickly get started, once you have Docker installed, run

docker pull neuralensemble/simulation
docker run -i -t neuralensemble/simulation /bin/bash

then inside the container

source ~/env/simulation/bin/activate

For ssh/X11 support, use the "simulationx" image instead of "simulation". Full instructions are available here.

I plan to add further images for neuroscience data analysis, providing Neo, Elephant, OpenElectrophy, SpykeViewer, the G-Node Python tools, KlustaSuite, etc. If anyone would like to help out, or suggest other tools that should be installed, please contact me, or open a ticket on Github.

Matthew Rocklin

Efficient Tabular Storage

tl;dr: We discuss efficient techniques for on-disk storage of tabular data, notably the following:

• Binary stores
• Column stores
• Categorical support
• Compression
• Indexed/Partitioned stores

We use NYCTaxi dataset for examples, and introduce a small project, Castra.

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

Larger than Memory Data and Disk I/O

We analyze large datasets (10-100GB) on our laptop by extending memory with disk. Tools like dask.array and dask.dataframe make this easier for array and tabular data.

Interaction times can improve significantly (from minutes to seconds) if we choose to store our data on disk efficiently. This is particularly important for large data because we can no longer separately “load in our data” while we get a coffee and then iterate rapidly on our dataset once it’s comfortably in memory.

Larger-than-memory datasets force interactive workflows to include the hard drive.

CSV is convenient but slow

CSV is great. It’s human readable, accessible by every tool (even Excel!), and pretty simple.

CSV is also slow. The pandas.read_csv parser maxes out at 100MB/s on simple data. This doesn’t include any keyword arguments like datetime parsing that might slow it down further. Consider the time to parse a 24GB dataset:

24GB / (100MB/s) == 4 minutes


A four minute delay is too long for interactivity. We need to operate in seconds rather than minutes otherwise people leave to work on something else. This improvement from a few minutes to a few seconds is entirely possible if we choose better formats.

Example with CSVs

As an example lets play with the NYC Taxi dataset using dask.dataframe, a library that copies the Pandas API but operates in chunks off of disk.

>>> import dask.dataframe as dd

...                  skipinitialspace=True,
...                  parse_dates=['pickup_datetime', 'dropoff_datetime'])

>>> df.head()
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime dropoff_datetime passenger_count trip_time_in_secs trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
0 89D227B655E5C82AECF13C3F540D4CF4 BA96DE419E711691B9445D6A6307C170 CMT 1 N 2013-01-01 15:11:48 2013-01-01 15:18:10 4 382 1.0 -73.978165 40.757977 -73.989838 40.751171
1 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-06 00:18:35 2013-01-06 00:22:54 1 259 1.5 -74.006683 40.731781 -73.994499 40.750660
2 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-05 18:49:41 2013-01-05 18:54:23 1 282 1.1 -74.004707 40.737770 -74.009834 40.726002
3 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:54:15 2013-01-07 23:58:20 2 244 0.7 -73.974602 40.759945 -73.984734 40.759388
4 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:25:03 2013-01-07 23:34:24 1 560 2.1 -73.976250 40.748528 -74.002586 40.747868

Time Costs

It takes a second to load the first few lines but 11 to 12 minutes to roll through the entire dataset. We make a zoomable picture below of a random sample of the taxi pickup locations in New York City. This example is taken from a full example notebook here.

df2 = df[(df.pickup_latitude > 40) &
(df.pickup_latitude < 42) &
(df.pickup_longitude > -75) &
(df.pickup_longitude < -72)]

sample = df2.sample(frac=0.0001)
pickup = sample[['pickup_latitude', 'pickup_longitude']]

result = pickup.compute()

from bokeh.plotting import figure, show, output_notebook
p = figure(title="Pickup Locations")
p.scatter(result.pickup_longitude, result.pickup_latitude, size=3, alpha=0.2)

Eleven minutes is a long time

This result takes eleven minutes to compute, almost all of which is parsing CSV files. While this may be acceptable for a single computation we invariably make mistakes and start over or find new avenues in our data to explore. Each step in our thought process now takes eleven minutes, ouch.

Interactive exploration of larger-than-memory datasets requires us to evolve beyond CSV files.

Principles to store tabular data

What efficient techniques exist for tabular data?

A good solution may have the following attributes:

1. Binary
2. Columnar
3. Categorical support
4. Compressed
5. Indexed/Partitioned

We discuss each of these below.

Binary

Consider the text ‘1.23’ as it is stored in a CSV file and how it is stored as a Python/C float in memory:

• CSV: 1.23
• C/Python float: 0x3f9d70a4

These look very different. When we load 1.23 from a CSV textfile we need to translate it to 0x3f9d70a4; this takes time.

A binary format stores our data on disk exactly how it will look in memory; we store the bytes 0x3f9d70a4 directly on disk so that when we load data from disk to memory no extra translation is necessary. Our file is no longer human readable but it’s much faster.

This gets more intense when we consider datetimes:

• CSV: 2015-08-25 12:13:14
• NumPy datetime representation: 1440529994000000 (as an integer)

Every time we parse a datetime we need to compute how many microseconds it has been since the epoch. This calculation needs to take into account things like how many days in each month, and all of the intervening leap years. This is slow. A binary representation would record the integer directly on disk (as 0x51e278694a680) so that we can load our datetimes directly into memory without calculation.

Columnar

Many analytic computations only require a few columns at a time, often only one, e.g.

>>> df.passenger_counts.value_counts().compute().sort_index()
0           3755
1      119605039
2       23097153
3        7187354
4        3519779
5        9852539
6        6628287
7             30
8             23
9             24
129            1
255            1
Name: passenger_count, dtype: int64

Of our 24 GB we may only need 2GB. Columnar storage means storing each column separately from the others so that we can read relevant columns without passing through irrelevant columns.

Our CSV example fails at this. While we only want two columns, pickup_datetime and pickup_longitude, we pass through all of our data to collect the relevant fields. The pickup location data is mixed with all the rest.

Categoricals

Categoricals encode repetitive text columns (normally very expensive) as integers (very very cheap) in a way that is invisible to the user.

Consider the following (mostly text) columns of our NYC taxi dataset:

>>> df[['medallion', 'vendor_id', 'rate_code', 'store_and_fwd_flag']].head()
medallion vendor_id rate_code store_and_fwd_flag
0 89D227B655E5C82AECF13C3F540D4CF4 CMT 1 N
1 0BD7C8F5BA12B88E0B67BED28BEA73D8 CMT 1 N
2 0BD7C8F5BA12B88E0B67BED28BEA73D8 CMT 1 N
3 DFD2202EE08F7A8DC9A57B02ACB81FE2 CMT 1 N
4 DFD2202EE08F7A8DC9A57B02ACB81FE2 CMT 1 N

Each of these columns represents elements of a small set:

• There are two vendor ids
• There are twenty one rate codes
• There are three store-and-forward flags (Y, N, missing)
• There are about 13000 taxi medallions. (still a small number)

And yet we store these elements in large and cumbersome dtypes:

In [4]: df[['medallion', 'vendor_id', 'rate_code', 'store_and_fwd_flag']].dtypes
Out[4]:
medallion             object
vendor_id             object
rate_code              int64
store_and_fwd_flag    object
dtype: object

We use int64 for rate code, which could easily have fit into an int8 an opportunity for an 8x improvement in memory use. The object dtype used for strings in Pandas and Python takes up a lot of memory and is quite slow:

In [1]: import sys
In [2]: sys.getsizeof('CMT')  # bytes
Out[2]: 40

Categoricals replace the original column with a column of integers (of the appropriate size, often int8) along with a small index mapping those integers to the original values. I’ve written about categoricals before so I won’t go into too much depth here. Categoricals increase both storage and computational efficiency by about 10x if you have text data that describes elements in a category.

Compression

After we’ve encoded everything well and separated our columns we find ourselves limited by disk I/O read speeds. Disk read bandwidths range from 100MB/s (laptop spinning disk hard drive) to 2GB/s (RAID of SSDs). This read speed strongly depends on how large our reads are. The bandwidths given above reflect large sequential reads such as you might find when reading all of a 100MB file in one go. Performance degrades for smaller reads. Fortunately, for analytic queries we’re often in the large sequential read case (hooray!)

We reduce disk read times through compression. Consider the datetimes of the NYC taxi dataset. These values are repetitive and slowly changing; a perfect match for modern compression techniques.

>>> ind = df.index.compute()  # this is on presorted index data (see castra section below)
>>> ind
DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 00:00:00',
'2013-01-01 00:00:00', '2013-01-01 00:00:00',
'2013-01-01 00:00:00', '2013-01-01 00:00:00',
'2013-01-01 00:00:00', '2013-01-01 00:00:00',
'2013-01-01 00:00:00', '2013-01-01 00:00:00',
...
'2013-12-31 23:59:42', '2013-12-31 23:59:47',
'2013-12-31 23:59:48', '2013-12-31 23:59:49',
'2013-12-31 23:59:50', '2013-12-31 23:59:51',
'2013-12-31 23:59:54', '2013-12-31 23:59:55',
'2013-12-31 23:59:57', '2013-12-31 23:59:57'],
dtype='datetime64[ns]', name=u'pickup_datetime', length=169893985, freq=None, tz=None)

Benchmark datetime compression

We can use a modern compression library, like fastlz or blosc to compress this data at high speeds.

In [36]: import blosc

In [37]: %time compressed = blosc.compress_ptr(address=ind.values.ctypes.data,
...:                                       items=len(ind),
...:                                       typesize=ind.values.dtype.alignment,
...:                                       clevel=5)
CPU times: user 3.22 s, sys: 332 ms, total: 3.55 s
Wall time: 512 ms

In [40]: len(compressed) / ind.nbytes  # compression ratio
Out[40]: 0.14296813539337488

In [41]: ind.nbytes / 0.512 / 1e9      # Compresson bandwidth (GB/s)
Out[41]: 2.654593515625

In [42]: %time _ = blosc.decompress(compressed)
CPU times: user 1.3 s, sys: 438 ms, total: 1.74 s
Wall time: 406 ms

In [43]: ind.nbytes / 0.406 / 1e9      # Decompression bandwidth (GB/s)
Out[43]: 3.3476647290640393

We store 7x fewer bytes on disk (thus septupling our effective disk I/O) by adding an extra 3GB/s delay. If we’re on a really nice Macbook pro hard drive (~600MB/s) then this is a clear and substantial win. The worse the hard drive, the better this is.

But sometimes compression isn’t as nice

Some data is more or less compressable than others. The following column of floating point data does not compress as nicely.

In [44]: x = df.pickup_latitude.compute().values
In [45]: %time compressed = blosc.compress_ptr(x.ctypes.data, len(x), x.dtype.alignment, clevel=5)
CPU times: user 5.87 s, sys: 0 ns, total: 5.87 s
Wall time: 925 ms

In [46]: len(compressed) / x.nbytes
Out[46]: 0.7518617315969132

This compresses more slowly and only provides marginal benefit. Compression may still be worth it on slow disk but this isn’t a huge win.

The pickup_latitude column isn’t compressible because most of the information isn’t repetitive. The numbers to the far right of the decimal point are more or less random.

40.747868


Other floating point columns may compress well, particularly when they are rounded to small and meaningful decimal values.

Compression rules of thumb

Optimal compression requires thought. General rules of thumb include the following:

• Compress integer dtypes
• Compress datetimes
• If your data is slowly varying (e.g. sorted time series) then use a shuffle filter (default in blosc)
• Don’t bother much with floating point dtypes
• Compress categoricals (which are just integer dtypes)

Avoid gzip and bz2

Finally, avoid gzip and bz2. These are both very common and very slow. If dealing with text data, consider snappy (also available via blosc.)

Indexing/Partitioning

One column usually dominates our queries. In time-series data this is time. For personal data this is the user ID.

Just as column stores let us avoid irrelevant columns, partitioning our data along a preferred index column lets us avoid irrelevant rows. We may need the data for the last month and don’t need several years’ worth. We may need the information for Alice and don’t need the information for Bob.

Traditional relational databases provide indexes on any number of columns or sets of columns. This is excellent if you are using a traditional relational database. Unfortunately the data structures to provide arbitrary indexes don’t mix well with some of the attributes discussed above and we’re limited to a single index that partitions our data into sorted blocks.

Some projects that implement these principles

Many modern distributed database storage systems designed for analytic queries implement these principles well. Notable players include Redshift and Parquet.

Additionally newer single-machine data stores like Dato’s SFrame and BColz follow many of these principles. Finally many people have been doing this for a long time with custom use of libraries like HDF5.

It turns out that these principles are actually quite easy to implement with the right tools (thank you #PyData) The rest of this post will talk about a tiny 500 line project, Castra, that implements these princples and gets good speedups on biggish Pandas data.

Castra

With these goals in mind we built Castra, a binary partitioned compressed columnstore with builtin support for categoricals and integration with both Pandas and dask.dataframe.

Load data from CSV files, sort on index, save to Castra

Here we load in our data from CSV files, sort on the pickup datetime column, and store to a castra file. This takes about an hour (as compared to eleven minutes for a single read.) Again, you can view the full notebook here

>>> import dask.dataframe as dd
...                  skipinitialspace=True,
...                  parse_dates=['pickup_datetime', 'dropoff_datetime'])

>>> (df.set_index('pickup_datetime', compute=False)
...    .to_castra('trip.castra', categories=True))

Profit

Now we can take advantage of columnstores, compression, and binary representation to perform analytic queries quickly. Here is code to create a histogram of trip distance. The plot of the results follows below.

Note that this is especially fast because Pandas now releases the GIL on value_counts operations (all groupby operations really). This takes around 20 seconds on my machine on the last release of Pandas vs 5 seconds on the development branch. Moving from CSV files to Castra moved the bottleneck of our computation from disk I/O to processing speed, allowing improvements like multi-core processing to really shine.

We plot the result of the above computation with Bokeh below. Note the spike around 20km. This is around the distance from Midtown Manhattan to LaGuardia airport.

I’ve shown Castra used above with dask.dataframe but it works fine with straight Pandas too.

Credit

Castra was started by myself and Valentin Haenel (current maintainer of bloscpack and bcolz) during an evening sprint following PyData Berlin. Several bugfixes and refactors were followed up by Phil Cloud and Jim Crist.

Castra is roughly 500 lines long. It’s a tiny project which is both good and bad. It’s being used experimentally and there are some heavy disclaimers in the README. This post is not intended as a sales pitch for Castra, but rather to provide a vocabulary to talk about efficient tabular storage.

Response to twitter traffic: again, this blogpost is not saying “use Castra!” Rather it says “don’t use CSVs!” and consider more efficient storage for interactive use. Other, more mature solutions exist or could be built. Castra was an experiment of “how fast can we get without doing too much work.”

August 25, 2015

Matthieu Brucher

Fun books: Heroic fantasy or space operas?

Between audio development, songwriting and a computer science day job, sometimes, there is a need for going to another dimension, escape to another reality. I think that is what books are for. And I found several series that are more than fun: they are addictive, and I can’t wait to read the next installment.

Heroic fantasy

I like heroic fantasy. I like the fact that you are in a place that is completely different from our own world, with different physics rules (because there is usually magic in them), and if it is written right, it’s just impossible to put a book down.

I started some time ago reading the Farseer Trilogy (from Robin Hobb) after my brother’s wife told me about it. It’s a sad story, seriously, about a bastard that get recognized by his actual father, only to see him dying in the first few pages of the first book of the trilogy. He’s then trained to be an assassin for his grand father, leaving in the shadows, finding love but having to letting it go… The magic is not really important here (although it does shape some important parts of Fitz’s life), it’s mainly the emotion behind the characters, how everything can unravel so fast, how you can lose everything in an eye blink even with he best support around you. And yet Fitz doesn’t change, he’s not becoming a bad man. The first trilogy was fascinating, the second one with the same main character adds focus to a secondary character of the first trilogy (and he also got the main part in yet another trilogy, but never could finish it, I missed Fitz imprint on the story).
The quality of the writing with the fantastic story makes it sure that you can’t let go of the books. It’s just not possible to start reading the first one and not finish the two trilogies.

There is another heroic fantasy series that I’m addicted to. This one is not over, and there are a new book every year at the moment. The story may be paralleled with Harry Potter. It’s the story of a simple man in a land without magic that wants to protect one day a woman held prisoner. After that, everything goes to hell for him, and he discovers things about him that he wouldn’t have dreamt of.
The first book is self-contained, then the next ones are built on top of it. It sometimes seems that the author, Terry Goodkind, added elements after witnessing the success of his books. At the beginning, the new additions feel kind of artificial sometimes, before the books manage to follow the same thread after a couple of installments.
Still, the story of Richard and Kahlan is just amazing, even if it starts, as I’ve said, with the usual “a random average guy is actually not your random average guy, but the most powerful human being, meeting the most powerful woman on Earth”. Well, it’s not Earth, and there are some pretty gruesome pages, with main characters dying at war as you would expect from an actual war. I’m looking forward to reading the next book, with the new arc of their story. It’s not as captivating (emotionally speaking) than the Farseer Trilogy, but it’s damn close.

The new arc:

Space opera

There are not that many space stories that I really enjoy. I was a Star Wars fan when I was young, but it’s over. The stories are not consistent, the universe too wide, with too many styles. But there is one author whose (quite rare books) I remember from my childhood. Edmond Hamilton wrote novels several decades ago, but I was amazed by the setup of his stories. The Star Kings and Return to the stars are also books I could not let go, they are so timeless, it seems they were written yesterday.

The current space opera I’m following is the story of Black Jack Geary (from Jack Campbell), a survivor of the beginning of a war, found in survival sleep after a century where he was pictured as a legend, who ends up being the admiral of a fleet deep in enemy lines, bound to be destroyed by their enemies. Contrary to the other series here, the focus is on space and less romantic involvement of the characters. There is something in that regard in all books, but the main parts are about action, strategy and the big picture. Also, from a scientific point of view, it’s almost sound, so why not?
In the first series, we follow Geary as he tries to go back to his side, discovering new allies, enemies he didn’t suspect (of course). In the second series, he is sent pursuing new relations with these allies and enemies, with quite interesting twists. There is also a series about some of these allies which is really different (less spacey, more earthy) and yet as interesting.

Wrapping up

I read other series (Seven son, Hunger games…), but these are really well written with solid stories. And if you have your own series that you like, I would be really interested to know about them and read them!

August 24, 2015

Wei Xue

GSoC Final Project Report

GSoC is approaching its end. I am very glad to have such great experience this summer. I explored the classical machine learning models, Gaussian mixture models (GM), Bayesian Gaussian mixture models with variational inferences (BGM), and Dirichlet Process Gaussian mixture (DPGM). The code and doc is in PR4802.

Besides these issues, I did some animations and IPN for these three models.

In conclusion, I finished the tasks of in the proposal, but I didn't have time to do the optional tasks, i.e., the incremental EM algorithm and different covariance estimators. Anyway, after GSoC, I will continue to contribute to the scikit-learn project.

Abraham Escalante

Goodbye GSoC.

Hello all,

This is my final entry for the GSoC. It's been one hell of a ride, nothing short of life changing.

When I started looking for a project, the idea was that even if I didn't get selected to participate, I was going to be able to make my first contribution to OpenSource, which in itself was enough motivation to try it.

I found several interesting projects and decided to apply to one of them, the one that I considered to better suit my situation at that moment. Before I got selected I had already interacted with a few members of the community and made a couple of contributions. I was hooked on OpenSource so there was no looking back. By the time I got selected, the GSoC had already met my expectations.

I found a healthy community in SciPy and I could not have asked for a better mentor than Ralf (Gommers). The community members were always involved and supportive while Ralf provided me with enough guidance to understand new concepts in a simple way (I'm no statistician) but not so much that I would be overwhelmed by the information and I still had room to learn by myself, which is an essential part of my learning process (that's where I find the motivation to do the little things).

After starting the GSoC I received the news that I was denied the scholarship to attend Sheffield and my plans for a master's degree were almost derailed. I then got an offer to study at the University of Toronto and this is where it got interesting (spoiler alert: I am writing this blog entry from downtown Toronto).

I went through the scholarship process again and got selected. I also went through the process of selecting my courses at the UofT. With Ralf's guidance and after some research I decided to take courses on Machine Learning, Natural Language Processing and other related topics.

I can now say with pride that I am the newest member of the SciPy community which will help me in my journey towards becoming a Machine Learning expert or maybe a Data Scientist, that remains to be seen, but we already have some plans on how I can keep contributing to SciPy and getting acquainted with the pandas and Numpy communities. I'd like to see what comes from there.

As you can see, I got a lot more than I had expected from this experience, which I attribute to having approached it with the idea of searching for a passion to turn into a career. Naturally I found it, so now it's time to switch gears.

I would like to use the last paragraph of this rant to give out some thanks. Thanks to Ralf for walking me along to find my own path within the beautiful world of OpenSource and Scientific Computing. Thanks to the SciPy community, especially to Josef Perktold and Evgeni Burovski for providing so much valuable feedback to my PRs. Thanks to Google for organising an event like this, helping people like me with the excuse they need to finally join OpenSource and stop leaving it for later. And of course, thanks to the people in my life that provide me with a reason to wake up and try to be a little better than the day before: My girlfriend, Hélène, who keeps my head above the water when I feel like I forgot how to swim by myself and my parents, whose love and support seem to have no end. You make me feel like I owe it to the world to be the best I can be (or try, at the very least).

August 22, 2015

Titus Brown

Unsub grant text: A Coordination Center for Training in Biomedical Data Science

Just as I was moving to UC Davis, a funding call for a training coordination center came out. I got partway down the path of applying for it before realizing that I was overwhelmed with the move, but I did generate some text that I thought was OK. Here it is!

The increasing velocity, variety, and volume of data generated in biomedical research is challenging the existing data management and analysis skills of most researchers. Access to and application of these forms of data is hindered not only by a lack of training and training opportunities in data-intensive biomedical research, but also by the heterogeneity of biomedical data as well as limited discoverability of training materials relevant to different biomedical fields. Many biomedical fields of study - genomics, informatics, biostatistics, and epidemiology, among others - have invested in data analysis methodologies and training materials, but no comprehensive index of biomedical data analysis methodologies, trainers, or training materials exists to support this training gap. Significant investments in biomedical data science training by the Helmsely Foundation and the NIH BD2K program, as well as by foundations not devoted to human health such as the Moore and Sloan Foundations, highlight the need for and opportunities in peer knowledge coordination and training.

We propose to bridge the gap between the availability of biomedical big data and the needs of biomedical researchers to make use of this data by building a coordination center around proven principles of open online collaboration. This coordination center will nucleate a national and international community of expert trainers, together with a catalog of openly available supporting materials developed by this community, to enable the discoverability of resources and the training of data-intensive biomedical researchers using modern, evidence-based teaching practices.

Aim 1: Build an index to enable categorization, discovery, and review of open educational resources.

Subaim 1A: Create and maintain an index of open educational resources.

Subaim 1B: Create and maintain software tooling behind index, including categorization, discovery, and review of resources.

Subaim 1C: Support categorization, discovery, and personalization of educational resources through a controlled vocabulary, personalized search, and lesson tracks.

Aim 2: Coordinate with existing biomedical/data science research and training community.

Subaim 2A: Build a "matchmaking service" to help scientists identify potential collaborative partners for lab rotations.

Subaim 2B: Connect and coordinate training components for national and international biomedical data science initiatives, including BD2K Center awardees, BD2K R25 awardees, and Foundation funders.

Subaim 2C: Facilitate connections and communication with the larger Data Science training community.

Aim 3: Build a community of trainers and contributors to reuse, review, remix, and create training materials.

Subaim 3A: Initiate regional training centers (Davis, Harvard, St. Louis (or Chicago/Florida) for coordinating trainers, doing material discovery & curation/needs analysis/assessment;

Subaim 3B: Build and maintain a catalog of trained instructors that enables discoverability, coordination and collaboration for training purposes.

Subaim 3C: Encourage and develop a diverse community of contributors through partnerships at regional training centers and Data Carpentry initiatives.

Introduction

As the volume, variety, and velocity of biomedical data increases, so too have the variety of training needs in analyzing this data. Data science training specific to biomedical research is still relatively rare, and where it exists it is siloed, reflecting the bottom-up emergence of training and education in response to research-area specific needs. However, as the research community and funders respond to the increasing need with increased effort and funding, there is an opportunity to coordinate efforts to serve the broader purpose of training so-called "pi-shaped" researchers - researchers with deep backgrounds in biomedicine and data science both.

We propose to create a virtual training coordination center to organize and coordinate online training materials, facilitate interactions and connections between the many biomedical and data science communities, and nucleate the formation of a more cohesive and more diverse biomedical data science training community.

This TCC will build and maintain a catalogue of open educational resources that can be personalized to researcher-specific career goals, and coordinate software development and data management with the ELIXIR-UK TeSS project (tess.oerc.ox.ac.uk), which serves the European community. Our main effort will be to provide automated and semi-automated gathering and classification of training materials, integrated into a sustainable open system that can be used by others, and served via a personalized curriculum system that can recommend materials based on research interests and prior training.

We will also interact with the national and international biomedical data science communities, including the newly funded BD2K centers, R25 workshop and material grants, EU's ELIXIR, etc., and facilitate connections between these communities and broader data science training initiatives that include the Moore/Sloan Data Science Environments at UW, NYU, and Berkeley, as well as Software Carpentry, Data Carpentry, and the Mozilla Science Lab. A key component of this will be a "matchmaking" service that seeks to identify and support potential collaboration and "lab rotation" opportunities for biomedical scientists looking for data science collaborators.

Finally, we will work to nucleate a diverse and expert community of trainers who can use, reuse, remix, and build new materials. This community will be built upon regional training coordination centers and a recurrent Train-the-Trainers (T3) program to introduce trainers to materials, training and assessment approaches, and technology useful for training. We will also emphasize the inclusion of underrepresented minorities in the T3 program.

Background

The growth of data in the natural sciences has been explosive, with a simultaneous and dramatic increase in all "three Vs" of data - volume, velocity, and variety - over the last two decades. This growth in data has in turn led to an increasing interest in quantitative and computational aspects of data analysis by academic and industry researchers. Computational infrastructure, analysis software, statistical methods for data analysis and integration, and research into the fundamental methods underpinning data driven discovery has all grown apace.

The growth in data has led to a training gap, in which many researchers suffer from the lack of a solid foundation in quantitative and computational methods. This gap is especially large in basic biology and biomedical research, where traditionally very few researchers have received any training in data analysis beyond basic mathematics and statistics. Moreover, as the volume and importance of data grow and the pace of research into data-driven discovery accelerates, the training gap is widening. Meanwhile, the opportunities for careers in biology-specific data analysis and data-driven discovery are increasing rapidly in both industry and academia, further increasing this gap between the need for trained researchers and the supply.

A number of training programs have stepped up to address this gap. One of the largest and broadest is Software Carpentry, a global non-profit which runs two-day intensive workshops on basic computational practice for academic scientists; as of 2015, Software Carpentry has trained over 10,000 students in 14 years. While not limited to biology, in 2014 half of the Software Carpentry workshops were biology focused, and approximately 1300 of the 2600 trainees were from biology backgrounds. iPlant Collaborative, an NSF-funded center focused on biological data analysis, has also run many training workshops to address the training gap. Internationally, the EU's ELIXIR program and the Australian Bioinformatics Network are focused on biological information, and have significant training programs.

More biomedically focused workshops and training programs in data science have also begun to be developed. Of particular note, NHGRI has funded a variety of computational training over the last decade that include T-32s, R25s, and K and F mechanisms. In the last year, the BD2K Initiative - formed specifically in recognition of cross-Institute opportunities and challenges in data science - has funded a number of "Big Data" centers and R25 workshop and resource development grants, with more to come in 2015. Most recently, the Helmsley Trust has invested $1.7m in the Mozilla Science Lab to help increase the capacity of biomedical scientists to integrate computation into their research. The landscape of data science training is much larger than biology and biomedical science, of course. In the past few years, a tremendous variety of online resources, including written tutorials, videos, Massive Open Online Courses (MOOCs), and webinars have emerged. Many universities and institutes have started data science training programs, with a notable investment by the Moore and Sloan foundations in Data Science Environments at NYU, UW, and UC Berkeley focused on data driven discovery. Furthermore, an NSF investment in BIO Centers led to the initiation of Data Carpentry, a sister non-profit to Software Carpentry that is focused on linking domain-specific data analysis methodology to the broader contexts of efficiency and reproducibility; Data Carpentry is now funded by the Moore Foundation. Thus, the training landscape in data science generally, and biomedical data science specifically, is large, complex, and international. Moreover, the number of training programs and initiatives is growing fast. Over the last few years, several themes in biomedical data science training have emerged: Many styles of training are needed, across many career levels. The training breakout at the BD2K "ADDSup" meeting in September 2014 summarized biomedical training opportunities in 10 dimensions, including formal vs informal, beginner to advanced, in-person vs online, short course vs long, centralized vs physical, andragogy vs pedagogy, project-based vs structured, individual vs group learning, "just in time" vs background training, and basic to clinically focused. Different types of materials, teaching approaches, and assessment approaches are appropriate for each of these. Training opportunities are increasingly oversubscribed. Both surveys and anecdotes suggest that the perceived need for training in biomedical data science is great. For example, the Australian Bioinformatics Network survey on bioinformatics needs concluded that access to training is by far the most dominant concern for biologists; more here. The summer workshop on sequence analysis run by Dr. Brown routinely has 5-10x more applicants (~200) than can be accommodated (25). Software Carpentry and Data Carpentry workshops with biology focuses typically fill up within two days of their announcement and always have a waiting list. And online courses on data analysis and statistics typically have 10s of thousands of participants. Common concerns of reproducibility, efficiency, automation, and statistical correctness underpin every data science domain. At the most domain-specific level, data science training inevitably must focus on specific data types, data analysis problems, and data analysis software. However, as trainees grow in expertise, the same concerns consistently emerge no matter the domain: how do we make this analysis reproducible? How can we most efficiently use available compute resources? How do we run the current analysis on a new data set? How do we assess significance and correctness of our results? This convergence suggests a role for inter-domain coordination of training, especially because some biomedical domains (such as bioinformatics) have explored this area in more depth than others There is a great need for more instructors versed in both advanced biomedical data science and evidence-based educational practice. The gap in biomedical data science is nowhere more evident than in the lack of instructors capable of teaching data science to biomedical researchers! A consistent theme from universities is that faculty working in this area are overwhelmed by existing teaching and research opportunities, which limits the available instructor pool. A major limiting factor in offering more biology-focused Software Carpentry workshops has been a lack of instructors, although this is somewhat ameliorated by the use of graduate students and postdocs. A more diverse trainee and instructor pool is needed. By definition, underrepresented minorities are underrepresented in faculty lines, but in biomedical data science, this further intersects with the significant underrepresentation of women and minorities in quantitative and computational disciplines. However, we are still at an early stage in biomedical data science where this underrepresentation could be addressed by targeted initiatives. There is a strong need (and attendant opportunity) for centralized coordination in biomedical data science training. Cataloging of training opportunities and materials could increase the efficiency and reuse of existing training, as well as identify where materials and training are lacking. Instructor training can increase the available pool of trained educators as well as provide opportunities for underrepresented populations to get involved in training. And coordination across domains on more advanced training topics could broaden the scope of these advanced materials. August 19, 2015 Fabian Pedregosa Holdout cross-validation generator Cross-validation iterators in scikit-learn are simply generator objects, that is, Python objects that implement the __iter__ method and that for each call to this method return (or more precisely, yield) the indices or a boolean mask for the train and test set. Hence, implementing new cross-validation iterators that behave as the ones in scikit-learn is easy with this in mind. Here goes a small code snippet that implements a holdout cross-validator generator following the scikit-learn API. import numpy as np from sklearn.utils import check_random_state class HoldOut: """ Hold-out cross-validator generator. In the hold-out, the data is split only once into a train set and a test set. Unlike in other cross-validation schemes, the hold-out consists of only one iteration. Parameters ---------- n : total number of samples test_size : 0 < float < 1 Fraction of samples to use as test set. Must be a number between 0 and 1. random_state : int Seed for the random number generator. """ def __init__(self, n, test_size=0.2, random_state=0): self.n = n self.test_size = test_size self.random_state = random_state def __iter__(self): n_test = int(np.ceil(self.test_size * self.n)) n_train = self.n - n_test rng = check_random_state(self.random_state) permutation = rng.permutation(self.n) ind_test = permutation[:n_test] ind_train = permutation[n_test:n_test + n_train] yield ind_train, ind_test  Contrary to other cross-validation schemes, holdout relies on a single split of the data. It is well known than in practice holdout performs much worse than KFold or LeaveOneOut schemes. However, holdout has the advantage that its theoretical properties are easier to derive. For examples of this see e.g. Section 8.7 of Theory of classification: a survey of some recent advances and the very recent The reusable holdout. August 18, 2015 Matthieu Brucher Announcement: ATKChorus 1.0.0 and ATKPhaser 1.0.0 Yet another plugin announcement, I’m happy to announce the release of a chorus plugin as well as a stereo phaser plugin based on the Audio Toolkit. They are available on Windows and OS X (min. 10.8) in different formats. The chorus plugin is based on a variable delay filter driven by a random delay signal, and the stereo phaser morphs a mono signal in a stereo phased pair. ATKChorus ATKStereoPhaser The supported formats are: • VST2 (32bits/64bits on Windows, 64bits on OS X) • VST3 (32bits/64bits on Windows, 64bits on OS X) • Audio Unit (64bits, OS X) Direct link for ATKChorus. Direct link for ATKUniversalDelay. The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code. August 14, 2015 Jake Vanderplas Out-of-Core Dataframes in Python: Dask and OpenStreetMap In recent months, a host of new tools and packages have been announced for working with data at scale in Python. For an excellent and entertaining summary of these, I'd suggest watching Rob Story's Python Data Bikeshed talk from the 2015 PyData Seattle conference. Many of these new scalable data tools are relatively heavy-weight, involving brand new data structures or interfaces to other computing environments, but Dask stands out for its simplicity. Dask is a light-weight framework for working with chunked arrays or dataframes across a variety of computational backends. Under the hood, Dask simply uses standard Python, NumPy, and Pandas commands on each chunk, and transparently executes operations and aggregates results so that you can work with datasets that are larger than your machine's memory. In this post, I'll take a look at how dask can be useful when looking at a large dataset: the full extracted points of interest from OpenStreetMap. We will use Dask to manipulate and explore the data, and also see the use of matplotlib's Basemap toolkit to visualize the results on a map. The Data: OpenStreetMap The data we will be looking at here is the extracted database of marked locations in Open Street Map. OpenStreetMap is a free and open, crowd-sourced online mapping tool, meaning that everything on it has been added by an individual user. This has resulted in a compiled collection of nearly ten million distinct points of interest, comprising everything from post boxes to restaurants to historical sites to park benches. An extracted table of data on these points of interest can be found on the OSM-x-tractor site, a free project which compiles these points of interest and makes them available in a variety of common formats. For this post, I downloaded the CSV file for World Points of Interest, and extracted the compressed file into "POIWorld.csv", a standard comma-separated value file. This file has just over nine million lines, each of which corresponds to a specific point of interest: In [1]: nrows = sum(1 for _ in open('POIWorld.csv')) nrows  Out[1]: 9140114  Using command-line tools, we can also see that the size of the file is about 760MB. In [2]: !ls -lh POIWorld.csv  -rw-r--r-- 1 jakevdp staff 761M Jul 14 14:10 POIWorld.csv  While this could fit into memory on most modern machines, here we'll take a more scalable approach, utilizing Dask to do our data ingestion and manipulation out-of-core. The benefit is that the approach used here will straightforwardly scale to even larger datasets analyzed across multiple machines. Before we begin to look at this data, let's take a look at Dask and how it can help us with this problem. Dask Basics Dask, fundamentally, is a lightweight generator of task graphs for Python. A task graph is a way of describing a sequence of operations so that they can be executed at a later point. By building these task graphs, Dask describes the exact sequence of inputs, operations, and outputs that your algorithm requires, and can send this description to a variety of backends for efficient parallel and/or out-of-core computation. Though the low-level details are interesting, it is the higher-level interfaces that most people will use. These interfaces are: • dask.bag: create task graphs using a functional programming style • dask.array: create task graphs using a NumPy-like array interface • dask.dataframe: create task graphs using a Pandas-like DataFrame interface Each of these provides a familiar Python interface for operating on data, with the difference that individual operations build graphs rather than computing results; the results must be explicitly extracted with a call to the compute() method. As an example of Dask in action, let's take a look at an example of using dask.array. Dask Arrays First, let's do a sequence of simple array operations using standard NumPy tools: In [3]: import numpy as np # create an array of normally-distributed random numbers a = np.random.randn(1000) # multiply this array by a factor b = a * 4 # find the minimum value b_min = b.min() print(b_min)  -11.4051061336  Dask allows us to use very similar code to build a task graph describing these operations: In [4]: import dask.array as da # create a dask array from the above array a2 = da.from_array(a, chunks=200) # multiply this array by a factor b2 = a2 * 4 # find the minimum value b2_min = b2.min() print(b2_min)  dask.array<x_3, shape=(), chunks=(), dtype=float64>  In this case, the result of the computation is not a value, but is instead a Dask array object which contains the sequence of steps needed to compute that value. If you have the graphviz package installed (pip install graphviz), you can visualize this sequence of steps on this chunked data: In [5]: from dask.dot import dot_graph dot_graph(b2_min.dask)  Out[5]: Reading from the bottom to the top, this graph shows exactly what will happen with our computation: first the array x is ingested by dd.from_array and split into five chunks of 200 elements. Each of these chunks is multiplied by 4, and the minimum value is found. Finally, the global minimum is found among these chunk-wise minima, and the result is returned. We can tell Dask to compute the result encoded in this graph using the compute() method of the dask object: In [6]: b2_min.compute()  Out[6]: -11.405106133564583  As expected, we get the same result as the normal, non-lazy method. I'm not going to go into any more technical details of Dask here, but you should be aware that it contains many more options for flexible creation and evaluation of such task graphs. For more information, I'd suggest reading through the dask documentation or watching the excellent Dask tutorial given by Matthew Rocklin at the recent PyData Seattle conference. Dask DataFrames and OpenStreetMap With this framework of task graphs plus lazy evaluation, let's take a look at using Dask in a more interesting context: exploring the OpenStreetMap data. We can start by taking a look at the first few rows of the data using Pandas, just to see what we're working with: In [7]: import pandas as pd data = pd.read_csv('POIWorld.csv', nrows=5) data.columns  Out[7]: Index([&apos﻿"osmid"&apos, &aposname&apos, &aposamenity&apos, &aposemergency&apos, &aposhistoric&apos, &aposleisure&apos, &aposman_made&apos, &aposoffice&apos, &aposshop&apos, &apossport&apos, &apostourism&apos, &aposcraft&apos, &aposLongitude&apos, &aposLatitude&apos], dtype=&aposobject&apos)  Each point of interest has latitude and longitude, along with a variety of other data labels. We'll extract these locations, and focus on the "name" and "amenity" columns associated with them. This can be done with Dask's version of Pandas' read_csv command: In [8]: from dask import dataframe as dd columns = ["name", "amenity", "Longitude", "Latitude"] data = dd.read_csv('POIWorld.csv', usecols=columns) data  Out[8]: dd.DataFrame<read-csv-POIWorld.csv-ea29fef3df6f73002fe27223a3749930, divisions=(None, None, None, ..., None, None)>  Notice here that the read_csv command did not actually open the file and access the data, but simply created a task graph describing the operations needed to access the data. Next let's use a Pandas-style filtering to select two subsets of this data: those rows with "amenity" specified, and those rows with "name" specified: In [9]: with_name = data[data.name.notnull()] with_amenity = data[data.amenity.notnull()]  Once again, we have not yet done any actual computation, but simply specified how to find the part of the data we're interested in. Diving Into the Data: Geography of Coffee One thing we can do with this data is pull-out certain points of interest and compare their distribution on a map. Here we'll try to reproduce a recent Reddit Post which maps the distribution of Dunkin Donuts and Starbucks locations in the US. First, we must further filter the data by the name column. Dask lets us use Pandas-style masking and vectorized string operations, so we can do this as follows: In [10]: is_starbucks = with_name.name.str.contains('[Ss]tarbucks') is_dunkin = with_name.name.str.contains('[Dd]unkin') starbucks = with_name[is_starbucks] dunkin = with_name[is_dunkin]  To see how many results we have, we can use a count() call and pass it to dd.compute() to see the results. This is the point when we are finally actually loading the data and computing quantities from the values: In [11]: dd.compute(starbucks.name.count(), dunkin.name.count())  Out[11]: (5301, 1276)  We find about 5300 Starbucks and 1300 Dunkin Donuts locations in the global dataset; this is far fewer than the true numbers for these chains, which are around 12000 Starbucks and 8000 Dunkin Donuts in the United States alone! Evidently, the OpenStreetMap data is not all that complete. From my own anecdotal experience with the dataset, I've found that the data tends to be fairly complete in dense urban areas, while missing many details in more sparsely-populated locations. Despite this incompleteness, let's push-on with the data we have and see what we can discover. We can start by computing and extracting the latitude and longitude from the graphs we have generated. We will do this in a single dd.compute() call, so that the data will be ingested only once: In [12]: locs = dd.compute(starbucks.Longitude, starbucks.Latitude, dunkin.Longitude, dunkin.Latitude) # extract arrays of values fro the series: lon_s, lat_s, lon_d, lat_d = [loc.values for loc in locs]  Now that we have this data, we can use matplotlib's basemap toolkit to visualize the results on a map: In [13]: %matplotlib inline import matplotlib.pyplot as plt from mpl_toolkits.basemap import Basemap def draw_USA(): """initialize a basemap centered on the continental USA""" plt.figure(figsize=(14, 10)) return Basemap(projection='lcc', resolution='l', llcrnrlon=-119, urcrnrlon=-64, llcrnrlat=22, urcrnrlat=49, lat_1=33, lat_2=45, lon_0=-95, area_thresh=10000)  In [14]: m = draw_USA() # Draw map background m.fillcontinents(color='white', lake_color='#eeeeee') m.drawstates(color='lightgray') m.drawcoastlines(color='lightgray') m.drawcountries(color='lightgray') m.drawmapboundary(fill_color='#eeeeee') # Plot the values in Starbucks Green and Dunkin Donuts Orange style = dict(s=5, marker='o', alpha=0.5, zorder=2) m.scatter(lon_s, lat_s, latlon=True, label="Starbucks", color='#00592D', **style) m.scatter(lon_d, lat_d, latlon=True, label="Dunkin' Donuts", color='#FC772A', **style) plt.legend(loc='lower left', frameon=False);  Again, these data are far from complete, but we can nevertheless see the general trend that many have noted before: Starbucks favors the west coast, while Dunkin Donuts favors the east coast. For anybody who has spent much time in, say, Seattle and Boston, this overall trend should not be all that surprising! Fast Food Nation Let's look at a different set of this data, via the "amenity" column. We can call the count() method to check how many rows of the data have amenity specified: In [15]: dd.compute(with_amenity.amenity.count())  Out[15]: (5075909,)  We see just over five million rows with an amenity label. With Pandas' value_counts() function, we can examine the most common of these labels in the dataset. Here the head() call triggers a computation: In [16]: with_amenity.amenity.value_counts().head(20)  Out[16]: bench 492190 restaurant 400620 place_of_worship 389266 school 335937 parking 282460 fuel 198865 post_box 181858 cafe 156946 bank 152218 fast_food 146962 recycling 135912 pharmacy 127259 waste_basket 119881 grave_yard 118324 bicycle_parking 110657 post_office 102431 drinking_water 94594 pub 94416 toilets 93708 telephone 90894 dtype: int64  Somewhat surprisingly, there are far more benches labeled than any other single labeled object. Down the list a bit, we see the fast_food category, which has around 150,000 global entries. Using a filter plus another value count, we can check which fast food restaurants are most common: In [17]: is_fastfood = with_amenity.amenity.str.contains('fast_food') fastfood = with_amenity[is_fastfood] fastfood.name.value_counts().head(12)  Out[17]: McDonald&aposs 8608 Subway 6784 Burger King 3180 KFC 2812 Wendy&aposs 1301 Taco Bell 1272 Pizza Hut 965 マクドナルド 918 Dairy Queen 744 Domino&aposs Pizza 680 McDonalds 634 Arby&aposs 606 dtype: int64  As an aside, one interesting thing we see is that there are three versions of McDonald's in this list: there are "McDonald's", and "McDonalds", of course, but also マクドナルド (roughly pronounced "Makudonarudo"), which is the Japanese adaptation of the well-known restaurant name. If you were attempting to use this dataset to count locations by chain, it would be important to take these multiple similar labels into account! Let's next take a look at the full collection of fast food restaurant locations, extract their latitude and longitude coordinates, and plot their locations on a map of the USA: In [18]: lat, lon = dd.compute(fastfood.Latitude, fastfood.Longitude)  In [19]: m = draw_USA() m.drawmapboundary(fill_color='#ffffff', linewidth=0) m.fillcontinents(color="#fcfcfc", lake_color='#ffffff', zorder=1) m.scatter(lon.values, lat.values, latlon=True, s=1, marker='o', alpha=0.1, zorder=2);  Here I've purposely left-out the geographical borders; we see that with fast food locations alone, we can see the major cities, and even trace-out some of the major interstate routes! I suspect that, like above, this data is far from complete, especially in more rural areas. I would love to see how a full fast-food-nation map would look, but after poking around it seems that most available data on that is proprietary (though FlowingData has an interesting visualization in the same spirit). Pubs of the British Isles Let's take a look at one last example, reproducing a post by Ramiro Gomez, a developer in Berlin whose website is definitely worth clicking around for a bit. Here we will extract all the pub locations from the dataset, and use them to visualize a small island nation with an astounding density of these establishments. We'll start by filtering the amenities for the word "pub" (being careful to use regular expressions which mark word boundaries, so that we don't match things like "public toilet"): In [20]: is_pub = with_amenity.amenity.str.contains(r'\bpub\b') pubs = with_amenity[is_pub] pubs.amenity.count().compute()  Out[20]: 94478  We have about 95,000 world-wide points of interest with "pub" in the label. Next, as above, we can extract the longitude and latitude arrays from our data: In [21]: lon, lat = dd.compute(pubs.Longitude, pubs.Latitude)  Finally, with a few lines of Basemap code, we can visualize the results: In [22]: fig, ax = plt.subplots(figsize=(10, 15)) m = Basemap(projection='mill', lon_0=-5.23636, lat_0=53.866772, llcrnrlon=-10.65073, llcrnrlat=49.16209, urcrnrlon=1.76334, urcrnrlat=60.860699) m.drawmapboundary(fill_color='#ffffff', linewidth=.0) x, y = m(lon.values, lat.values) m.scatter(x, y, s=1, marker=',', color="steelblue", alpha=0.8);  The pub locations alone are enough to make out most of the borders and contours of the islands! The above visualizations are fun, but have merely scraped the surface of what can be done with this data – what interesting geographical visualizations can you come up with using these data and tools? Thanks for reading! This post was written entirely in the IPython notebook. You can download this notebook, or see a static view on nbviewer. August 13, 2015 Continuum Analytics Memex - Mining the Dark Web Continuum Analytics has always been a leader in open source development and state-of-the-art technology. A prime example of our dedication to addressing complex problems and testing the boundaries of data science can be seen in our work on the Defense Advanced Research Projects Agency (DARPA) Memex program, which has been featured on 60 Minutes. numfocus NumFOCUS Supported Project Data Carpentry Receives Grant NumFOCUS is pleased to announce that one of our sponsored projects, Data Carpentry, has received$750,000 in funding from the Gordon and Betty Moore Foundation.

NumFOCUS is committed to supporting and promoting world-class, innovative, open source scientific software. A fundamental component of this mission is to develop and provide training that instills good practice, enables scientific progress, and develops a diverse and strong community around open source and open science. We are excited to continue to work with Data Carpentry to grow this community! Also look for Data Carpentry contributors at upcoming PyData conferences.

Additional information about the activities to be funded by this grant is available at the Data Carpentry blog.

August 11, 2015

Matthieu Brucher

Announcement: ATKUniversalVariableDelay 1.0.2 and ATKUniversalDelay 1.2.0

I’m happy to announce the release of a variable delay plugin and an update to a fixed delay plugin based on the Audio Toolkit. They are available on Windows and OS X (min. 10.8) in different formats. The variable delay plugin allows to define a delay based on a sinusoid (mean delay, delay amplitude and frequency) and mix it with feedback and forward signals. The fixed delay plugin allows a fixed delay (up to a second). Both plugins can run at 192kHz.

ATKUniversalVariableDelay
ATKUniversalDelay

The supported formats are:

• VST2 (32bits/64bits on Windows, 64bits on OS X)
• VST3 (32bits/64bits on Windows, 64bits on OS X)
• Audio Unit (64bits, OS X)

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

August 10, 2015

Matthew Rocklin

A Weekend with Asyncio

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: I learned asyncio and rewrote part of dask.distributed with it; this details my experience

asyncio

The asyncio library provides concurrent programming in the style of Go, Clojure’s core.async library, or more traditional libraries like Twisted. Asyncio offers a programming paradigm that lets many moving parts interact without involving separate threads. These separate parts explicitly yield control to each other and to a central authority and then regain control as others yield control to them. This lets one escape traps like race conditions to shared state, a web of callbacks, lost error reporting, and general confusion.

I’m not going to write too much about asyncio. Instead I’m going to briefly describe my problem, link to a solution, and then dive into good-and-bad points about using asyncio while they’re fresh in my mind.

Exercise

I won’t actually discuss the application much after this section; you can safely skip this.

I decided to rewrite the dask.distributed Worker using asyncio. This worker has to do the following:

1. Store local data in a dictionary (easy)
2. Perform computations on that data as requested by a remote connection (act as a server in a client-server relationship)
3. Collect data from other workers when we don’t have all of the necessary data for a computation locally (peer-to-peer)
4. Serve data to other workers who need our data for their own computations (peer-to-peer)

It’s a sort of distributed RPC mechanism with peer-to-peer value sharing. Metadata for who-has-what data is stored in a central metadata store; this could be something like Redis.

The current implementation of this is a nest of threads, queues, and callbacks. It’s not bad and performs well but tends to be hard for others to develop.

Additionally I want to separate the worker code because it’s useful outside of dask.distributed. Other distributed computation solutions exist in my head that rely on this technology.

For the moment the code lives here: https://github.com/mrocklin/dist. I like the design. The module-level docstring of worker.py is short and informative. But again, I’m not going to discuss the application yet; instead, here are some thoughts on learning/developing with asyncio.

General Thoughts

Disclaimer I am a novice concurrent programmer. I write lots of parallel code but little concurrent code. I have never used existing frameworks like Twisted.

I liked the experience of using asyncio and recommend the paradigm to anyone building concurrent applications.

The Good:

• I can write complex code that involves multiple asynchronous calls, complex logic, and exception handling all in a single place. Complex application logic is no longer spread in many places.
• Debugging is much easier now that I can throw import pdb; pdb.set_trace() lines into my code and expect them to work (this fails when using threads).
• My code fails more gracefully, further improving the debug experience. Ctrl-C works.
• The paradigm shared by Go, Clojure’s core.async, and Python’s asyncio felt viscerally good. I was able to reason well about my program as I was building it and made nice diagrams about explicitly which sequential processes interacted with which others over which channels. I am much more confident of the correctness of the implementation and the design of my program. However, after having gone through this exercise I suspect that I could now implement just about the same design without asyncio. The design paradigm was perhaps as important as the library itself.
• I have to support Python 2. Fortunately I found the trollius port of asyncio to be very usable. It looks like it was a direct fork-then-modify of tulip.

• There wasn’t a ZeroMQ connectivity layer for Trollius (though aiozmq exists in Python 3) so I ended up having to use threads anyway for inter-node I/O. This, combined with ZeroMQ’s finicky behavior did mean that my program crashed hard sometimes. I’m considering switching to plain sockets (which are supported nativel by Trollius and asyncio) due to this.
• While exceptions raise cleanly I can’t determine from where they originate. There are no line numbers or tracebacks. Debugging in a concurrent environment is hard; my experience was definitely better than threads but still could be improved. I hope that asyncio in Python 3.4 has better debugging support.
• The API documentation is thorough but stackoverflow, general best practices, and example coverage is very sparse. The project is new so there isn’t much to go on. I found that reading documentation for Go and presentations on Clojure’s core.async were far more helpful in preparing me to use asyncio than any of the asyncio docs/presentations.

Future

I intend to pursue this into the future and, if the debugging experience is better in Python 3 am considering rewriting the dask.distributed Scheduler in Python 3 with asyncio proper. This is possible because the Scheduler doesn’t have to be compatible with user code.

I found these videos to be useful:

August 07, 2015

Wei Xue

Progress Report 3

My mentor gave me some useful advices after I finished all the codes of BayesianGaussianMixture and DirichletProcessGaussianMixture. So in these two weeks, I fixed the style problems and did all the necessary test cases for BayesianGaussianMixture. I also did the visualization of Gaussian mixture with variational inference for four types of precision using matplotlib.animation, link

Next step, I will explore some optional tasks which are incremental learning and other covariance estimators besides the test cases of DirichletProcessGaussianMixture.

Jake Vanderplas

Frequentism and Bayesianism V: Model Selection

This post is part of a 5-part series: Part I Part II Part III Part IV Part V

See also Frequentism and Bayesianism: A Python-driven Primer, a peer-reviewed article partially based on this content.

Last year I wrote a series of posts comparing frequentist and Bayesian approaches to various problems:

Here I am going to dive into an important topic that I've not yet covered: model selection. We will take a look at this from both a frequentist and Bayesian standpoint, and along the way gain some more insight into the fundamental philosophical divide between frequentist and Bayesian methods, and the practical consequences of this divide.

My quick, TL;DR summary is this: for model selection, frequentist methods tend to be conceptually difficult but computationally straightforward, while Bayesian methods tend to be conceptually straightforward but computationally difficult.

Model Fitting vs Model Selection

The difference between model fitting and model selection is often a cause of confusion. Model fitting proceeds by assuming a particular model is true, and tuning the model so it provides the best possible fit to the data. Model selection, on the other hand, asks the larger question of whether the assumptions of the model are compatible with the data.

Let's make this more concrete. By model here I essentially mean a formula, usually with tunable parameters, which quantifies the likelihood of observing your data. For example, your model might consist of the statement, "the $$(x, y)$$ observations come from a straight line, with known normal measurement errors $$\sigma_y$$". Labeling this model $$M_1$$, we can write:

$y_{M_1}(x;\theta) = \theta_0 + \theta_1 x\\ y \sim \mathcal{N}(y_{M_1}, \sigma_y^2)$

where the second line indicates that the observed $$y$$ is normally distributed about the model value, with variance $$\sigma_y^2$$. There are two tunable parameters to this model, represented by the vector $$\theta = [\theta_0, \theta_1]$$ (i.e. the slope and intercept).

Another model might consist of the statement "the observations $$(x, y)$$ come from a quadratic curve, with known normal measurement errors $$\sigma_y$$". Labeling this model $$M_2$$, we can write:

$y_{M_2}(x;\theta) = \theta_0 + \theta_1 x + \theta_2 x^2\\ y \sim \mathcal{N}(y_{M_2}, \sigma_y^2)$

There are three tunable parameters here, again represented by the vector $$\theta$$.

Model fitting, in this case, is the process of finding constraints on the values of the parameters $$\theta$$ within each model. That is, it allows you to make statements such as, "assuming $$M_1$$ is true, this particular $$\theta$$ gives the best-fit line" or "assuming $$M_2$$ is true, this particular vector $$\theta$$ gives the best-fit curve." Model fitting proceeds without respect to whether the model is capable of describing the data well; it just arrives at the best-fit model under the assumption that the model is accurate.

Model selection, on the other hand, is not concerned with the parameters themselves, but with the question of whether the model is capable of describing the data well. That is, it allows you to say, "for my data, a line ($$M_1$$) provides a better fit than a quadratic curve ($$M_2$$)".

Let's make this more concrete by introducing some data.

Consider the following data:

In [1]:
import numpy as np
data = np.array([[ 0.42,  0.72,  0.  ,  0.3 ,  0.15,
0.09,  0.19,  0.35,  0.4 ,  0.54,
0.42,  0.69,  0.2 ,  0.88,  0.03,
0.67,  0.42,  0.56,  0.14,  0.2  ],
[ 0.33,  0.41, -0.22,  0.01, -0.05,
-0.05, -0.12,  0.26,  0.29,  0.39,
0.31,  0.42, -0.01,  0.58, -0.2 ,
0.52,  0.15,  0.32, -0.13, -0.09 ],
[ 0.1 ,  0.1 ,  0.1 ,  0.1 ,  0.1 ,
0.1 ,  0.1 ,  0.1 ,  0.1 ,  0.1 ,
0.1 ,  0.1 ,  0.1 ,  0.1 ,  0.1 ,
0.1 ,  0.1 ,  0.1 ,  0.1 ,  0.1  ]])
x, y, sigma_y = data


To get an idea of what we're looking at, let's quickly visualize these points:

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set() # set default plot styles

x, y, sigma_y = data
fig, ax = plt.subplots()
ax.errorbar(x, y, sigma_y, fmt='ok', ecolor='gray')
ax.set(xlabel='x', ylabel='y', title='input data');


Our central model selection question will be this: what better fits this data: a linear ($$M_1$$) or quadratic ($$M_2$$) curve?

Let's create a function to compute these models given some data and parameters; for convenience, we'll make a very general polynomial model function:

In [3]:
def polynomial_fit(theta, x):
"""Polynomial model of degree (len(theta) - 1)"""
return sum(t * x ** n for (n, t) in enumerate(theta))


If the theta variable is of length 2, this corresponds to the linear model ($$M_1$$). If the theta variable is length 3, this corresponds to the quadratic model ($$M_2$$).

As detailed in my previous posts, both the frequentist and Bayesian approaches to model fitting often revolve around the likelihood, which, for independent errors, is the product of the probabilities for each individual point. Here is a function which computes the log-likelihood for the two models:

In [4]:
from scipy import stats

def logL(theta, model=polynomial_fit, data=data):
"""Gaussian log-likelihood of the model at theta"""
x, y, sigma_y = data
y_fit = model(theta, x)
return sum(stats.norm.logpdf(*args)
for args in zip(y, y_fit, sigma_y))


Both the Bayesian and frequentist approaches are based on the likelihood, and the standard frequentist approach is to find the model which maximizes this expression. Though there are efficient closed-form ways of maximizing this, we'll use a direct optimization approach here for clarity:

In [5]:
from scipy import optimize

def best_theta(degree, model=polynomial_fit, data=data):
theta_0 = (degree + 1) * [0]
neg_logL = lambda theta: -logL(theta, model, data)
return optimize.fmin_bfgs(neg_logL, theta_0, disp=False)

theta1 = best_theta(1)
theta2 = best_theta(2)


Let's now we can visually compare the maximum-likelihood degree-1 and degree-2 models:

In [6]:
xfit = np.linspace(0, 1, 1000)
fig, ax = plt.subplots()
ax.errorbar(x, y, sigma_y, fmt='ok', ecolor='gray')
ax.plot(xfit, polynomial_fit(theta1, xfit), label='best linear model')
ax.plot(xfit, polynomial_fit(theta2, xfit), label='best quadratic model')
ax.legend(loc='best', fontsize=14)
ax.set(xlabel='x', ylabel='y', title='data');


The crux of the model selection question is this: how we can quantify the difference between these models and decide which model better describes our data?

Naive Approach: Comparing Maximum Likelihoods

One common mistake is to assume that we can select between models via the value of the maximum likelihood. While this works in some special cases, it is not generally applicable. Let's take a look at the maximum log-likelihood value for each of our fits:

In [7]:
print("linear model:    logL =", logL(best_theta(1)))

linear model:    logL = 22.0108670066



The quadratic model yields a higher log-likelihood, but this does not necessarily mean it is the better model!

The problem is that the quadratic model has more degrees of freedom than the linear model, and thus will always give an equal or larger maximum likelihood, regardless of the data! This trend holds generally: as you increase model complexity, the maximum likelihood value will (almost) always increase!

Let's take a look at the best maximum likelihood for a series of polynomial fits (linear, quadratic, cubic, quartic, etc.):

In [8]:
degrees = np.arange(1, 10)
thetas = [best_theta(d) for d in degrees]
logL_max = [logL(theta) for theta in thetas]

fig, ax = plt.subplots(1, 2, figsize=(14, 5))
ax[0].plot(degrees, logL_max)
ax[0].set(xlabel='degree', ylabel='log(Lmax)')
ax[1].errorbar(x, y, sigma_y, fmt='ok', ecolor='gray')
ylim = ax[1].get_ylim()
for (degree, theta) in zip(degrees, thetas):
if degree not in [1, 2, 9]: continue
ax[1].plot(xfit, polynomial_fit(theta, xfit),
label='degree={0}'.format(degree))
ax[1].set(ylim=ylim, xlabel='x', ylabel='y')
ax[1].legend(fontsize=14, loc='best');


We see in the left panel that the maximum likelihood value always increases as we increase the degree of the polynomial. Looking at the right panel, we see how this metric has led us astray: while the ninth order polynomial certainly leads to a larger likelihood, it achieves this by over-fitting the data.

Thus, in some ways, you can view the model selection question as fundamentally about comparing models while correcting for over-fitting of more complicated models. Let's see how this is done within the frequentist and Bayesian approaches.

Fundamentals of Frequentist & Bayesian Model Selection

Recall that the fundamental difference between the frequentist and Bayesian approaches is the definition of probability.

Frequentists consider probabilities as frequencies: that is, a probability is only meaningful in the context of repeated experiments (even if those repetitions are merely hypothetical). This means, for example, that in the frequentist approach:

• observed data (and any quantities derived from them) are considered to be random variables: if you make the observations again under similar circumstances, the data may be different, and the details depend on the generating distribution.
• model parameters (those things that help define the generating distribution) are considered fixed: they aren't subject to a probability distribution; they just are.

On the other hand, Bayesians consider probabilities as degrees-of-belief: that is, a probability is a way of quantifying our certainty about a particular statement. This means, for example, that in the Bayesian approach:

• observed data are not directly considered as random variables; they just are.
• model parameters are uncertain quantities and thus subject to probabilistic description.

This difference in philosophy has real, practical implications, as we will see below.

Some Notation

Before we continue, a quick note on notation. For the below discussion, it is important to be able to denote probabilities concisely. We'll generally be writing conditional probabilities of the form $$P(A~|~B)$$, which can be read "the probability of A given B". Additionally, I'll be using the following shorthands:

• $$D$$ represents observed data
• $$M$$, $$M_1$$, $$M_2$$, etc. represent a model
• $$\theta$$ represents a set of model parameters

With this in mind, we'll be writing statements like $$P(D~|~\theta,M)$$, which should be read "the probability of seeing the data, given the parameters $$\theta$$ with model $$M$$. I'm playing a bit fast-and-loose with discrete vs continuous variables, but the meaning should be clear from the context.

Model Fitting

In the model fitting context, the difference in philosopy of probability leads to frequentist and Bayesians dealing with different quantities:

• frequentists look at the likelihood: $$P(D~|~\theta, M)$$
• Bayesians look at the posterior: $$P(\theta~|~D, M)$$

Note the main distinction: frequentists operate on a probability of the data, while Bayesians operate on a probability of the model parameters, in line with their respective considerations about the applicability of probability.

Frequentists, here, have a clear advantage: the likelihood is something we can compute directly from the model – after all, a model is nothing more than a specification of the likelihood given model parameters. By optimizing this likelihood expression directly, as we saw above, you can arrive at an estimated best-fit model.

Bayesians, on the other hand, have a slightly more difficult task. To find an expression for the posterior, we can use Bayes' theroem:

$P(\theta~|~D,M) = P(D~|~\theta,M) \frac{P(\theta~|~M)}{P(D~|~M)}$

We see that the posterior is proportional to the likelihood used by frequentists, and the constant of proportionality involves ratio of $$P(\theta~|~M)$$ and $$P(D~|~M)$$. $$P(\theta~|~M)$$ here is the prior and quantifies our belief/uncertainty about the parameters $$\theta$$ without reference to the data. $$P(D~|~M)$$ is the model evidence, and in this context amounts to no more than a normalization term.

For a more detailed discussion of model fitting in the frequentist and Bayesian contexts, see the previous posts linked above.

Model Selection

Similarly, when comparing two models $$M_1$$ and $$M_2$$, the frequentist and Bayesian approaches look at different quantities:

• frequentists compare the model likelihood, $$P(D~|~M_1)$$ and $$P(D~|~M_2)$$
• Bayesians compare the model posterior, $$P(M_1~|~D)$$ and $$P(M_2~|~D)$$

Notice here that the parameter values $$\theta$$ no longer appear: we're not out to compare how well particular fits of the two models describe data; we're out to compare how well the models themselves describe the data. Unlike the parameter likelihood $$P(D~|~\theta, M)$$ above, neither quantity here is directly related to the likelihood expression, and so we must figure out how to re-express the desired quantity in terms we can compute.

Model Selection: Bayesian Approach

For model selection, in many ways Bayesians have the advantage, at least in theory. Through a combination of Bayes Theorem and probability axioms, we can re-express the model posterior $$P(M~|~D)$$ in terms of computable quantities and priors:

First, using Bayes' Theorem:

$P(M~|~D) = P(D~|~M)\frac{P(M)}{P(D)}$

Using the definition of conditional probability, the first term can be expressed as an integral over the parameter space of the likelihood:

$P(D~|~M) = \int_\Omega P(D~|~\theta, M) P(\theta~|~M) d\theta$

Notice that this integral is over the exact quantity optimized in the case of Bayesian model fitting.

The remaining terms are priors, the most problematic of which is $$P(D)$$ – the prior probability of seeing your data without reference to any model. I'm not sure that $$P(D)$$ could ever be actually computed in the real world, but fortunately it can be canceled by computing the odds ratio between two alternative models:

$O_{21} \equiv \frac{P(M_2~|~D)}{P(M_1~|~D)} = \frac{P(D~|~M_2)}{P(D~|~M_1)}\frac{P(M_2)}{P(M_1)}$

We now have a means of comparing two models via computable quantities: an integral over the likelihood, and a prior odds for each model. Often the ratio of prior odds is assumed to be near unity, leaving only the well-defined (but often computationally intensive) integral over likelihood for each model. More on this below.

Model Selection: Frequentist Approach

For model selection, frequentists are working with the quantity $$P(D~|~M)$$. Notice that unlike Bayesians, frequentists cannot express this as an integral over parameter space, because the notion of a probability distribution over model parameters does not make sense in the frequentist context. But recall that frequentists can make probabilistic statements about data or quantities derived from them: with this in mind, we can make progress by computing some statistic from the data for which the distribution is known. The difficulty is that which statistic is most useful depends highly on the precise model and data being used, and so practicing frequentist statistics requires a breadth of background knowledge about the assumptions made by various approaches.

For example, one commonly-seen distribution for data-derived statistis is the $$\chi^2$$ (chi-squared) distribution. The $$\chi^2$$ distribution with $$K$$ degrees of freedom describes the distribution of a sum of squares of $$K$$ independent normally-distributed variables. We can use Python tools to quickly visualize this:

In [9]:
from scipy import stats
v = np.linspace(0, 8, 1000)
for k in range(1, 7):
plt.plot(v, stats.chi2.pdf(v, k),
label="k = {0}".format(k))
plt.legend(ncol=2)
plt.gca().set(title='$\chi^2$ distribution',
xlabel='x', ylabel='p(x)', ylim=(0, 0.5));


The key observation is this: if we can somehow transform our data into a single statistic $$S(D;\theta,M)$$ which we expect behaves like the sum of squares of normally-distributed values, then we can analyze the likelihood of this statistic in terms of this $$\chi^2$$ distribution, and use this as a proxy for the model likelihood $$P(D~|~M)$$.

In the case of our linear and quadratic model likelihoods $$M_1$$ and $$M_2$$ above, the models are built on the explicit expectation that for the correct model, the observed data $$y$$ will be normally distributed about the model value $$y_M$$ with a standard deviation $$\sigma_y$$. Thus the sum of squares of normalized residuals about the best-fit model should follow the $$\chi^2$$ distribution, and so we construct our $$\chi^2$$ statistic:

$\chi^2(D;\theta,M) = \sum_n\left[\frac{y_n - y_M(x_n;\theta)}{\sigma_{y,n}}\right]^2$

Coincidentally (or is it?), this statistic is proportional to the negative log likelihood, up to a constant offset, and so for model fitting, minimizing the $$\chi^2$$ is equivalent to maximizing the likelihood.

With this statistic, we've replaced an uncomputable quantity $$P(D~|~M)$$ with a computable quantity $$P(S~|~M_S)$$, where $$S = \chi^2$$ is a particular transformation of the data for which (under our model assumptions) we can compute the probability distribution. So rather than asking "how likely are we to see the data $$D$$ under the model $$M$$", we can ask "how likely are we to see the statistic $$S$$ under the model $$M_S$$", and the two likelihoods will be equivalent as long as our assumptions hold.

Frequentist and Bayesian Model Selection by Example

Let's use the ideas developed above to address the above model selection problem in both a frequentist and Bayesian context.

Frequentist Model Selection: $$\chi^2$$

We introduced the $$\chi^2$$ statistic above, and now let's take a look at how this is actually computed. From our proposed models, we explicitly expect that the residuals about the model fit will be independent and normally distributed, with variance $$\sigma_y^2$$. Consequently, we expect that for the correct model, the sum of normalized residuals will be drawn from a $$\chi^2$$ distribuion with a degree of freedom related to the number of data points. For $$N$$ data points fit by a $$K$$-parameter linear model, the degrees of freedom are usually given by $$dof = N - K$$, though there are some caveats to this (See The Dos and Don'ts of Reduced Chi-Squared for an enlightening discussion).

Let's define some functions to compute the $$\chi^2$$ and the number of degrees of freedom, and evaluate the likelihood of each model with the $$\chi^2$$ distribution:

In [10]:
def compute_chi2(degree, data=data):
x, y, sigma_y = data
theta = best_theta(degree, data=data)
resid = (y - polynomial_fit(theta, x)) / sigma_y
return np.sum(resid ** 2)

def compute_dof(degree, data=data):
return data.shape[1] - (degree + 1)

def chi2_likelihood(degree, data=data):
chi2 = compute_chi2(degree, data)
dof = compute_dof(degree, data)
return stats.chi2(dof).pdf(chi2)

print("chi2 likelihood")
print("- linear model:    ", chi2_likelihood(1))

chi2 likelihood
- linear model:     0.0455244340637



We have found that the likelihood of the observed residuals under the linear model ($$M_1$$) is slightly larger than the likelihood of the observed residuals under the quadratic model ($$M_2$$).

To understand what this is saying, it is helpful to visualize these distributions:

In [11]:
fig, ax = plt.subplots()
for degree, color in zip([1, 2], ['blue', 'green']):
v = np.linspace(0, 40, 1000)
chi2_dist = stats.chi2(compute_dof(degree)).pdf(v)
chi2_val = compute_chi2(degree)
chi2_like = chi2_likelihood(degree)
ax.fill(v, chi2_dist, alpha=0.3, color=color,
label='Model {0} (degree = {0})'.format(degree))
ax.vlines(chi2_val, 0, chi2_like, color=color, alpha=0.6)
ax.hlines(chi2_like, 0, chi2_val, color=color, alpha=0.6)
ax.set(ylabel='L(chi-square)')
ax.set_xlabel('chi-square')
ax.legend(fontsize=14);


We can see visually here how this procedure corrects for model complexity: even though the $$\chi^2$$ value for the quadratic model is lower (shown by the vertical lines), the characteristics of the $$\chi^2$$ distribution mean the likelihood of seeing this value is lower (shown by the horizontal lines), meaning that the degree=1 linear model is favored.

Significance of the Comparison

But how much should we trust this conclusion in favor of the linear model? In other words, how do we quantify the significance of this difference in $$\chi^2$$ likelihoods?

We can make progress by realizing that in the frequentist context all data-derived quantities can be consistered probabilistically, and this includes the difference in $$\chi^2$$ values from two models! For this particular case, the difference of $$\chi^2$$ statistics here also follows a $$\chi^2$$ distribution, with 1 degree of freedom. This is due to the fact that the models are nested – that is, the linear model is a specialization of the quadratic model (for some background, look up the Likelihood Ratio Test).

We might proceed by treating the linear model as the null hypothesis, and asking if there is sufficient evidence to justify the more complicated quadratic model. Just as above, we can plot the $$\chi^2$$ difference along with its expected distribution:

In [12]:
chi2_diff = compute_chi2(1) - compute_chi2(2)

v = np.linspace(0, 5, 1000)
chi2_dist = stats.chi2(1).pdf(v)
p_value = 1 - stats.chi2(1).cdf(chi2_diff)

fig, ax = plt.subplots()
ax.fill_between(v, 0, chi2_dist, alpha=0.3)
ax.fill_between(v, 0, chi2_dist * (v > chi2_diff), alpha=0.5)
ax.axvline(chi2_diff)
ax.set(ylim=(0, 1), xlabel="$\chi^2$ difference", ylabel="probability");
ax.text(4.9, 0.95, "p = {0:.2f}".format(p_value),
ha='right', va='top', size=14);


Here we see where this observed $$\chi^2$$ difference lies on its expected distribution, under the null hypothesis that the linear model is the true model. The area of the distribution to the right of the observed value is known as the p value: for our data, the $$p$$-value is 0.17, meaning that, assuming the linear model is true, there is a 17% probability that simply by chance we would see data that favors the quadratic model more strongly than ours. The standard interpretation of this is to say that our data are not inconsistent with the linear model: that is, our data does not support the quadratic model enough to conclusively reject the simpler linear model.

Other Frequentist Approaches

Recall that a primary purpose of using the $$\chi^2$$ statistic, above, was to prevent mistaken selection of very flexible models which over-fit the data. Other qualitatively different approaches exist to limit this overfitting. Some important ones are:

• The Akaike Information Criterion is an information-theoretic approach which penalizes the maximum likelihood to account for added model complexity. It makes rather stringent assumptions about the form of the likelihood, and so cannot be universally applied.
• Cross-Validation is a sampling method in which the model fit and evaluation take place on disjoint randomized subsets of the data, which acts to penalize over-fitting. Cross-validation is more computationally intensive than other frequentist approaches, but has the advantage that it relies on very few assumptions about the data and model and so is applicable to a broad range of models.
• Other sampling-based methods: the classical frequentist approach relies on computing specially-crafted statistics that fit the assumptions of your model. Some of this specialization can be side-stepped through randomized methods like bootstrap and jackknife resampling. For an interesting introduction to this subject, I'd recommend the short and free book, Statistics is Easy by Shasha and Wilson, or the extremely approachable talk, Statistics Without the Agonizing Pain by John Rauser.

We will not demonstrate these approaches here, but they are relatively straightforward and interesting to learn!

Bayesian Model Selection: the Odds Ratio

The Bayesian approach proceeds very differently. Recall that the Bayesian model involves computing the odds ratio between two models:

$O_{21} = \frac{P(M_2~|~D)}{P(M_1~|~D)} = \frac{P(D~|~M_2)}{P(D~|~M_1)}\frac{P(M_2)}{P(M_1)}$

Here the ratio $$P(M_2) / P(M_1)$$ is the prior odds ratio, and is often assumed to be equal to 1 if no compelling prior evidence favors one model over another. The ratio $$P(D~|~M_2) / P(D~|~M_1)$$ is the Bayes factor, and is the key to Bayesian model selection.

The Bayes factor can be computed by evaluating the integral over the parameter likelihood:

$P(D~|~M) = \int_\Omega P(D~|~\theta, M) P(\theta~|~M) d\theta$

This integral is over the entire parameter space of the model, and thus can be extremely computationally intensive, especially as the dimension of the model grows beyond a few. For the 2-dimensional and 3-dimensional models we are considering here, however, this integral can be computed directly via numerical integration.

We'll start, though, by using an MCMC to draw samples from the posterior in order to solve the model fitting problem. We will use the emcee package, which requires us to first define functions which compute the prior, likelihood, and posterior under each model:

In [13]:
def log_prior(theta):
# size of theta determines the model.
# flat prior over a large range
if np.any(abs(theta) > 100):
return -np.inf  # log(0)
else:
return 200 ** -len(theta)

def log_likelihood(theta, data=data):
x, y, sigma_y = data
yM = polynomial_fit(theta, x)
return -0.5 * np.sum(np.log(2 * np.pi * sigma_y ** 2)
+ (y - yM) ** 2 / sigma_y ** 2)

def log_posterior(theta, data=data):
theta = np.asarray(theta)
return log_prior(theta) + log_likelihood(theta, data)


Next we draw samples from the posterior using MCMC:

In [14]:
import emcee

def compute_mcmc(degree, data=data,
log_posterior=log_posterior,
nwalkers=50, nburn=1000, nsteps=2000):
ndim = degree + 1  # this determines the model
rng = np.random.RandomState(0)
starting_guesses = rng.randn(nwalkers, ndim)
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[data])
sampler.run_mcmc(starting_guesses, nsteps)
trace = sampler.chain[:, nburn:, :].reshape(-1, ndim)
return trace

trace_2D = compute_mcmc(1)
trace_3D = compute_mcmc(2)


The output is a trace, or a series of samples which by design should reflect the posterior distribution. To visualize the posterior samples, I like to use seaborn's jointplot (for 2D samples) or PairGrid (for N-D samples):

In [15]:
import pandas as pd
columns = [r'$\theta_{0}$'.format(i) for i in range(3)]
df_2D = pd.DataFrame(trace_2D, columns=columns[:2])

with sns.axes_style('ticks'):
jointplot = sns.jointplot(r'$\theta_0$', r'$\theta_1$',
data=df_2D, kind="hex");

In [16]:
df_3D = pd.DataFrame(trace_3D, columns=columns[:3])

# get the colormap from the joint plot above
cmap = jointplot.ax_joint.collections[0].get_cmap()

with sns.axes_style('ticks'):
grid = sns.PairGrid(df_3D)
grid.map_diag(plt.hist, bins=30, alpha=0.5)
grid.map_offdiag(plt.hexbin, gridsize=50, linewidths=0, cmap=cmap)