Category: Data

We are very excited to announce an early release of PyAutoDiff, a library that allows automatic differentiation in NumPy, among other useful features. A quickstart guide is available here.

Autodiff can compute gradients (or derivatives) with a simple decorator:

More broadly, autodiff leverages Theano's powerful symbolic engine to compile NumPy functions, allowing features like mathematical optimization, GPU acceleration, and of course automatic differentiation. Autodiff is compatible with any NumPy operation that has a Theano equivalent and fully supports multidimensional arrays. It also gracefully handles many Python constructs (though users should be very careful with control flow tools like if/else and loops!).

In addition to the  @gradient decorator, users can apply  @function to compile functions without altering their return values. Compiled functions can automatically take advantage of Theano's optimizations and available GPUs, though users should note that GPU computations are only supported for float32 dtypes. Other decorators, classes, and high-level functions are available; see the docs for more information.

It is also possible for autodiff to trace NumPy objects through multiple functions. It can then compile symbolic representations of all of the traced operations (or their gradients) -- even with respect to objects that were purely local to the function(s) scope.

One of the original motivations for autodiff was working with SVMs that were defined purely in NumPy. The following example (also available at autodiff/examples/ fits an SVM to random data, using autodiff to compute parameter gradients for SciPy's L-BFGS-B solver:

Some members of the scientific community will recall that James Bergstra began the PyAutoDiff project a year ago in an attempt to unify NumPy's imperative style with Theano's functional syntax. James successfully demonstrated the project's utility, and this version builds out and on top of that foundation. Standing on the shoulders of giants, indeed!

Please note that autodiff remains under active development and features may change. The library has been performing well in internal testing, but we're sure that users will find new and interesting ways to break it. Please file any bugs you may find!


“We make a lot of guesses,” says Sriram Sankar, an engineer on Facebook’s search team. “We apply a lot of intuition… and we use a lot of data to verify that intuition.”

(via Wired)

Getting scientific Python running on a Mac is one of the biggest hurdles for data scientists who are just getting started (and, trust us, professionals too!). We'd usually steer readers toward one of the more popular articles on the subject, but it's gotten a bit stale. Therefore, here are updated step-by-step instructions for getting a basic environment set up.

In this guide we'll install the following packages:

  1. Quickstart
  2. Homebrew
  3. Python
  4. virtualenv and virtualenvwrapper
  5. NumPy
  6. SciPy
  7. matplotlib
  8. IPython and the Qt console

Continue reading “Installing scientific Python on Mac OS X” »

In today's NYT, David Brooks writes on the growing trend of "data-ism" -- the relationship people have with the data that surrounds of defines them:

[T]he data revolution is giving us wonderful ways to understand the present and the past. Will it transform our ability to predict and make decisions about the future? We’ll see.

One small grievance with Brooks' view is that he falls into the common trap of viewing the data world in black and white -- hard, objective mathematics vs subjective intuition. He does not seem to acknowledge even the possibility of a blended approach, expressing some reservation about the faith people put in numbers.

This usually comes about from a conflation of "statistics" with an inferential process like machine intelligence. Statistics is the study and description of quantitative information. It's usually not a normative science. Using statistics, we build tools to help us study various phenomena. These tools lead to decisions. It's not entirely correct to view the quantification process (what Brooks calls "data-ism") in isolation. In fact, in the absence of context, we believe it to be meaningless from an analytical perspective.

According to the NYT, Disney is implementing a new data-driven system in its parks:

...Disney has decided that MyMagic+ is essential. The company must aggressively weave new technology into its parks — without damaging the sense of nostalgia on which the experience depends — or risk becoming irrelevant to future generations, Mr. Staggs said. From a business perspective, he added, MyMagic+ could be “transformational.”

At first, the system will be used to track customer information (name, birthday, preferred rides, etc.). Future versions should allow the company to know where in the park each customer is and whether or not a nearby attraction would interest them. Beyond that, the company could predict a customer's likely trajectory around the park -- or even subtly adjust that path to minimize time on line and maximize time spent on the attractions.

In one of the best technology articles of 2012, Steve Lohr writes about the need for thought and intuition when dealing with data. He observes:

In so many Big Data applications, a math model attaches a crisp number to human behavior, interests and preferences. The peril of that approach, as in finance, was the subject of a recent book by Emanuel Derman, a former quant at Goldman Sachs and now a professor at Columbia University. Its title is “Models. Behaving. Badly.”

Claudia Perlich, chief scientist at Media6Degrees, an online ad-targeting start-up in New York, puts the problem this way: “You can fool yourself with data like you can’t with anything else. I fear a Big Data bubble.”

And concludes:

Listening to the data is important, [data scientists] say, but so is experience and intuition. After all, what is intuition at its best but large amounts of data of all kinds filtered through a human brain rather than a math model?

At the M.I.T. conference, Ms. Schutt was asked what makes a good data scientist. Obviously, she replied, the requirements include computer science and math skills, but you also want someone who has a deep, wide-ranging curiosity, is innovative and is guided by experience as well as data.