traims' blog

AI, machine learning, data analysis; complex networks, natural language processing. #DataMining #MachineLearning #recsys #python #ruby #rstats
Recent Tweets @
Posts I Like

I have attended “Python for Data Analysis” meeting organised by Data Science London. There were two main talks — by Didrik Pinte from Enthought and by Wes McKinney, creator of pandas.

NumPy, the Python foundation for number crunching

by Didrik Pinte Python contributor to QuantLib (a library for quant finance), and MD of Enthought, developer of EPD-the scientific computing Python platform.

image

(c) by Data Science London @ds_ldn

About a half of the audience has already used NumPy, though I think only a couple of people has gone deep with C integration and memory optimizations. So it was a mix of an introductory talk with showing Cython code and profiling tools. 

Interestingly, when someone decided to port NumPy to .NET, it didn’t work efficiently because of unpredictable garbage collection in .NET.

Didrik has also shown how a memory monitor from Pikos works.

Python for Data Analysis

by Wes McKinney @wesmckinnformer quant, author of pandas (the powerful Python library for data analysis), author of the book: “Python for Data Analysis

(c) by Data Science London

Most of the talk was done in the ipython notebook. Using a MovieLens dataset as an example, Wes has shown different pandas functions: data slicing, merge, map etc. The library is also good for data munging/cleaning/preparation.

He told they are doing further improvements of the library because of use cases when people try to open a 5 GB Kaggle dataset and the system uses 20 GB of memory.

Rmagic library: running R code in Python. Useful e.g. for ggplot2 library, which has no matches in the Python world. 

"Python for Data Analysis" book is an introduction to pandas with working code examples, a better learning material than plain documentation. Print copies are not available yet; books will probably appear for Strata New York. (I have just checked my O’Reilly account, my copy is not listed as an “early release” anymore). 

The first speaker, Didrik, added two reasons to use pandas:

  1. its excellent documentation; 
  2. vbench, performance benchmarking for pandas.
Random notes:
  • Carlos, the event host, offered three O’Reilly books as quiz prizes. Didrik came up with a question about ndarray strides. Wes asked who has created the J language, which inspired NumPy. 
  • Didrik was using Canopy on his Mac. Wes was using ipython notebook.
  • Meetup page of the event: http://www.meetup.com/Data-Science-London/events/85448442/