Machine learning and data mining are well...
pymc tutorial by Chris Fonnesbeck
SVM tutorial by Chris Fonnesbeck
My friends who are doing research in economics and finance sometimes ask me what Python libraries they should look into. The most obvious choices are NumPy and SciPy. In this post, I would like to describe two other libraries that are less known in the finance community: pandas and QSTK.
Pandas is a powerful tool for data analysis in Python. Some of its features are specifically tailored for finance applications.
Let’s look at a simple example. Imagine that we would like to download data on stock prices for the Apple company (AAPL) from Yahoo Finance. This can be done in a single line of Python code. Next, we can output what we have downloaded.
Let’s take a look at the data that we have downloaded (the output of line 7):
Our plot which shows how price changes over time (the output of line 10):
For a detailed review of the features, you can take a look at the library documentation, or at the following materials:
QSTK is an open-source library for portfolio construction and management. It seems to be a mostly educational tool, which is also useful for rapid prototyping.
We are building the QSToolKit primarily for finance students, computing students, and quantitative analysts with programming experience. You should not expect to use it as a desktop app trading platform. Instead, think of it as a software infrastructure to support a workflow of modeling, testing and trading. (from QSTK Wiki)
The only reason I mention this library is because it was required for the programming assignments in the Computational Investing MOOC on Coursera. The course was prepared by Dr. Tucker Balch from GeorgiaTech, and he is a lead developer of QSTK.
Copula-based clustering is a kind of model-based clustering where each cluster is modeled as a set of realizations of one random variable. To model k clusters, k-dimensional copulas are used. Different kinds of copulas can be used for modelling the data; it’s up to the researcher to decide which ones to choose.
The method is particularly useful for datasets with a high dependence between different clusters. Its performance degrades when dependence is lower.
I have attended “Python for Data Analysis” meeting organised by Data Science London. There were two main talks — by Didrik Pinte from Enthought and by Wes McKinney, creator of pandas.
About a half of the audience has already used NumPy, though I think only a couple of people has gone deep with C integration and memory optimizations. So it was a mix of an introductory talk with showing Cython code and profiling tools.
Interestingly, when someone decided to port NumPy to .NET, it didn’t work efficiently because of unpredictable garbage collection in .NET.
Didrik has also shown how a memory monitor from Pikos works.
Python for Data Analysis
by Wes McKinney @wesmckinn — former quant, author of pandas (the powerful Python library for data analysis), author of the book: “Python for Data Analysis”
Most of the talk was done in the ipython notebook. Using a MovieLens dataset as an example, Wes has shown different pandas functions: data slicing, merge, map etc. The library is also good for data munging/cleaning/preparation.
He told they are doing further improvements of the library because of use cases when people try to open a 5 GB Kaggle dataset and the system uses 20 GB of memory.
Rmagic library: running R code in Python. Useful e.g. for ggplot2 library, which has no matches in the Python world.
"Python for Data Analysis" book is an introduction to pandas with working code examples, a better learning material than plain documentation. Print copies are not available yet; books will probably appear for Strata New York. (I have just checked my O’Reilly account, my copy is not listed as an “early release” anymore).
The first speaker, Didrik, added two reasons to use pandas:
Open Interests Europe Hackathon will take place in London, November 24 and 25.
Open Interests Europe brings together developers, designers, activists, journalists and other geeks for two days of learning, fun, intense hacking and app building.
How EU money is spent is an issue that concerns everyone who pays taxes to the EU. As the influence of Brussels lobbyists grows, it is increasingly important to draw the connections between lobbying, policy-making and funding. Journalists and activists need browsable databases, tools and platforms to investigate lobbyists’ influence and where the money goes in the EU. Join us and help build these tools!
The Hackathon challenges include Lobbying Transparency and Fish Subsidies.
Organised by the Open Knowledge Foundation and the European Journalism Centre, sponsored by Knight-Mozilla OpenNews.
For more details, please see the event website:
and register at http://openinterests.eventbrite.com