traims' blog

Jan 16

Copula-based clustering of datasets with multivariate dependence structure: a brief overview and links

I have just attended a talk on copula-based clustering algorithms, where work of Francesca Marta Lilja Di Lascio and Simone Giannerini was presented. 

Copula-based clustering is a kind of model-based clustering where each cluster is modeled as a set of realizations of one random variable. To model k clusters, k-dimensional copulas are used. Different kinds of copulas can be used for modelling the data; it’s up to the researcher to decide which ones to choose.

The method is particularly useful for datasets with a high dependence between different clusters. Its performance degrades when dependence is lower. 

The R package CoClust, which implements a new clustering method based on copulas, is available on CRAN: http://cran.r-project.org/web/packages/CoClust/ (R Help)

References:

  1. F. Marta L. Di Lascio, Simone Giannerini. A Copula-Based Algorithm for Discovering Patterns of Dependent Observations. Journal of Classification. April 2012, Volume 29, Issue 1, pp 50-75 
  2. Di Lascio, F. M. L. (2008). Analyzing the dependence structure of microarray data: A copula–based approach. Ph.D. Dissertation, University of Bologna, Bologna, Italy. Available at http://amsdottorato.cib.unibo.it/670/  

Oct 19

Python for Data Analysis, 18 Oct 2012, London

I have attended “Python for Data Analysis” meeting organised by Data Science London. There were two main talks — by Didrik Pinte from Enthought and by Wes McKinney, creator of pandas.

NumPy, the Python foundation for number crunching

by Didrik Pinte Python contributor to QuantLib (a library for quant finance), and MD of Enthought, developer of EPD-the scientific computing Python platform.

image

(c) by Data Science London @ds_ldn

About a half of the audience has already used NumPy, though I think only a couple of people has gone deep with C integration and memory optimizations. So it was a mix of an introductory talk with showing Cython code and profiling tools. 

Interestingly, when someone decided to port NumPy to .NET, it didn’t work efficiently because of unpredictable garbage collection in .NET.

Didrik has also shown how a memory monitor from Pikos works.

Python for Data Analysis

by Wes McKinney @wesmckinnformer quant, author of pandas (the powerful Python library for data analysis), author of the book: “Python for Data Analysis

(c) by Data Science London

Most of the talk was done in the ipython notebook. Using a MovieLens dataset as an example, Wes has shown different pandas functions: data slicing, merge, map etc. The library is also good for data munging/cleaning/preparation.

He told they are doing further improvements of the library because of use cases when people try to open a 5 GB Kaggle dataset and the system uses 20 GB of memory.

Rmagic library: running R code in Python. Useful e.g. for ggplot2 library, which has no matches in the Python world. 

“Python for Data Analysis” book is an introduction to pandas with working code examples, a better learning material than plain documentation. Print copies are not available yet; books will probably appear for Strata New York. (I have just checked my O’Reilly account, my copy is not listed as an “early release” anymore). 

The first speaker, Didrik, added two reasons to use pandas:

  1. its excellent documentation; 
  2. vbench, performance benchmarking for pandas.
Random notes:

Oct 16

OKFN and the European Journalism Centre

Open Interests Europe Hackathon will take place in London, November 24 and 25.

Open Interests Europe brings together developers, designers, activists, journalists and other geeks for two days of learning, fun, intense hacking and app building.

How EU money is spent is an issue that concerns everyone who pays taxes to the EU. As the influence of Brussels lobbyists grows, it is increasingly important to draw the connections between lobbying, policy-making and funding. Journalists and activists need browsable databases, tools and platforms to investigate lobbyists’ influence and where the money goes in the EU. Join us and help build these tools!

The Hackathon challenges include Lobbying Transparency and Fish Subsidies.

Organised by the Open Knowledge Foundation and the European Journalism Centre, sponsored by Knight-Mozilla OpenNews.

For more details, please see the event website:
http://okfnlabs.org/events/hackdays/lobbying.html

and register at http://openinterests.eventbrite.com

Related links:

Sep 14

Yuri Suzuki: London Underground Circuit Map 


images © hitomi kai yodaJapanese designer Yuri Suzuki has sent DesignBoom images of his ‘London underground circuit maps’ project, developed as part of the Designers in Residence program at the London Design Museum, on show until January 13th, 2013.

Yuri Suzuki: London Underground Circuit Map 

images © hitomi kai yoda

Japanese designer Yuri Suzuki has sent DesignBoom images of his ‘London underground circuit maps’ project, developed as part of the Designers in Residence program at the London Design Museum, on show until January 13th, 2013.

Aug 05

Palo Alto looks to use open data to embrace ‘city as a platform’ — O’Reilly Radar

“The city of Palo Alto in California joined over a dozen cities around the United States and globe when it launched its own open data platform.

The city initially published open datasets that include the 2010 census data, pavement condition, city tree locations, park locations, bicycle paths and hiking trails, creek water level, rainfall and utility data. Open data about Palo Alto budgets, campaign finance, government salaries, regulations, licensing, or performance — which would all offer more insight into traditional metrics for government accountability — were not part of this first release.

The platform includes an application programming interface (API) which enables direct access through a RESTful interface to open government data published in a JSON format. Datasets can also be embedded like YouTube videos.

Palo Alto looks to use open data to embrace ‘city as a platform’, O’Reilly Radar

Data samples

Powered by Junar