## Java wrapper for the Fortran L-BFGS-B algorithm

I have moved the source of my L-BFGS-B wrapper library from my university page to a more convenient hosting place: https://github.com/mkobos/lbfgsb_wrapper .

The wrapper is a Java library proxy for the L-BFGS-B quasi-Newton optimization library. The original implementation is written in the Fortran 77 language and is available on the authors’ original distribution website. The wrapper library uses JNI with SWIG to create the proxy code. A by-product of the Java wrapper is a quite low-level C wrapper for the Fortran code. In short, the structure of the wrapping layers looks as follows: `Fortran -> C -> JNI by SWIG -> Java`.

Posted in Java | Tagged , , | 8 Comments

## Java PCA transformation library

I wrote a Java library implementing Principal Component Analysis data transformation. About two years ago I needed such an algorithm implemented in Java and I was not able to find one, hence this library.

The library along with a more detailed description as well as binary and source packages is placed on github: https://github.com/mkobos/pca_transform

Posted in Java | Tagged | 4 Comments

## Creating data frame row-by-row in R

When manipulating data in R, I often find myself in a situation where I have to create a new data frame in an iterative row-by-row way. There are approaches to do it this way, but a natural question is which one of them is the best, or more specifically, which one is the fastest?

To answer this question, I checked experimentally how different approaches fare on data frames of different size. Below, I present couple of methods along with a sample code that creates a data frame with `n` rows for each method.

## Methods

The first method, which I called “one by one”, uses `rbind` to add a new row to the data frame.

```create.one.by.one <- function(n){
data <- data.frame()
for(i in 1:n) data <- rbind(data, row)
return(data)
}```

The second one, which I called “from list”, inserts all of the rows into a list and after that it creates a single data frame by calling `do.call(rbind, list.of.rows)`

```create.from.list <- function(n){
data.l <- list()
for(i in 1:n) data.l <- c(data.l, list(row))
return(do.call(rbind, data.l))
}```

In the third method called “preallocated”, we create a new data frame with an appropriate number of rows and then fill each row in consecutive steps. The initial values of each column are the default values of data types used.

```create.preallocated <- function(n){
data <- data.frame(name=character(n), age=numeric(n), height=numeric(n),
id=numeric(n), friends.no=numeric(n), stringsAsFactors=FALSE)
for(i in 1:n) data[i, ] <- row
return(data)
}```

The fourth method called “preallocated with NAs” is analogous to the third one, but instead of using the default values in data frame, we use `NA`s.

```create.preallocated.with.NAs <- function(n){
data <- data.frame(name=rep(NA, n), age=rep(NA, n), height=rep(NA, n),
id=rep(NA, n), friends.no=rep(NA, n), stringsAsFactors=FALSE)
for(i in 1:n) data[i, ] <- row
return(data)
}```

## Results and conclusions

The results of a simple performance test are presented in the figure below. The source code that executes the test and produces the plot is here: test.R Creating data frame on a row-by-row basis using different methods.

As can be seen, both “preallocated” and “preallocated with NAs” methods are the fastest. Their drawback is that you have to know upfront how many rows the constructed data frame will have. Next, as the speed is concerned, is the “from list” method which seems to be a sensible choice when you don’t know the number of rows of the data frame. The slowest one is the “one by one” method.

Posted in R | Tagged | 7 Comments

## Concurrent tree crawler

I needed an application to crawl a web site containing archive issues of certain magazines and download selected articles.

To access the site, you had to log in using a web form. The web site was constructed in a hierarchical tree-like way: a magazine page consisted of many links to single issues, each issue consisted of links to single articles. On each level, a single page with links contained only a fraction of all of the links to pages of the lower level (e.g. 50 links). You were able to get more links by clicking the “next” or “previous” links (see the figure below). Each node of the web site tree-like structure can correspond to many linked pages

I was not able to find a crawler that could be easily applied to this problem, thus I decided to write one myself. After using it and downloading the data I needed, I decided to make it more general so other people could use it.

The algorithmic heart of the program is a generic concurrent tree-crawling algorithm described here: https://github.com/mkobos/tree_crawler/wiki/algorithm/index.pdf. On this foundation, a web site crawler is built. It can be extended in 3 main ways: 1) to crawl some general tree structure, 2) to crawl a web site with tree-like structure, 3) to crawl a web site with a structure similar to my original magazine archive. It has some nice useful features like a schedule of program activity, logging, dealing with different kinds of errors etc.

• The source of the library containing the implementation of the program along with some more detailed description is placed here: http://github.com/mkobos/tree_crawler. The python package can be also downloaded and installed from the PyPi repository (e.g. by executing `pip install --user concurrent_tree_crawler` or `easy_install --user concurrent_tree_crawler`).
• A more detailed usage of the library and description of the generic tree crawling library is placed here: http://github.com/mkobos/tree_crawler/wiki.
Posted in Python | Tagged | 1 Comment

## Reader-Writer lock with priority for writers in Python

I needed a reader-writer lock to use in the second readers-writers problem with python threads. In this problem, many readers can simultaneously access a share, and a writer has an exclusive access to this share. Additionally, the following constraints should be met: 1) no reader should be kept waiting if the share is currently opened for reading unless a writer is also waiting for the share, 2) no writer should be kept waiting for the share longer than absolutely necessary.

Surprisingly, I wasn’t able to find an implementation of such a synchronization object that met some basic common-sense requirements: 1) simple, clear implementation, 2) tested with some basic unit tests. So, I wrote my own version — it is placed here: http://code.activestate.com/recipes/577803-reader-writer-lock-with-priority-for-writers .

Posted in Python | 4 Comments

Post contents

## Introduction

I implemented a simple algorithm that computes distances between all pairs of given set of n-dimensional points. The algorithm is implemented in C++ and Java. To communicate with C++ code, I use R’s Rcpp package. To communicate with Java, I use two methods. The first one is a simple approach where the communication is made through temporary disk files containing the input and output data, and the Java program is called through the `system()` function. In the second approach, the communication is made with help of the rJava package.

## Code excerpts

The Java code used through the rJava package looks as follows:

```public class CalculateDistances {
public static double[] run(double[][] pts){
double[] distances = new double[((pts.length-1)*pts.length)/2];
int k = 0;
for(int i = 0; i < pts.length-1; i++){
for(int j = i+1; j < pts.length; j++){
distances[k] = calculateDistance(pts[i], pts[j]);
k++;
}
}
return distances;
}

private static double calculateDistance(double[] pt0, double[] pt1){
double sum = 0;
for(int d = 0; d < pt0.length; d++){
sum += Math.pow(pt0[d]-pt1[d], 2);
}
return Math.sqrt(sum);
}
}```

The C++ code used through the Rcpp package is the following:

``` #include "calculate_distances.h"

#include

double calculateDistance(const Rcpp::NumericVector& pt0,
const Rcpp::NumericVector& pt1){
double sum = 0;
for(int d = 0; d < pt0.size(); d++){
sum += pow(pt0[d]-pt1[d], 2);
}
return sqrt(sum);
}

SEXP calculateDistances(SEXP matrix){
using namespace Rcpp ;

NumericMatrix ptsMatrix(matrix); // create Rcpp wrapper for matrix in SEXP

// crete numeric vector of appropriate length
NumericVector distances( (ptsMatrix.nrow()-1)*ptsMatrix.nrow()/2 );
int k = 0;
for(int i = 0; i < ptsMatrix.nrow()-1; i++){
for(int j = i+1; j < ptsMatrix.nrow(); j++){
distances[k] = calculateDistance(
ptsMatrix.row(i), ptsMatrix.row(j));
k++;
}
}
return distances;
}```

## Test

To test these methods, I ran the program on a 10-dimensional set of points drawn randomly from a normal distribution. The number of points in a single set varied from 1 to 3000. Each run was repeated 5 times and then averaged to obtain more reliable results. The results (number of points vs. time taken by the method) are shown in the following plot. The bars correspond to standard deviation of a given run. Comparison of different methods computing the distances between points: java – implementation in Java with the help of rJava, cpp – implementation in C++ with the help of Rcpp, java_file – implementation in Java with the use of temporary disk files.

## Analysis of the test results

It can be seen than the method that uses C++ is the fastest one. Of course, it was expected since C++ programs are generally more efficient than Java, but that’s not the whole story. Apart from that, the Rcpp package allows to write a more efficient code responsible for communication between the native language and R because in Rcpp we do not copy the whole objects while passing them to the native code (as we do in rJava). Rcpp allows and encourages using its thin and effective wrappers for R objects without unnecessary copying of their contents.

On the other hand, the Java code is more readable for a layman, since it doesn’t use any nonstandard packages to implement the functionality. This is not the case with the C++ version where we use the `Rcpp.h` header to access R-specific constructs.

What was unexpected in this comparison is that the method based on temporary files is faster for smaller data sets. This is probably due to expensive but hidden bookkeeping connected with the rJava package.

## Source code

The source code of these methods along with a makefile that builds appropriate packages and installs them in the system and the R code that generates the plot is placed here:

The program was tested with R 2.12.1 on Ubuntu 11.04 system. The code is written in such way that it should be pretty easy to use it as a template for R connection code when implementing other algorithms in Java or C++.

Posted in R | Tagged , , | 6 Comments

## First entry

Welcome to my blog. I plan to place here some programming-related things that were useful to me and hopefully will also be of use to a wider audience.