Creating data frame row-by-row in R

When manipulating data in R, I often find myself in a situation where I have to create a new data frame in an iterative row-by-row way. There are approaches to do it this way, but a natural question is which one of them is the best, or more specifically, which one is the fastest?

To answer this question, I checked experimentally how different approaches fare on data frames of different size. Below, I present couple of methods along with a sample code that creates a data frame with n rows for each method.

Methods

The first method, which I called “one by one”, uses rbind to add a new row to the data frame.

create.one.by.one <- function(n){
  data <- data.frame()
  for(i in 1:n) data <- rbind(data, row)
  return(data)
}

The second one, which I called “from list”, inserts all of the rows into a list and after that it creates a single data frame by calling do.call(rbind, list.of.rows)

create.from.list <- function(n){
  data.l <- list()
  for(i in 1:n) data.l <- c(data.l, list(row))
  return(do.call(rbind, data.l))
}

In the third method called “preallocated”, we create a new data frame with an appropriate number of rows and then fill each row in consecutive steps. The initial values of each column are the default values of data types used.

create.preallocated <- function(n){
  data <- data.frame(name=character(n), age=numeric(n), height=numeric(n), 
    id=numeric(n), friends.no=numeric(n), stringsAsFactors=FALSE)
  for(i in 1:n) data[i, ] <- row
  return(data)
}

The fourth method called “preallocated with NAs” is analogous to the third one, but instead of using the default values in data frame, we use NAs.

create.preallocated.with.NAs <- function(n){
  data <- data.frame(name=rep(NA, n), age=rep(NA, n), height=rep(NA, n), 
    id=rep(NA, n), friends.no=rep(NA, n), stringsAsFactors=FALSE)
  for(i in 1:n) data[i, ] <- row
  return(data)
}

Results and conclusions

The results of a simple performance test are presented in the figure below. The source code that executes the test and produces the plot is here: test.R

Creating data frame on a row-by-row basis using different methods.

As can be seen, both “preallocated” and “preallocated with NAs” methods are the fastest. Their drawback is that you have to know upfront how many rows the constructed data frame will have. Next, as the speed is concerned, is the “from list” method which seems to be a sensible choice when you don’t know the number of rows of the data frame. The slowest one is the “one by one” method.

This entry was posted in R and tagged . Bookmark the permalink.

6 Responses to Creating data frame row-by-row in R

  1. newfuntek says:

    Have you tried plyr package? http://www.jstatsoft.org/v40/i01/paper

    • Mateusz says:

      I haven’t tried it yet. Here, I generally focused my attention on standard R functions, but it might be interesting to compare them with plyr’s.

  2. MusX says:

    I can recommend data.table package

  3. keremw says:

    Hi but what type of data are you using for “row”
    is it a list?
    I’m trying to use your way but it fails to append my list

  4. Norine says:

    It’s hard to find your page in google. I found it on 20 spot, you should build quality backlinks , it will help
    you to get more visitors. I know how to help you, just search in google – k2 seo tricks

  5. Margaret says:

    I read a lot of interesting articles here. Probably you spend a lot of time
    writing, i know how to save you a lot of work, there is an online tool that creates
    readable, google friendly posts in minutes, just type in google
    - laranitas free content source

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>