Splitting Intervals in a Data Frame: A Step-by-Step R Solution

Splitting Intervals in a Data Frame

In this article, we will explore how to split intervals in a data frame into equal lengths and retain their respective information. We will use the R programming language as an example.

Introduction

Suppose you have a data frame with coordinates and their respective values, which can be at intervals of length 1, 2, 4, 6, or 8, and so on. You want to split each interval that is not equal to 1 into two equal parts and keep their respective information.

For example, let’s consider the following data frame:

chr start end meth cov
chr1 16136 16136 100.00 1.0
chr1 16137 16138 100.00 4.0
chr2 16243 16242 100.00 4.5
chr2 16244 16246 100.00 10.0
chr2 16247 16248 83.33 6.0
chr3 16251 16256 50.00 2.0

We want to split each interval that is not equal to 1 into two equal parts and keep their respective information.

A Base R Solution

Here comes a base R solution. First, we make it easier for ourselves by defining a function seqr() that creates sequences out of a range of length 2.

seqr <- function(x) {
  seq(x[[1]], x[[2]])
}

Then, assuming unique columns as in your example, we create row-wise 1:nrow(dat) sequences of start and stop and fill the results row-wise into a two-columned matrix, and then combine it with the remaining columns exploiting recycling. The result will be rbind()ed.

res <- do.call(rbind,
  lapply(1:nrow(dat), function(i) {
    cbind(chr = dat[i, 1],
          matrix(seqr(dat[i, 2:3]), ncol=2, byrow=TRUE, 
                 dimnames=list(NULL, names(dat)[2:3])), 
          dat[i, 4:5], row.names=NULL)))
res

Data

Let’s define the data frame:

dat <- structure(list(chr = c("chr1", "chr1", "chr2", "chr2", "chr2",
"chr3"), start = c(16136L, 16137L, 16139L, 16243L, 16247L,
16251L), end = c(16136L, 16138L, 16142L, 16246L, 16250L,
16256L), meth = c(100, 100, 100, 100, 83.33, 50),
cov = c(1, 4, 4.5, 10, 6, 2)), row.names = c(NA,
-6L), class = "data.frame")

Conclusion

In this article, we explored how to split intervals in a data frame into equal lengths and retain their respective information using the R programming language. We defined a function seqr() that creates sequences out of a range of length 2 and then combined it with the remaining columns exploiting recycling. The result was then bound together using the rbind() function.

By following these steps, you can easily split intervals in your data frame into equal lengths and retain their respective information.


Last modified on 2023-06-05