How to Load MediaLab Data Files into R

Did you know MediaLab has many operations that produce CSV data files? This post reviews how to load a folder of CSV files into R data frames.

Disclaimer

Please note: we are not experts in using R, and we pulled this example together from Internet sources. However, it is running a production data processing pipeline in R. We processed 100s of hours of video to produce the data, using the Contour Data operation.

File System

Let’s start with how to handle the file system, and get a list of file paths to process. The source videos had the starting time-and-date in the file name, so we used some regex magic to leverage that. We needed to adjust the TimeIndex value contained in each CSV file to an absolute timestamp.

# Root Folder is set here (NO Backslashes please)
rootFolder <- "C:/Users/xxxxx/Pictures/Saved Pictures/MediaLab/Data"
# Adjust the PATTERN (regex) to match your file name pattern
fileList <- list.files(path=rootFolder, pattern="Contour.*AXIS.*csv$", full.names=TRUE)
# Extract the formatted date portion of file names (2 steps)
fmatches <- regexpr("\\d{4}-\\d{2}-\\d{2}_\\d{2}_\\d{2}_\\d{2}", fileList)
ftimestamps <- substr(fileList,fmatches,fmatches+attr(fmatches,"match.length")-1)

That gets us 2 parallel vectors of the file names and the timestamps for each one. Now let’s just zip it up and loop over the list. Please note: there may be better ways to do this, remember the disclaimer above. Feel free to post improvements in the comments. This loop simply loads and plots the data frames using ggplot.

fileInfo <- cbind(fileList, ftimestamps)
for(fi in 1:nrow(fileInfo)) {
	fname <- fileInfo[fi,1]
	ts <- fileInfo[fi,2]
	chart <- process_file(fname, ts)
	# each plot in a new window
	dev.new()
	# must print or it doesn't display
	print(chart)
	# save to a file
	output <- sprintf("%s.png", fname);
	ggsave(output, plot=chart)
}

Process Item

Just for more context, let’s briefly look at the process_file function, because this is the one that actually loads those CSV files. This is an example; do whatever you need for each CSV at this point.

process_file <- function(filename, starting) {
	dataset <- load_csv(filename, starting)
	summary <- create_summary(dataset)
	chart_labels <- create_labels(summary)
	chart <- create_chart(summary, starting, chart_labels)
	return(chart)
}

Load Data Frame

Let’s examine the load_csv function now.

load_csv <- function(filename, starting) {
	data <- read.csv(filename, stringsAsFactors=FALSE)
	#if you need to filter on area
	#data <- subset(data, Area >= 4)
	today <- strptime(starting, "%Y-%m-%d_%H_%M_%S")
	# NOTE assumes only one-digit day
	parsed <- ifelse(nchar(data$TimeIndex) > 8, substring(data$TimeIndex,first=3), data$TimeIndex)
	data$TimeOffset <- as.difftime(parsed, units="secs")
	# cant get this to work; need to extract dayoffset[[*]][1]
	#dayoffset <- ifelse(nchar(data$TimeIndex) > 8, strsplit(data$TimeIndex,"\\."),  strsplit(as.character("0.99:99:99"),"\\."))
	#data$DayOffset <- dayoffset
	# NOTE assumes only one-digit day
	data$DayOffset <- as.integer(ifelse(nchar(data$TimeIndex) > 8, substring(data$TimeIndex,first=1,last=1), "0"))
	data$OffsetTimeIndex <- data$TimeOffset + today
	data$HourOfDay <- as.integer(format(data$OffsetTimeIndex, "%H"))
	return(data)
}

The biggest issue in this function is the hoops to jump through to add the absolute starting date to the TimeIndex of each row of data, and for the purposes of this task, extract a few more “interesting” timestamp-based values for subsequent processing.

The main issue that must be worked around is the format of the TimeIndex column, when the source is longer than 24 hours. In our case the videos were 7 days long! That required some gymnastics. We also used the lubridate package for some of these tasks.

We could have done some of this “date massaging” in a spreadsheet application. The size of the data files was extreme, and Excel did not handle it.

In the meantime, there are updates to the TimeIndex output that make this easier within R!

Follow-up

As a result of this, we are improving the options for formatting the TimeIndex. Watch for this in the next package release!