Blog

Improving CSV processing code with laziness

By Yuji Yokoo, 10 Feb 2014

Enumerator::Lazy

Recently, I have had to analyse a very large CSV file and look for some lines containing specific values. In this post, I will explain how the lazy enumerator let me write simpler code.

Enumerator::Lazy is still relatively new to many Rubyists including myself, but in short, the lazy enumerator lets us evaluate its elements as needed. This behaviour is quite different from the regular enumerator where the evaluation happens upfront, or eagerly. This lets us write simpler and still efficient CSV processing code using the enumerator methods.

Example Problem

Suppose we have a large CSV like this:

ID,SKU ID,NAME,AVAILABILITY,MANUFACTURER,LINK,IMAGE LINK
"1","123ABC","Product 1","2","Foo Products","http://localhost/123abc","http://localhost/img/123abc.jpg"
"2","23456","Shoes 1","5","Foobar Shoes","http://localhost/23456","http://localhost/img/23456.jpg"
"3","123ABC-2","Product 2","0","Foo Products","http://localhost/123abc-2","http://localhost/img/123abc-2.jpg"

Let's pretend we had a lot more lines like these. Also, let's say we want to look for rows with "Foobar Shoes" as the "MANUFACTURER" in this CSV, and print its "NAME" with its "SKU ID", and we only need the first 5 occurrences of this. Although we could use other tools like grep and awk, we will focus on doing it with Ruby.

In Ruby, we might do this:

require 'csv'

rows = []
CSV.new(File.open('input.csv','r'), :headers => true).each do |row|
  rows << "#{row['SKU ID']} - #{row['NAME']}" if row['MANUFACTURER'] == "Foobar Shoes"
  break if rows.size >= 5
end

p rows

This is okay and works fine, but it would be easier to read if we expressed this in select, map, and take.

We could do it this way:

require 'csv'

rows = CSV.new(File.open('input.csv','r'), :headers => true).select do |row|
  row['MANUFACTURER'] == "Foobar Shoes"
end.map do |row|
  "#{row['SKU ID']} - #{row['NAME']}"
end.take(5)

p rows

Although this is not any shorter, each step is now in a spearate block, which makes it easier to read and maintain. However, there is a problem; it is eager and loads every row in memory. So, it takes a lot more time to run, and takes up more memory than the previous example. This is a serious problem if your CSV file contains many rows, like 300,000 lines.

Introducing Laziness

This is exactly the type of problem we should be using the lazy enumerator for. In order to be lazy, all we have to do is to call lazy on the CSV object. By calling lazy on the CSV object here, we can get a lazy enumerator and use the lazy version of map and select. We also have to call force at the end, since it remains unevaluated without it.

require 'csv'

rows = CSV.new(File.open('input.csv','r'), :headers => true).lazy.select do |row|
  row['MANUFACTURER'] == "Foobar Shoes"
end.map do |row|
  "#{row['SKU ID']} - #{row['NAME']}"
end.take(5).force

p rows

The only differences are calling lazy on the CSV object and force at the end, but this version does not load every row in memory at the same time, and it runs much more efficiently than the last example.

Final Thoughts

The concept of laziness is more commonly seen in functional programming, and may be unfamiliar to some Rubyists, but if we are using Ruby 2.0 or later, we should remember that it exists, and use it when it is appropriate.

blog comments powered by Disqus