sabato 31 ottobre 2009

Reading file with LINQ.

LINQ, the new extension in the Microsoft Framework 3.5 very often is very powerful. Today I tried to read a big txt file (104 Mbytes, 5 millions rows).


For first, I created a new Console application in C# .NET 3.5. In the main method i wrote this code:

//start time
long start = Environment.TickCount;
var query = from a in File.ReadAllLines(@"C:\reading_test.txt")
select a;
foreach (string s in query)
;

Console.WriteLine("With ReadAllLines: " + (Environment.TickCount - start).ToString() + " ms");


When executed it on my machine, time result was 3313 ms.

The File.ReadAllLines() method sometimes is useful, but not performant. Infact with a line code only we can read all lines of a file. But there's a problem: if your input file is very big, you'll have a huge wast of memory.
In my example, when this application start memory used was about 17Mb, but after the query was 360Mb !!
This means all read lines are holdings in memory.

I solved this problem in this way:

long start = Environment.TickCount;
using (StreamReader sr = new StreamReader(@"C:\reading_test.txt")) {
var query = from x in sr.GetLines()
select x;
foreach (string s in query)
;
}
Console.WriteLine("With Extension Method: " + (Environment.TickCount - start).ToString() + " ms");

This is my code for GetLines() extension method:

static class ExtensionsClass {
public static IEnumerable GetLines(this StreamReader sr) {
String line;

if(sr == null)
throw new ArgumentNullException();

while((line = sr.ReadLine()) != null)
yield return line;
}
}


The query operator is implemented as an extension method for the StreamReader class. It enumerates the lines provided by the StreamReader one by one, but does not load a line in memory before it's actually needed.
The main point is that this technique allows you to work with huge files while maintaining a small memory usage profile.

With first method (File.ReadAllLines) time elapsed was about 3600 ms. With the second (with extensin method) time elapsed was about 1200 (less than half of first).