What are the Best Ways Fastest Ways to Parse Extremely Large Data Files?

Sometime back a question was asked to develop a well performant parser– there was no restriction defined in the question whatsoever as to what technology, logic, flow/etc should be applied. Just the input block, and expected output result format, and maximum time that the parser may take.

This may answer questions:

How to query a .CSV and save the result in another CSV?
Custom .CSV Parser?
How to ETL .CSV into a .CSV
High performance .CSV parser

I found the performance requirement interesting, and so is the reason of this post; besides, another reason is to discuss only “some” of the solutions of the problem; therefore, it should be stated clearly in a question as to what exactly is required, just to save the the test’ee and tester from any after shocks. Though can have several solutions, but lets get into the details and see what do we have here. Following are important aspects of the problem.

Business requirement:

Calculate the usage of different country and dial codes for a particular customer, and write result in a separate file.

Functional requirement:

Read from CSV data source
Query for the data (select, group by, sum, count, etc)
Write into a separate .CSV file

Non functional requirement: (Performance)

Data source contains the ~1.2 million records, and the module is required to complete the whole procedure in less than 5 seconds.

Given that:

We have a CustomerData.CSV file already exists, to which we will query.

CustomerData.CSV has the following schema:

Columns	Description
Field 1	This field contains the customer id (sorted in ascending order)
Field 2	Country code
Field 3	Dial Code
Field 4	Start time
Field 5	Call duration

Result.CSV has the following schema:

Columns	Description
Field 1	Country code
Field 2	Dial Code
Field 3	Total duration (in minutes and seconds)

SOLUTION 1: (Use Jet OLE DB Text Driver)

Easiest, quickest, fastest, and very well performant!

Step 1: Define the following schema.ini file in some folder that you like:

If you would like, then look into the Schema.ini File (Text File Driver). Following content goes in schema.ini:

[CustomerData.csv]
Format=CSVDelimited
CharacterSet=ANSI
ColNameHeader=False
Col1=customerId Text Width 20
Col2=countryCode Short Width 3
Col3=dialCode Short Width 3
Col4=startTime DateTime Width 15
Col5=callDuration Text Width 5
[result.csv]
ColNameHeader=False
CharacterSet=1252
Format=CSVDelimited
Col1=countryCode Short
Col2=dialCode Short
Col3=Expr1002 Float

STEP 2: Stub in the backend code in some .cs file

private static void Query(string CustomerID)
{
    //Pseudo/logic:
    //Use jet ole db text driver to select * insert into new table; 
    //to read-from a .csv, and write-into a .csv
  
    string customerId = CustomerID;
    string writeTo = @"result.csv";
    string readFrom = @"CustomerData.csv";
  
    //1: SELECT * INTO NEW_TABLE 
    //2:  FROM SOURCE_TABLE 
  
    //dont just read, write as well.
    string query = @"  SELECT 
                    countryCode, dialCode, sum(callDuration) INTO " + writeTo + @"
                FROM 
                    [" + readFrom + @"] 
                WHERE 
                    customerId='" + customerId + @"' 
                GROUP BY countryCode, dialCode";
  
    Stopwatch timer = new Stopwatch();
  
    try
    {
        Console.WriteLine("Looking for customer: {0}, to export into {1}.", customerId, writeTo);
                
        string connectionString = @"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=E:\MyDocs\Software Test\;Extended Properties='text;HDR=No;FMT=CSVDelimited'";
        using (OleDbConnection conn = new OleDbConnection(connectionString))
        {
            OleDbCommand cmd = new OleDbCommand(query, conn);
            conn.Open();
            timer.Start();
            int nRecordsAffected = cmd.ExecuteNonQuery();
            timer.Stop();
            conn.Close();
        }
    }
    catch (Exception ex) { Console.Write(ex.ToString()); }
    Console.WriteLine("Time taken to read/write (ms):[{0}] ({1} secs)", timer.ElapsedMilliseconds, TimeSpan.FromMilliseconds(timer.ElapsedMilliseconds).Seconds);
}

Output:

SOLUTION 2: (Write a custom class, create indexes, and apply bisection search)

For instance, following code performs following operations to achieve the same:

Perform indexing on the selected column
Select specific records from the .CSV file
Perform aggregate function, call SUM() – Use LINQ

Usage:

using (CsvParser parser = new CsvParser(@"E:\CustomerData.csv"))
{
    parser.PerformIndexing((int)CsvParser.CsvColumns.Col1_CustomerID);//One time only.
  
    timer.Start();
    parser.Select("AMANTEL").Sum((int)CsvParser.CsvColumns.CallDuration);
    timer.Stop();
  
    foreach (var o in parser.Result)
    {
        Console.WriteLine(string.Format("CountryCode:{0}, DialCode:{1}, TotalDurationOfCall:{2}",
                   o.CountryCode, o.DialCode, o.TotalDurationOfCall));
    }
  
}
Console.WriteLine("Time taken to read/write (ms):[{0}] ({1} secs)", timer.ElapsedMilliseconds, TimeSpan.FromMilliseconds(timer.ElapsedMilliseconds).Seconds);

Output:

Backend code:

class CsvParser : IDisposable
{
  
    public dynamic Result { get; set; }
    private string _fileName;
    private char _separator = ',';
    private Dictionary<string, Bounds> _lstIndex = new Dictionary<string, Bounds>();
    private List<string> _Rows = new List<string>();
    public enum CsvColumns { Col1_CustomerID = 0, Col2_CountryCode = 1, Col3_DialCode = 2, Col4_StartTime = 3, CallDuration = 5 }
  
    //Simple bound structure to hold start and end index in the file.
    class Bounds
    {
        public int Start { get; set; }
        public int End { get; set; }
  
        public Bounds(int start, int stop) { Start = start; End = stop; }
    }
  
  
    //Default constructor
    public CsvParser(string file, char seperator = ',')
    {
        if (string.IsNullOrEmpty(file)) throw new Exception("Invalid file");
  
        this._fileName = file; this._separator = seperator;
    }
  
    /// <summary>
    /// Should be called once, before using the object;
    /// </summary>
    /// <param name="nColumn">Column to be indexed</param>
    /// <returns>Chained object</returns>
    public CsvParser PerformIndexing(int nColumn)
    {
        using (StreamReader reader = new StreamReader(_fileName))
        {
            string previousVal = string.Empty; string currentVal = string.Empty;
            int nStart = 0;
            int nEnd = 0;
            int nRowCounter = 0;
  
            do
            {
                currentVal = reader.ReadLine().Split(_separator)[nColumn];
  
                if (previousVal != currentVal)
                {
                    if (!string.IsNullOrEmpty(previousVal))
                    {
                        nEnd = nRowCounter;
                        _lstIndex.Add(previousVal, new Bounds(nStart, nEnd)); //Add previous value
                        nStart = nEnd + 1;//next line
                    }
  
                    previousVal = currentVal;
                }
  
  
                nRowCounter++;//next line
            } while (!reader.EndOfStream);
        }
  
        System.Diagnostics.Trace.WriteLine(string.Format("Done. Total indexed {0}.", _lstIndex.Count));
        return this;
    }
  
    /// <summary>
    /// Select rows where given customer id
    /// </summary>
    /// <param name="CustomerID">Customer id predicate</param>
    /// <returns></returns>
    internal CsvParser Select(string CustomerID)
    {
        using (StreamReader reader = new StreamReader(_fileName))
        {
            //1. Get location from index; also get the next index id so that we know where to stop.
            //2. Jump to that position
            //3. Start fetching
  
            Bounds bounds = _lstIndex[CustomerID];
  
            int nRowCounter = 0;
            while (!reader.EndOfStream || nRowCounter == bounds.End)
            {
                if (nRowCounter >= bounds.Start)
                    _Rows.Add(reader.ReadLine());
  
                nRowCounter++;
  
                if (nRowCounter > bounds.End) break;
            }
        }
  
        return this;
    }
  
    /// <summary>
    /// Binary search
    /// </summary>
    /// <param name="data"></param>
    /// <param name="key"></param>
    /// <param name="left"></param>
    /// <param name="right"></param>
    /// <returns></returns>
    [Obsolete("Unused", true)]
    internal static int Search(string[] data, string key, int left, int right)
    {
        if (left <= right)
        {
            int middle = (left + right) / 2;
            if (key == data[middle])
                return middle;
            else if (!key.Equals(data[middle]))
                return Search(data, key, left, middle - 1);
            else
                return Search(data, key, middle + 1, right);
        }
        return -1;
    }
  
    /// <summary>
    /// Provide SUM aggregate function
    /// </summary>
    /// <param name="nColumnID">by column</param>
    /// <returns>Chained object</returns>
    internal CsvParser Sum(int nColumnID)
    {
        var result = from theRow in _Rows
                        let rowItems = theRow.Split(_separator)
  
                        group theRow by new
                        {
                            countryCode = rowItems[(int)CsvColumns.Col2_CountryCode],
                            dialCode = rowItems[(int)CsvColumns.Col3_DialCode]
                        } into g
  
                        select new
                        {
                            CountryCode = g.Key.countryCode,
                            DialCode = g.Key.dialCode,
                            TotalDurationOfCall = g.Sum(p => p[(int)CsvColumns.CallDuration]),
                            selectedRows = g
                        };
  
        Result = result.ToList();
  
        return this;
    }
  
    #region IDisposable Members
    public void Dispose() { }
    #endregion
  
  
}

Btw, far more interesting code would have been the following implementation:

parser .Select(Col1, Col2, Col3)
 .Where(Col1,"AMANTEL")
 .Sum(Col3);

Let me know if you can help me with that (0: Note that, I have not applied bisection search, yet; but the method is there.

SOLUTION 3: (Use TextFieldParser class)

Checkout this solution, but its a VB turned into C# solution. Let me know if you enjoy?

using (var parser =
    new TextFieldParser(@"c:\CustomerData.CSV")
        {
            TextFieldType = FieldType.Delimited,
            Delimiters = new[] { "," }
        })
{
    while (!parser.EndOfData)
    {
        string[] fields;
        fields = parser.ReadFields();
        //Do something with it!
    }
}

SOLUTION 4: (Using LINQ to CSV)

Here is how.

SOLUTION 5: (Use XmlCSVReader, convert CSV to XML and use XPath to query the data)

How so? Here is the method.

SOLUTION 6: (Load .CSV in a database and use SELECT query to get result)

See Importing CSV Data and saving it in database; I hope you get the idea; ping me if you did not. Another, just as interesting A Fast CSV Reader is also there; this is interesting because it has benchmarks.

SOLUTION 7 (Bonus): (Use Text Driver with DSN)

Just to retrieve data, quick and easy.

OdbcConnection conn = new OdbcConnection("DSN=CustomerData.csv");
conn.Open();
OdbcCommand foo = new OdbcCommand(@"SELECT * FROM [CustomerData.csv]", conn);
IDataReader dr = foo.ExecuteReader();
while (dr.Read())
{
    List<string> data = new List<string>();
    int cols = dr.GetSchemaTable().Rows.Count;
    for (int i = 0; i < cols; i++)
    {
        Console.WriteLine(string.Format("Col:{0}", dr[i].ToString()));
    }
}

Happy parsing!

withoutbugs.com

Feature Post

Top

Top 6 ways to parse .CSV? High Performance!

What are the Best Ways Fastest Ways to Parse Extremely Large Data Files?

Calculate the usage of different country and dial codes for a particular customer, and write result in a separate file.

Data source contains the ~1.2 million records, and the module is required to complete the whole procedure in less than 5 seconds.

SOLUTION 1: (Use Jet OLE DB Text Driver)

SOLUTION 2: (Write a custom class, create indexes, and apply bisection search)

SOLUTION 3: (Use TextFieldParser class)

SOLUTION 4: (Using LINQ to CSV)

SOLUTION 5: (Use XmlCSVReader, convert CSV to XML and use XPath to query the data)

SOLUTION 6: (Load .CSV in a database and use SELECT query to get result)

SOLUTION 7 (Bonus): (Use Text Driver with DSN)

Trending

Open Banking Aggregator Marketplace

Featured Post

Building is easy; knowing what not to build is much harder!

Ad

Ad

Recent Views

Ad

Meta

Featured Posts

Life