Deduping a large file

by kiteason7. March 2013 09:08

I recently came across a requirement to dedupe a large text file.  The basic requirements were:

  • The file had > 100 million lines, each consisting of some text data, followed by a separator, followed by some more text data.  The text data was 30 to 100 characters long, the metadata 30 characters.
  • Two lines were duplicates of one another if they had the same text data (the first part); the metadata was ignored for the purpose of detecting a duplicate.
  • The deduped output needed to preserve the last line in any set of duplicates, and had to preserve the original order of lines.

I started as always by writing a set of tests.  (When working solo on algorithmic-style work, I tend to write all the tests I can think of before cutting any real code.)  Here are the first few lines of the tests module:

  1. open NUnit.Framework
  2. open FsUnit
  3. open Xunit
  4. open Xunit.Extensions
  5. open Dedupe
  6.  
  7. [<AutoOpen>]
  8. module TestUtils =
  9.  
  10. let LinesToSeq (lines : string) =
  11.         lines.Split([|'|'|]) |> Seq.ofArray
  12.  
  13. let Segment (i : int) (str : string) =
  14.         (str.Split([|','|]).[i]).GetHashCode()
  15.  
  16. [<TestFixture>]
  17. type ``Given a data set with no duplicates`` ()=
  18.  
  19.     [<Theory>]
  20.     [<InlineData("")>]
  21.     [<InlineData("one")>]
  22.     [<InlineData("one|\\cf0
  23.  two")>]
  24.     [<InlineData("one|\\cf0
  25.  two|\\cf0
  26.  three")>]
  27.     [<Test>]
  28. member test.``deduping the data set produces an identical data set`` (lines)=
  29. let expected = lines |> LinesToSeq |> Array.ofSeq
  30. let actual = lines |> LinesToSeq |> DedupeBy (Segment 0) |> Array.ofSeq
  31.         actual |> should equal expected
  32.  
  33. [<TestFixture>]
  34. type ``Given a data set with one duplicate on the first segment`` ()=
  35.  
  36.     [<Theory>]
  37.     [<InlineData("unique1,data1|\\cf0
  38.  unique1,data2", "unique1,data2")>]
  39.     [<Test>]
  40. member test.``deduping the data set produces a dataset consisting of the last line`` (lines, dedupedLines)=
  41. let expected = dedupedLines |> LinesToSeq |> Array.ofSeq
  42. let actual = lines |> LinesToSeq |> DedupeBy (Segment 0) |> Array.ofSeq
  43.         actual |> should equal expected

(Pasting has corrupted the InlineData attributes slightly.  The \\cf0 parts should just read \.)

Then I knocked up a fairly naive solution, which passed all my unit tests:

  1. let DedupeBy (f : string -> string) (seq : seq<string>) =
  2. // Make a dictionary of key values and the highest line that key appears on:
  3. let dict = new Dictionary<string,int>()
  4.     seq
  5.     |> Seq.iteri (fun i line -> let key = f line
  6.                                 dict.[key] <- i)
  7.  
  8. // Filter the input sequence to show only the highest-numbered lines:
  9.     seq
  10.     |> Seq.mapi (fun i line -> i, line)
  11.     |> Seq.filter (fun (i, line) -> let maxLineNo = dict.[(f line)]
  12.                                     i = maxLineNo)
  13.     |> Seq.map (fun (_, line) -> line)

As you can see, the approach was to:

1) Build a dictionary with the file line text (or the specified substring of the file line text) as the dictionary key, and the highest line number that text appears in as the dictionary value.  That's achieved by repeatedly overwriting the line number in the dictionary - which works because we are iterating through the lines in line number order.

2) Filter the file lines to pick only those whose line numbers are the highest for that text - which is identified by looking back at the dictionary.

Note that the parameter 'f' in the code is used to inject a function which is responsible for picking out whatever substring of each file line constitutes the portion which needs to be unique.  Given that the separator in the input file was a tab, this function could be this:

  1. // Pick out the i'th substring, splitting on tab:
  2. let Segment (i : int) (str : string) =
  3.     str.Split([|'\t'|]).[i]

(The input file was very consistent in format, so I didn’t bother with any checking to see if element i existed in the line.)

Actually calling the deduping function was simply:

  1. DedupeBy (Segment 0)

This worked well on my laptop – provided volumes were modest.  However when I moved the code onto a desktop machine for volume testing I had a setback.  The same code, with the same data, threw an exception suggesting that I couldn’t access an object which had already been garbage collected.  I was a little surprised that this hadn’t happened on my laptop, but after a bit of googling I tracked it down to the fact that my code was double-iterating the sequence – see the two lines beginning seq |> above.

It was also pretty obvious that the code was holding more in memory than it really needed to.  Clearly the Dictionary-based approach was going to be heavy on memory anyway, but it was at least possible to give the garbage collector a chance by not sending in the data from outside in a sequence.  Instead I sacrificed a little elegance by passing in a filename and making the dedupe function do its own I/O:

  1. let DedupeFileBy (f : string -> string) (fileName : string) =
  2. // Make a dictionary of key values and the highest line that key appears on:
  3. let dict = new Dictionary<string,int>()
  4.     IO.File.ReadLines(fileName)
  5.     |> Seq.iteri (fun i line -> dict.[f line] <- i)
  6.  
  7. // Filter the input sequence to show only the highest-numbered lines:
  8. let results =
  9.         IO.File.ReadLines(fileName)
  10.         |> Seq.mapi (fun i line -> i, line)
  11.         |> Seq.choose (fun (i, line) -> if i = dict.[f line] then Some(line) else None)
  12.  
  13.     IO.File.WriteAllLines(fileName + ".dedupe", results)

For completeness here’s the console application’s main function:

  1. [<EntryPoint>]
  2. let main args =
  3. let sw = new System.Diagnostics.Stopwatch()
  4.     sw.Restart()
  5.     Dedupe.DedupeFileBy (Segment 0) @"d:\temp\29mx4.txt"
  6.     System.Console.WriteLine("Time: " + sw.Elapsed.ToString())
  7.     System.Console.ReadKey(true) |> ignore
  8.     0

This performed pretty well.  I was able to process a 116,000,000 line file, with 4 duplicates for each line, in 4 minutes 34 seconds.  Not bad for a 10Gb file!

Tags:

Add comment

  Country flag

biuquote
  • Comment
  • Preview
Loading

About the author

F# bore

Month List

Page List