Tapping on the Box: A Few Words about Autism

by kiteason2. April 2014 13:30

Today is World Autism Awareness Day. I don’t want to trivialize this sometimes very serious condition – but I do have some thoughts on my own recent experiences. And, yes, somehow I will be tying this back to F#!

This mind-thread started a few months ago when I spoke at a software conference. Also on the bill was a fellow F# developer. Although we didn’t previously know one another, we got on famously, mainly bonding over our mutual amusement at the quirks of, er, some other languages. At one point he happened to mention that his wife sometimes teases him by saying that he is ‘on the spectrum’. My wife does the same to me, and I did wonder whether this might be one of those times when true words are spoken in jest.

So when I came across a simple survey which purports to find out whether people are indeed ‘on the spectrum’, I gave it a go. The lower bound for being on the spectrum is 26 – I scored 27. It turned out that one wife, at least, was bang on. My friend took the test too, and guess what: he scored 27. No wonder we got on so well!

So far, this is all about mild amusement at our personal foibles. After all, it was just some test on the internet, right? Then last night I happened to watch a BBC Horizon programme which covers the work of Professor Uta Frith. With various colleagues, she invented or refined a number of brilliant tests which are designed to distinguish between people who have some form of autism/Asperger’s, and the rest of the population. The most compelling is a series of little videos which show animated shapes and lines appearing to interact. Although the animations are fairly abstract, almost everybody interprets them as depicting some kind of social transaction. This includes people with autism, except perhaps of a severe kind. So, to be clear, most autistic people also get that a social interaction is being depicted. The point is, autistic people interpret that social interaction differently! As I watched with my family I couldn’t help exclaiming: “Oh my god, I’m getting all of these wrong.” One of the scenarios, according to the ‘normal’ interpretation, involves a big triangle trying to encourage a small triangle to go out for a trip. I interpreted it as some kind of home-invasion!

As the programme progressed, it turned out that I was interpreting every single animation exactly as an autistic person would. My daughter found another similar test on Youtube: same result.

Now, I’m guessing you probably don’t care how very-mildly-autistic I think I am. But this is what really got me thinking: there is another test, commonly administered to quite small children. The psychologist has a box and she says to the child: “I have a toy boat in this box, would you like to see it?” Obviously the child totally wants to see the toy boat. So the investigator taps twice on the box, gets the boat out and shows it to the child. Then she says “Now, would you like to show me the toy?” She passes the box to the child.

Here’s the kicker: ‘normal’ children tap the box twice and open it to show the toy. Children who might be on the spectrum simply open the box. I am absolutely certain that if someone had performed this test on me as a child (or now!), I would have fallen into the direct-opening camp.

I think this sheds light on the relationship between the ‘F# converts community’ and the ‘C#’ community. Here’s how the story often seems to go in an organisation: one or two people start to use F#, get all excited, and try to encourage all the other developers to go the same way. At least some of the C# developers simply do not see the attraction, and pretty soon there are two opposing camps. Members of each camp think that other camp is completely barmy. Leaving aside the rational arguments about the pros and cons of each language, this is what I think is going on: the kind of people who are rapid adopters of F# are, like me, a little bit on the spectrum. They ‘just want to get on’ with things, and the social realities – the fact that the C#/OO community has its own, partly-social conventions (certain TDD practices, design patterns and so forth) simply don’t occur to them as an impediment. (Adhering to these conventions is the equivalent of ‘tapping on the box’.) The C#/OO folks, in contrast, are somewhat more social, and care much more about what the rest of their established community thinks. Hence – guaranteed conflict.

You could easily refute this theory by doing some spectrum-testing on a good sample of F# and C#-only developers, to see if their scores differ significantly. I leave this thought as a gift to someone who is looking for a PhD thesis.

I seem to be saying that the C# hold-outs are doing so for solely psychological reasons, not practical ones. This could easily be interpreted as insulting to that community. Sorry! But you have to make some allowances for me – I’m a tiny bit autistic, remember?

Tags:

F# Syntax You Didn’t Know (Maybe)

by kiteason26. February 2014 11:09

Everyone plateaus a bit when learning a new language.  It’s important to guard against that and keep learning!  To that end here are six F# tricks which - maybe - you haven’t come across yet.  Enjoy.

1) Strings are sequences

Because they implement IEnumable, strings are sequences.  Which means that you don’t have to ‘manually’ pull out individual elements and process them ‘char by agonizing char’.

val ToMorse : s:string -> string

> ToMorse "ABACAB";;

val it : string = ".- -... .- -.-. .- -... "

>

2) You don’t always need to say (fun x -> …)

If the arguments of the function you want to call in something like a Seq.map or Seq.iter operation are the same types and order as those required by the higher-order-function in question, you don’t need to map the arguments across with the (fun x -> …) syntax.

It’s probably clearer with some examples:

3) You can pattern match in for-loop arguments

This is a naughty one!  If you use the syntax

for <Discriminated Union Pattern> in <Sequence of Discriminated Unions> do

...then the body of the loop will iterate only for those sequence items which match the DU Pattern.  This is so naughty that you’ll get a compiler warning and Jon Harrop will tell you off, but still…

4) You can pattern match on many bizarre combinations of list items

5) You can do comprehensions of floating point numbers

This is another one that is not without its risks but… you can increment values in a list or array comprehension by things other than integers.

val xValues : float [] = [|0.0; 0.1; 0.2; 0.3...

6) Active patterns can use active patterns

val PrimaryColorWays : (RainbowColor * RainbowColor * RainbowColor) [] =

[|(Red, Orange, Yellow); (Violet, Green, Orange); (Violet, Orange, Red)|]

OK, so who can shoehorn every single one of these into one application?

Tags:

The Beautiful Camel

by kiteason19. February 2014 13:15

Recently, I’ve been thinking a lot about a couple of things.  The first is a tweet I saw expressing amazement that anyone should think F# syntax is ‘ugly’.  The second is the utterly wonderful documentary series ‘Wild China’ which has been airing here in the UK.  Under the influence of slightly too many Valentine’s day coffee-chocolates, the two steams of thought got slightly mixed up in my head.  Here is the result.

The Beautiful Camel

There was once a young merchant whose family travelled the Silk Road from China to the Mediterranean.  One year, the young merchant, being young and still somewhat foolish, tarried a little too long in the perfumed houses of Persia, and his caravan, not wanting to be caught in the mountains in winter, left without him.

The young merchant was not worried: he had with him a fine horse, Warda, that he had ridden since childhood.  Warda knew his every whim, and he the horse’s.  It would present little difficulty to catch up with the caravan.  So he set off confidently enough, and, pausing only to ask the occasional oncoming traveller how far ahead the caravan was, he and Warda ate up the eastward miles.

But a strange thing happened.  Although Warda’s hoofbeats echoed briskly across the landscape, each oncomer would give him the same news: “You are three days behind.”  Before long it was “You are four days behind”.  Eventually the young merchant found himself on the fringes of the Gobi Desert, no closer to his caravan than when he had set out a month before.

Presently he came to an oasis, where the Silk Road angled northward to skirt the desert.  Frustrated by his lack of progress, the young merchant decided to spend the day at the oasis, before heading directly across the Gobi.  By travelling at night, and keeping the north star (he knew it as al-Jadi) to his left, he could navigate well enough.  Thus he would cut a great corner off the Silk Road’s route, and be sure to catch up with the caravan.

As he rested among the palms, an old man approached him.  The young merchant kept his hand on his dagger, but the old man seemed friendly enough.  When the merchant confided his plight - and his plan for a short cut - the man nodded.  The young merchant was a little surprised, expecting to hear the usual blood-curdling stories of singing sand dunes and deadly mirages.  Instead, the old man seemed almost encouraging.

“There is just one thing, my friend”, said the old man, chewing a fig.  “Your horse, magnificent though she is, is not a suitable mount for this journey”.

The young merchant grew angry.  “I know how this goes”, he thought.  “He will offer me some miserable pack animal in exchange for my beloved Warda.”  But he held his tongue, and the old man went on.

“I see what you are thinking,” said the old man.  “That I seek to take your steed.  But no, you may keep the horse, though she is beautiful.  All I ask is that you take with you also one of my camels.  It has been a good breeding year and I have more than I know what to do with - you may take your pick of them as my gift.  Think of me as a camel contributor.”

But the foolish young merchant would have none of it.  The camels were ugly and unfamiliar, and he had no time to learn how to control one.  He dismissed the old man, and settled down to rest in the shade.  In the evening, as he was filling his water-skins, the old man returned.

“I beseech you: take a camel!  These shifting sands are no place for a horse.  Besides, the camels know the routes across the desert.  They will lead you.”  But the camels, even with the benefit of a brilliant moon, looked no better than they had in the day, and the merchant had heard - was it from a horse dealer? or a farrier? - that they would spit in your eye if they did not like you.  He dismissed the old man with a less than subtle gesture towards his dagger.

And so the young merchant set off across the dunes, keeping al-Jadi to his left.  In the first hour he made great progress.  He and Warda were as one, the horse anticipating perfectly his chosen route through the scattered rocks and shallow dunes.  A camel indeed, for a fine merchant such as he!  Such foolishness in that oasis!

But steadily, the landscape changed.  The dunes became steeper, higher; the rocky places scattered with stones that seemed exactly the right size to break Warda’s pace.  The moon was past its zenith and the dark shadows it cast were lengthening.  He urged Warda on, despite her increasingly stumbling gait.  Far too soon, the sun began to rise.  Al-Jadi faded from view and it would soon be too hot to travel.  He urged Warda into the shade of a large rock and prepared to sit out the baking day.

He was woken by Warda’s laboured breathing.  She seemed to having some sort of fit, foaming at the mouth and thrashing about.  He braved the scything limbs and tried to get her to drink from one of his water skins.  It was futile.  Warda gave out one last despairing gasp and lay still.  Infuriated, the merchant took his belt and started to beat the animal.  But it was in vain - his entire caravan could have returned and flogged the poor horse until the Spring and it would have done nothing to revive the poor creature.

Appalled at what the desert had done to him - and so quickly! - the merchant fell to his knees and wept.  How could it have come to this?  All his ancestors, for as long as time could remember, had been horseman; horses had served them well in every conceivable circumstance.  It wasn’t fair!

Presently the sound of sliding rocks and the snort of a large animal roused him from his self pity.  He looked up to see the silhouettes of two large camels, the leader ridden by the old man from the oasis, striding confidently down towards him.  For the first time he appreciated how perfectly suited the creatures were to this place.  The wide splayed feet stood on the sand rather than sinking into it.  The thick hair shrugged off the dazzling sunlight as it would shrug off the freezing night.  The old man, he saw, was not as old as he had seemed in the oasis.  He called a greeting to the merchant.  He seemed to have a slight Australian accent, which was odd, as Australia wouldn’t be rediscovered for another half-millenium.  The man pulled back his tagelmust and smiled sardonically.

“Do you still think my camels are ugly?”

Tags:

What to do if you “just don’t get” F#

by kiteason22. December 2013 08:11

You’ve heard a lot about F#, maybe watched a video, seen a presentation. Maybe even given it a spin.

And you just don’t get it.

But you’re a persistent kind of person and you don’t want to dismiss it out of hand. After all, the rapidly-growing F# community can’t all be wrong. (Can we?)

Here are a few actions you can take which might ‘unblock’ your appreciation of the language. If you give these things a try and you still think ‘meh’, well – good luck to you. Nothing is for everybody!

Unlearn OO

Many developers have grown up in an object-oriented environment. Sometimes they make the mistake of trying immediately to map simple F# syntax into an OO view of the world. (Where are my classes? Where are my encapsulation and data hiding? How do I do inheritance? Tell me now!) Don’t get me wrong: F# has a pretty complete OO story to tell, with interfaces, inheritance and all that good stuff. The trouble is that if you try and learn F# in an OO context from the get-go, you may be trying to do too much at once. Staying close to an OO (C#) view of the world will mean you’ll have the nagging feeling that ‘I could do this quicker in C# - with a single Resharper command’.

It’s better to make a clean break with OO initially. You can work back to a more class-based view of the world later (if you need to).

Action: Forget everything you know about OO to start with. Learn some basic syntax (let, |>, map). Only worry about OO features once you have that basic stuff down.

Forget the functional jargon

The F# world, and functional languages in general, are littered with terms like ‘monads’ and ‘algebraic data types’ – and of course ‘functional’ itself. These can be offputting. So forget them: you don’t need to understand them. How can I be sure? I’m a professional F# developer and I am proud to say I have only the dimmest understanding of both monads and algebraic data types – and many other terms. When you need to start understanding such things, they will come easily. But you can bash out a lot of great code before you get to that stage.

Action: Forget any terminology you might come across that isn’t directly linked to the task at hand.

Expect a Higher Code Density

F# developers tend to do a lot more per line of code than – for example – C# developers. Time after time I have sat in (or given) demos where people coming from other language traditions have clearly been thinking ‘Well that’s nice, but when is (s)he going to show me the real code’. Subconsciously at least, the audience skims over what they think is header information, when in fact they are looking at actual code. I sometimes think that people imagine we are actually ‘cheating’ in some way but nope – we are just doing a hell of a lot per line of code. Less filler, more thriller!

Action: Get used to really looking at every line of code. It’s probably doing more than you think.

Tools Support – Get Over It

Sometimes people get really scared when they learn that they aren’t going to be able to use Resharper in F# code. Relax! You are just moving one tick along the lengthy scale between ‘hipster who edits everything in Notepad’ and ‘corporate type who needs a 4GL to do anything’. Developers can succeed at many points along the scale. Sure we’d like certain basic refactorings (like ‘Rename’ – which is coming, slowly). But this is actually much less of an issue in practice than you might think. This is largely because F# codebases tend to be smaller and simpler, so need less ‘wrangling’. Maybe, just maybe, you’ll actually feel liberated without Resharper questioning your every keystroke.

Incidentally, this comment really only applies to those tools which examine and manipulate the syntax of your code (such as Resharper’s ‘convert to linq expression’). Other types of tool, notably NUnit and the magnificent NCrunch, tend to work just fine with F#.

And if there is a cherished tool which you just can’t let go of, consider requesting the authors to add F# support.  (Or pitch in yourself!)

Action: Be prepared to let go of a few cherished tools.

I hope some of these mental tricks help you get over that initial hump. If they do – welcome aboard!

Tags: ,

Calculating cumulative binomial distribution probability

by kiteason13. October 2013 08:15

On my current project, we recently came across the problem of calculating cumulative binomial distribution probability. The formula for this is relatively straightforward:

 binomial

Any decent programmer could code this up in short order.  In fact we used the function provided by Math.Net.

But there's a problem. Although the final results are modest numbers, some of the intermediate results are... well the term 'astronomical' is too small a word. Our failing case involved calculating 12798! (12798 factorial) - which in case you didn't know is 12798 * 12797 * 12796 ... * 2.  Yeah, that's a big number. According to Wolfram Alpha it works out at

factorial

Compare that to the estimated number of protons in the universe, which is a mere 10^80.

I had a think about this, and I wondered: how is Excel doing this?  Excel has a BINOM.DIST function which calculates for large inputs fine – and fairly quickly.  As luck would have it, Microsoft have actually published the logic they use, here: http://support.microsoft.com/kb/827459

Kudos to them for both developing this algorithm, and publishing it.

I was able to re-implement this in F# relatively easily, and our binomial statistical tests now work just fine.  I didn’t bother to switch to the special algorithm only for large inputs (as Excel does) as for our purposes the results don’t appear to be affected by the different approaches.

Since then I’ve refined the code a bit to make it less repetitive (and not proprietary!).  Here is the entire function.

(Full project with units tests here:  https://github.com/misterspeedy/Binomial)

If anyone fancies adding a more comprehensive set of tests, and incorporating the code into Math.Net, please feel free.

Tags: , ,

You're looking a little flushed

by kiteason8. October 2013 12:52

Recently someone raised the issue of how you might use active patterns to represent the various named Poker hands.  You can’t do it directly, as there are - depending on exactly how you count them - ten named hands.  You can only have 7 tags in a ‘complete’ active pattern.

I decided to experiment with partial active patterns to represent the hand types.  It didn’t end up as the snappy, fssnip-able example I was hoping for.  (Full code here.)  But it was very instructive.

The named poker hands are, in descending order of winning-power:

  • Royal Flush
  • Straight Flush
  • Four of a Kind
  • Full House
  • Flush
  • Straight
  • Three of a Kind
  • Two Pair
  • One Pair
  • High Card

The definitions I’ve used – and many of the test cases – are here: http://en.wikipedia.org/wiki/List_of_poker_hands

So – let the cards speak!  First, discriminated unions (DUs) to represent a suit:

…and a ‘rank’ (e.g. Ace, 10 of…):

(By the way I’ve been told off by @lu_a_jalla for using unicode characters.  Sorry, but they’re just so cute - I couldn’t resist it.)

Types to represent a card and a hand of cards:

And we’ll also need some utility functions which summarise the contents of a set of cards in various ways.  I won’t list them all but you’ll get the idea from these:

Now we can write some partial active patterns to work out the type of a hand.  Each partial active pattern either returns some specific tag (e.g. RoyalFlush), or 'None’. 

Here’s the one for Royal Flush:

…and for Straight Flush:

(Actually I suspect you could get away with not distinguishing between those two as ‘Royal’ is a special case of ‘Straight’.  But we’ll ignore that.)

Here’s the one for Four of a Kind:

You get the idea.  It’s worth saying that these partial active patterns need to be matched in order, because some of the higher hands would also be matched by lower hands’ active patterns.

Now we can get to the real business of Poker: deciding which of two hands wins.  Let’s have yet another DU to represent that:

You’ll see that this DU has a static member which can determine the outcome of comparing two card ranks.  (I initially started off by implementing IComparable, but that took me down the rabbit-hole of having to implement IEquatable and various operators.  So I backed out of that one!)  This member is used when we can determine a winner based on one card from each hands.  Hand-level comparisons are handled like this:

It’s a long gist, but worth it to see how this approach plays out.  You’ll see that I work down the named hands, in descending order of ‘power’.  For each hand I deal with the cases where a hand is pitted against a hand of same type (e.g. Flush with Flush), because for this case we need to go into detail about the cards within the hand.  Then I deal with the cases where we know a hand wins because it is pitted against an inferior hand.

One other subtlety: if you’re eagle eyed you might question why CompareHands takes arguments of Hand rather than arrays of cards.  The reason is because in lines like this:

…if we pass in Hand (whose constructor mandates there there are five cards in the hand) we know we are not going to hit an out of range index looking at the 0th card.

If you look at the source in GitHub you’ll see that I’ve also implemented a ‘showdown’ – in other words comparing a number of hands to determine an overall winner.  But that’s something for another post.

I was impressed by the ability of DUs and partial active patterns to model cards, hands, and the interactions between them.  But I’d be interested to hear if anyone has a more concise solution.

Tags: , , ,

F# Talk in Hereford

by kiteason11. September 2013 00:30

The Smart Devs group in Hereford kindly asked me to talk to them about F#.  We had a great evening, with loads of pizza and swag.  Thank you Smart Devs for such a friendly reception.

Here are links to the materials:

The planetarium demo – demonstrates use (and some of the pitfalls) of the SQL Server type provider.  (It’s worth reading the readme for this one before plunging in.)

https://github.com/misterspeedy/Stars

PowerPoint presentation – this includes some material on the Freebase type provider that we didn’t do because of slow internet connection.

http://www.slideshare.net/KitEason/f-presentation-for-smartdevs-herford

The code for the Freebase type provider demo (that we didn’t do):

https://github.com/misterspeedy/DemoFreebase

A version of the great circle distance calculator that uses units-of-measure:

http://fssnip.net/jA

Recommended reading:

Chris Smith’s ‘Programming F#’:

http://www.amazon.co.uk/Programming-F-3-0-Chris-Smith/dp/1449320295/ref=sr_1_1?ie=UTF8&qid=1378884411&sr=8-1&keywords=chris+smith+f%23

Dom Syme et. al.’s ‘Expert F# 3.0’

http://www.amazon.co.uk/Expert-3-0-3rd-Edition-Apress/dp/1430246502/ref=sr_1_1?ie=UTF8&qid=1378884430&sr=8-1&keywords=expert+f%23

Please tweet me (@kitlovesfsharp) if I’ve missed anything I promised to link to.

Tags:

Deduping a large file

by kiteason7. March 2013 09:08

I recently came across a requirement to dedupe a large text file.  The basic requirements were:

  • The file had > 100 million lines, each consisting of some text data, followed by a separator, followed by some more text data.  The text data was 30 to 100 characters long, the metadata 30 characters.
  • Two lines were duplicates of one another if they had the same text data (the first part); the metadata was ignored for the purpose of detecting a duplicate.
  • The deduped output needed to preserve the last line in any set of duplicates, and had to preserve the original order of lines.

I started as always by writing a set of tests.  (When working solo on algorithmic-style work, I tend to write all the tests I can think of before cutting any real code.)  Here are the first few lines of the tests module:

  1. open NUnit.Framework
  2. open FsUnit
  3. open Xunit
  4. open Xunit.Extensions
  5. open Dedupe
  6.  
  7. [<AutoOpen>]
  8. module TestUtils =
  9.  
  10. let LinesToSeq (lines : string) =
  11.         lines.Split([|'|'|]) |> Seq.ofArray
  12.  
  13. let Segment (i : int) (str : string) =
  14.         (str.Split([|','|]).[i]).GetHashCode()
  15.  
  16. [<TestFixture>]
  17. type ``Given a data set with no duplicates`` ()=
  18.  
  19.     [<Theory>]
  20.     [<InlineData("")>]
  21.     [<InlineData("one")>]
  22.     [<InlineData("one|\\cf0
  23.  two")>]
  24.     [<InlineData("one|\\cf0
  25.  two|\\cf0
  26.  three")>]
  27.     [<Test>]
  28. member test.``deduping the data set produces an identical data set`` (lines)=
  29. let expected = lines |> LinesToSeq |> Array.ofSeq
  30. let actual = lines |> LinesToSeq |> DedupeBy (Segment 0) |> Array.ofSeq
  31.         actual |> should equal expected
  32.  
  33. [<TestFixture>]
  34. type ``Given a data set with one duplicate on the first segment`` ()=
  35.  
  36.     [<Theory>]
  37.     [<InlineData("unique1,data1|\\cf0
  38.  unique1,data2", "unique1,data2")>]
  39.     [<Test>]
  40. member test.``deduping the data set produces a dataset consisting of the last line`` (lines, dedupedLines)=
  41. let expected = dedupedLines |> LinesToSeq |> Array.ofSeq
  42. let actual = lines |> LinesToSeq |> DedupeBy (Segment 0) |> Array.ofSeq
  43.         actual |> should equal expected

(Pasting has corrupted the InlineData attributes slightly.  The \\cf0 parts should just read \.)

Then I knocked up a fairly naive solution, which passed all my unit tests:

  1. let DedupeBy (f : string -> string) (seq : seq<string>) =
  2. // Make a dictionary of key values and the highest line that key appears on:
  3. let dict = new Dictionary<string,int>()
  4.     seq
  5.     |> Seq.iteri (fun i line -> let key = f line
  6.                                 dict.[key] <- i)
  7.  
  8. // Filter the input sequence to show only the highest-numbered lines:
  9.     seq
  10.     |> Seq.mapi (fun i line -> i, line)
  11.     |> Seq.filter (fun (i, line) -> let maxLineNo = dict.[(f line)]
  12.                                     i = maxLineNo)
  13.     |> Seq.map (fun (_, line) -> line)

As you can see, the approach was to:

1) Build a dictionary with the file line text (or the specified substring of the file line text) as the dictionary key, and the highest line number that text appears in as the dictionary value.  That's achieved by repeatedly overwriting the line number in the dictionary - which works because we are iterating through the lines in line number order.

2) Filter the file lines to pick only those whose line numbers are the highest for that text - which is identified by looking back at the dictionary.

Note that the parameter 'f' in the code is used to inject a function which is responsible for picking out whatever substring of each file line constitutes the portion which needs to be unique.  Given that the separator in the input file was a tab, this function could be this:

  1. // Pick out the i'th substring, splitting on tab:
  2. let Segment (i : int) (str : string) =
  3.     str.Split([|'\t'|]).[i]

(The input file was very consistent in format, so I didn’t bother with any checking to see if element i existed in the line.)

Actually calling the deduping function was simply:

  1. DedupeBy (Segment 0)

This worked well on my laptop – provided volumes were modest.  However when I moved the code onto a desktop machine for volume testing I had a setback.  The same code, with the same data, threw an exception suggesting that I couldn’t access an object which had already been garbage collected.  I was a little surprised that this hadn’t happened on my laptop, but after a bit of googling I tracked it down to the fact that my code was double-iterating the sequence – see the two lines beginning seq |> above.

It was also pretty obvious that the code was holding more in memory than it really needed to.  Clearly the Dictionary-based approach was going to be heavy on memory anyway, but it was at least possible to give the garbage collector a chance by not sending in the data from outside in a sequence.  Instead I sacrificed a little elegance by passing in a filename and making the dedupe function do its own I/O:

  1. let DedupeFileBy (f : string -> string) (fileName : string) =
  2. // Make a dictionary of key values and the highest line that key appears on:
  3. let dict = new Dictionary<string,int>()
  4.     IO.File.ReadLines(fileName)
  5.     |> Seq.iteri (fun i line -> dict.[f line] <- i)
  6.  
  7. // Filter the input sequence to show only the highest-numbered lines:
  8. let results =
  9.         IO.File.ReadLines(fileName)
  10.         |> Seq.mapi (fun i line -> i, line)
  11.         |> Seq.choose (fun (i, line) -> if i = dict.[f line] then Some(line) else None)
  12.  
  13.     IO.File.WriteAllLines(fileName + ".dedupe", results)

For completeness here’s the console application’s main function:

  1. [<EntryPoint>]
  2. let main args =
  3. let sw = new System.Diagnostics.Stopwatch()
  4.     sw.Restart()
  5.     Dedupe.DedupeFileBy (Segment 0) @"d:\temp\29mx4.txt"
  6.     System.Console.WriteLine("Time: " + sw.Elapsed.ToString())
  7.     System.Console.ReadKey(true) |> ignore
  8.     0

This performed pretty well.  I was able to process a 116,000,000 line file, with 4 duplicates for each line, in 4 minutes 34 seconds.  Not bad for a 10Gb file!

Tags:

About the author

F# bore

Month List

Page List