Sunday, October 21, 2012

F# on Algorithms - Reservior sampling

The reservior sampling is something related to processing large amount of data. This algorithm is also related to the shuffle algorithm. Its definition on wiki is here. In this sample, I also want to demonstrate the power of active pattern. The algorithm is trying to get N element from a large data set, the data set is so big and it is not possible to hold it in the memory. Let us treat the big data as a Seq.

the algorithm first try to fill the result set whose size is resultSize. Once successfully get resultSize elements, the rest element is selected randomly based on a uniform distribution from 0 to current. If the generated random number is in the range of [0, resultSize), replace the result set's element with the new element. So the new element has the probability resultSize/currentIndex.

 let rand = System.Random()  
 let (|InReserviorRange|_|) i resultSize =  
   if i < resultSize then  
     Some()  
   else  
     None  
 type ResultList<'a> = System.Collections.Generic.List<'a>  
 let result = ResultList<_>()  
 let reserviorSampling data resultSize =   
   let (|InReserviorRange|_|) i = (|InReserviorRange|_|) i resultSize    
   data  
   |> Seq.iteri (fun index n ->  
           match index with  
           | InReserviorRange->  
             result.Add(n)  
           | _ ->    
             let newIndex = rand.Next(index)             
             match newIndex with  
             | InReserviorRange ->   
               result.[newIndex] <- n  
             | _ -> ())  
 let seq = seq { for i=0 to 100 do yield i }  
 reserviorSampling seq 5  
 result |> Seq.iter (printfn "%A")  

Note: For a real world application, you have to rewrite the random number generator for number bigger than int32's max value.

In this code, I also want to show the power of active pattern. The InReserviorRange is an active pattern. It make sure the i
I had had made some bugs because ignore the ELSE scenario or think ELSE scenario is not necessary. The match + active pattern can prevent this kind of error. The active pattern just another way to define a function, no much extra cost than define a function. And it gives extra check with match and improves the code readability.

The active pattern can also use the higher order function idea, look at the second InReserviorRange pattern, that is the syntax how to define second pattern based on current pattern.

No comments: