February 25, 2012

Restringing a Racket with Haskell (pt 3)

Now we tackle the tricky part. extract-microdata uses the dreaded hobgoblin of functional programming: mutable state! Annoyingly convenient little boxes with updatable values, and mutable struct fields. One problem with all this mutable stuff is that it allows the control flow to follow a different path from the state changes. To turn this into functional Haskell code, we'll need to do a bit of untangling so that state can travel along with the control flow, and be replace with new state whenever a change is needed.

Here's the Racket function again, for reference:

(define (extract-microdata root [scope (void)] [result (box '())])
  (let ([new-scope (void)])
    (when (h:html-full? root)
      (for ([element (h:html-full-content root)])
           (match element
             [(h:html-element (list (x:attribute start stop key value) ...))
              (let ([attrs (for/list ([k key] [v value])
                                     (cons k v))])
                (when (assoc 'itemscope attrs)
                  (set! new-scope
                        (itemscope (cdr (or (assoc 'itemtype attrs)
                                            (cons "" #f))) '()))
                  (when (void? scope)
                    (set-box! result (cons new-scope (unbox result)))))

                (for ([prop attrs]
                      #:when (eq? (car prop) 'itemprop))
                     (set-itemscope-properties!
                      scope (cons (itemprop (cdr prop)
                                            (extract-html5-value element
                                              new-scope))
                                  (itemscope-properties scope)))))
              (extract-microdata element new-scope result)]
             [else #f]))))
  (unbox result))

This function iterates over all the content nodes of the document, and when they are elements it potentially creates a new itemscope structs and pushes them onto/into the result box. It also iterates over all the attributes of the elements looking for itemprop attributes. When it finds them it extracts their HTML5 value and pushes it into the properties field of the current itemscope object. Finally it recurses over each of the element's children. Perhaps we can break this down into several more manageable pieces.

First, let's focus on recognizing itemscope attributes and creating new Itemscope values. Let's call this piece extractNewItemscope. Since this function can potentially push a new Itemscope on a stack, it needs to take a list of Itemscope as input, and produce another one as output (linked lists in Haskell make perfect stacks):

extractNewItemscope :: Node -> [Itemscope] -> [Itemscope]
extractNewItemscope elt scopes = undefined

Again, we need to make function cases based on whether the Node is an Element or something else, and we need to test for when the attributes of the element contain an "itemscope" key:

extractNewItemscope :: Node -> [Itemscope] -> [Itemscope]
extractNewItemscope elt@(Element _ attrs _) scopes =
  if isJust $ lookup "itemscope" attrs
  then undefined
  else scopes
extractNewItemscope _ scopes = scopes

Since the first thing we're doing in the first function case is another test, we can use a guard clause to further refine it, letting the second case pick up the work the else clause was doing:

extractNewItemscope :: Node -> [Itemscope] -> [Itemscope]
extractNewItemscope elt@(Element _ attrs _) scopes
  | isJust $ lookup "itemscope" attrs =
    undefined
extractNewItemscope _ scopes = scopes

Now the body of this function only needs to build a new Itemscope value and push it onto the scopes stack. The type of this Itemscope is the value of the "itemtype" attribute (if any), and the properties start out empty. The colon (:) is the "cons" operator that pushes a new value onto the left-hand side of an existing list (scopes here):

extractNewItemscope :: Node -> [Itemscope] -> [Itemscope]
extractNewItemscope elt@(Element _ attrs _) scopes
  | isJust $ lookup "itemscope" attrs =
    Itemscope (lookup "itemtype" attrs) [] : scopes
extractNewItemscope _ scopes = scopes

Next, let's define a function to update the properties of an Itemscope when we see an "itemprop" attribute. To make this clearer, we'll first define a helper function to add a new property to an existing Itemscope: (since it's functional, it returns a new Itemscope):

addProp :: Text -> MicrodataValue -> Itemscope -> Itemscope
addProp name value (Itemscope t props) =
  Itemscope t (Itemprop name value : props)

Next we can use this to update the the top-most Itemscope on the stack when we see an "itemprop" attribute:

extractItemprop :: Node -> (Text, Text) -> [Itemscope] -> [Itemscope]
extractItemprop elt ("itemprop", name) (scope:scopes) =
  addProp name (extractHtml5Value elt scope) scope : scopes
extractItemprop _ _ scopes = scopes

The first line of the let creates a new Itemprop value with the name and HTML5 value. The second line pushes this new Itemprop onto the list of existing properties. The third line then creates a new Itemscope value that will replace the one we're starting with. Notice how we pattern matched the entire scope stack with (scope:scopes) so that we can throw the old scope away and return the tail of the list with newScope pushed on the front.

Oops -- when we compile this we see that another tweak is needed. We previously had defined the value field of the Itemprop data type as Text, but in part 2 we discovered that values needed to be of type MicrodataValue. Let's fix that:

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: MicrodataValue
  } deriving (Show, Eq)

One nice thing about Haskell modules (the stuff within a single file) is that all the definitions are mutually recursive. That means that you can list functions in top-down dependency order, bottom-up, or however you want to organize things. The same goes for types, which is why we can change this field to MicrodataValue even though it's defined below in the file. (I spent several years programming in O'Caml, and it requires things to be defined in bottom-up dependency order. Believe me, this is better.)

Next, let's define a function that's similar to extractItemprop, but that works with all attributes of an element, not just one. We might be able to use map, but since we want each attribute to be able to update the top-most Itemscope on the stack we need to thread the stack (scopes) through all these map operations. That's exactly what a "fold" does -- it passes an accumulator parameter along and expects you to accumulate your result into the accumulator instead of just returning it like map does. map is a fold where the accumulation is done for you -- the function results are accumulated into a list.

We'll use foldr to do this. foldr folds from right-to-left. This might seem backwards, but its usually the right thing in Haskell. The reason it's often the right thing is because lists are built by pushing newer elements on the left, so if you want to preserve the order of your results with respect to the original list, you need to start from the right. Also, if you're relying on Haskell's laziness to avoid computing values from the tail of the list that never used, then foldr is exactly what you want. (On the other hand, if you don't care about preserving the left-to-right order, then foldl' is way more efficient. See Foldr Foldl Foldl' if you want to understand this better.)

extractItemprops :: Node -> [Itemscope] -> [Itemscope]
extractItemprops elt@(Element _ attrs _) scopes =
  foldr (extractItemprop elt) scopes attrs
extractItemprops _ scopes = scopes

That's it for extractItemprops. The scopes stack is passed to each extractItemprop call, and a new scope stack is returned that will be passed to the next call, or returned as the final result. Note that rather than just calling extractItemprop directly we doing a bit of currying magic here, calling (extractItemprop elt) instead. This gives each call the same elt parameter, and the partial application returns a new function that fits the requirement for foldr.

Now we're at a place where we can put these pieces together to write a functional version of extract-microdata. Since it's functional we'll want to pass our state along at every step, and return new state as a result. And since it will want to operate on every child of an element, we'll want to use our frind foldr again. The only tricky part if figuring out what each step performed by foldr actually needs to do:

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child scopes = undefined
extractMicrodata _ scopes = scopes

Hopefully at this point, there are no surprises here. The extractMicrodata function looks a lot like extractItemprops that we've seen already. But why name the local function "go"? That's just a common Haskell idiom (in Zetalisp it was always "frob", so this is just another step forward).

The body of go just needs to thread our state (scopes) through each of the steps:

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child scopes =
          let scopes' = extractNewItemscope child scopes
              scopes'' = extractItemprops child scopes'
          in extractMicrodata child scopes''
extractMicrodata _ scopes = scopes

First extractNewItemscope is called which potentially pushes a new Itemscope on the stack. We'll call its result scopes' (using a "prime" is just another idiom -- the single quote is a perfectly valid identifier character in Haskell). Then extractItemprops is called with scopes' and returns scopes''. Finally we recursively call extractMicrodata for the child so we can process its children, passing scopes''. This ultimately returns a new scope stack as our result.

Of course we're never happy with this much verbosity in Haskell, so we can write this a little more concisely once again:

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child scopes =
          extractMicrodata child $
            extractItemprops child $
              extractNewItemscope child scopes
extractMicrodata _ scopes = scopes

How's that? Well, let me counter your question with how's this?:

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child scopes =
          (extractMicrodata child .
           extractItemprops child .
           extractNewItemscope child) scopes
extractMicrodata _ scopes = scopes

We're using the function composition operator, dot (.), to compose the 3 functions that all get passed scopes. Note that the composition operator in Haskell works by calling the right-most function first, and passing the result to the left most -- the way mathematicians do it, but perhaps backwards from what we expect from most programming languages. It is defined as:

(.) :: (b -> c) -> (a -> b) -> a -> c
(f . g) x = f (g x)

One more tweak:

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child =
          extractMicrodata child .
          extractItemprops child .
          extractNewItemscope child
extractMicrodata _ scopes = scopes

Now we're dropping the application of the composed functions to scopes, and dropping scopes from the parameter list. Instead, we're just returning the composed function itself since it has the correct type. We're moving into what they call the "point-free style" here (sometimes referred to as "pointless style"). Maybe this is getting too carried away.

Wrap it up...

The only thing we need to do now is put a bow on it, and test it out.

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc = foldr extractMicrodata [] $ docContent doc

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
" result1 = case parseHTML "" test1 of Left msg -> Left msg Right tree -> Right $ extractMicrodataFromDoc tree

extractMicrodataFromDoc simply folds extractMicrodata over all the document's content. The variables test1 and result1 are things I've thrown in for testing purposes. You shouldn't leave them in production code, but if you evaluate result1 in the *haskell* buffer, you'll see our handiwork:

*Main> result1
Right [Itemscope {itemscopeType = Nothing, itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "Foobar"},Itemprop {itempropName = "name", itempropValue = TextValue "quux"},Itemprop {itempropName = "item", itempropValue = ItemscopeValue (Itemscope {itemscopeType = Nothing, itemscopeProperties = []})}]},Itemscope {itemscopeType = Just "type", itemscopeProperties = []}]

Of course, there are always improvements to be made. Note how in the definition of result1 we're reaching into the Either data structure and in the Right case calling another function on what's there. Effectively, we're doing an operation that might be generalized like this:

mapEither :: (a -> b) -> Either c a -> Either c b
mapEither f (Either l r) = Either l (f r)

Well, there's a very common generalization of this pattern called fmap. Instead of operating on an Either data type, it operates on anything that supports the Functor type class.

fmap :: Functor f => (a -> b) -> f a -> f b

The Functor type class is basically one that allows functions applied to data structures to be applied to that data structure's elements. We can use that to simplify the result1 definition:

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc = foldr extractMicrodata [] $ docContent doc

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1

fmap takes a bit to get your head around, but it's really powerful and useful. If you look at the definition of mapEither that I gave you can get some idea of how it works. In fact, given that definition, we could have defined the Functor instance for Either like this (if we didn't have a definition already):

instance Functor (Either a) where
  fmap = mapEither

Why (Either a) in the above? Great Good explains it better than I can.

Review

Where are we now?

{-# LANGUAGE OverloadedStrings #-}

import Data.Foldable (foldMap)
import Data.List (find)
import Data.Maybe (isJust)
import Data.Text (Text, concat)
import Prelude hiding (concat)
import Text.XmlHtml

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: MicrodataValue
  } deriving (Show, Eq)

data Itemscope = Itemscope {
  itemscopeType :: Maybe Text,
  itemscopeProperties :: [Itemprop]
  } deriving (Show, Eq)

addProp :: Text -> MicrodataValue -> Itemscope -> Itemscope
addProp name value (Itemscope t props) =
  Itemscope t (Itemprop name value : props)

hasItemscope :: Node -> Bool
hasItemscope elt@(Element _ _ _) =
  isJust $ lookup "itemscope" $ elementAttrs elt
hasItemscope _ = False

data MicrodataValue = ItemscopeValue Itemscope
                    | TextValue Text
                    deriving (Show, Eq)

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope =
  TextValue $ foldMap getText $ elementChildren elt
  where getText elt@(TextNode text) = text
        getText _ = ""

extractNewItemscope :: Node -> [Itemscope] -> [Itemscope]
extractNewItemscope elt@(Element _ attrs _) scopes
  | isJust $ lookup "itemscope" attrs =
    Itemscope (lookup "itemtype" attrs) [] : scopes
extractNewItemscope _ scopes = scopes

extractItemprop :: Node -> (Text, Text) -> [Itemscope] -> [Itemscope]
extractItemprop elt ("itemprop", name) (scope:scopes) =
  addProp name (extractHtml5Value elt scope) scope : scopes
extractItemprop _ _ scopes = scopes

extractItemprops :: Node -> [Itemscope] -> [Itemscope]
extractItemprops elt@(Element _ attrs _) scopes =
  foldr (extractItemprop elt) scopes attrs
extractItemprops _ scopes = scopes

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child =
          extractMicrodata child .
          extractItemprops child .
          extractNewItemscope child
extractMicrodata _ scopes = scopes

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc = foldr extractMicrodata [] $ docContent doc

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1

So we're done? Well, if you're an astute reader, you've already realized that this code doesn't actually work as intended. Stay tuned for part 4 where we'll do a little debugging.

No comments:

Post a Comment