March 2, 2012

Restringing a Racket with Haskell (pt 5)

In the last section we finished up an initial version of our microdata extractor -- one that actually works this time. Admittedly, the code was starting to get a little ugly with all the state propagation going around. But don't worry -- with a simple monad we can simplify things and make them pretty once again.

Oh! Did I say monad? We aren't going to write yet another monad tutorial here, I hope! There are far too many of them in the world already. Instead I'm just going to port the code from part 3 to use the State monad to show that it's not that scary, and actually a little bit helpful.

First, we must import Control.Monad.State. If you go look at the package description with the link above, you'll see that it really uses Control.Monad.State.Lazy by default (there's a strict variant, but we won't go into it here).

Next, we need to figure out where to use this crazy thing. In the last part, we talked about propagating the scope stack -- the state -- through various operations. This was always apparent because their signatures ended with either Itemscope -> Itemscope or State -> State. Seeing something like this is a tell-tale sign that the State monad might be useful.

The two main operations that State provides are get and put. In many ways you can think of them as having a little mutable memory location where you can squirrel away things you need to update -- rather like box in Racket. However, unlike box, State follows the control flow of your program rather than just floating around in memory somewhere. (Haskell provide actual mutable Ref cells that do "float around in memory", but let's not go into that here.) State is cooler than some of the alternatives because it really is functional even though it feels imperative. Awesomeness abounds.

To use State with our microdata extractor program, we simply need to define a state monad that our program's operations will be "in". By "in," I mean that the final type in these operation's signature will be this state monad, or in other words, they'll return a monad value as their result. What's a monad value? It's just a function. In our case, it's just a function that takes a scope stack and returns a scope stack (possibly a different one than the one you gave).

Monads are a truly clever trick, and one that's well worth getting your head around in general. At a fundamental level, they're all about sequencing operations together (and Haskell provides build-in syntax for making this sequencing easier -- the do statement). But here, our interest is just a very specific kind of monad that deals with state.

So let's define a state monad for microdata extraction called ExtractState:

type ExtractState a = State (Itemscope, [Itemscope]) a

The type keyword defines a type synonym (like typedef in C++), and this type is parametrized by the variable a, which will be the result type of a state operation. ExtractState a is equivalent to the State type where the type of the actual state is fixed, and is the type of our state from the last section, (Itemscope, [Itemscope]). That's it.

Now the magic of defining this state type is that when we're in the ExtractState monad, calling get will get us our state -- seemingly out of the ether. And calling put will allow us to store a new scope stack into the background radiation. Really what's happening is that there's one more level of functional argument passing just below the monad surface that is shuttling our state around. Best not to think about it too deeply. Let's just use it.

Let's begin by defining a few helper functions to access the various parts of our state. We can define these in terms of get and put that State defines:

getScope :: ExtractState Itemscope
getScope = do (scope, _) <- get; return scope

putScope :: Itemscope -> ExtractState ()
putScope scope = do (_, orphans) <- get; put (scope, orphans)

You can see in put that the old scope is dropped on the floor, and the new scope is stored in its place. If we want to get fancier, we can rewrite putScope in terms of modify which passes the old state to a function which should return a new state to put:

putScope :: Itemscope -> ExtractState ()
putScope scope = modify $ \(_, orphans) -> (scope, orphans)

We can also write a state accessor to save an otherwise orphaned Itemscope onto the list of orphans:

saveOrphan :: Itemscope -> ExtractState ()
saveOrphan r = modify $ \(scope, orphans) -> (scope, r:orphans)

With these, we're now ready to rewrite our code to use the ExtractState monad. Let's begin with extractMicrodata which used to look like this:

extractMicrodata :: Node -> State -> State
extractMicrodata elt@(Element _ _ children) state =
  foldr extractItemprops state children
extractMicrodata _ state = state

We can shift it into ExtractState gear by rewriting it like this (deferring the body of the first function cases for a moment):

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) = undefined
extractMicrodata _ = return ()

From this, you can see that the signature changed from Node -> State -> State to Node -> ExtractState (). ExtractState hides the details of the state transitions into the monad operations that are used to implement this type -- functions that take a state and return a state. At the same time, the scopes parameters went away because we can access it with our monad accessor functions when it's needed. The () (called "unit") type parameter is like void in C++. It simply means that there's no interesting value returned from this function (although the real "value" here is the monad itself). We've used the standard monad return function to return a unit value in the second function case. That's because we don't modify the state, and don't really have a return value otherwise.

Now filling in the details:

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  foldr extractItemprops children
extractMicrodata _ = return ()

we'll get into a bit of trouble:

/Users/warrenharris/projects/racket-hs/microdata5.hs:64:3:
    Couldn't match expected type `StateT
                                    (Itemscope, [Itemscope]) Data.Functor.Identity.Identity ()'
                with actual type `[a0] -> b0'
    Expected type: ExtractState ()
      Actual type: [a0] -> b0
    In the return type of a call of `foldr'
    In the expression: foldr extractItemprops children
Failed, modules loaded: none.

We can see that the expected type was ExtractState (), whereas the actual type was a function from lists to something: [a0] -> b0. We didn't even get the arity right here since foldr is expecting another argument (the initial accumulator which was our scope stack). Using map instead isn't much better:

/Users/warrenharris/projects/racket-hs/microdata5.hs:64:3:
    Couldn't match expected type `StateT
                                    (Itemscope, [Itemscope]) Data.Functor.Identity.Identity ()'
                with actual type `[b0]'
    Expected type: ExtractState ()
      Actual type: [b0]
    In the return type of a call of `map'
    In the expression: map extractItemprops children
Failed, modules loaded: none.

Now we're returning a list instead of a monad. What we need a monadic version of map instead. Fortunately, there is just such a thing called mapM:

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  do _ <- mapM extractItemprops children
     return ()
extractMicrodata _ = return ()

Since mapM returns a list of results, we're discarding it with the wildcard bind, and returning unit instead. We could also write this using Haskell's lower-level monad operation >> which throws away the result of the left-hand side:

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  mapM extractItemprops children >> return ()
extractMicrodata _ = return ()

But this is still unsatisfying because we're constructing a list that we're just going to throw away. What we need is a version of mapM that doesn't build lists at all, but simply iterates over the input list, sequencing the monadic operations accordingly and throwing away whatever they return. Fortunately there is just such an operation, called mapM_ (the underscore designates "returns nothing"). And since what we're doing looks more like iteration than anything, there's a variant of mapM_ called forM_ that reverses its argument order so that it looks more like a foreach construct:

*Main> :t mapM
mapM :: Monad m => (a -> m b) -> [a] -> m [b]
*Main> :t mapM_
mapM_ :: Monad m => (a -> m b) -> [a] -> m ()
*Main> :t forM_
forM_ :: Monad m => [a] -> (a -> m b) -> m ()

Let's use forM_ to simplify once again:

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  forM_ children $ (\child ->
    extractItemprops child)
extractMicrodata _ = return ()

Hey, that looks a lot like imperative code! For each child in children, run the operation extractItemprops passing it the child. (The backslash here is a lambda, and in fact, since the precedence of lambda expressions is higher than $ we don't need the parens around the lambda expression either.) But since child is the last parameter passed to extractItemprops, we can simplify this just a little further (because \x -> f x is just f):

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  forM_ children extractItemprops
extractMicrodata _ = return ()

Let's move on to extractNewItemscope. Here was our previous definition:

extractNewItemscope :: Node -> State -> (Itemscope, State)
extractNewItemscope elt@(Element _ attrs _) (scope, orphans)
  | isJust $ lookup "itemscope" attrs =
    let newScope = Itemscope (lookup "itemtype" attrs) []
        (scopeVal, orphans') = extractMicrodata elt (newScope, orphans)
    in (scopeVal, (scope, orphans'))
extractNewItemscope node state =
  let state'@(scopeVal, _) = extractMicrodata node state
  in (scopeVal, state')  -- return current scope as scopeVal

This time we don't have a simple State -> State at the end of the function signature. Instead the final return type is returning a value and a new state, State -> (Itemscope, State). Fortunately, we can describe this in the State monad as ExtractState Itemscope:

extractNewItemscope :: Node -> ExtractState Itemscope
extractNewItemscope elt@(Element _ attrs _)
  | isJust $ lookup "itemscope" attrs = undefined
extractNewItemscope node = undefined

Filling in the details now:

extractNewItemscope :: Node -> ExtractState Itemscope
extractNewItemscope elt@(Element _ attrs _)
  | isJust $ lookup "itemscope" attrs = do
    prevScope <- getScope
    let newScope = Itemscope (lookup "itemtype" attrs) []
    putScope newScope
    extractMicrodata elt
    scopeVal <- getScope
    putScope prevScope
    return scopeVal
extractNewItemscope node = do
  extractMicrodata node
  scopeVal <- getScope
  return scopeVal  -- return current scope as scopeVal

From this you can see that we use the getScope operation to get the current scope. The back-arrow operator (<-) lets us bind a pattern to the result of a monad operation, and here we're binding the variable prevScope. We create a newScope using a let statement (note that a let in a do sequence isn't followed by the in keyword). Next we use putScope to store the new scope, then call extractMicrodata, then use getScope again to get the modified verion of newScope (now called scopeVal). Then we use putScope again to put the previous scope back, and finally return scopeVal.

Now don't confuse let foo = ... and do foo <- .... They both more or less do something similar -- assign a variable -- but the latter works with monad operations. If we had used a let with a monad operation instead, we'd be getting our hands on the monad itself, not the return value that's in the monad. Using let with monads can be quite useful when manipulating control flow, or plugging pieces of programs together -- just not what we're interested in doing here. Monads are really powerful because they decouple the work of screwing the plumbing together from running the crap through the pipes (if you catch my drift).

Of course the code above is kind of ugly. Most of it is just manipulating previous and current scope. We can wrap this up in a nice little helper function that lets us lexically set the current scope:

withScope :: Itemscope -> ExtractState a -> ExtractState a
withScope scope m = do
  prevScope <- getScope
  putScope scope
  rv <- m
  putScope prevScope
  return rv

Here, withScope takes a scope and another monad operation, m. It sets up the new scope, invokes m, reverts the scope, and returns m's value. (If you were worried about m raising exceptions, you'd want to put an exception handler in here to revert to the previous scope value, but we don't raise any exceptions in our code.)

We can now simplify extractNewItemscope:

extractNewItemscope :: Node -> ExtractState Itemscope
extractNewItemscope elt@(Element _ attrs _)
  | isJust $ lookup "itemscope" attrs =
    withScope (Itemscope (lookup "itemtype" attrs) []) $ do
      extractMicrodata elt
      scopeVal <- getScope
      return scopeVal
extractNewItemscope node = do
  extractMicrodata node
  scopeVal <- getScope  -- return current scope as scopeVal
  return scopeVal

or using the >> operation to further simplify:

extractNewItemscope :: Node -> ExtractState Itemscope
extractNewItemscope elt@(Element _ attrs _)
  | isJust $ lookup "itemscope" attrs =
    withScope (Itemscope (lookup "itemtype" attrs) []) $
      extractMicrodata elt >> getScope
extractNewItemscope node =
  extractMicrodata node >> getScope

Moving on to the extractItemprops function:

extractItemprops :: Node -> ExtractState ()
extractItemprops elt@(Element _ attrs _) = do
  scopeVal <- extractNewItemscope elt
  case lookup "itemprop" attrs of
    Just name -> do
      scope <- getScope
      putScope $ addProp name (extractHtml5Value elt scopeVal) scope
    Nothing -> saveOrphan scopeVal
extractItemprops _ = return ()

There shouldn't be any surprises at this point. We're just using getScope to get the current scope, modifying it with addProp, and using putScope to store the new value back. This could be simpler too if we had a version of modify that only operated on the scope part of the state:

modifyScope :: (Itemscope -> Itemscope) -> ExtractState ()
modifyScope f = modify $ \(scope, orphans) -> (f scope, orphans)

extractItemprops :: Node -> ExtractState ()
extractItemprops elt@(Element _ attrs _) = do
  scopeVal <- extractNewItemscope elt
  case lookup "itemprop" attrs of
    Just name ->
      modifyScope $ addProp name (extractHtml5Value elt scopeVal)
    Nothing -> saveOrphan scopeVal
extractItemprops _ = return ()

Monads running free

So I've shown you how to rewrite some simple state propagating code to look cleaner and neater using the State monad, but how to you set it in motion -- supplying the initial state so that the monad can actually run? Well, every monad has some kind of run function that does this, and for State it's called (appropriately), runState. We can use this to rewrite extractMicrodataFromDoc:

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc =
  let f = mapM_ extractItemprops $ docContent doc
      initialState = (Itemscope (Just "top") [], [])
      ((), (scope, orphans)) = runState f initialState
  in scope:orphans

What runState does is it takes an initial state (the empty scope stack in this case, []) and returns a pair of the final answer (unit here, ()) and the final state (scopes). Since we want extractMicrodataFromDoc to return the scope stack as its answer we just discard the unit result and return scopes instead.

Note that extractMicrodataFromDoc is not itself a monadic operation (since it's return type isn't ExtractState). That means we can call it just like before in our test case:

test1 = "
\ \ Foobar\ \
\ \

quux

\ \
\ \
" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1

We're done

{-# LANGUAGE OverloadedStrings #-}

import Control.Monad (forM)
import Control.Monad.State
import Data.Foldable (foldMap)
import Data.List (find)
import Data.Maybe (isJust)
import Data.Text (Text, concat)
import Prelude hiding (concat)
import Text.XmlHtml

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: MicrodataValue
  } deriving (Show, Eq)

data Itemscope = Itemscope {
  itemscopeType :: Maybe Text,
  itemscopeProperties :: [Itemprop]
  } deriving (Show, Eq)

addProp :: Text -> MicrodataValue -> Itemscope -> Itemscope
addProp name value (Itemscope t props) =
  Itemscope t (Itemprop name value : props)

hasItemscope :: Node -> Bool
hasItemscope elt@(Element _ _ _) =
  isJust $ lookup "itemscope" $ elementAttrs elt
hasItemscope _ = False

data MicrodataValue = ItemscopeValue Itemscope
                    | TextValue Text
                    deriving (Show, Eq)

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope =
  TextValue $ foldMap getText $ elementChildren elt
  where getText elt@(TextNode text) = text
        getText _ = ""

type ExtractState a = State (Itemscope, [Itemscope]) a

getScope :: ExtractState Itemscope
getScope = do (scope, _) <- get; return scope

putScope :: Itemscope -> ExtractState ()
putScope scope = modify $ \(_, orphans) -> (scope, orphans)

saveOrphan :: Itemscope -> ExtractState ()
saveOrphan r = modify $ \(scope, orphans) -> (scope, r:orphans)

withScope :: Itemscope -> ExtractState a -> ExtractState a
withScope scope m = do
  prevScope <- getScope
  putScope scope
  rv <- m
  putScope prevScope
  return rv

modifyScope :: (Itemscope -> Itemscope) -> ExtractState ()
modifyScope f = modify $ \(scope, orphans) -> (f scope, orphans)

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  forM_ children extractItemprops
extractMicrodata _ = return ()

extractNewItemscope :: Node -> ExtractState Itemscope
extractNewItemscope elt@(Element _ attrs _)
  | isJust $ lookup "itemscope" attrs =
    withScope (Itemscope (lookup "itemtype" attrs) []) $
      extractMicrodata elt >> getScope
extractNewItemscope node =
  extractMicrodata node >> getScope

extractItemprops :: Node -> ExtractState ()
extractItemprops elt@(Element _ attrs _) = do
  scopeVal <- extractNewItemscope elt
  case lookup "itemprop" attrs of
    Just name ->
      modifyScope $ addProp name (extractHtml5Value elt scopeVal)
    Nothing -> saveOrphan scopeVal
extractItemprops _ = return ()

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc =
  let f = mapM_ extractItemprops $ docContent doc
      initialState = (Itemscope (Just "top") [], [])
      ((), (scope, orphans)) = runState f initialState
  in scope:orphans

-- Debugging stuff:

html str = case parseHTML "" str of
  Left msg -> error msg
  Right rslt -> head $ docContent rslt

topscope = Itemscope (Just "top") []

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1
I hope you got something out of this long-winded tour. Programming in Haskell can be pretty nice once you get over a few hurdles.

No comments:

Post a Comment