In the last section we finished up an initial version of our microdata extractor -- one that actually works this time. Admittedly, the code was starting to get a little ugly with all the state propagation going around. But don't worry -- with a simple monad we can simplify things and make them pretty once again.
Oh! Did I say monad? We aren't going to write yet another monad tutorial here, I hope! There are far too many of them in the world already. Instead I'm just going to port the code from part 3 to use the State monad to show that it's not that scary, and actually a little bit helpful.
First, we must import Control.Monad.State. If you go look at the package description with the link above, you'll see that it really uses Control.Monad.State.Lazy by default (there's a strict variant, but we won't go into it here).
Next, we need to figure out where to use this crazy thing. In the last part, we talked about propagating the scope stack -- the state -- through various operations. This was always apparent because their signatures ended with either Itemscope -> Itemscope
or State -> State
. Seeing something like this is a tell-tale sign that the State
monad might be useful.
The two main operations that State
provides are get
and put
. In many ways you can think of them as having a little mutable memory location where you can squirrel away things you need to update -- rather like box
in Racket. However, unlike box
, State
follows the control flow of your program rather than just floating around in memory somewhere. (Haskell provide actual mutable Ref cells that do "float around in memory", but let's not go into that here.) State is cooler than some of the alternatives because it really is functional even though it feels imperative. Awesomeness abounds.
To use State
with our microdata extractor program, we simply need to define a state monad that our program's operations will be "in". By "in," I mean that the final type in these operation's signature will be this state monad, or in other words, they'll return a monad value as their result. What's a monad value? It's just a function. In our case, it's just a function that takes a scope stack and returns a scope stack (possibly a different one than the one you gave).
Monads are a truly clever trick, and one that's well worth getting your head around in general. At a fundamental level, they're all about sequencing operations together (and Haskell provides build-in syntax for making this sequencing easier -- the do
statement). But here, our interest is just a very specific kind of monad that deals with state.
So let's define a state monad for microdata extraction called ExtractState
:
type ExtractState a = State (Itemscope, [Itemscope]) a
The type
keyword defines a type synonym (like typedef
in C++), and this type is parametrized by the variable a
, which will be the result type of a state operation. ExtractState a
is equivalent to the State
type where the type of the actual state is fixed, and is the type of our state from the last section, (Itemscope, [Itemscope])
. That's it.
Now the magic of defining this state type is that when we're in the ExtractState
monad, calling get
will get us our state -- seemingly out of the ether. And calling put
will allow us to store a new scope stack into the background radiation. Really what's happening is that there's one more level of functional argument passing just below the monad surface that is shuttling our state around. Best not to think about it too deeply. Let's just use it.
Let's begin by defining a few helper functions to access the various parts of our state. We can define these in terms of get
and put
that State
defines:
getScope :: ExtractState Itemscope getScope = do (scope, _) <- get; return scope putScope :: Itemscope -> ExtractState () putScope scope = do (_, orphans) <- get; put (scope, orphans)
You can see in put
that the old scope is dropped on the floor, and the new scope is stored in its place. If we want to get fancier, we can rewrite putScope
in terms of modify
which passes the old state to a function which should return a new state to put:
putScope :: Itemscope -> ExtractState () putScope scope = modify $ \(_, orphans) -> (scope, orphans)
We can also write a state accessor to save an otherwise orphaned Itemscope
onto the list of orphans:
saveOrphan :: Itemscope -> ExtractState () saveOrphan r = modify $ \(scope, orphans) -> (scope, r:orphans)
With these, we're now ready to rewrite our code to use the ExtractState
monad. Let's begin with extractMicrodata
which used to look like this:
extractMicrodata :: Node -> State -> State extractMicrodata elt@(Element _ _ children) state = foldr extractItemprops state children extractMicrodata _ state = state
We can shift it into ExtractState
gear by rewriting it like this (deferring the body of the first function cases for a moment):
extractMicrodata :: Node -> ExtractState () extractMicrodata elt@(Element _ _ children) = undefined extractMicrodata _ = return ()
From this, you can see that the signature changed from Node -> State -> State
to Node -> ExtractState ()
. ExtractState
hides the details of the state transitions into the monad operations that are used to implement this type -- functions that take a state and return a state. At the same time, the scopes
parameters went away because we can access it with our monad accessor functions when it's needed. The ()
(called "unit") type parameter is like void
in C++. It simply means that there's no interesting value returned from this function (although the real "value" here is the monad itself). We've used the standard monad return
function to return a unit value in the second function case. That's because we don't modify the state, and don't really have a return value otherwise.
Now filling in the details:
extractMicrodata :: Node -> ExtractState () extractMicrodata elt@(Element _ _ children) = foldr extractItemprops children extractMicrodata _ = return ()
we'll get into a bit of trouble:
/Users/warrenharris/projects/racket-hs/microdata5.hs:64:3: Couldn't match expected type `StateT (Itemscope, [Itemscope]) Data.Functor.Identity.Identity ()' with actual type `[a0] -> b0' Expected type: ExtractState () Actual type: [a0] -> b0 In the return type of a call of `foldr' In the expression: foldr extractItemprops children Failed, modules loaded: none.
We can see that the expected type was ExtractState ()
, whereas the actual type was a function from lists to something: [a0] -> b0
. We didn't even get the arity right here since foldr
is expecting another argument (the initial accumulator which was our scope stack). Using map
instead isn't much better:
/Users/warrenharris/projects/racket-hs/microdata5.hs:64:3: Couldn't match expected type `StateT (Itemscope, [Itemscope]) Data.Functor.Identity.Identity ()' with actual type `[b0]' Expected type: ExtractState () Actual type: [b0] In the return type of a call of `map' In the expression: map extractItemprops children Failed, modules loaded: none.
Now we're returning a list instead of a monad. What we need a monadic version of map
instead. Fortunately, there is just such a thing called mapM
:
extractMicrodata :: Node -> ExtractState () extractMicrodata elt@(Element _ _ children) = do _ <- mapM extractItemprops children return () extractMicrodata _ = return ()
Since mapM
returns a list of results, we're discarding it with the wildcard bind, and returning unit instead. We could also write this using Haskell's lower-level monad operation >>
which throws away the result of the left-hand side:
extractMicrodata :: Node -> ExtractState () extractMicrodata elt@(Element _ _ children) = mapM extractItemprops children >> return () extractMicrodata _ = return ()
But this is still unsatisfying because we're constructing a list that we're just going to throw away. What we need is a version of mapM
that doesn't build lists at all, but simply iterates over the input list, sequencing the monadic operations accordingly and throwing away whatever they return. Fortunately there is just such an operation, called mapM_
(the underscore designates "returns nothing"). And since what we're doing looks more like iteration than anything, there's a variant of mapM_
called forM_
that reverses its argument order so that it looks more like a foreach
construct:
*Main> :t mapM mapM :: Monad m => (a -> m b) -> [a] -> m [b] *Main> :t mapM_ mapM_ :: Monad m => (a -> m b) -> [a] -> m () *Main> :t forM_ forM_ :: Monad m => [a] -> (a -> m b) -> m ()
Let's use forM_
to simplify once again:
extractMicrodata :: Node -> ExtractState () extractMicrodata elt@(Element _ _ children) = forM_ children $ (\child -> extractItemprops child) extractMicrodata _ = return ()
Hey, that looks a lot like imperative code! For each child
in children
, run the operation extractItemprops
passing it the child. (The backslash here is a lambda, and in fact, since the precedence of lambda expressions is higher than $
we don't need the parens around the lambda expression either.) But since child
is the last parameter passed to extractItemprops
, we can simplify this just a little further (because \x -> f x
is just f
):
extractMicrodata :: Node -> ExtractState () extractMicrodata elt@(Element _ _ children) = forM_ children extractItemprops extractMicrodata _ = return ()
Let's move on to extractNewItemscope
. Here was our previous definition:
extractNewItemscope :: Node -> State -> (Itemscope, State) extractNewItemscope elt@(Element _ attrs _) (scope, orphans) | isJust $ lookup "itemscope" attrs = let newScope = Itemscope (lookup "itemtype" attrs) [] (scopeVal, orphans') = extractMicrodata elt (newScope, orphans) in (scopeVal, (scope, orphans')) extractNewItemscope node state = let state'@(scopeVal, _) = extractMicrodata node state in (scopeVal, state') -- return current scope as scopeVal
This time we don't have a simple State -> State
at the end of the function signature. Instead the final return type is returning a value and a new state, State -> (Itemscope, State)
. Fortunately, we can describe this in the State
monad as ExtractState Itemscope
:
extractNewItemscope :: Node -> ExtractState Itemscope extractNewItemscope elt@(Element _ attrs _) | isJust $ lookup "itemscope" attrs = undefined extractNewItemscope node = undefined
Filling in the details now:
extractNewItemscope :: Node -> ExtractState Itemscope extractNewItemscope elt@(Element _ attrs _) | isJust $ lookup "itemscope" attrs = do prevScope <- getScope let newScope = Itemscope (lookup "itemtype" attrs) [] putScope newScope extractMicrodata elt scopeVal <- getScope putScope prevScope return scopeVal extractNewItemscope node = do extractMicrodata node scopeVal <- getScope return scopeVal -- return current scope as scopeVal
From this you can see that we use the getScope
operation to get the current scope. The back-arrow operator (<-
) lets us bind a pattern to the result of a monad operation, and here we're binding the variable prevScope
. We create a newScope
using a let
statement (note that a let
in a do
sequence isn't followed by the in
keyword). Next we use putScope
to store the new scope, then call extractMicrodata
, then use getScope
again to get the modified verion of newScope (now called scopeVal
). Then we use putScope
again to put the previous scope back, and finally return scopeVal
.
Now don't confuse let foo = ...
and do foo <- ...
. They both more or less do something similar -- assign a variable -- but the latter works with monad operations. If we had used a let
with a monad operation instead, we'd be getting our hands on the monad itself, not the return value that's in the monad. Using let
with monads can be quite useful when manipulating control flow, or plugging pieces of programs together -- just not what we're interested in doing here. Monads are really powerful because they decouple the work of screwing the plumbing together from running the crap through the pipes (if you catch my drift).
Of course the code above is kind of ugly. Most of it is just manipulating previous and current scope. We can wrap this up in a nice little helper function that lets us lexically set the current scope:
withScope :: Itemscope -> ExtractState a -> ExtractState a withScope scope m = do prevScope <- getScope putScope scope rv <- m putScope prevScope return rv
Here, withScope
takes a scope and another monad operation, m
. It sets up the new scope, invokes m
, reverts the scope, and returns m
's value. (If you were worried about m
raising exceptions, you'd want to put an exception handler in here to revert to the previous scope value, but we don't raise any exceptions in our code.)
We can now simplify extractNewItemscope
:
extractNewItemscope :: Node -> ExtractState Itemscope extractNewItemscope elt@(Element _ attrs _) | isJust $ lookup "itemscope" attrs = withScope (Itemscope (lookup "itemtype" attrs) []) $ do extractMicrodata elt scopeVal <- getScope return scopeVal extractNewItemscope node = do extractMicrodata node scopeVal <- getScope -- return current scope as scopeVal return scopeVal
or using the >>
operation to further simplify:
extractNewItemscope :: Node -> ExtractState Itemscope extractNewItemscope elt@(Element _ attrs _) | isJust $ lookup "itemscope" attrs = withScope (Itemscope (lookup "itemtype" attrs) []) $ extractMicrodata elt >> getScope extractNewItemscope node = extractMicrodata node >> getScope
Moving on to the extractItemprops
function:
extractItemprops :: Node -> ExtractState () extractItemprops elt@(Element _ attrs _) = do scopeVal <- extractNewItemscope elt case lookup "itemprop" attrs of Just name -> do scope <- getScope putScope $ addProp name (extractHtml5Value elt scopeVal) scope Nothing -> saveOrphan scopeVal extractItemprops _ = return ()
There shouldn't be any surprises at this point. We're just using getScope
to get the current scope, modifying it with addProp
, and using putScope
to store the new value back. This could be simpler too if we had a version of modify
that only operated on the scope part of the state:
modifyScope :: (Itemscope -> Itemscope) -> ExtractState () modifyScope f = modify $ \(scope, orphans) -> (f scope, orphans) extractItemprops :: Node -> ExtractState () extractItemprops elt@(Element _ attrs _) = do scopeVal <- extractNewItemscope elt case lookup "itemprop" attrs of Just name -> modifyScope $ addProp name (extractHtml5Value elt scopeVal) Nothing -> saveOrphan scopeVal extractItemprops _ = return ()
Monads running free
So I've shown you how to rewrite some simple state propagating code to look cleaner and neater using the State
monad, but how to you set it in motion -- supplying the initial state so that the monad can actually run? Well, every monad has some kind of run function that does this, and for State
it's called (appropriately), runState
. We can use this to rewrite extractMicrodataFromDoc
:
extractMicrodataFromDoc :: Document -> [Itemscope] extractMicrodataFromDoc doc = let f = mapM_ extractItemprops $ docContent doc initialState = (Itemscope (Just "top") [], []) ((), (scope, orphans)) = runState f initialState in scope:orphans
What runState
does is it takes an initial state (the empty scope stack in this case, []
) and returns a pair of the final answer (unit here, ()
) and the final state (scopes
). Since we want extractMicrodataFromDoc
to return the scope stack as its answer we just discard the unit result and return scopes
instead.
Note that extractMicrodataFromDoc
is not itself a monadic operation (since it's return type isn't ExtractState
). That means we can call it just like before in our test case:
test1 = "\ \ Foobar\ \" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1\ \\ \quux
\ \
We're done
{-# LANGUAGE OverloadedStrings #-} import Control.Monad (forM) import Control.Monad.State import Data.Foldable (foldMap) import Data.List (find) import Data.Maybe (isJust) import Data.Text (Text, concat) import Prelude hiding (concat) import Text.XmlHtml data Itemprop = Itemprop { itempropName :: Text, itempropValue :: MicrodataValue } deriving (Show, Eq) data Itemscope = Itemscope { itemscopeType :: Maybe Text, itemscopeProperties :: [Itemprop] } deriving (Show, Eq) addProp :: Text -> MicrodataValue -> Itemscope -> Itemscope addProp name value (Itemscope t props) = Itemscope t (Itemprop name value : props) hasItemscope :: Node -> Bool hasItemscope elt@(Element _ _ _) = isJust $ lookup "itemscope" $ elementAttrs elt hasItemscope _ = False data MicrodataValue = ItemscopeValue Itemscope | TextValue Text deriving (Show, Eq) extractHtml5Value :: Node -> Itemscope -> MicrodataValue extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope extractHtml5Value elt scope = TextValue $ foldMap getText $ elementChildren elt where getText elt@(TextNode text) = text getText _ = "" type ExtractState a = State (Itemscope, [Itemscope]) a getScope :: ExtractState Itemscope getScope = do (scope, _) <- get; return scope putScope :: Itemscope -> ExtractState () putScope scope = modify $ \(_, orphans) -> (scope, orphans) saveOrphan :: Itemscope -> ExtractState () saveOrphan r = modify $ \(scope, orphans) -> (scope, r:orphans) withScope :: Itemscope -> ExtractState a -> ExtractState a withScope scope m = do prevScope <- getScope putScope scope rv <- m putScope prevScope return rv modifyScope :: (Itemscope -> Itemscope) -> ExtractState () modifyScope f = modify $ \(scope, orphans) -> (f scope, orphans) extractMicrodata :: Node -> ExtractState () extractMicrodata elt@(Element _ _ children) = forM_ children extractItemprops extractMicrodata _ = return () extractNewItemscope :: Node -> ExtractState Itemscope extractNewItemscope elt@(Element _ attrs _) | isJust $ lookup "itemscope" attrs = withScope (Itemscope (lookup "itemtype" attrs) []) $ extractMicrodata elt >> getScope extractNewItemscope node = extractMicrodata node >> getScope extractItemprops :: Node -> ExtractState () extractItemprops elt@(Element _ attrs _) = do scopeVal <- extractNewItemscope elt case lookup "itemprop" attrs of Just name -> modifyScope $ addProp name (extractHtml5Value elt scopeVal) Nothing -> saveOrphan scopeVal extractItemprops _ = return () extractMicrodataFromDoc :: Document -> [Itemscope] extractMicrodataFromDoc doc = let f = mapM_ extractItemprops $ docContent doc initialState = (Itemscope (Just "top") [], []) ((), (scope, orphans)) = runState f initialState in scope:orphans -- Debugging stuff: html str = case parseHTML "" str of Left msg -> error msg Right rslt -> head $ docContent rslt topscope = Itemscope (Just "top") [] test1 = "I hope you got something out of this long-winded tour. Programming in Haskell can be pretty nice once you get over a few hurdles.\ \" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1\ \\ \ Foobar\ \quux
\ \
No comments:
Post a Comment