Now we tackle the tricky part. extract-microdata
uses the dreaded hobgoblin of functional programming: mutable state! Annoyingly convenient little boxes with updatable values, and mutable struct fields. One problem with all this mutable stuff is that it allows the control flow to follow a different path from the state changes. To turn this into functional Haskell code, we'll need to do a bit of untangling so that state can travel along with the control flow, and be replace with new state whenever a change is needed.
Here's the Racket function again, for reference:
(define (extract-microdata root [scope (void)] [result (box '())]) (let ([new-scope (void)]) (when (h:html-full? root) (for ([element (h:html-full-content root)]) (match element [(h:html-element (list (x:attribute start stop key value) ...)) (let ([attrs (for/list ([k key] [v value]) (cons k v))]) (when (assoc 'itemscope attrs) (set! new-scope (itemscope (cdr (or (assoc 'itemtype attrs) (cons "" #f))) '())) (when (void? scope) (set-box! result (cons new-scope (unbox result))))) (for ([prop attrs] #:when (eq? (car prop) 'itemprop)) (set-itemscope-properties! scope (cons (itemprop (cdr prop) (extract-html5-value element new-scope)) (itemscope-properties scope))))) (extract-microdata element new-scope result)] [else #f])))) (unbox result))
This function iterates over all the content nodes of the document, and when they are elements it potentially creates a new itemscope
structs and pushes them onto/into the result
box. It also iterates over all the attributes of the elements looking for itemprop
attributes. When it finds them it extracts their HTML5 value and pushes it into the properties
field of the current itemscope object. Finally it recurses over each of the element's children. Perhaps we can break this down into several more manageable pieces.
First, let's focus on recognizing itemscope attributes and creating new Itemscope
values. Let's call this piece extractNewItemscope
. Since this function can potentially push a new Itemscope
on a stack, it needs to take a list of Itemscope
as input, and produce another one as output (linked lists in Haskell make perfect stacks):
extractNewItemscope :: Node -> [Itemscope] -> [Itemscope] extractNewItemscope elt scopes = undefined
Again, we need to make function cases based on whether the Node
is an Element
or something else, and we need to test for when the attributes of the element contain an "itemscope"
key:
extractNewItemscope :: Node -> [Itemscope] -> [Itemscope] extractNewItemscope elt@(Element _ attrs _) scopes = if isJust $ lookup "itemscope" attrs then undefined else scopes extractNewItemscope _ scopes = scopes
Since the first thing we're doing in the first function case is another test, we can use a guard clause to further refine it, letting the second case pick up the work the else clause was doing:
extractNewItemscope :: Node -> [Itemscope] -> [Itemscope] extractNewItemscope elt@(Element _ attrs _) scopes | isJust $ lookup "itemscope" attrs = undefined extractNewItemscope _ scopes = scopes
Now the body of this function only needs to build a new Itemscope
value and push it onto the scopes stack. The type of this Itemscope
is the value of the "itemtype"
attribute (if any), and the properties start out empty. The colon (:
) is the "cons" operator that pushes a new value onto the left-hand side of an existing list (scopes
here):
extractNewItemscope :: Node -> [Itemscope] -> [Itemscope] extractNewItemscope elt@(Element _ attrs _) scopes | isJust $ lookup "itemscope" attrs = Itemscope (lookup "itemtype" attrs) [] : scopes extractNewItemscope _ scopes = scopes
Next, let's define a function to update the properties of an Itemscope
when we see an "itemprop"
attribute. To make this clearer, we'll first define a helper function to add a new property to an existing Itemscope:
(since it's functional, it returns a new Itemscope
):
addProp :: Text -> MicrodataValue -> Itemscope -> Itemscope addProp name value (Itemscope t props) = Itemscope t (Itemprop name value : props)
Next we can use this to update the the top-most Itemscope
on the stack when we see an "itemprop"
attribute:
extractItemprop :: Node -> (Text, Text) -> [Itemscope] -> [Itemscope] extractItemprop elt ("itemprop", name) (scope:scopes) = addProp name (extractHtml5Value elt scope) scope : scopes extractItemprop _ _ scopes = scopes
The first line of the let
creates a new Itemprop
value with the name and HTML5 value. The second line pushes this new Itemprop
onto the list of existing properties. The third line then creates a new Itemscope
value that will replace the one we're starting with. Notice how we pattern matched the entire scope stack with (scope:scopes)
so that we can throw the old scope away and return the tail of the list with newScope
pushed on the front.
Oops -- when we compile this we see that another tweak is needed. We previously had defined the value field of the Itemprop
data type as Text
, but in part 2 we discovered that values needed to be of type MicrodataValue
. Let's fix that:
data Itemprop = Itemprop { itempropName :: Text, itempropValue :: MicrodataValue } deriving (Show, Eq)
One nice thing about Haskell modules (the stuff within a single file) is that all the definitions are mutually recursive. That means that you can list functions in top-down dependency order, bottom-up, or however you want to organize things. The same goes for types, which is why we can change this field to MicrodataValue
even though it's defined below in the file. (I spent several years programming in O'Caml, and it requires things to be defined in bottom-up dependency order. Believe me, this is better.)
Next, let's define a function that's similar to extractItemprop
, but that works with all attributes of an element, not just one. We might be able to use map
, but since we want each attribute to be able to update the top-most Itemscope
on the stack we need to thread the stack (scopes
) through all these map operations. That's exactly what a "fold" does -- it passes an accumulator parameter along and expects you to accumulate your result into the accumulator instead of just returning it like map
does. map
is a fold where the accumulation is done for you -- the function results are accumulated into a list.
We'll use foldr
to do this. foldr
folds from right-to-left. This might seem backwards, but its usually the right thing in Haskell. The reason it's often the right thing is because lists are built by pushing newer elements on the left, so if you want to preserve the order of your results with respect to the original list, you need to start from the right. Also, if you're relying on Haskell's laziness to avoid computing values from the tail of the list that never used, then foldr
is exactly what you want. (On the other hand, if you don't care about preserving the left-to-right order, then foldl'
is way more efficient. See Foldr Foldl Foldl' if you want to understand this better.)
extractItemprops :: Node -> [Itemscope] -> [Itemscope] extractItemprops elt@(Element _ attrs _) scopes = foldr (extractItemprop elt) scopes attrs extractItemprops _ scopes = scopes
That's it for extractItemprops
. The scopes
stack is passed to each extractItemprop
call, and a new scope stack is returned that will be passed to the next call, or returned as the final result. Note that rather than just calling extractItemprop
directly we doing a bit of currying magic here, calling (extractItemprop elt)
instead. This gives each call the same elt
parameter, and the partial application returns a new function that fits the requirement for foldr
.
Now we're at a place where we can put these pieces together to write a functional version of extract-microdata
. Since it's functional we'll want to pass our state along at every step, and return new state as a result. And since it will want to operate on every child of an element, we'll want to use our frind foldr
again. The only tricky part if figuring out what each step performed by foldr
actually needs to do:
extractMicrodata :: Node -> [Itemscope] -> [Itemscope] extractMicrodata elt@(Element _ _ children) scopes = foldr go scopes children where go child scopes = undefined extractMicrodata _ scopes = scopes
Hopefully at this point, there are no surprises here. The extractMicrodata
function looks a lot like extractItemprops
that we've seen already. But why name the local function "go
"? That's just a common Haskell idiom (in Zetalisp it was always "frob", so this is just another step forward).
The body of go
just needs to thread our state (scopes
) through each of the steps:
extractMicrodata :: Node -> [Itemscope] -> [Itemscope] extractMicrodata elt@(Element _ _ children) scopes = foldr go scopes children where go child scopes = let scopes' = extractNewItemscope child scopes scopes'' = extractItemprops child scopes' in extractMicrodata child scopes'' extractMicrodata _ scopes = scopes
First extractNewItemscope
is called which potentially pushes a new Itemscope on the stack. We'll call its result scopes'
(using a "prime" is just another idiom -- the single quote is a perfectly valid identifier character in Haskell). Then extractItemprops
is called with scopes'
and returns scopes''
. Finally we recursively call extractMicrodata
for the child so we can process its children, passing scopes''
. This ultimately returns a new scope stack as our result.
Of course we're never happy with this much verbosity in Haskell, so we can write this a little more concisely once again:
extractMicrodata :: Node -> [Itemscope] -> [Itemscope] extractMicrodata elt@(Element _ _ children) scopes = foldr go scopes children where go child scopes = extractMicrodata child $ extractItemprops child $ extractNewItemscope child scopes extractMicrodata _ scopes = scopes
How's that? Well, let me counter your question with how's this?:
extractMicrodata :: Node -> [Itemscope] -> [Itemscope] extractMicrodata elt@(Element _ _ children) scopes = foldr go scopes children where go child scopes = (extractMicrodata child . extractItemprops child . extractNewItemscope child) scopes extractMicrodata _ scopes = scopes
We're using the function composition operator, dot (.
), to compose the 3 functions that all get passed scopes
. Note that the composition operator in Haskell works by calling the right-most function first, and passing the result to the left most -- the way mathematicians do it, but perhaps backwards from what we expect from most programming languages. It is defined as:
(.) :: (b -> c) -> (a -> b) -> a -> c (f . g) x = f (g x)
One more tweak:
extractMicrodata :: Node -> [Itemscope] -> [Itemscope] extractMicrodata elt@(Element _ _ children) scopes = foldr go scopes children where go child = extractMicrodata child . extractItemprops child . extractNewItemscope child extractMicrodata _ scopes = scopes
Now we're dropping the application of the composed functions to scopes
, and dropping scopes
from the parameter list. Instead, we're just returning the composed function itself since it has the correct type. We're moving into what they call the "point-free style" here (sometimes referred to as "pointless style"). Maybe this is getting too carried away.
Wrap it up...
The only thing we need to do now is put a bow on it, and test it out.
extractMicrodataFromDoc :: Document -> [Itemscope] extractMicrodataFromDoc doc = foldr extractMicrodata [] $ docContent doc test1 = "\ \" result1 = case parseHTML "" test1 of Left msg -> Left msg Right tree -> Right $ extractMicrodataFromDoc tree\ \\ \ Foobar\ \quux
\ \
extractMicrodataFromDoc
simply folds extractMicrodata
over all the document's content. The variables test1
and result1
are things I've thrown in for testing purposes. You shouldn't leave them in production code, but if you evaluate result1
in the *haskell* buffer, you'll see our handiwork:
*Main> result1 Right [Itemscope {itemscopeType = Nothing, itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "Foobar"},Itemprop {itempropName = "name", itempropValue = TextValue "quux"},Itemprop {itempropName = "item", itempropValue = ItemscopeValue (Itemscope {itemscopeType = Nothing, itemscopeProperties = []})}]},Itemscope {itemscopeType = Just "type", itemscopeProperties = []}]
Of course, there are always improvements to be made. Note how in the definition of result1
we're reaching into the Either
data structure and in the Right
case calling another function on what's there. Effectively, we're doing an operation that might be generalized like this:
mapEither :: (a -> b) -> Either c a -> Either c b mapEither f (Either l r) = Either l (f r)
Well, there's a very common generalization of this pattern called fmap
. Instead of operating on an Either
data type, it operates on anything that supports the Functor
type class.
fmap :: Functor f => (a -> b) -> f a -> f b
The Functor
type class is basically one that allows functions applied to data structures to be applied to that data structure's elements. We can use that to simplify the result1
definition:
extractMicrodataFromDoc :: Document -> [Itemscope] extractMicrodataFromDoc doc = foldr extractMicrodata [] $ docContent doc test1 = "\ \" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1\ \\ \ Foobar\ \quux
\ \
fmap
takes a bit to get your head around, but it's really powerful and useful. If you look at the definition of mapEither
that I gave you can get some idea of how it works. In fact, given that definition, we could have defined the Functor
instance for Either
like this (if we didn't have a definition already):
instance Functor (Either a) where fmap = mapEither
Why (Either a)
in the above? Great Good explains it better than I can.
Review
Where are we now?
{-# LANGUAGE OverloadedStrings #-} import Data.Foldable (foldMap) import Data.List (find) import Data.Maybe (isJust) import Data.Text (Text, concat) import Prelude hiding (concat) import Text.XmlHtml data Itemprop = Itemprop { itempropName :: Text, itempropValue :: MicrodataValue } deriving (Show, Eq) data Itemscope = Itemscope { itemscopeType :: Maybe Text, itemscopeProperties :: [Itemprop] } deriving (Show, Eq) addProp :: Text -> MicrodataValue -> Itemscope -> Itemscope addProp name value (Itemscope t props) = Itemscope t (Itemprop name value : props) hasItemscope :: Node -> Bool hasItemscope elt@(Element _ _ _) = isJust $ lookup "itemscope" $ elementAttrs elt hasItemscope _ = False data MicrodataValue = ItemscopeValue Itemscope | TextValue Text deriving (Show, Eq) extractHtml5Value :: Node -> Itemscope -> MicrodataValue extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope extractHtml5Value elt scope = TextValue $ foldMap getText $ elementChildren elt where getText elt@(TextNode text) = text getText _ = "" extractNewItemscope :: Node -> [Itemscope] -> [Itemscope] extractNewItemscope elt@(Element _ attrs _) scopes | isJust $ lookup "itemscope" attrs = Itemscope (lookup "itemtype" attrs) [] : scopes extractNewItemscope _ scopes = scopes extractItemprop :: Node -> (Text, Text) -> [Itemscope] -> [Itemscope] extractItemprop elt ("itemprop", name) (scope:scopes) = addProp name (extractHtml5Value elt scope) scope : scopes extractItemprop _ _ scopes = scopes extractItemprops :: Node -> [Itemscope] -> [Itemscope] extractItemprops elt@(Element _ attrs _) scopes = foldr (extractItemprop elt) scopes attrs extractItemprops _ scopes = scopes extractMicrodata :: Node -> [Itemscope] -> [Itemscope] extractMicrodata elt@(Element _ _ children) scopes = foldr go scopes children where go child = extractMicrodata child . extractItemprops child . extractNewItemscope child extractMicrodata _ scopes = scopes extractMicrodataFromDoc :: Document -> [Itemscope] extractMicrodataFromDoc doc = foldr extractMicrodata [] $ docContent doc test1 = "\ \" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1\ \\ \ Foobar\ \quux
\ \
So we're done? Well, if you're an astute reader, you've already realized that this code doesn't actually work as intended. Stay tuned for part 4 where we'll do a little debugging.
No comments:
Post a Comment