Houston, we have a problem. If we evaluate result
from part 3 and look at it closely, we'll see it:
type ExtractState a = State [Itemscope] a *Main> result1 Right [Itemscope {itemscopeType = Nothing, itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "quux"},Itemprop {itempropName = "item", itempropValue = ItemscopeValue (Itemscope {itemscopeType = Nothing, itemscopeProperties = []})}]},Itemscope {itemscopeType = Just "type", itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "Foobar"}]}]
Uh, let me format that for you:
Right [ Itemscope {itemscopeType = Nothing, itemscopeProperties = [ Itemprop {itempropName = "name", itempropValue = TextValue "quux"}, Itemprop {itempropName = "item", itempropValue = ItemscopeValue ( Itemscope {itemscopeType = Nothing, itemscopeProperties = []})}]}, Itemscope {itemscopeType = Just "type", itemscopeProperties = [ Itemprop {itempropName = "name", itempropValue = TextValue "Foobar"}]}]
and pull up our original HTML for reference:
test1 = "\ \"\ \\ \ Foobar\ \quux
\ \
Here we can see that the Itemprop
s aren't in the right places at all. The "name"="quux"
one should be inside the empty Itemscope
that's the value of "item"
, and that Itemprop
should be a sibling of the "name"="Foobar"
one that's below. What's going wrong?
Well, after thinking a bit more about what the algorithm here needs to do, it's clear that we need to traverse one tree (the HTML), and build a new one (the Itemscope
s). But although we used foldr
to push items onto a stack as we walked down the HTML tree, we didn't pop any items from that stack as we unwound the HTML tree. So we were always left with last pushed item on the top of the stack when we updated the Itemprop
values. This is completely broken.
Just to be a little more clear, there are 3 uses of foldr
in our code: in extractItemprops
, extractMicrodata
and in extractMicrodataFromDoc
. The one in extractItemprops
is fine because it is simply processing a list of attributes that all belong to the same HTML element. Similarly, the one in extractMicrodataFromDoc
is fine because it is processing the content of the document which is a list of HTML nodes all at the top-level of the document. The problem is in extractMicrodata
which not only processes the immediate children of an element, but calls itself to process each of their children.
It isn't obvious how we can modify extractMicrodata
to do this pushing and popping since it isn't until we get deep into its implementation do we know that we're looking at an element with an "itemscope"
attribute. A better approach here would be to recurse down the xmlhtml tree, calling our extraction operations in a mutually recursive way, and returning a correctly constructed Itemscope
from each call.
By mutually recursively, we mean calling extractMicrodata
, extractItemprops
, extractNewItemscope
from each other as necessary -- so the 3 of these form a recursive cycle. But which one should be called first? To determine that, we need to consider the html structure. Some nodes like the div
node in the example above have both itemprop
and itemscope
attributes. When this occurs we want the itemprop
("item"
in this case) to have a value that contains the itemscope
and all its sub-properties ("name"="quux"
here). This means that extractItemprops
should be called first (from extractMicrodataFromDoc
), and it should in turn call extractNewItemscope
to build a new Itemscope
value when the itemscope
attribute is present. Finally extractNewItemscope
should call extractMicrodata
to recursively process the html children, and this can recurse back to extractItemprops
once again.
Unlike before when each of these functions would take a stack (list) of Itemscopes
, now it's only necessary for them to take the current Itemscope
object, and return a (possibly new) Itemscope
on the way back out. This is because our call stack will replace the explicit stack (that we never popped) from before.
Let's start with extractMicrodata
, since it's at the bottom of this call stack. Rather than calling all 3 extraction functions on each child, it now only calls extractItemprops
:
extractMicrodata :: Node -> Itemscope -> Itemscope extractMicrodata elt@(Element _ _ children) scope = foldr extractItemprops scope children extractMicrodata _ scope = scope
Pretty straightforward. Next, extractMicrodata
is called from extractNewItemscope
. In the case where we have an element with an itemscope
attribute, we'll construct a new Itemscope
and pass this to extractMicrodata
. Otherwise, we'll continue working with the one we were given:
extractNewItemscope :: Node -> Itemscope -> Itemscope extractNewItemscope elt@(Element _ attrs _) scope | isJust $ lookup "itemscope" attrs = let newScope = Itemscope (lookup "itemtype" attrs) [] in extractMicrodata elt newScope extractNewItemscope node scope = extractMicrodata node scope
Next we can rework extractItemprops
(and it's associated routine, extractItemprop
) to call extractNewItemscope
:
extractItemprop :: Node -> (Text, Text) -> Itemscope -> Itemscope extractItemprop elt ("itemprop", name) scope = addProp name (extractHtml5Value elt scope) scope extractItemprop _ _ scope = scope extractItemprops :: Node -> Itemscope -> Itemscope extractItemprops elt@(Element _ attrs _) scope = let scope' = extractNewItemscope elt scope in foldr (extractItemprop elt) scope' attrs extractItemprops _ scope = scope
Now let's test them out on some simple html. To make doing this easier in ghci, let's first define a helper function to parse the html and get rid of the Either
node (by throwing an exception in the Left
(wrong) case), and a dummy Itemscope
to use at the top-level:
html str = case parseHTML "" str of Left msg -> error msg Right rslt -> head $ docContent rslt topscope = Itemscope (Just "top") []
Let's try our new functions out. Be sure to first enable OverloadedStrings in ghci so that we can have the literal html string converted to the correct type for our helper function:
*Main> :set -XOverloadedStrings *Main> extractNewItemscope (html "") topscope Itemscope {itemscopeType = Nothing, itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "quux"}]} *Main> extractNewItemscope (html "quux
") topscope Itemscope {itemscopeType = Just "top", itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "quux"}]}quux
This seem to be working correctly. When we call extractNewItemscope
with a node with itemscope
, we built a new Itemscope
object and store the "name"="quux"
property there. And when the node doesn't have itemscope
, we store "name"="quux"
in the current Itemscope
that's passed in (topscope
in this case).
But wait a minute... In the first case above, we ended up dropping the passed-in scope on the floor. You can see this because the scope
parameter is never used in extractNewItemscope
's first function case. This isn't right, because that scope might already contain someone's "good stuff" that they're expecting to see in the final result. What we really need to do in extractNewItemscope
is return both the previous scope (our state) and the newly constructed Itemscope
object (our result value) if we do indeed construct one. Likewise, in the case where we don't construct one, then the updated scope will be both our state, and our Itemscope result value:
extractNewItemscope :: Node -> Itemscope -> (Itemscope, Itemscope) extractNewItemscope elt@(Element _ attrs _) scope | isJust $ lookup "itemscope" attrs = let newScope = Itemscope (lookup "itemtype" attrs) [] in (extractMicrodata elt newScope, scope) extractNewItemscope node scope = let scope' = extractMicrodata node scope in (scope', scope')
Since we've changed the type of extractNewItemscope
, this requires a similar change in extractItemprops
. Now we need to receive both the result value and the updated state from extractNewItemscope
and use the new Itemscope
value whenever the itemprop
keyword is seen:
extractItemprop :: Node -> Itemscope -> (Text, Text) -> Itemscope -> Itemscope extractItemprop elt is ("itemprop", name) scope = addProp name (extractHtml5Value elt is) scope extractItemprop _ _ _ scope = scope extractItemprops :: Node -> Itemscope -> Itemscope extractItemprops elt@(Element _ attrs _) scope = let (is, scope') = extractNewItemscope elt scope in foldr (extractItemprop elt is) scope' attrs extractItemprops _ scope = scope
Let's see if this works:
*Main> extractItemprops (html "") topscope Itemscope {itemscopeType = Just "top", itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "quux"}]} *Main> extractItemprops (html "quux
") topscope Itemscope {itemscopeType = Just "top", itemscopeProperties = []}quux
Hmmm... another problem. In the first case where there is no itemscope
attribute things work properly. The topscope
is updated to contain the "name"="quux"
property as we expect. But in the second case when itemscope
is present, there's another problem. We're only getting an empty topscope
here. The reason for this is because there is no itemprop
that connects the topscope
to the inner Itemscope
object. Let's add one and verify that this is indeed the problem:
*Main> extractItemprops (html "") topscope Itemscope {itemscopeType = Just "top", itemscopeProperties = [Itemprop {itempropName = "connect", itempropValue = ItemscopeValue (Itemscope {itemscopeType = Nothing, itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "quux"}]})}]}quux
This is what we're expecting to see -- the topscope
now contains the "connect"
property that connects it to the previously orphaned Itemscope
containing "name"="quux"
. So this means that although extractItemprops
appears to be working ok for the case where everything is connected, we have an oversight for the case when we have isolated Itemscopes
that aren't connected. We need to return these as results too somehow.
The easiest way to do this might be to find the point in the code were we're about to drop an orphaned Itemscope
, and store it in a list somewhere so that we can get it when the algorithm completes. This new list becomes another piece of state that we need to pass around.
In order to keep things tidy, let's introduce a State
type and use this in all of our stateful functions. That will be a little clearer than making the result be a tuple. Here's our state which includes the current Itemscope
that the algorithm is working on, which we had before, and the new list of orphaned Itemscopes:
type State = (Itemscope, [Itemscope])
Now we can rewrite our stateful code as follows:
extractMicrodata :: Node -> State -> State extractMicrodata elt@(Element _ _ children) state = foldr extractItemprops state children extractMicrodata _ state = state extractNewItemscope :: Node -> State -> (Itemscope, State) extractNewItemscope elt@(Element _ attrs _) (scope, orphans) | isJust $ lookup "itemscope" attrs = let newScope = Itemscope (lookup "itemtype" attrs) [] (scopeVal, orphans') = extractMicrodata elt (newScope, orphans) in (scopeVal, (scope, orphans')) extractNewItemscope node state = let state'@(scopeVal, _) = extractMicrodata node state in (scopeVal, state') -- return current scope as scopeVal extractItemprop :: Node -> Itemscope -> (Text, Text) -> State -> State extractItemprop elt is ("itemprop", name) (scope, orphans) = (addProp name (extractHtml5Value elt is) scope, orphans) extractItemprop _ _ _ state = state extractItemprops :: Node -> State -> State extractItemprops elt@(Element _ attrs _) state = let (is, state') = extractNewItemscope elt state in foldr (extractItemprop elt is) state' attrs extractItemprops _ scope = scope
The problem now is that although we're passing our new result state around, we are never storing anything into it. As the algorithm is implemented, it isn't clear where this should happen... so I'll give you a hint. It happens in the case where the foldr
in extractItemprops
doesn't encounter an itemprop
attribute. I guess foldr
isn't what we want anymore then.
Now we could do another pass over the attributes to determine whether itemprop
doesn't occur, or we could pass a boolean through the foldr
accumulator and update it if anything is found. But let's make it easy on ourselves and say that the itemprop
attribute can only occur once in an html node. That allows us to rewrite extractItemprops
to use find
instead of foldr
to look for the itemprop
attribute:
extractItemprops :: Node -> State -> State extractItemprops elt@(Element _ attrs _) state = let (is, (scope, orphans)) = extractNewItemscope elt state in case lookup "itemprop" attrs of Just name -> (addProp name (extractHtml5Value elt is) scope, orphans) Nothing -> (scope, is:orphans) extractItemprops _ state = state
Now in the case where we find something, we add a new property as before. But in the case were we don't, we throw the Itemscope
object on the result list instead.
Let's wrap this up by reworking the last function we need, extractMicrodataFromDoc
:
extractMicrodataFromDoc :: Document -> [Itemscope] extractMicrodataFromDoc doc = let initialState = (Itemscope (Just "top") [], []) (scope, orphans) = foldr extractItemprops initialState $ docContent doc in scope:orphans
And finally, our debugged, working code:
{-# LANGUAGE OverloadedStrings #-} import Data.Foldable (foldMap) import Data.List (find) import Data.Maybe (isJust) import Data.Text (Text, concat) import Prelude hiding (concat) import Text.XmlHtml data Itemprop = Itemprop { itempropName :: Text, itempropValue :: MicrodataValue } deriving (Show, Eq) data Itemscope = Itemscope { itemscopeType :: Maybe Text, itemscopeProperties :: [Itemprop] } deriving (Show, Eq) addProp :: Text -> MicrodataValue -> Itemscope -> Itemscope addProp name value (Itemscope t props) = Itemscope t (Itemprop name value : props) hasItemscope :: Node -> Bool hasItemscope elt@(Element _ _ _) = isJust $ lookup "itemscope" $ elementAttrs elt hasItemscope _ = False data MicrodataValue = ItemscopeValue Itemscope | TextValue Text deriving (Show, Eq) extractHtml5Value :: Node -> Itemscope -> MicrodataValue extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope extractHtml5Value elt scope = TextValue $ foldMap getText $ elementChildren elt where getText elt@(TextNode text) = text getText _ = "" type State = (Itemscope, [Itemscope]) extractMicrodata :: Node -> State -> State extractMicrodata elt@(Element _ _ children) state = foldr extractItemprops state children extractMicrodata _ state = state extractNewItemscope :: Node -> State -> (Itemscope, State) extractNewItemscope elt@(Element _ attrs _) (scope, orphans) | isJust $ lookup "itemscope" attrs = let newScope = Itemscope (lookup "itemtype" attrs) [] (scopeVal, orphans') = extractMicrodata elt (newScope, orphans) in (scopeVal, (scope, orphans')) extractNewItemscope node state = let state'@(scopeVal, _) = extractMicrodata node state in (scopeVal, state') -- return current scope as scopeVal extractItemprops :: Node -> State -> State extractItemprops elt@(Element _ attrs _) state = let (is, (scope, orphans)) = extractNewItemscope elt state in case lookup "itemprop" attrs of Just name -> (addProp name (extractHtml5Value elt is) scope, orphans) Nothing -> (scope, is:orphans) extractItemprops _ state = state extractMicrodataFromDoc :: Document -> [Itemscope] extractMicrodataFromDoc doc = let initialState = (Itemscope (Just "top") [], []) (scope, orphans) = foldr extractItemprops initialState $ docContent doc in scope:orphans -- Debugging stuff: html str = case parseHTML "" str of Left msg -> error msg Right rslt -> head $ docContent rslt topscope = Itemscope (Just "top") [] test1 = "\ \" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1\ \\ \ Foobar\ \quux
\ \
Now are we done? There's always one more thing! In part 5 we'll see how to use the "m" word to simplify this further.