March 25, 2012

A comic look at Haskell (pt 2)

In the last part, we defined some Haskell foreign functions for the libarchive library. Now we'll get down to the actual business of translating a few Racket routines into Haskell, redesign them a bit, and ultimately show how lazy IO can be a very nice way to interact with this library.

Let's begin by translating the archive-read-current-file function that we saw previously:

(define (archive-read-current-file archive entry-size)
  (let ([buffer (malloc 'atomic entry-size)])
    (memset buffer 0 entry-size)
    (archive_read_data archive buffer entry-size)
    (make-sized-byte-string buffer entry-size)))

Here, the low-level archive_read_data native method wants to be passed a buffer to write to. In the last section, we did this by making archiveReadData allow it to write to a ByteString value (which can be mutated as you can see, if you call the right "unsafe" operations, e.g. unsafeUseAsCStringLen). So now our problem is simply to provide an appropriately sized ByteString for the Haskell version of this routine to write to. We can do this with Data.ByteString.replicate (which like malloc and memset). Then based on the actual number of bytes read, we can trim the ByteString to the exact size with take:

archiveReadCurrentFile :: Archive -> Int -> IO ByteString
archiveReadCurrentFile ar sz = do
  let bs = replicate sz 0
  cnt <- archiveReadData ar bs
  return $ take cnt bs

This probably isn't the most efficient way to do this. A better API might use Data.ByteString.Internal.createAndTrim which avoids making the final copy that take makes. But this would require redesigning archiveReadData to take a Ptr Word8 to write to instead, or simply calling archive_read_data directly. Something we can revisit later.

Next, let's translate the archive-read-next-file-entry function:

(define (archive-read-next-file-entry archive)
  (let-values ([(entry ret) (archive_read_next_header archive)])
    (if (= ret ARCHIVE_OK)
        (archive-entry (archive_entry_pathname entry)
                       (archive_entry_size entry)
                       (rewrite-file-mode (archive_entry_filetype entry)))
        (if (= ret ARCHIVE_EOF)
            #f
            (raise (list "error" ret))))))

This is pretty straightforward. It simply calls the archive_read_next_header function, checks the result, and builds a Racket archive-entry structure (note the dash instead of underscore in the name) from 3 parts of the C archive_entry struct that's returned. Now, I don't know why it's actually necessary to build another structure containing the archive_entry components (perhaps because Alex wanted to pattern match against this structure?), but in Haskell, I'll just rely on the native methods we designed earlier to access the components of the archive_entry:

archiveReadNextFileEntry :: Archive -> IO ArchiveEntry
archiveReadNextFileEntry ar = do
  rv <- archiveReadNextHeader ar
  case rv of
    Left err -> error (show err)
    Right entry -> return entry

Earlier we had made archiveReadNextHeader return a result of type Either Result ArchiveEntry. And if you recall, an Either value has Left and Right constructors that are used to distinguish either of 2 cases. Usually the Left case is used to specify an error result, and the Right case is for the common, non-error result. We could have returned a pair, (Result, Maybe ArchiveEntry), rather like the Racket code has done, but Either is a slightly cleaner way to do the same thing.

Now rather than returning our Either result, we want this routine to throw an exception for the error case, and simply return the ArchiveEntry for the common case. The case statement accomplishes this. Also, if you recall from earlier, we would like to be able to catch these exceptions, so we're going to swap out the error function with ioError and userError. And one more thing: The whole routine can be simplified by using the either function to say what to do in either of the 2 cases:

archiveReadNextFileEntry :: Archive -> IO ArchiveEntry
archiveReadNextFileEntry ar =
  archiveReadNextHeader ar >>= either (ioError . userError . show) return

This says: Call the archiveReadNextHeader function (that's in the IO monad), and pass its result to the either function (rather like a unix pipe). Then either takes 2 parameters: the first says if you get a Left value, call the composed function (ioError . userError . show) on it (remember, this is equivalent to (\left -> ioError (userError (show left))) -- yes the dots seem to work backwards from most languages we're familiar with). In the Right case, simply call return on the value.

Folding archives

Now we get to the fun part. Alex thought that a good API for visiting the elements of an archive file would be to fold over them. A fold is like a map that passes a piece of data long for the ride -- an accumulator value that it can change or replace as each file is visited. (When the accumulator is a list, you have a map operation exactly.) Here's the Racket fold routine:

(define (archive-folder archive proc init-val)
  (match (archive-read-next-file-entry archive)
    [(archive-entry path size filetype)
     (archive-folder
      archive proc
      (apply proc
             (list (archive-entry path size filetype)
                   (lambda () (archive-read-current-file archive size))
                   init-val)))]
    [#f init-val]))

This calls archive-read-next-file-entry, and when it returns an archive-entry the user's proc is called to handle it's data. The proc is passed the archive-entry (actually a copy of it -- tisk, tisk), a function that can be called to read the current file (a thunk -- a nullary function that's used when we want to delay evaluation -- rather like Haskell's laziness, don't you think?), and the accumulator value, init-val. The result of the user's proc becomes the new accumulator that's passed to a recursive call to archive-folder which will go on to read the next entry.

In Haskell, we can write this almost identically:

archiveFolder :: Archive -> (ArchiveEntry -> IO ByteString -> a -> IO a) ->
                 a -> IO a
archiveFolder ar f acc =
  catch
    (do entry <- archiveReadNextFileEntry ar
        acc' <- f entry (readCurrentEntry entry) acc
        archiveFolder ar f acc')
    (\_ -> return acc)
  where readCurrentEntry entry = do
          sz <- archiveEntrySize entry
          bs <- archiveReadCurrentFile ar sz
          return bs

There are a few subtle differences here. Instead of testing the result of archiveReadNextFileEntry, we're going to rely on catching an exception to know when to terminate the fold. And instead of using a thunk (which is rarely needed in a lazy language like Haskell), we're just going to pass an IO operation which is the result of calling the local function readCurrentEntry.

Let's define a few additional routines in a separate file so that we can test this out:

module Main where

import Codec.Archive.Libarchive
import Data.ByteString hiding (map)
import Data.Int
import qualified Data.ByteString.Lazy as L
import Prelude hiding (read, take)

bufSize :: Int
bufSize = 10240

archiveReadOpen :: String -> IO Archive
archiveReadOpen path = do
  ar <- archiveReadNew
  expectOK $ archiveReadSupportCompressionAll ar
  expectOK $ archiveReadSupportFormatAll ar
  expectOK $ archiveReadOpenFile ar path bufSize
  return ar

archiveFoldFile :: String -> (ArchiveEntry -> IO ByteString -> a -> IO a) ->
                   a -> IO a
archiveFoldFile path f acc = do
  ar <- archiveReadOpen path
  archiveFolder ar f acc

readEntry :: ArchiveEntry -> IO ByteString -> [(Int, ByteString)] ->
             IO [(Int, ByteString)]
readEntry entry reader acc = do
  t <- archiveEntryFiletype entry
  if t == ae_ifreg
    then do sz <- archiveEntrySize entry
            bs <- reader
            let sub = take (min sz 10) bs -- just keep some of the data for test purposes
            return $ (sz, sub):acc
    else return acc

path = "/Users/warrenharris/projects/blog/comic/v-vendetta-0.cbz"

test1 = archiveFoldFile path readEntry []

Let's call it:

*Main> test1
[(2535,"-=Book=-\r\r"),(956,"V For Vend"),(590199,"\255\216\255\224\NUL\DLEJFIF"),(374813,"\255\216\255\224\NUL\DLEJFIF"),(508622,"\255\216\255\224\NUL\DLEJFIF"),(518735,"\255\216\255\224\NUL\DLEJFIF"),(527499,"\255\216\255\224\NUL\DLEJFIF"),(451816,"\255\216\255\224\NUL\DLEJFIF"),(541239,"\255\216\255\224\NUL\DLEJFIF"),(481156,"\255\216\255\224\NUL\DLEJFIF"),(516998,"\255\216\255\224\NUL\DLEJFIF"),(510813,"\255\216\255\224\NUL\DLEJFIF"),(478150,"\255\216\255\224\NUL\DLEJFIF"),(437432,"\255\216\255\224\NUL\DLEJFIF"),(310424,"\255\216\255\224\NUL\DLEJFIF"),(265642,"\255\216\255\224\NUL\DLEJFIF"),(301822,"\255\216\255\224\NUL\DLEJFIF"),(676613,"\255\216\255\224\NUL\DLEJFIF"),(53109,"\255\216\255\224\NUL\DLEJFIF"),(452430,"\255\216\255\224\NUL\DLEJFIF"),(305523,"\255\216\255\224\NUL\DLEJFIF"),(319714,"\255\216\255\224\NUL\DLEJFIF"),(298048,"\255\216\255\224\NUL\DLEJFIF")]

Once thing you'll notice here is that the data got reversed. The last entry (beginning with "-=Book=-\r\r") got put at the front of the list. This is because the fold routine that we wrote is essentially a left-fold, which means it starts from the left (the beginning of the archive contents) and pushes its results into a list as it goes. So the list has the first element on the right. No big deal. We can simply reverse it if we want.

Folding archives -- take 2

Now the first thing the astute reader will notice about archiveFolder is that it really isn't necessary to provide readCurrentEntry at all. This is something the user's function can implement as long as it has the Archive and ArchiveEntry values in hand. We might rewrite this fold routine as follows:

archiveFolder2 :: Archive -> (Archive -> ArchiveEntry -> a -> IO a) -> a -> IO a
archiveFolder2 ar f acc =
  catch
    (do entry <- archiveReadNextFileEntry ar
        acc' <- f ar entry acc
        archiveFolder2 ar f acc')
    (\_ -> return acc)

Now let's provide a few additional routines so that we can test this out:

archiveFoldFile2 :: String -> (Archive -> ArchiveEntry -> a -> IO a) -> a -> IO a
archiveFoldFile2 path f acc = do
  ar <- archiveReadOpen path
  archiveFolder2 ar f acc

readEntry2 :: Archive -> ArchiveEntry -> [(Int, ByteString)] ->
              IO [(Int, ByteString)]
readEntry2 ar entry acc = do
  t <- archiveEntryFiletype entry
  if t == ae_ifreg
    then do sz <- archiveEntrySize entry
            bs <- archiveReadCurrentFile ar sz
            let sub = take (min sz 10) bs
            return $ (sz, sub):acc
    else return acc

test2 = archiveFoldFile2 path readEntry2 []

We get the exact same result if we call this one. All we've done is moved a bit of the work into the readEntry2 function.

Step away from the fold

Now the problem with using a fold operation to traverse an archive file is that it doesn't give you a way to control which parts of the archive get visited. Granted, the user's proc can decided not to slurp in the contents of an archive_entry, but it can't decided that it doesn't want to continue visiting entries. The fold operation is going to continue marching to the end of the archive, whether you've got enough data or not.

Now it might be possible to use exceptions, continuations or delimited control to provide an escape hatch for fold, but a nicer abstraction would be one that didn't have to resort to escape hatches at all -- one that just build the data structure you were about to use as you decided you needed it. Just like a file read operation that streams in chunks of the file each successive time read is called, wouldn't it be nice if we could provide a structured stream of data that only created the data values from the raw archive data as necessary? That's exactly what lazy IO can do for you.

If you remember what I said earlier, the IO monad isn't lazy at all. It's all about sequencing "real world" operations in an order that's predictable and timely. So before we get into the lazy part of lazy IO, let's just design a plain old IO operation that reads the entire archive.

To read an archive, we need to alternatively read headers and archive data -- so there's essentially 2 states to the reading process. And as we read, we need to accumulate this data somewhere. Perhaps the most natural place to accumulate it is as a list of ByteStrings -- one for each archive_entry.

However, instead of regular (strict) ByteStrings, we're going to use lazy ByteStrings instead. There are 2 reasons for this: First, a lazy ByteString provides the same interface abstraction as a regular ByteString, but underneath is uses a sequence of buffer segments to allow data to be brought in as needed. Second, is because it has laziness built into it from the get-go. Although we're not going to use the lazy aspect of L.ByteString just yet, we'll need it eventually -- just just hold on.

Let's start from the bottom up, by first providing a function to return a lazy ByteString for each archive_entry. It's easy to make a L.ByteString from a list of strict ones simply by calling L.fromChunks:

archiveReadEntry :: Ptr ARCHIVE -> IO L.ByteString
archiveReadEntry a = do
  chunks <- archiveReadEntryChunks a
  return $ L.fromChunks chunks

archiveReadEntryChunks :: Ptr ARCHIVE -> IO [ByteString]
archiveReadEntryChunks a = do
  bs <- archiveReadEntryChunk a
  if length bs == 0
    then return []
    else do rest <- archiveReadEntryChunks a
            return (bs:rest)

archiveReadEntryChunk :: Ptr ARCHIVE -> IO ByteString
archiveReadEntryChunk a = do
  buf <- mallocArray bufSize
  rv <- archive_read_data a buf (fromIntegral bufSize)
  unsafePackCStringFinalizer (castPtr buf) (fromIntegral rv) (free buf)

Here I've cheated a bit and made all these functions take a Ptr ARCHIVE instead of an Archive. That's so that I can unwrap the Archive once at the beginning, and also because I don't plan to export these functions from the module. Note that we're mallocing fixed-size chunks and calling archive_read_data to fill them. Then we're using unsafePackCStringFinalizer to associate each chunk with its free finalizer. That's a little more efficient than what we had before.

Next we can define a function that alternately reads headers and entry contents, and does this recursively until the end of the archive is reached:

archiveRead :: Archive -> IO [L.ByteString]
archiveRead ar = do
  withForeignPtr (unArchive ar) $ \a -> do
    rv <- archiveReadNextHeader ar
    case rv of
      Right entry -> do
        t <- archiveEntryFiletype entry
        if t == ae_ifreg
          then do bs <- archiveReadEntry a
                  rest <- archiveRead ar
                  return (bs:rest)
          else archiveRead ar
      Left _ -> return []

This says to read an archive, first read a header. If there is no error, check the archive entry's file type, and if it's a regular file, read the archive's contents followed by a recursive call to read the rest of the archive. If it's not a regular file, we'll skip that entry and continue on to the next one. We'll continue doing this until archiveReadNextHeader returns an error.

Now we're almost done. The only thing that remains is this little detail I mentioned earlier -- the fact that the IO monad is strict, and as written this will pull the entire contents of the archive into memory, building the list of (fully read in) L.ByteStrings as it goes. What we need to do is introduce a little delay in a couple of critical places, so that some of these IO operations aren't performed until they're needed. This is a job for unsafeInterleaveIO.

There are 2 places where unsafeInterleaveIO is needed. First, we need to delay reading in the tail of the list of L.ByteStrings until more entries are needed. Second, we need to delay reading in the entire contents of any individual entry, so that its L.ByteString isn't fully materialized until the data is actually consumed. The first one of these goes in archiveRead itself, at the point where the recursive call is about to be made:

archiveRead :: Archive -> IO [L.ByteString]
archiveRead ar = do
  withForeignPtr (unArchive ar) $ \a -> do
    rv <- archiveReadNextHeader ar
    case rv of
      Right entry -> do
        t <- archiveEntryFiletype entry
        if t == ae_ifreg
          then do bs <- archiveReadEntry a
                  rest <- unsafeInterleaveIO $ archiveRead ar
                  return (bs:rest)
          else archiveRead ar
      Left _ -> return []

The second one goes in archiveReadEntryChunks, again at the point where it is about to be called recursively:

archiveReadEntryChunks :: Ptr ARCHIVE -> IO [ByteString]
archiveReadEntryChunks a = do
  bs <- archiveReadEntryChunk a
  if length bs == 0
    then return []
    else do rest <- unsafeInterleaveIO $ archiveReadEntryChunks a
            return (bs:rest)

Now the cool thing about lazy IO is that nothing gets read in until it is needed, but due to the sequencing savviness of monads, everything happens in exactly the same order as if it had been read in eagerly. So we get the best of both worlds -- an elegant abstraction (a list of ByteStrings), and the benefits of streaming.

Conclusion

At this point I'm sure your disappointed to find that this article has had very little to do with comics, and in general hasn't been very funny at all. But stay tuned for part 3, where I'll attempt to put it this library to use and actually build that web app for your comic viewing pleasure.

A comic look at Haskell

Well, if you're one of the 2 people that read my last set of blog articles, Restringing a Racket with Haskell, you'll know that Alex is attempting to branch out from the ancient esoteric art of Emacs Lisp hacking to learn the well-mannered sport of Racket. I, on the other hand, am trying to steer him towards the abyss. As Oliver Wendell Holmes once said, "The mind, once expanded to the dimensions of larger ideas, never returns to its original size." Here's to Alex with a big head!

Now last time I made a wild and unsubstantiated claim that Haskell makes me a better programmer, and then proceeded to show you a bunch of boring bits about trying to get a silly HTML microdata extractor to work. Well, this time, you'll be happy to know that I'm going to stand by my claim, and likewise go on to bore you with the likes of foreign functions and lazy IO. Why am I wasting your time, you ask? First, you're the one chosing to read this! Second, we've got to learn a little bit of the Haskell landscape before the rockinghorse people can get down to gnashing those marshmallow pies. So if I can get Alex, or anyone else out there in the studio audience over the initial hurdle, I've done my bit.

So this time, I was tossed a comic book archive file (.cbz) and told that our goal was to create a little web app that displays the comic book pages. Along with this, I was given the following bit of Racket foreign function interface to libarchive, a nifty little package developed by another friend of ours, Tim, for reading all sorts of archive formats.

#lang racket/base
(require ffi/unsafe
         ffi/unsafe/define
         ffi/unsafe/alloc
         racket/match)

(define-ffi-definer define-archive (ffi-lib "libarchive" (list "12")))

(define ARCHIVE_OK 0)
(define ARCHIVE_EOF 1)

(define _archive-pointer (_cpointer 'archive))
(define _archive_entry-pointer (_cpointer 'archive-entry))

(define-archive archive_read_close
  (_fun _archive-pointer -> _int) #:wrap (deallocator))
(define-archive archive_read_free
  (_fun _archive-pointer -> _int) #:wrap (deallocator))
(define-archive archive_read_new (_fun -> _archive-pointer)
  #:wrap (allocator archive_read_free))
(define-archive archive_read_support_compression_all
  (_fun _archive-pointer -> _int))
(define-archive archive_read_support_format_all
  (_fun _archive-pointer -> _int))
(define-archive archive_read_open_file
  (_fun _archive-pointer _string _int -> _int)
  #:wrap (allocator archive_read_close))
(define-archive archive_read_next_header
  (_fun _archive-pointer (o : (_ptr o _archive_entry-pointer))
        -> (r : _int)
        -> (values o r)))
(define-archive archive_entry_pathname
  (_fun _archive_entry-pointer -> _string))
(define-archive archive_entry_size
  (_fun _archive_entry-pointer -> _int))
(define-archive archive_entry_filetype
  (_fun _archive_entry-pointer -> _int))
(define-archive archive_read_data
  (_fun _archive-pointer _pointer _int -> _int))

(define (archive-read-current-file archive entry-size)
  (let ([buffer (malloc 'atomic entry-size)])
    (memset buffer 0 entry-size)
    (archive_read_data archive buffer entry-size)
    (make-sized-byte-string buffer entry-size)))

(struct archive-entry (path size filetype) #:transparent)

(define (rewrite-file-mode mode)
  (match mode
    [#o0170000 'fmt]
    [#o0100000 'file]
    [#o0120000 'link]
    [#o0140000 'sock]
    [#o0020000 'chr]
    [#o0060000 'blk]
    [#o0040000 'dir]
    [#o0010000 'fifo]))

(define (archive-read-next-file-entry archive)
  (let-values ([(entry ret) (archive_read_next_header archive)])
    (if (= ret ARCHIVE_OK)
        (archive-entry (archive_entry_pathname entry)
                       (archive_entry_size entry)
                       (rewrite-file-mode (archive_entry_filetype entry)))
        (if (= ret ARCHIVE_EOF)
            #f
            (raise (list "error" ret))))))

(define (archive-folder archive proc init-val)
  (match (archive-read-next-file-entry archive)
    [(archive-entry path size filetype)
     (archive-folder
      archive proc
      (apply proc
             (list (archive-entry path size filetype)
                   (lambda () (archive-read-current-file archive size))
                   init-val)))]
    [#f init-val]))

(define (archive-fold-files archive-path proc init-val)
  (let ([archive (archive_read_new)]) ; check that archive != NULL
    (archive_read_support_compression_all archive) ; check return
    (archive_read_support_format_all archive) ; check return
    (archive_read_open_file archive archive-path 10240) ; check return
    (archive-folder archive proc init-val)))

(archive-fold-files
"/home/alexbl/Downloads/Judge Dredd The Complete Case Files/Judge Dredd The Complete Case Files 001 (Digital Copy Colour Edit).cbr"
 (lambda (entry reader val)
   (if (equal? (archive-entry-filetype entry) 'file)
       (cons (cons (archive-entry-path entry)
                   (let ([data (reader)])
                    (subbytes data 0 (min (bytes-length data) 10)))) val)
       val))
 '())

;; (archive-fold-files
;;  "/home/alexbl/Downloads/Judge Dredd The Complete Case Files/Judge Dredd The Complete Case Files 001 (Digital Copy Colour Edit).cbr"
;;  (lambda (entry reader val)
;;    (cons (cons (archive-entry-path entry) (archive-entry-size entry)) val))
;;  '())

(provide archive-fold-files
         archive-entry)

As you can see from this file, there are a bunch of actual foreign function definitions (like archive_read_close), and then a number of higher-level routines to make it more convenient to get at an archive's contents.

Before we can begin translating this to Haskell foreign functions, there's a lot to learn about how Haskell's FFI works. There are a bunch of articles on that topic on the Haskell Wiki, as well as a very nice series of articles at Inside 233. But like always, I'm just going to cut to the chase and give you the answer, because we're both in a hurry.

To do this, we're going to use Haskell's lowest-level tool, a preprocessor called "hsc2hs" to convert a foreign function definition into working Haskell code. There are several other ways to generate FFI interfaces for Haskell, but hsc2hs isn't too painful, and it's always good to know the basics. Maybe later we'll revisit this with c2hs or something else.

FFI Boilerplate

We start by creating a .hsc file in the right place for Haskell's module hierarchy. If we look on Hackage, we see that many archive handling packages are in the Codec namespace. So we'll do the same and create a file called "Codec/Archive/Libarchive.hsc" and give it some initial boilerplate:

{-# LANGUAGE ForeignFunctionInterface, EmptyDataDecls, DeriveDataTypeable,
             ScopedTypeVariables #-}

#include <archive.h>
#include <archive_entry.h>

module Codec.Archive.Libarchive where

import Data.ByteString (ByteString, replicate, take)
import qualified Data.ByteString.Lazy as L
import Data.ByteString.Unsafe (unsafePackCString, unsafeUseAsCStringLen,
                               unsafePackCStringFinalizer)
import Foreign.C (CInt, CSize, CString, withCAString)
import Foreign.Marshal.Alloc (alloca, free)
import Foreign.Marshal.Array (mallocArray)
import Foreign.Storable (peek)
import Foreign.Ptr (Ptr, FunPtr, castFunPtr, castPtr)
import Foreign.ForeignPtr (ForeignPtr, newForeignPtr, newForeignPtr_,
                           withForeignPtr)
import Prelude hiding (replicate, take)
import System.IO.Unsafe (unsafeInterleaveIO)

I've made this easy for you to see which functions come from which modules, in case you want to go look them up. You really don't need to declare the individual functions you use in this way, provided that there aren't any conflicts between modules. (You'll notice that there is a conflict in the above between ByteString's replicate and take which I'm hiding from the Prelude package. Also the Data.ByteString.Lazy module defines a whole parallel set of lazy ByteString functions that we'll get at by using the "L" prefix.)

Now one cool thing about hsc2hs is that you can just include C headers directly, like I've done with the 2 libarchive headers. We'll see how these get used in just a few minutes. But first a bit of bad news. Unlike normal Haskell .hs files, you can't sit in haskell-mode in Emacs and evaluate .hsc files in the same way. I know that's really disappointing, but maybe some Emacs Lisp hacker will come along and fix that sad situation eventually. In the mean time, we'll resort to Haskell's big fancy make system, Cabal, to build this file. Here's the cabal definition, a file we'll name "libarchive.cabal" that lives in our top-level directory:

name: libarchive
version: 0.1
license: BSD3
cabal-version: >= 1.8
copyright: (c) 2012 Warren Harris
author: Warren Harris <warrensomebody@gmail.com>
maintainer: Warren Harris <warrensomebody@gmail.com>
stability: alpha
synopsis: libarchive bindings
description: libarchive bindings
category: Codec
build-type: Simple

Library
    exposed-modules:    Codec.Archive.Libarchive
    build-depends:      base >= 3 && < 5,
                        bytestring >= 0.9
    extra-libraries:    archive
    ghc-options:        -Wall

First a little aside: If you're on a Mac and you don't yet have libarchive installed, you can get it via macports. I tried this the other day, and decided to port upgrade outdated at the same time. Eleven (yes, 11) hours later, I had all sorts of shiny new toys like gcc45 and Erlang (Erlang?). This was the last straw, so I decided to switch to Homebrew which provides a much more sane package manager for the Mac. Unfortunately, you need a bit of magic to install the development headers for libarchive with Homebrew:

$ brew tap homebrew/dupes
$ brew install `brew --prefix`/Library/Formula/libarchive.rb

Now we're ready to configure and build our (empty) libarchive module::

$ cabal configure
$ cabal install

We'll need to re-run the cabal install command whenever we make a change to our Libarchive.hsc file. At least until we get that Emacs integration.

Our first foreign functions

Now, let's create our first foreign function definitions, starting with the interface to archive_entry. We'll put these at the end of the Libarchive.hsc file:

data ARCHIVE_ENTRY      -- native type
newtype ArchiveEntry = ArchiveEntry { unArchiveEntry :: ForeignPtr ARCHIVE_ENTRY }
                     deriving (Show, Eq)

foreign import ccall unsafe "archive.h archive_entry_pathname"
  archive_entry_pathname :: Ptr ARCHIVE_ENTRY -> IO CString

archiveEntryPathname :: ArchiveEntry -> IO ByteString
archiveEntryPathname ae =
  withForeignPtr (unArchiveEntry ae) $ \a -> do
    cstr <- archive_entry_pathname a
    unsafePackCString cstr

foreign import ccall unsafe "archive.h archive_entry_size"
  archive_entry_size :: Ptr ARCHIVE_ENTRY -> IO CInt

archiveEntrySize :: ArchiveEntry -> IO Int
archiveEntrySize ae =
  withForeignPtr (unArchiveEntry ae) $ \a -> do
    sz <- archive_entry_size a
    return $ fromIntegral sz

foreign import ccall unsafe "archive.h archive_entry_filetype"
  archive_entry_filetype :: Ptr ARCHIVE_ENTRY -> IO CInt

archiveEntryFiletype :: ArchiveEntry -> IO Int
archiveEntryFiletype ae =
  withForeignPtr (unArchiveEntry ae) $ \a -> do
    mode <- archive_entry_filetype a
    return $ fromIntegral mode

The first line defines a new data type, ARCHIVE_ENTRY, with no constructors. The only reason for this is so that we can distinguish native pointers to C archive_entry structures from other kinds of pointers. In Haskell, you'll see Ptr ARCHIVE_ENTRY as the type of the first argument of the archive_entry accessors.

The second line defines a higher-level wrapper type for archive_entry structures. Here we point to our archive_entry with a Haskell ForeignPtr instead of a regular Ptr. ForeignPtrs allow us to associate a finalizer callback with a C pointer, so that when the garbage collector decides the object is no longer needed, the C structure can be automatically freed. You aren't required to design foreign function interfaces with ForeignPtrs. One can always provide APIs that require the user to explicitly free structures, or manage them in a block scope -- but here using finalizers seems like a good way to go. Now, unfortunately in the above routines we don't actually see the finalizer get associated with the instances. That's because all these structs are managed by the archive structure, and its finalizer will automatically destroy the archive_entrys. So we'll come back to this finalizer thing later.

Another thing: newtype is a cool way to define a type that incurs no runtime overhead. Unlike typedef in C that defines type synonyms, newtype defines a type that's logically distinct from the thing it wraps, so you can't accidentally use a ForeignPtr ARCHIVE_ENTRY when an ArchiveEntry is needed. The ArchiveEntry type constructor that wraps these values is only there at compile time, so there's no runtime overhead for using a newtype.

Finally, you'll see that each of these foreign functions definitions has a low-level C function declaration -- the foreign import statements with associated type information, and the higher-level user entry points that deal with unwrapping the ForeignPtrs and converting from C types like CInt to more convenient Haskell types like Int. Let's look at the first foreign import statement, archive_entry_pathname. It uses the unsafe declaration to say that this function is "unsafe" in the sense that it can't call back into the Haskell runtime. If it did, the Haskell runtime would become seriously confused. However, be aware that even if your native function doesn't call back into Haskell, you might want to declare a function safe if you want to allow Haskell's lightweight threads to preempt it. All unsafe functions run to completion, and hog the entire process while they're running.

All of these foreign import declarations also state that the native functions are in the IO monad (the final result such as CString is wrapped with IO). This might seem odd for a function like archive_entry_size which is purely functional in nature. The reason I'm doing this is due to the way libarchive works. Since it streams through the archive file's contents, if you call archive_entry_size after you move on to the next archive header, it returns 0. So it's really critical that when you use it in the IO monad it really runs right away, and isn't run lazily when the value is actually needed. If you allow it to run lazily, it runs too late and comes up empty handed. I found this out the hard way.

This brings us to an interesting point about the IO monad. Like all monads, the IO monad is about sequencing operations together, so that they happen in the right order. However, unlike most monads, the IO monad is strict (i.e. not lazy), which means that the operations happen as soon as you call the functions, rather than when the monad's result value is really needed. We'll come back to this when we talk about lazy IO later.

Moving on to the higher-level functions. First, I've named this by the standard Haskell camel-case naming conventions to distinguish them from the low-level routines. You'll notice that my module declaration doesn't declare which functions are exported (so everything will be exported by default), but the intention is to simply export the high-level functions and types, and hide the lower-level (and possibly dangerous) native methods. Each high-level function like archiveEntryPathname first unwraps the ArchiveEntry type by calling the unArchiveEntry record accessor to get at the ForeignPtr, and then uses withForeignPtr to get at the low-level Ptr ARCHIVE_ENTRY. This is then passed to the native method along with other arguments (translated into their C types), and the results are similarly translated back into Haskell types. For instance, unsafePackCString creates a ByteString from a null-terminated CString, and fromIntegral is the numeric workhorse that converts all sorts of numbers to whatever's needed.

Now one more thing we'll need when we call archiveEntryFiletype is a set of constants for the various modes it returns (things like AE_IFREG in the "archive.h" header). We could have defined these with data constructors, but since I'm lazy I'll just define a set of octal constants here. Note that they have to begin with a lower-case letter since they're values not types:

ae_ifmt :: Int
ae_ifmt = 0o170000

ae_ifreg :: Int
ae_ifreg = 0o100000

ae_iflnk :: Int
ae_iflnk = 0o120000

ae_ifsock :: Int
ae_ifsock = 0o140000

ae_ifchr :: Int
ae_ifchr = 0o020000

ae_ifblk :: Int
ae_ifblk = 0o060000

ae_ifdir :: Int
ae_ifdir = 0o040000

ae_ififo :: Int
ae_ififo = 0o010000

Skipping right along...

Before we define foreign functions for archive operations, let's first define some routines for dealing with these function's return types. Instead of testing an integer result, let's use a data type to define the possible result codes, and some helper functions for calling these functions and converting their int results to this type:

data Result = OK | EOF | RETRY | WARN | FAILED | FATAL
            deriving (Eq, Show)

result :: CInt -> Result
result 0 = OK
result 1 = EOF
result (-10) = RETRY
result (-20) = WARN
result (-25) = FAILED
result (-30) = FATAL
result _ = FATAL  -- unexpected result

withResult :: (Ptr ARCHIVE -> IO CInt) -> Archive -> IO Result
withResult f ar =
  withForeignPtr (unArchive ar) $ \a -> do
    rv <- f a
    return $ result rv

expectOK :: IO Result -> IO ()
expectOK m = m >>= go
  where go OK = return ()
        go err = ioError $ userError (show err)

The function expectOK is something that wraps an IO operation returning a Result into one that throws an exception instead. This is just for the cases where we'd like to streamline the common case, and not deal with error handling every step of the way. I'm not sure that this function makes any sense in a production-quality version of this library, but we'll use it here when it's convenient. Also note that I originally wrote the last line as go err = error (show err) (err is the Result value, and error is a function that aborts the computation by raising an error). However, the exceptions raise by error can't be caught like those by ioError, so I switched.

Finally, here are the rest of the FFI definitions we'll need. They aren't all that interesting after you've groked the above, so I'm not going to say any more about them:

data ARCHIVE     -- native type
newtype Archive = Archive { unArchive :: ForeignPtr ARCHIVE }
                deriving (Show, Eq)

foreign import ccall unsafe "archive.h archive_read_new"
  archive_read_new :: IO (Ptr ARCHIVE)

foreign import ccall unsafe "archive.h &archive_read_finish"
  p_archive_read_finish :: FunPtr (Ptr ARCHIVE -> IO CInt)

archiveReadNew :: IO Archive
archiveReadNew = do
  ar <- archive_read_new
  fp <- newForeignPtr (castFunPtr p_archive_read_finish) ar
  return $ Archive fp

foreign import ccall unsafe "archive.h archive_read_support_compression_all"
  archive_read_support_compression_all :: Ptr ARCHIVE -> IO CInt

archiveReadSupportCompressionAll :: Archive -> IO Result
archiveReadSupportCompressionAll ar =
  withResult archive_read_support_compression_all ar

foreign import ccall unsafe "archive.h archive_read_support_format_all"
  archive_read_support_format_all :: Ptr ARCHIVE -> IO CInt

archiveReadSupportFormatAll :: Archive -> IO Result
archiveReadSupportFormatAll ar =
  withResult archive_read_support_format_all ar

foreign import ccall unsafe "archive.h archive_read_open_file"
  archive_read_open_file :: Ptr ARCHIVE -> CString -> CSize -> IO CInt

archiveReadOpenFile :: Archive -> String -> Int -> IO Result
archiveReadOpenFile ar filename blocksize =
  withCAString filename $ \c_filename -> do
  let sz = fromIntegral blocksize
  withResult (\a -> archive_read_open_file a c_filename sz) ar

foreign import ccall unsafe "archive.h archive_read_next_header"
  archive_read_next_header :: Ptr ARCHIVE -> Ptr (Ptr ARCHIVE_ENTRY) -> IO CInt

archiveReadNextHeader :: Archive -> IO (Either Result ArchiveEntry)
archiveReadNextHeader ar =
  alloca $ \ptr -> do
    rv <- withResult (\a -> archive_read_next_header a ptr) ar
    case rv of
      OK -> do entry <- peek ptr
               fp <- newForeignPtr_ entry
               return $ Right $ ArchiveEntry fp
      err -> return $ Left err

foreign import ccall unsafe "archive.h archive_read_close"
  archive_read_close :: Ptr ARCHIVE -> IO CInt

archiveReadClose :: Archive -> IO Result
archiveReadClose ar = withResult archive_read_close ar

foreign import ccall unsafe "archive.h archive_read_finish"
  archive_read_finish :: Ptr ARCHIVE -> IO CInt

archiveReadFinish :: Archive -> IO Result
archiveReadFinish ar = withResult archive_read_finish ar

foreign import ccall unsafe "archive.h archive_read_data"
  archive_read_data :: Ptr ARCHIVE -> Ptr CString -> CSize -> IO CSize

archiveReadData :: Archive -> ByteString -> IO Int
archiveReadData ar bs =
  withForeignPtr (unArchive ar) $ \a ->
    unsafeUseAsCStringLen bs $ \(buf, len) -> do
    rv <- archive_read_data a (castPtr buf) (fromIntegral len)
    return $ fromIntegral rv

Ok, I will say one thing. In archiveReadNextHeader, you can see that the native archive_entry is wrapped as a ForeignPtr using newForeignPtr_. Unlike newForeignPtr, newForeignPtr_ doesn't take a finalizer. This is because the archive_entry struct will be destroyed when we destroy the archive structure, not when the Haskell ArchiveEntry object goes out of scope. However, in archiveReadNew you can see where we do associate the archive_read_finish native function as a destructor for the Haskell Archive object. (Note how we have to use FunPtr when we want a native pointer to a function. Also, the castPtr is used because we're ignoring the result of archive_read_finish. Haskell expects finalizers to return IO ().)

Next, in part 2 we'll get down to translating higher-level Racket routines like archive-fold-files. Don't doze off yet...

March 2, 2012

Restringing a Racket with Haskell (pt 5)

In the last section we finished up an initial version of our microdata extractor -- one that actually works this time. Admittedly, the code was starting to get a little ugly with all the state propagation going around. But don't worry -- with a simple monad we can simplify things and make them pretty once again.

Oh! Did I say monad? We aren't going to write yet another monad tutorial here, I hope! There are far too many of them in the world already. Instead I'm just going to port the code from part 3 to use the State monad to show that it's not that scary, and actually a little bit helpful.

First, we must import Control.Monad.State. If you go look at the package description with the link above, you'll see that it really uses Control.Monad.State.Lazy by default (there's a strict variant, but we won't go into it here).

Next, we need to figure out where to use this crazy thing. In the last part, we talked about propagating the scope stack -- the state -- through various operations. This was always apparent because their signatures ended with either Itemscope -> Itemscope or State -> State. Seeing something like this is a tell-tale sign that the State monad might be useful.

The two main operations that State provides are get and put. In many ways you can think of them as having a little mutable memory location where you can squirrel away things you need to update -- rather like box in Racket. However, unlike box, State follows the control flow of your program rather than just floating around in memory somewhere. (Haskell provide actual mutable Ref cells that do "float around in memory", but let's not go into that here.) State is cooler than some of the alternatives because it really is functional even though it feels imperative. Awesomeness abounds.

To use State with our microdata extractor program, we simply need to define a state monad that our program's operations will be "in". By "in," I mean that the final type in these operation's signature will be this state monad, or in other words, they'll return a monad value as their result. What's a monad value? It's just a function. In our case, it's just a function that takes a scope stack and returns a scope stack (possibly a different one than the one you gave).

Monads are a truly clever trick, and one that's well worth getting your head around in general. At a fundamental level, they're all about sequencing operations together (and Haskell provides build-in syntax for making this sequencing easier -- the do statement). But here, our interest is just a very specific kind of monad that deals with state.

So let's define a state monad for microdata extraction called ExtractState:

type ExtractState a = State (Itemscope, [Itemscope]) a

The type keyword defines a type synonym (like typedef in C++), and this type is parametrized by the variable a, which will be the result type of a state operation. ExtractState a is equivalent to the State type where the type of the actual state is fixed, and is the type of our state from the last section, (Itemscope, [Itemscope]). That's it.

Now the magic of defining this state type is that when we're in the ExtractState monad, calling get will get us our state -- seemingly out of the ether. And calling put will allow us to store a new scope stack into the background radiation. Really what's happening is that there's one more level of functional argument passing just below the monad surface that is shuttling our state around. Best not to think about it too deeply. Let's just use it.

Let's begin by defining a few helper functions to access the various parts of our state. We can define these in terms of get and put that State defines:

getScope :: ExtractState Itemscope
getScope = do (scope, _) <- get; return scope

putScope :: Itemscope -> ExtractState ()
putScope scope = do (_, orphans) <- get; put (scope, orphans)

You can see in put that the old scope is dropped on the floor, and the new scope is stored in its place. If we want to get fancier, we can rewrite putScope in terms of modify which passes the old state to a function which should return a new state to put:

putScope :: Itemscope -> ExtractState ()
putScope scope = modify $ \(_, orphans) -> (scope, orphans)

We can also write a state accessor to save an otherwise orphaned Itemscope onto the list of orphans:

saveOrphan :: Itemscope -> ExtractState ()
saveOrphan r = modify $ \(scope, orphans) -> (scope, r:orphans)

With these, we're now ready to rewrite our code to use the ExtractState monad. Let's begin with extractMicrodata which used to look like this:

extractMicrodata :: Node -> State -> State
extractMicrodata elt@(Element _ _ children) state =
  foldr extractItemprops state children
extractMicrodata _ state = state

We can shift it into ExtractState gear by rewriting it like this (deferring the body of the first function cases for a moment):

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) = undefined
extractMicrodata _ = return ()

From this, you can see that the signature changed from Node -> State -> State to Node -> ExtractState (). ExtractState hides the details of the state transitions into the monad operations that are used to implement this type -- functions that take a state and return a state. At the same time, the scopes parameters went away because we can access it with our monad accessor functions when it's needed. The () (called "unit") type parameter is like void in C++. It simply means that there's no interesting value returned from this function (although the real "value" here is the monad itself). We've used the standard monad return function to return a unit value in the second function case. That's because we don't modify the state, and don't really have a return value otherwise.

Now filling in the details:

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  foldr extractItemprops children
extractMicrodata _ = return ()

we'll get into a bit of trouble:

/Users/warrenharris/projects/racket-hs/microdata5.hs:64:3:
    Couldn't match expected type `StateT
                                    (Itemscope, [Itemscope]) Data.Functor.Identity.Identity ()'
                with actual type `[a0] -> b0'
    Expected type: ExtractState ()
      Actual type: [a0] -> b0
    In the return type of a call of `foldr'
    In the expression: foldr extractItemprops children
Failed, modules loaded: none.

We can see that the expected type was ExtractState (), whereas the actual type was a function from lists to something: [a0] -> b0. We didn't even get the arity right here since foldr is expecting another argument (the initial accumulator which was our scope stack). Using map instead isn't much better:

/Users/warrenharris/projects/racket-hs/microdata5.hs:64:3:
    Couldn't match expected type `StateT
                                    (Itemscope, [Itemscope]) Data.Functor.Identity.Identity ()'
                with actual type `[b0]'
    Expected type: ExtractState ()
      Actual type: [b0]
    In the return type of a call of `map'
    In the expression: map extractItemprops children
Failed, modules loaded: none.

Now we're returning a list instead of a monad. What we need a monadic version of map instead. Fortunately, there is just such a thing called mapM:

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  do _ <- mapM extractItemprops children
     return ()
extractMicrodata _ = return ()

Since mapM returns a list of results, we're discarding it with the wildcard bind, and returning unit instead. We could also write this using Haskell's lower-level monad operation >> which throws away the result of the left-hand side:

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  mapM extractItemprops children >> return ()
extractMicrodata _ = return ()

But this is still unsatisfying because we're constructing a list that we're just going to throw away. What we need is a version of mapM that doesn't build lists at all, but simply iterates over the input list, sequencing the monadic operations accordingly and throwing away whatever they return. Fortunately there is just such an operation, called mapM_ (the underscore designates "returns nothing"). And since what we're doing looks more like iteration than anything, there's a variant of mapM_ called forM_ that reverses its argument order so that it looks more like a foreach construct:

*Main> :t mapM
mapM :: Monad m => (a -> m b) -> [a] -> m [b]
*Main> :t mapM_
mapM_ :: Monad m => (a -> m b) -> [a] -> m ()
*Main> :t forM_
forM_ :: Monad m => [a] -> (a -> m b) -> m ()

Let's use forM_ to simplify once again:

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  forM_ children $ (\child ->
    extractItemprops child)
extractMicrodata _ = return ()

Hey, that looks a lot like imperative code! For each child in children, run the operation extractItemprops passing it the child. (The backslash here is a lambda, and in fact, since the precedence of lambda expressions is higher than $ we don't need the parens around the lambda expression either.) But since child is the last parameter passed to extractItemprops, we can simplify this just a little further (because \x -> f x is just f):

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  forM_ children extractItemprops
extractMicrodata _ = return ()

Let's move on to extractNewItemscope. Here was our previous definition:

extractNewItemscope :: Node -> State -> (Itemscope, State)
extractNewItemscope elt@(Element _ attrs _) (scope, orphans)
  | isJust $ lookup "itemscope" attrs =
    let newScope = Itemscope (lookup "itemtype" attrs) []
        (scopeVal, orphans') = extractMicrodata elt (newScope, orphans)
    in (scopeVal, (scope, orphans'))
extractNewItemscope node state =
  let state'@(scopeVal, _) = extractMicrodata node state
  in (scopeVal, state')  -- return current scope as scopeVal

This time we don't have a simple State -> State at the end of the function signature. Instead the final return type is returning a value and a new state, State -> (Itemscope, State). Fortunately, we can describe this in the State monad as ExtractState Itemscope:

extractNewItemscope :: Node -> ExtractState Itemscope
extractNewItemscope elt@(Element _ attrs _)
  | isJust $ lookup "itemscope" attrs = undefined
extractNewItemscope node = undefined

Filling in the details now:

extractNewItemscope :: Node -> ExtractState Itemscope
extractNewItemscope elt@(Element _ attrs _)
  | isJust $ lookup "itemscope" attrs = do
    prevScope <- getScope
    let newScope = Itemscope (lookup "itemtype" attrs) []
    putScope newScope
    extractMicrodata elt
    scopeVal <- getScope
    putScope prevScope
    return scopeVal
extractNewItemscope node = do
  extractMicrodata node
  scopeVal <- getScope
  return scopeVal  -- return current scope as scopeVal

From this you can see that we use the getScope operation to get the current scope. The back-arrow operator (<-) lets us bind a pattern to the result of a monad operation, and here we're binding the variable prevScope. We create a newScope using a let statement (note that a let in a do sequence isn't followed by the in keyword). Next we use putScope to store the new scope, then call extractMicrodata, then use getScope again to get the modified verion of newScope (now called scopeVal). Then we use putScope again to put the previous scope back, and finally return scopeVal.

Now don't confuse let foo = ... and do foo <- .... They both more or less do something similar -- assign a variable -- but the latter works with monad operations. If we had used a let with a monad operation instead, we'd be getting our hands on the monad itself, not the return value that's in the monad. Using let with monads can be quite useful when manipulating control flow, or plugging pieces of programs together -- just not what we're interested in doing here. Monads are really powerful because they decouple the work of screwing the plumbing together from running the crap through the pipes (if you catch my drift).

Of course the code above is kind of ugly. Most of it is just manipulating previous and current scope. We can wrap this up in a nice little helper function that lets us lexically set the current scope:

withScope :: Itemscope -> ExtractState a -> ExtractState a
withScope scope m = do
  prevScope <- getScope
  putScope scope
  rv <- m
  putScope prevScope
  return rv

Here, withScope takes a scope and another monad operation, m. It sets up the new scope, invokes m, reverts the scope, and returns m's value. (If you were worried about m raising exceptions, you'd want to put an exception handler in here to revert to the previous scope value, but we don't raise any exceptions in our code.)

We can now simplify extractNewItemscope:

extractNewItemscope :: Node -> ExtractState Itemscope
extractNewItemscope elt@(Element _ attrs _)
  | isJust $ lookup "itemscope" attrs =
    withScope (Itemscope (lookup "itemtype" attrs) []) $ do
      extractMicrodata elt
      scopeVal <- getScope
      return scopeVal
extractNewItemscope node = do
  extractMicrodata node
  scopeVal <- getScope  -- return current scope as scopeVal
  return scopeVal

or using the >> operation to further simplify:

extractNewItemscope :: Node -> ExtractState Itemscope
extractNewItemscope elt@(Element _ attrs _)
  | isJust $ lookup "itemscope" attrs =
    withScope (Itemscope (lookup "itemtype" attrs) []) $
      extractMicrodata elt >> getScope
extractNewItemscope node =
  extractMicrodata node >> getScope

Moving on to the extractItemprops function:

extractItemprops :: Node -> ExtractState ()
extractItemprops elt@(Element _ attrs _) = do
  scopeVal <- extractNewItemscope elt
  case lookup "itemprop" attrs of
    Just name -> do
      scope <- getScope
      putScope $ addProp name (extractHtml5Value elt scopeVal) scope
    Nothing -> saveOrphan scopeVal
extractItemprops _ = return ()

There shouldn't be any surprises at this point. We're just using getScope to get the current scope, modifying it with addProp, and using putScope to store the new value back. This could be simpler too if we had a version of modify that only operated on the scope part of the state:

modifyScope :: (Itemscope -> Itemscope) -> ExtractState ()
modifyScope f = modify $ \(scope, orphans) -> (f scope, orphans)

extractItemprops :: Node -> ExtractState ()
extractItemprops elt@(Element _ attrs _) = do
  scopeVal <- extractNewItemscope elt
  case lookup "itemprop" attrs of
    Just name ->
      modifyScope $ addProp name (extractHtml5Value elt scopeVal)
    Nothing -> saveOrphan scopeVal
extractItemprops _ = return ()

Monads running free

So I've shown you how to rewrite some simple state propagating code to look cleaner and neater using the State monad, but how to you set it in motion -- supplying the initial state so that the monad can actually run? Well, every monad has some kind of run function that does this, and for State it's called (appropriately), runState. We can use this to rewrite extractMicrodataFromDoc:

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc =
  let f = mapM_ extractItemprops $ docContent doc
      initialState = (Itemscope (Just "top") [], [])
      ((), (scope, orphans)) = runState f initialState
  in scope:orphans

What runState does is it takes an initial state (the empty scope stack in this case, []) and returns a pair of the final answer (unit here, ()) and the final state (scopes). Since we want extractMicrodataFromDoc to return the scope stack as its answer we just discard the unit result and return scopes instead.

Note that extractMicrodataFromDoc is not itself a monadic operation (since it's return type isn't ExtractState). That means we can call it just like before in our test case:

test1 = "
\ \ Foobar\ \
\ \

quux

\ \
\ \
" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1

We're done

{-# LANGUAGE OverloadedStrings #-}

import Control.Monad (forM)
import Control.Monad.State
import Data.Foldable (foldMap)
import Data.List (find)
import Data.Maybe (isJust)
import Data.Text (Text, concat)
import Prelude hiding (concat)
import Text.XmlHtml

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: MicrodataValue
  } deriving (Show, Eq)

data Itemscope = Itemscope {
  itemscopeType :: Maybe Text,
  itemscopeProperties :: [Itemprop]
  } deriving (Show, Eq)

addProp :: Text -> MicrodataValue -> Itemscope -> Itemscope
addProp name value (Itemscope t props) =
  Itemscope t (Itemprop name value : props)

hasItemscope :: Node -> Bool
hasItemscope elt@(Element _ _ _) =
  isJust $ lookup "itemscope" $ elementAttrs elt
hasItemscope _ = False

data MicrodataValue = ItemscopeValue Itemscope
                    | TextValue Text
                    deriving (Show, Eq)

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope =
  TextValue $ foldMap getText $ elementChildren elt
  where getText elt@(TextNode text) = text
        getText _ = ""

type ExtractState a = State (Itemscope, [Itemscope]) a

getScope :: ExtractState Itemscope
getScope = do (scope, _) <- get; return scope

putScope :: Itemscope -> ExtractState ()
putScope scope = modify $ \(_, orphans) -> (scope, orphans)

saveOrphan :: Itemscope -> ExtractState ()
saveOrphan r = modify $ \(scope, orphans) -> (scope, r:orphans)

withScope :: Itemscope -> ExtractState a -> ExtractState a
withScope scope m = do
  prevScope <- getScope
  putScope scope
  rv <- m
  putScope prevScope
  return rv

modifyScope :: (Itemscope -> Itemscope) -> ExtractState ()
modifyScope f = modify $ \(scope, orphans) -> (f scope, orphans)

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  forM_ children extractItemprops
extractMicrodata _ = return ()

extractNewItemscope :: Node -> ExtractState Itemscope
extractNewItemscope elt@(Element _ attrs _)
  | isJust $ lookup "itemscope" attrs =
    withScope (Itemscope (lookup "itemtype" attrs) []) $
      extractMicrodata elt >> getScope
extractNewItemscope node =
  extractMicrodata node >> getScope

extractItemprops :: Node -> ExtractState ()
extractItemprops elt@(Element _ attrs _) = do
  scopeVal <- extractNewItemscope elt
  case lookup "itemprop" attrs of
    Just name ->
      modifyScope $ addProp name (extractHtml5Value elt scopeVal)
    Nothing -> saveOrphan scopeVal
extractItemprops _ = return ()

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc =
  let f = mapM_ extractItemprops $ docContent doc
      initialState = (Itemscope (Just "top") [], [])
      ((), (scope, orphans)) = runState f initialState
  in scope:orphans

-- Debugging stuff:

html str = case parseHTML "" str of
  Left msg -> error msg
  Right rslt -> head $ docContent rslt

topscope = Itemscope (Just "top") []

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1
I hope you got something out of this long-winded tour. Programming in Haskell can be pretty nice once you get over a few hurdles.