March 25, 2012

A comic look at Haskell (pt 2)

In the last part, we defined some Haskell foreign functions for the libarchive library. Now we'll get down to the actual business of translating a few Racket routines into Haskell, redesign them a bit, and ultimately show how lazy IO can be a very nice way to interact with this library.

Let's begin by translating the archive-read-current-file function that we saw previously:

(define (archive-read-current-file archive entry-size)
  (let ([buffer (malloc 'atomic entry-size)])
    (memset buffer 0 entry-size)
    (archive_read_data archive buffer entry-size)
    (make-sized-byte-string buffer entry-size)))

Here, the low-level archive_read_data native method wants to be passed a buffer to write to. In the last section, we did this by making archiveReadData allow it to write to a ByteString value (which can be mutated as you can see, if you call the right "unsafe" operations, e.g. unsafeUseAsCStringLen). So now our problem is simply to provide an appropriately sized ByteString for the Haskell version of this routine to write to. We can do this with Data.ByteString.replicate (which like malloc and memset). Then based on the actual number of bytes read, we can trim the ByteString to the exact size with take:

archiveReadCurrentFile :: Archive -> Int -> IO ByteString
archiveReadCurrentFile ar sz = do
  let bs = replicate sz 0
  cnt <- archiveReadData ar bs
  return $ take cnt bs

This probably isn't the most efficient way to do this. A better API might use Data.ByteString.Internal.createAndTrim which avoids making the final copy that take makes. But this would require redesigning archiveReadData to take a Ptr Word8 to write to instead, or simply calling archive_read_data directly. Something we can revisit later.

Next, let's translate the archive-read-next-file-entry function:

(define (archive-read-next-file-entry archive)
  (let-values ([(entry ret) (archive_read_next_header archive)])
    (if (= ret ARCHIVE_OK)
        (archive-entry (archive_entry_pathname entry)
                       (archive_entry_size entry)
                       (rewrite-file-mode (archive_entry_filetype entry)))
        (if (= ret ARCHIVE_EOF)
            #f
            (raise (list "error" ret))))))

This is pretty straightforward. It simply calls the archive_read_next_header function, checks the result, and builds a Racket archive-entry structure (note the dash instead of underscore in the name) from 3 parts of the C archive_entry struct that's returned. Now, I don't know why it's actually necessary to build another structure containing the archive_entry components (perhaps because Alex wanted to pattern match against this structure?), but in Haskell, I'll just rely on the native methods we designed earlier to access the components of the archive_entry:

archiveReadNextFileEntry :: Archive -> IO ArchiveEntry
archiveReadNextFileEntry ar = do
  rv <- archiveReadNextHeader ar
  case rv of
    Left err -> error (show err)
    Right entry -> return entry

Earlier we had made archiveReadNextHeader return a result of type Either Result ArchiveEntry. And if you recall, an Either value has Left and Right constructors that are used to distinguish either of 2 cases. Usually the Left case is used to specify an error result, and the Right case is for the common, non-error result. We could have returned a pair, (Result, Maybe ArchiveEntry), rather like the Racket code has done, but Either is a slightly cleaner way to do the same thing.

Now rather than returning our Either result, we want this routine to throw an exception for the error case, and simply return the ArchiveEntry for the common case. The case statement accomplishes this. Also, if you recall from earlier, we would like to be able to catch these exceptions, so we're going to swap out the error function with ioError and userError. And one more thing: The whole routine can be simplified by using the either function to say what to do in either of the 2 cases:

archiveReadNextFileEntry :: Archive -> IO ArchiveEntry
archiveReadNextFileEntry ar =
  archiveReadNextHeader ar >>= either (ioError . userError . show) return

This says: Call the archiveReadNextHeader function (that's in the IO monad), and pass its result to the either function (rather like a unix pipe). Then either takes 2 parameters: the first says if you get a Left value, call the composed function (ioError . userError . show) on it (remember, this is equivalent to (\left -> ioError (userError (show left))) -- yes the dots seem to work backwards from most languages we're familiar with). In the Right case, simply call return on the value.

Folding archives

Now we get to the fun part. Alex thought that a good API for visiting the elements of an archive file would be to fold over them. A fold is like a map that passes a piece of data long for the ride -- an accumulator value that it can change or replace as each file is visited. (When the accumulator is a list, you have a map operation exactly.) Here's the Racket fold routine:

(define (archive-folder archive proc init-val)
  (match (archive-read-next-file-entry archive)
    [(archive-entry path size filetype)
     (archive-folder
      archive proc
      (apply proc
             (list (archive-entry path size filetype)
                   (lambda () (archive-read-current-file archive size))
                   init-val)))]
    [#f init-val]))

This calls archive-read-next-file-entry, and when it returns an archive-entry the user's proc is called to handle it's data. The proc is passed the archive-entry (actually a copy of it -- tisk, tisk), a function that can be called to read the current file (a thunk -- a nullary function that's used when we want to delay evaluation -- rather like Haskell's laziness, don't you think?), and the accumulator value, init-val. The result of the user's proc becomes the new accumulator that's passed to a recursive call to archive-folder which will go on to read the next entry.

In Haskell, we can write this almost identically:

archiveFolder :: Archive -> (ArchiveEntry -> IO ByteString -> a -> IO a) ->
                 a -> IO a
archiveFolder ar f acc =
  catch
    (do entry <- archiveReadNextFileEntry ar
        acc' <- f entry (readCurrentEntry entry) acc
        archiveFolder ar f acc')
    (\_ -> return acc)
  where readCurrentEntry entry = do
          sz <- archiveEntrySize entry
          bs <- archiveReadCurrentFile ar sz
          return bs

There are a few subtle differences here. Instead of testing the result of archiveReadNextFileEntry, we're going to rely on catching an exception to know when to terminate the fold. And instead of using a thunk (which is rarely needed in a lazy language like Haskell), we're just going to pass an IO operation which is the result of calling the local function readCurrentEntry.

Let's define a few additional routines in a separate file so that we can test this out:

module Main where

import Codec.Archive.Libarchive
import Data.ByteString hiding (map)
import Data.Int
import qualified Data.ByteString.Lazy as L
import Prelude hiding (read, take)

bufSize :: Int
bufSize = 10240

archiveReadOpen :: String -> IO Archive
archiveReadOpen path = do
  ar <- archiveReadNew
  expectOK $ archiveReadSupportCompressionAll ar
  expectOK $ archiveReadSupportFormatAll ar
  expectOK $ archiveReadOpenFile ar path bufSize
  return ar

archiveFoldFile :: String -> (ArchiveEntry -> IO ByteString -> a -> IO a) ->
                   a -> IO a
archiveFoldFile path f acc = do
  ar <- archiveReadOpen path
  archiveFolder ar f acc

readEntry :: ArchiveEntry -> IO ByteString -> [(Int, ByteString)] ->
             IO [(Int, ByteString)]
readEntry entry reader acc = do
  t <- archiveEntryFiletype entry
  if t == ae_ifreg
    then do sz <- archiveEntrySize entry
            bs <- reader
            let sub = take (min sz 10) bs -- just keep some of the data for test purposes
            return $ (sz, sub):acc
    else return acc

path = "/Users/warrenharris/projects/blog/comic/v-vendetta-0.cbz"

test1 = archiveFoldFile path readEntry []

Let's call it:

*Main> test1
[(2535,"-=Book=-\r\r"),(956,"V For Vend"),(590199,"\255\216\255\224\NUL\DLEJFIF"),(374813,"\255\216\255\224\NUL\DLEJFIF"),(508622,"\255\216\255\224\NUL\DLEJFIF"),(518735,"\255\216\255\224\NUL\DLEJFIF"),(527499,"\255\216\255\224\NUL\DLEJFIF"),(451816,"\255\216\255\224\NUL\DLEJFIF"),(541239,"\255\216\255\224\NUL\DLEJFIF"),(481156,"\255\216\255\224\NUL\DLEJFIF"),(516998,"\255\216\255\224\NUL\DLEJFIF"),(510813,"\255\216\255\224\NUL\DLEJFIF"),(478150,"\255\216\255\224\NUL\DLEJFIF"),(437432,"\255\216\255\224\NUL\DLEJFIF"),(310424,"\255\216\255\224\NUL\DLEJFIF"),(265642,"\255\216\255\224\NUL\DLEJFIF"),(301822,"\255\216\255\224\NUL\DLEJFIF"),(676613,"\255\216\255\224\NUL\DLEJFIF"),(53109,"\255\216\255\224\NUL\DLEJFIF"),(452430,"\255\216\255\224\NUL\DLEJFIF"),(305523,"\255\216\255\224\NUL\DLEJFIF"),(319714,"\255\216\255\224\NUL\DLEJFIF"),(298048,"\255\216\255\224\NUL\DLEJFIF")]

Once thing you'll notice here is that the data got reversed. The last entry (beginning with "-=Book=-\r\r") got put at the front of the list. This is because the fold routine that we wrote is essentially a left-fold, which means it starts from the left (the beginning of the archive contents) and pushes its results into a list as it goes. So the list has the first element on the right. No big deal. We can simply reverse it if we want.

Folding archives -- take 2

Now the first thing the astute reader will notice about archiveFolder is that it really isn't necessary to provide readCurrentEntry at all. This is something the user's function can implement as long as it has the Archive and ArchiveEntry values in hand. We might rewrite this fold routine as follows:

archiveFolder2 :: Archive -> (Archive -> ArchiveEntry -> a -> IO a) -> a -> IO a
archiveFolder2 ar f acc =
  catch
    (do entry <- archiveReadNextFileEntry ar
        acc' <- f ar entry acc
        archiveFolder2 ar f acc')
    (\_ -> return acc)

Now let's provide a few additional routines so that we can test this out:

archiveFoldFile2 :: String -> (Archive -> ArchiveEntry -> a -> IO a) -> a -> IO a
archiveFoldFile2 path f acc = do
  ar <- archiveReadOpen path
  archiveFolder2 ar f acc

readEntry2 :: Archive -> ArchiveEntry -> [(Int, ByteString)] ->
              IO [(Int, ByteString)]
readEntry2 ar entry acc = do
  t <- archiveEntryFiletype entry
  if t == ae_ifreg
    then do sz <- archiveEntrySize entry
            bs <- archiveReadCurrentFile ar sz
            let sub = take (min sz 10) bs
            return $ (sz, sub):acc
    else return acc

test2 = archiveFoldFile2 path readEntry2 []

We get the exact same result if we call this one. All we've done is moved a bit of the work into the readEntry2 function.

Step away from the fold

Now the problem with using a fold operation to traverse an archive file is that it doesn't give you a way to control which parts of the archive get visited. Granted, the user's proc can decided not to slurp in the contents of an archive_entry, but it can't decided that it doesn't want to continue visiting entries. The fold operation is going to continue marching to the end of the archive, whether you've got enough data or not.

Now it might be possible to use exceptions, continuations or delimited control to provide an escape hatch for fold, but a nicer abstraction would be one that didn't have to resort to escape hatches at all -- one that just build the data structure you were about to use as you decided you needed it. Just like a file read operation that streams in chunks of the file each successive time read is called, wouldn't it be nice if we could provide a structured stream of data that only created the data values from the raw archive data as necessary? That's exactly what lazy IO can do for you.

If you remember what I said earlier, the IO monad isn't lazy at all. It's all about sequencing "real world" operations in an order that's predictable and timely. So before we get into the lazy part of lazy IO, let's just design a plain old IO operation that reads the entire archive.

To read an archive, we need to alternatively read headers and archive data -- so there's essentially 2 states to the reading process. And as we read, we need to accumulate this data somewhere. Perhaps the most natural place to accumulate it is as a list of ByteStrings -- one for each archive_entry.

However, instead of regular (strict) ByteStrings, we're going to use lazy ByteStrings instead. There are 2 reasons for this: First, a lazy ByteString provides the same interface abstraction as a regular ByteString, but underneath is uses a sequence of buffer segments to allow data to be brought in as needed. Second, is because it has laziness built into it from the get-go. Although we're not going to use the lazy aspect of L.ByteString just yet, we'll need it eventually -- just just hold on.

Let's start from the bottom up, by first providing a function to return a lazy ByteString for each archive_entry. It's easy to make a L.ByteString from a list of strict ones simply by calling L.fromChunks:

archiveReadEntry :: Ptr ARCHIVE -> IO L.ByteString
archiveReadEntry a = do
  chunks <- archiveReadEntryChunks a
  return $ L.fromChunks chunks

archiveReadEntryChunks :: Ptr ARCHIVE -> IO [ByteString]
archiveReadEntryChunks a = do
  bs <- archiveReadEntryChunk a
  if length bs == 0
    then return []
    else do rest <- archiveReadEntryChunks a
            return (bs:rest)

archiveReadEntryChunk :: Ptr ARCHIVE -> IO ByteString
archiveReadEntryChunk a = do
  buf <- mallocArray bufSize
  rv <- archive_read_data a buf (fromIntegral bufSize)
  unsafePackCStringFinalizer (castPtr buf) (fromIntegral rv) (free buf)

Here I've cheated a bit and made all these functions take a Ptr ARCHIVE instead of an Archive. That's so that I can unwrap the Archive once at the beginning, and also because I don't plan to export these functions from the module. Note that we're mallocing fixed-size chunks and calling archive_read_data to fill them. Then we're using unsafePackCStringFinalizer to associate each chunk with its free finalizer. That's a little more efficient than what we had before.

Next we can define a function that alternately reads headers and entry contents, and does this recursively until the end of the archive is reached:

archiveRead :: Archive -> IO [L.ByteString]
archiveRead ar = do
  withForeignPtr (unArchive ar) $ \a -> do
    rv <- archiveReadNextHeader ar
    case rv of
      Right entry -> do
        t <- archiveEntryFiletype entry
        if t == ae_ifreg
          then do bs <- archiveReadEntry a
                  rest <- archiveRead ar
                  return (bs:rest)
          else archiveRead ar
      Left _ -> return []

This says to read an archive, first read a header. If there is no error, check the archive entry's file type, and if it's a regular file, read the archive's contents followed by a recursive call to read the rest of the archive. If it's not a regular file, we'll skip that entry and continue on to the next one. We'll continue doing this until archiveReadNextHeader returns an error.

Now we're almost done. The only thing that remains is this little detail I mentioned earlier -- the fact that the IO monad is strict, and as written this will pull the entire contents of the archive into memory, building the list of (fully read in) L.ByteStrings as it goes. What we need to do is introduce a little delay in a couple of critical places, so that some of these IO operations aren't performed until they're needed. This is a job for unsafeInterleaveIO.

There are 2 places where unsafeInterleaveIO is needed. First, we need to delay reading in the tail of the list of L.ByteStrings until more entries are needed. Second, we need to delay reading in the entire contents of any individual entry, so that its L.ByteString isn't fully materialized until the data is actually consumed. The first one of these goes in archiveRead itself, at the point where the recursive call is about to be made:

archiveRead :: Archive -> IO [L.ByteString]
archiveRead ar = do
  withForeignPtr (unArchive ar) $ \a -> do
    rv <- archiveReadNextHeader ar
    case rv of
      Right entry -> do
        t <- archiveEntryFiletype entry
        if t == ae_ifreg
          then do bs <- archiveReadEntry a
                  rest <- unsafeInterleaveIO $ archiveRead ar
                  return (bs:rest)
          else archiveRead ar
      Left _ -> return []

The second one goes in archiveReadEntryChunks, again at the point where it is about to be called recursively:

archiveReadEntryChunks :: Ptr ARCHIVE -> IO [ByteString]
archiveReadEntryChunks a = do
  bs <- archiveReadEntryChunk a
  if length bs == 0
    then return []
    else do rest <- unsafeInterleaveIO $ archiveReadEntryChunks a
            return (bs:rest)

Now the cool thing about lazy IO is that nothing gets read in until it is needed, but due to the sequencing savviness of monads, everything happens in exactly the same order as if it had been read in eagerly. So we get the best of both worlds -- an elegant abstraction (a list of ByteStrings), and the benefits of streaming.

Conclusion

At this point I'm sure your disappointed to find that this article has had very little to do with comics, and in general hasn't been very funny at all. But stay tuned for part 3, where I'll attempt to put it this library to use and actually build that web app for your comic viewing pleasure.

No comments:

Post a Comment