Composewell builds Streamly and other open-source Haskell libraries. We also offer consulting and training in Haskell and high-performance functional programming. Learn more at composewell.com.
In Part 1, we introduced stream fusion, why it matters for performance in Haskell, and how it enables Streamly to compete with C in speed.
In this post, we’ll go deeper: we’ll look at how Streamly’s
Stream type works, how GHC’s optimizer eliminates overhead, and why
careful design plus inlining are the keys to reliable fusion. In Part
3, we’ll cover tricky fusion cases and how we
solved them using a GHC plugin.
Streamly builds on two powerful optimizations already present in GHC:
These are the workhorses of stream fusion. Let’s see how they apply in Streamly.
The Stream type in Streamly is adapted from the vector package:
data Step s a = Yield a s | Skip s | Stop
data Stream m a = forall s. Stream (s -> m (Step s a)) sHere’s what these types mean:
Stream wraps a state machine, consisting of a step function
and an initial state of type s.
The step function (s -> m (Step s a)) is called repeatedly with the
current state to produce the next stream element and state, until it
returns Stop.
Step is the outcome of a call to step:
Yield a s – produce a value and advance to a new stateSkip s – advance to a new state without producing a valueStop – terminate the streamA Step is equivalent to one iteration in an imperative loop. The
key difference: in an imperative loop the loop logic and state are
monolithic, while in functional streams a loop is built with modular
combinators where each combinator has its own independent state and
logic. These modular pieces are chained together, and under the hood
the compiler fuses them into one efficient monolithic imperative loop.
A stream typically flows through three stages:
enumerateFromTo)
creates a sequence of values i.e. a Stream.mapM) consume the input stream, apply their logic, and produce another
Stream.fold) drives
the stream to completion and returns a result.Each transformation combinator works like this:
s.s plus the combinator’s own state.for or while loop.Without fusion, chaining combinators means repeatedly allocating and
passing boxed Step constructors wrapping the state and the result
across stages of the stream pipeline. This adds a per-element, per stage
overhead that quickly destroys performance.
Stream fusion eliminates this overhead. The compiler stitches together the step functions of all combinators into a single fused loop, where states are passed as unboxed values. This removes intermediate heap allocations and yields performance on par with hand-written imperative code like in C.
mapM Fused into a LoopConsider Streamly’s definition of mapM:
mapM :: Monad m => (a -> m b) -> Stream m a -> Stream m b
mapM f (Stream step state) = Stream step1 state
where
step1 st = do
r <- step st
case r of
Yield x s -> f x >>= \a -> return $ Yield a s
Skip s -> return $ Skip s
Stop -> return StopNow use it in a simple program:
main :: IO ()
main = do
Stream.fold Fold.drain
$ Stream.mapM return
$ Stream.enumerateFromTo (1 :: Int) 1000At the source level, this is just a modular pipeline. But after GHC’s optimizer runs, something remarkable happens.
Compile with:
$ ghc -O2 -ddump-simpl -ddump-to-file -dsuppress-all mapM.hsThe resulting Core shows a tight loop with no constructors boxed or unboxed:
main_$s$wgo
= \ ww_s2Xl eta_s2Xn ->
case <=# ww_s2Xl 1000# of {
__DEFAULT -> (# eta_s2Xn, () #);
1# -> main_$s$wgo (+# ww_s2Xl 1#) eta_s2Xn
}Key takeaway: All the Yield, Skip, and Stop constructors
are gone. The pipeline collapses into a single recursive loop — as
efficient as hand-written C.
GHC relies on two key optimizations to enable stream fusion:
Case-of-case transformation – removes intermediate constructors by inlining and simplifying nested case expressions.
enumerateFromTo and
immediately consumed by mapM are eliminated.Spec constructor optimization – specializes recursive functions by generating a dedicated version of the recursive loop for each constructor case.
fold is a recursive loop, SpecConstr
optimization is applied to it, the value ww_s2Xl is carried as an
unboxed integer in the loop.Together, these transformations turn a chain of high-level stream combinators into a tight, allocation-free recursive loop, indistinguishable from hand-written imperative code.
Fusion only happens if GHC can see through the definitions of the functions involved. In practice, this means:
Stream, Fold, or
Parser must be inlined.IO action — the fully composed loop including an
elimination function like fold — does not need to be inlined
(and usually shouldn’t be).For example:
f :: IO ()
f =
Stream.fold Fold.drain
$ Stream.mapM return
$ Stream.enumerateFromTo (1 :: Int) 1000Here:
enumerateFromTo),mapM), anddrain)are all fragments of the loop. These must be inlined so that GHC can stitch them together into a single fused loop.
The IO action f itself, however, already represents the fully
composed loop. Inlining it would be counterproductive, since no further
fusion can occur beyond that point.
Streamly is built from ground up with fusion at its core:
The key types designed for fusion are:
Stream, UnfoldFold, Scanl, ParserThese types can be freely composed with the provided combinators, and in practice a chain of combinators fuses end-to-end. The result is expressive code that compiles down to efficient loops, without sacrificing modularity or performance.
Fusion is a compile-time optimization: the compiler stitches combinators into a single loop by inlining their definitions. This makes fusion inherently static — it cannot handle cases where functions cannot be inlined, such as recursive calls or streams passed to or returned from functions at runtime. In these situations, the compiler has no visibility into the code that needs to be fused.
cons and append does not
scale well.cons, append and other such stream combinators generate
very efficient code in non-recursive pipelines (e.g. adding a few
elements or few streams), they become impractical when used to compose
large number of elements or streams dynamically.cons implementation, state-based branching
to retrieve stream elements does not scale well, because there
are too many decisions to be made for each element.As an example, a recursive prime sieve implemented with fused cons
would be inefficient or even unusable. The same algorithm, however,
works well with CPS-based cons (StreamK).
primes :: StreamK Identity Int
primes =
StreamK.cons 3
$ StreamK.fromStream
$ Stream.filter filterPrime (Stream.enumerateFromThen 5 7)
where
filterPrime n =
all (\p -> n `mod` p /= 0)
$ Stream.takeWhile (\p -> p * p <= n)
$ StreamK.toStream primesTo address these cases, Streamly provides CPS streams (StreamK):
StreamK is especially useful in scenarios like parsers, where a stream
must be reconstructed at runtime during backtracking. While fused
streams excel when the structure is known statically, StreamK handles
dynamic reconstruction in a scalable manner. The trade-off is a small overhead,
in practice this cost is negligible when working with chunked
streams (e.g. arrays of data).
Together, fused streams and CPS streams give Streamly best of both worlds:
This dual approach ensures that Streamly delivers both performance and scalability, covering all use cases without leaving corner cases unsolved.
Fusion is not free — for the compiler. Larger loops mean:
For very large pipelines, it’s sometimes better to break them into
smaller parts and mark some with NOINLINE. This prevents over-fusion
and reduces compile-time costs but can significantly hurt performance
especially if the loop is broken at the wrong place.
The reliability of Streamly’s stream fusion comes from two strategies:
Unfolds
were introduced as a lighter-weight alternative to concatMap, solving
nested loop fusion reliably.fusion-plugin.
This ensures intermediate allocations are eliminated even in tricky cases.Stream) give you performance; CPS streams (StreamK) give you scalability.fusion-plugin make fusion reliable across practical use cases.For background on stream fusion, case-of-case and SpecConstr optimizations in Haskell, see:
-fspec-constr and related flagsThat’s Part 2. In Part 3, Reliable Stream Fusion with Fusion Plugin, we’ll look at the hard cases where fusion doesn’t “just work” — and how we built a compiler plugin to solve them.