COMPOSEWELL TECHNOLOGIES
BLOG
Posts and announcements for all things Composewell, including new streamly releases, performance notes, and notes on general haskell topics.

Composewell builds Streamly and other open-source Haskell libraries. We also offer consulting and training in Haskell and high-performance functional programming. Learn more at composewell.com.


14 Sep 2025
Harendra Kumar
How Stream Fusion Works in Streamly

In Part 1, we introduced stream fusion, why it matters for performance in Haskell, and how it enables Streamly to compete with C in speed.

In this post, we’ll go deeper: we’ll look at how Streamly’s Stream type works, how GHC’s optimizer eliminates overhead, and why careful design plus inlining are the keys to reliable fusion. In Part 3, we’ll cover tricky fusion cases and how we solved them using a GHC plugin.

Fusion in GHC: The Foundations

Streamly builds on two powerful optimizations already present in GHC:

  • Case-of-case transformation – stitches code together by eliminating intermediate constructors.
  • Spec constructor optimization – specializes recursive loops so constructors can be carried as unboxed values.

These are the workhorses of stream fusion. Let’s see how they apply in Streamly.

Stream Representation in Streamly

The Stream type in Streamly is adapted from the vector package:

data Step s a = Yield a s | Skip s | Stop

data Stream m a = forall s. Stream (s -> m (Step s a)) s

Here’s what these types mean:

  • Stream wraps a state machine, consisting of a step function and an initial state of type s.

  • The step function (s -> m (Step s a)) is called repeatedly with the current state to produce the next stream element and state, until it returns Stop.

  • Step is the outcome of a call to step:

    • Yield a s – produce a value and advance to a new state
    • Skip s – advance to a new state without producing a value
    • Stop – terminate the stream

A Step is equivalent to one iteration in an imperative loop. The key difference: in an imperative loop the loop logic and state are monolithic, while in functional streams a loop is built with modular combinators where each combinator has its own independent state and logic. These modular pieces are chained together, and under the hood the compiler fuses them into one efficient monolithic imperative loop.

How Stream Combinators Work

A stream typically flows through three stages:

  1. Generation – A stream generator function (e.g. enumerateFromTo) creates a sequence of values i.e. a Stream.
  2. Transformation – One or more stream transformation combinators (e.g. mapM) consume the input stream, apply their logic, and produce another Stream.
  3. Elimination – A stream consumer function (e.g. fold) drives the stream to completion and returns a result.

Each transformation combinator works like this:

  • It drives the input stream step by step using its state s.
  • At the same time, it maintains its own local state for the transformation.
  • The effective state is therefore the original state s plus the combinator’s own state.
  • Each step updates both states, much like loop variables are updated inside an imperative for or while loop.

What Stream Fusion Does

Without fusion, chaining combinators means repeatedly allocating and passing boxed Step constructors wrapping the state and the result across stages of the stream pipeline. This adds a per-element, per stage overhead that quickly destroys performance.

Stream fusion eliminates this overhead. The compiler stitches together the step functions of all combinators into a single fused loop, where states are passed as unboxed values. This removes intermediate heap allocations and yields performance on par with hand-written imperative code like in C.

Example: mapM Fused into a Loop

Consider Streamly’s definition of mapM:

mapM :: Monad m => (a -> m b) -> Stream m a -> Stream m b
mapM f (Stream step state) = Stream step1 state

    where

    step1 st = do
        r <- step st
        case r of
            Yield x s -> f x >>= \a -> return $ Yield a s
            Skip s    -> return $ Skip s
            Stop      -> return Stop

Now use it in a simple program:

main :: IO ()
main = do
    Stream.fold Fold.drain
        $ Stream.mapM return
        $ Stream.enumerateFromTo (1 :: Int) 1000

At the source level, this is just a modular pipeline. But after GHC’s optimizer runs, something remarkable happens.

What GHC Generates

Compile with:

$ ghc -O2 -ddump-simpl -ddump-to-file -dsuppress-all mapM.hs

The resulting Core shows a tight loop with no constructors boxed or unboxed:

main_$s$wgo
  = \ ww_s2Xl eta_s2Xn ->
      case <=# ww_s2Xl 1000# of {
        __DEFAULT -> (# eta_s2Xn, () #);
        1# -> main_$s$wgo (+# ww_s2Xl 1#) eta_s2Xn
      }

Key takeaway: All the Yield, Skip, and Stop constructors are gone. The pipeline collapses into a single recursive loop — as efficient as hand-written C.

How GHC Achieves Fusion

GHC relies on two key optimizations to enable stream fusion:

  1. Case-of-case transformation – removes intermediate constructors by inlining and simplifying nested case expressions.

    • For example, constructors produced by enumerateFromTo and immediately consumed by mapM are eliminated.
    • This fuses the two state machines, eliminating boxing overhead between stages in a transformation pipeline.
  2. Spec constructor optimization – specializes recursive functions by generating a dedicated version of the recursive loop for each constructor case.

    • This allows constructor payloads to be passed as unboxed values inside the recursive loop.
    • It is especially effective in stream elimination combinators (e.g. folds), which are inherently recursive.
    • In the Core snippet above, fold is a recursive loop, SpecConstr optimization is applied to it, the value ww_s2Xl is carried as an unboxed integer in the loop.

Together, these transformations turn a chain of high-level stream combinators into a tight, allocation-free recursive loop, indistinguishable from hand-written imperative code.

Inlining: The Key to Fusion

Fusion only happens if GHC can see through the definitions of the functions involved. In practice, this means:

  • Any function that produces or consumes a Stream, Fold, or Parser must be inlined.
  • The final IO action — the fully composed loop including an elimination function like fold — does not need to be inlined (and usually shouldn’t be).

For example:

f :: IO ()
f =
    Stream.fold Fold.drain
        $ Stream.mapM return
        $ Stream.enumerateFromTo (1 :: Int) 1000

Here:

  • The generator (enumerateFromTo),
  • The transformer (mapM), and
  • The eliminator (drain)

are all fragments of the loop. These must be inlined so that GHC can stitch them together into a single fused loop.

The IO action f itself, however, already represents the fully composed loop. Inlining it would be counterproductive, since no further fusion can occur beyond that point.

Fusion in Streamly

Streamly is built from ground up with fusion at its core:

  • You write high-level, modular, declarative Haskell using modular combinators.
  • The compiler fuses those combinators into tight, efficient low-level loops.

The key types designed for fusion are:

  • Producers: Stream, Unfold
  • Consumers: Fold, Scanl, Parser

These types can be freely composed with the provided combinators, and in practice a chain of combinators fuses end-to-end. The result is expressive code that compiles down to efficient loops, without sacrificing modularity or performance.

Limitations of Fusion

Fusion is a compile-time optimization: the compiler stitches combinators into a single loop by inlining their definitions. This makes fusion inherently static — it cannot handle cases where functions cannot be inlined, such as recursive calls or streams passed to or returned from functions at runtime. In these situations, the compiler has no visibility into the code that needs to be fused.

When Fusion Falls Short

  • Recursive use of fused operations like cons and append does not scale well.
  • While cons, append and other such stream combinators generate very efficient code in non-recursive pipelines (e.g. adding a few elements or few streams), they become impractical when used to compose large number of elements or streams dynamically.
  • For example in the fused cons implementation, state-based branching to retrieve stream elements does not scale well, because there are too many decisions to be made for each element.
  • Similarly, streams that are composed dynamically at runtime cannot be fused, because their structure is not statically known.

As an example, a recursive prime sieve implemented with fused cons would be inefficient or even unusable. The same algorithm, however, works well with CPS-based cons (StreamK).

primes :: StreamK Identity Int
primes =
    StreamK.cons 3
        $ StreamK.fromStream
        $ Stream.filter filterPrime (Stream.enumerateFromThen 5 7)

    where

    filterPrime n =
          all (\p -> n `mod` p /= 0)
        $ Stream.takeWhile (\p -> p * p <= n)
        $ StreamK.toStream primes

CPS Streams: A Complement

To address these cases, Streamly provides CPS streams (StreamK):

  • Fused streams are like static arrays — fixed and extremely fast.
  • CPS streams are like linked lists — dynamic and composable at runtime.

StreamK is especially useful in scenarios like parsers, where a stream must be reconstructed at runtime during backtracking. While fused streams excel when the structure is known statically, StreamK handles dynamic reconstruction in a scalable manner. The trade-off is a small overhead, in practice this cost is negligible when working with chunked streams (e.g. arrays of data).

A Complete Model

Together, fused streams and CPS streams give Streamly best of both worlds:

  • Fusion delivers maximum performance when pipelines are known at compile time.
  • CPS streams provide flexibility when recursive or runtime composition is needed.

This dual approach ensures that Streamly delivers both performance and scalability, covering all use cases without leaving corner cases unsolved.

The Costs of Fusion

Fusion is not free — for the compiler. Larger loops mean:

  • Longer compilation times
  • More memory usage
  • Bigger generated code

For very large pipelines, it’s sometimes better to break them into smaller parts and mark some with NOINLINE. This prevents over-fusion and reduces compile-time costs but can significantly hurt performance especially if the loop is broken at the wrong place.

Designing for Reliable Fusion

The reliability of Streamly’s stream fusion comes from two strategies:

  1. Designing abstractions for fusion – For example, Unfolds were introduced as a lighter-weight alternative to concatMap, solving nested loop fusion reliably.
  2. Fixing GHC’s gaps – Where GHC’s simplifier can’t fuse, Streamly provides the fusion-plugin. This ensures intermediate allocations are eliminated even in tricky cases.

Key Takeaways

  • Fusion eliminates constructors and allocations, producing loops as efficient as C.
  • Inlining is critical: generators, transformers, and folds must all be visible to GHC.
  • Fused streams (Stream) give you performance; CPS streams (StreamK) give you scalability.
  • Streamly’s design + fusion-plugin make fusion reliable across practical use cases.

Further Reading

For background on stream fusion, case-of-case and SpecConstr optimizations in Haskell, see:

Up Next

That’s Part 2. In Part 3, Reliable Stream Fusion with Fusion Plugin, we’ll look at the hard cases where fusion doesn’t “just work” — and how we built a compiler plugin to solve them.

All rights reserved, Composewell Technologies