COMPOSEWELL TECHNOLOGIES
BLOG
Posts and announcements for all things Composewell, including new streamly releases, performance notes, and notes on general haskell topics.

Composewell builds Streamly and other open-source Haskell libraries. We also offer consulting and training in Haskell and high-performance functional programming. Learn more at composewell.com.


15 Sep 2025
Harendra Kumar
Reliable Stream Fusion with Fusion Plugin

This is the third post in a three-part series on stream fusion in Streamly.

  • Part 1 explained how stream fusion lets you write expressive yet high performance code in Haskell.
  • Part 2 explained how stream fusion works in streamly and how GHC turns high-level combinators into low-level imperative loops.

In this post, we’ll tackle a thorny question:

Why does fusion sometimes fail unpredictably — and how can we make it reliable?

The answer: a GHC compiler plugin, fusion-plugin.

Why Fusion Sometimes Fails

Fusion in Haskell relies on aggressive inlining. Library authors can mark functions with INLINE pragmas, and most of the time that’s enough.

But during simplification, GHC introduces internal bindings (like join points). These bindings are not visible at the source level, and whether they get inlined depends on GHC’s size heuristics.

  • If they are small they get inlined and fusion succeeds.
  • If they are too large they are not inlined and fusion breaks.

The result: unpredictable fusion.

See GHC ticket #17075 for details.

Demonstrating the Problem

Let’s look at the Stream.uniq combinator, which drops consecutive duplicate elements in a stream:

uniq :: (Eq a, Monad m) => Stream m a -> Stream m a
uniq (Stream step state) = Stream step' (Nothing, state)

    where
    {-# INLINE_LATE step' #-}
    step' gst (Nothing, st) = do
        r <- step gst st
        return $ case r of
            Yield x s -> Yield x (Just x, s)
            Skip s -> Skip  (Nothing, s)
            Stop -> Stop

    step' gst (Just x, st)  = do
         r <- step gst st
         return $ case r of
             Yield y s
                | x == y -> Skip (Just x, s)
                | otherwise -> Yield y (Just y, s)
             Skip s -> Skip (Just x, s)
             Stop -> Stop

Now, use it once in a program:

main :: IO ()
main = do
    Stream.fold Fold.drain
        $ Stream.uniq
        $ Stream.enumerateFromTo (1::Int) 1000

Compiling with -O2 yields perfectly fused Core:

main_$s$wgo
  = \ sc_s386 sc1_s385 eta_s36L ->
      case <=# sc_s386 1000# of {
        __DEFAULT -> (# eta_s36L, () #);
        1# ->
          case ==# sc1_s385 sc_s386 of {
            __DEFAULT -> main_$s$wgo (+# sc_s386 1#) sc_s386 eta_s36L;
            1# -> main_$s$wgo (+# sc_s386 1#) sc1_s385 eta_s36L
          }
      }

No Yield, Skip, or Stop constructors remain.

But now chain three uniqs:

main = do
    Stream.fold Fold.drain
        $ Stream.uniq
        $ Stream.uniq
        $ Stream.uniq
        $ Stream.enumerateFromTo (1::Int) 1000

This time the Core is littered with constructors:

main_$s$wgo1
  = \ sc_s3gW sc1_s3gV ww_s3ds ww1_s3dt eta_s3dx ->
      join {
        $j_s2GC ipv_a2EK ipv1_a2EL
          = case ipv1_a2EL of {
              Yield y_a2y5 s1_a2y6 ->
                case y_a2y5 of { I# y1_a2Fb ->
                case ==# sc_s3gW y1_a2Fb of {
                  __DEFAULT ->
                    case ==# sc1_s3gV y1_a2Fb of {
                      __DEFAULT ->
                        case s1_a2y6 of { (ww2_X1W, ww3_X1X) ->
                        main_$s$wgo1 y1_a2Fb y1_a2Fb ww2_X1W ww3_X1X ipv_a2EK
                        };
                      1# ->
                        case s1_a2y6 of { (ww2_X1W, ww3_X1X) ->
                        main_$s$wgo1 y1_a2Fb sc1_s3gV ww2_X1W ww3_X1X ipv_a2EK
                        }
                    };
                  1# ->
                    case s1_a2y6 of { (ww2_X1P, ww3_X1Q) ->
                    main_$s$wgo1 sc_s3gW sc1_s3gV ww2_X1P ww3_X1Q ipv_a2EK
                    }
                }
                };
              Skip s1_a2yb ->
                case s1_a2yb of { (ww2_X1P, ww3_X1Q) ->
                main_$s$wgo1 sc_s3gW sc1_s3gV ww2_X1P ww3_X1Q ipv_a2EK
                };
              Stop -> (# ipv_a2EK, () #)
            } } in
      case ww_s3ds of wild2_X1R {
        Nothing ->
          case ww1_s3dt of x1_a2qD { I# ipv_s2FJ ->
          case <=# ipv_s2FJ 1000# of {
            __DEFAULT -> (# eta_s3dx, () #);
            1# ->
              jump $j_s2GC
                eta_s3dx (Yield x1_a2qD (Just x1_a2qD, I# (+# ipv_s2FJ 1#)))
          }
          };
        Just x_X1S ->
          case ww1_s3dt of x1_a2qD { I# ipv_s2FJ ->
          case <=# ipv_s2FJ 1000# of {
            __DEFAULT -> (# eta_s3dx, () #);
            1# ->
              case x_X1S of { I# x2_a2F8 ->
              case ==# x2_a2F8 ipv_s2FJ of {
                __DEFAULT ->
                  jump $j_s2GC
                    eta_s3dx (Yield x1_a2qD (Just x1_a2qD, I# (+# ipv_s2FJ 1#)));
                1# -> jump $j_s2GC eta_s3dx (Skip (wild2_X1R, I# (+# ipv_s2FJ 1#)))
              }
              }
          }
          }
      }

Instead of a tight loop, we get constructor allocations (Yield, Skip, Stop, Just, Nothing) on every step. Fusion failed.

Root Cause: Join Points and Heuristics

The failure comes from a join point that GHC introduces internally:

join {
  $j_s2GC ipv_a2EK ipv1_a2EL
    = case ipv1_a2EL of {
        Yield y_a2y5 s1_a2y6 -> ...

This join point wasn’t inlined because it exceeded GHC’s size threshold. That blocked the case-of-case transformation, leaving constructors in the loop.

Why not just raise the threshold?

Because that would globally inline too much, slowing down compilation and bloating code.

We need selective inlining: inline only the bindings that block fusion.

The Fusion Plugin Solution

fusion-plugin gives programmers control through fusion annotations.

{-# ANN type Step Fuse #-}
data Step s a = Yield a s | Skip s | Stop

This tells the compiler:

Any binding that scrutinizes or constructs Step must be inlined, no matter its size.

The plugin scans bindings during the simplifier pass:

  • If a fusible type (Step) is involved then the binding is marked INLINE.
  • Run another simplifier pass which then eliminates the intermediate constructors.

Let’s try our triple-uniq example again, now with the plugin:

$ ghc -O2 -fplugin Fusion.Plugin uniq.hs

And the resulting Core is fully fused again:

main_$s$wgo
  = \ sc_s3nB sc1_s3nA sc2_s3nz sc3_s3ny eta_s3jY ->
      case <=# sc_s3nB 1000# of {
        __DEFAULT -> (# eta_s3jY, () #);
        1# ->
          case ==# sc1_s3nA sc_s3nB of {
            __DEFAULT ->
              case ==# sc2_s3nz sc_s3nB of {
                __DEFAULT ->
                  case ==# sc3_s3ny sc_s3nB of {
                    __DEFAULT ->
                      main_$s$wgo (+# sc_s3nB 1#) sc_s3nB sc_s3nB sc_s3nB eta_s3jY;
                    1# -> main_$s$wgo (+# sc_s3nB 1#) sc_s3nB sc_s3nB sc3_s3ny eta_s3jY
                  };
                1# ->
                  main_$s$wgo (+# sc_s3nB 1#) sc_s3nB sc2_s3nz sc3_s3ny eta_s3jY
              };
            1# ->
              main_$s$wgo (+# sc_s3nB 1#) sc1_s3nA sc2_s3nz sc3_s3ny eta_s3jY
          }
      }

Clean, tight, allocation-free code.

How It Works Under the Hood

Here’s the workflow at a high level:

  1. The plugin inspects the ModGuts of a module (i.e. all its bindings).
  2. It searches for scrutinizing or constructing patterns involving types marked as fusible.
  3. Any such bindings are automatically marked INLINE.
  4. A second simplifier pass then removes them, allowing fusion to go through.

A pleasant side-effect of this design is that the programmer only needs to annotate the state type used in a combinator. Previously, you had to carefully inline every function involved — both inside the combinator’s implementation and across the entire pipeline. With the plugin, a single annotation is enough: it automatically inlines every binding that touches that type, including join points and let bindings generated internally by GHC.

In other words, you no longer need to scatter INLINE pragmas across step functions and helper bindings. Annotate the type once, and the plugin handles the rest.

Inlining Scrutinizing Binders

The plugin looks for binders whose right-hand side scrutinizes a fusible type (i.e. a type annotated with Fuse).

Case expressions: If a binder contains a Case scrutinizing a fusible constructor, the binder is marked INLINE:

binder = case x of { Yield a s -> expr }

Since Yield belongs to a fusible type, the whole binder is marked INLINE.

We then recurse into the case alternatives (expr above). If they contain Let bindings that themselves scrutinize fusible constructors, those are also marked INLINE:

b1 = let b2 = expr1 in expr2

If expr2 scrutinizes a fusible constructor, b1 is marked INLINE. Both expr1 and expr2 are traversed recursively for further let bindings to inline.

If the expression we recurse into is an App, Lam, or Cast:

App expr1 expr2
Lam _ expr1
Cast expr1 _

The plugin simply traverses their subexpressions, checking for any Let bindings to be inlined.

Inlining Constructing Binders

The plugin also looks for constructing binders — bindings that build values of a type annotated with Fuse. These too must be inlined for fusion to succeed.

To detect them, the plugin traverses Let, Case, Lam, and Cast nodes in the GHC Core, examining their subexpressions. If it encounters a Var expression of a fusible type, the corresponding binder is marked INLINE.

For example, in the Core snippet below the binding step_s9QR constructs and returns Yield and Stop values (both part of a fusible Step type). Without inlining this binder, these constructors would remain in the final Core and block fusion.

let {
  step_s9QR
    = \ _ [Occ=Dead]
        (eta_B0 :: Int)
        (eta_B1 [OS=OneShot] :: GHC.State# GHC.RealWorld) ->
        case eta_B0 of wild_i8YU { GHC.I# x_i8YV ->
        case value_a67r of { GHC.I# y_i8YY ->
        case GHC.># x_i8YV y_i8YY of {
          __DEFAULT ->
            (# eta_B1,
               Yield @Int @Int wild_i8YU (GHC.Types.I# (GHC.+# x_i8YV 1#)) #);
          1# ->
            (# eta_B1, Stop @Int @Int #)
        }
        }
        } } in
  ...

Developer Tools

The plugin includes debugging and reporting options:

  • Dump Core after each pass

    -fplugin-opt=Fusion.Plugin:dump-core

    Output of each core-2-core pass is dumped in a separate file. One can compare outputs from successive phases to see what blocks fusion.

  • Report unfused bindings

    -fplugin-opt=Fusion.Plugin:verbose=1..4

    Track problematic internal bindings (including join points).

These features make it much easier to see inside fusion.

Costs and Trade-offs

  • Compilation overhead: a few extra simplifier passes.
  • Selective impact: only bindings involving fusible types are affected.
  • Configurable: annotations can be turned off, restoring vanilla behavior.

In practice, the runtime performance gains from reliable fusion far outweigh the extra compile time.

Looking Ahead

The ultimate goal: support fusion annotations natively in GHC. That way, programmers and library authors can control fusion reliably, without relying on a plugin.

This would benefit the entire Haskell ecosystem — especially libraries like vector and streamly that rely on predictable fusion.

Future extensions could include:

  • Adding thresholds to control inlining size.
  • Fine-tuning trade-offs between fusion reliability and compile-time cost.

Conclusion

Fusion in GHC is powerful but fragile. A single missed inlining can derail the whole pipeline.

The fusion-plugin, makes fusion reliable:

  • Annotate fusible types.
  • Let the plugin enforce inlining.
  • Enjoy clean, fused Core every time.

Though the plugin was motivated by streamly, it is generally usable by any Haskell library, not just streamly.

This closes the story: Haskell can give you expressive, modular code and performance on par with C, without unpredictability.

All rights reserved, Composewell Technologies