Composewell builds Streamly and other open-source Haskell libraries. We also offer consulting and training in Haskell and high-performance functional programming. Learn more at composewell.com.
This is the third post in a three-part series on stream fusion in Streamly.
In this post, we’ll tackle a thorny question:
Why does fusion sometimes fail unpredictably — and how can we make it reliable?
The answer: a GHC compiler plugin,
fusion-plugin.
Fusion in Haskell relies on aggressive inlining. Library authors
can mark functions with INLINE pragmas, and most of the time that’s
enough.
But during simplification, GHC introduces internal bindings (like join points). These bindings are not visible at the source level, and whether they get inlined depends on GHC’s size heuristics.
The result: unpredictable fusion.
See GHC ticket #17075 for details.
Let’s look at the Stream.uniq combinator, which drops consecutive
duplicate elements in a stream:
uniq :: (Eq a, Monad m) => Stream m a -> Stream m a
uniq (Stream step state) = Stream step' (Nothing, state)
where
{-# INLINE_LATE step' #-}
step' gst (Nothing, st) = do
r <- step gst st
return $ case r of
Yield x s -> Yield x (Just x, s)
Skip s -> Skip (Nothing, s)
Stop -> Stop
step' gst (Just x, st) = do
r <- step gst st
return $ case r of
Yield y s
| x == y -> Skip (Just x, s)
| otherwise -> Yield y (Just y, s)
Skip s -> Skip (Just x, s)
Stop -> StopNow, use it once in a program:
main :: IO ()
main = do
Stream.fold Fold.drain
$ Stream.uniq
$ Stream.enumerateFromTo (1::Int) 1000Compiling with -O2 yields perfectly fused Core:
main_$s$wgo
= \ sc_s386 sc1_s385 eta_s36L ->
case <=# sc_s386 1000# of {
__DEFAULT -> (# eta_s36L, () #);
1# ->
case ==# sc1_s385 sc_s386 of {
__DEFAULT -> main_$s$wgo (+# sc_s386 1#) sc_s386 eta_s36L;
1# -> main_$s$wgo (+# sc_s386 1#) sc1_s385 eta_s36L
}
}No Yield, Skip, or Stop constructors remain.
But now chain three uniqs:
main = do
Stream.fold Fold.drain
$ Stream.uniq
$ Stream.uniq
$ Stream.uniq
$ Stream.enumerateFromTo (1::Int) 1000This time the Core is littered with constructors:
main_$s$wgo1
= \ sc_s3gW sc1_s3gV ww_s3ds ww1_s3dt eta_s3dx ->
join {
$j_s2GC ipv_a2EK ipv1_a2EL
= case ipv1_a2EL of {
Yield y_a2y5 s1_a2y6 ->
case y_a2y5 of { I# y1_a2Fb ->
case ==# sc_s3gW y1_a2Fb of {
__DEFAULT ->
case ==# sc1_s3gV y1_a2Fb of {
__DEFAULT ->
case s1_a2y6 of { (ww2_X1W, ww3_X1X) ->
main_$s$wgo1 y1_a2Fb y1_a2Fb ww2_X1W ww3_X1X ipv_a2EK
};
1# ->
case s1_a2y6 of { (ww2_X1W, ww3_X1X) ->
main_$s$wgo1 y1_a2Fb sc1_s3gV ww2_X1W ww3_X1X ipv_a2EK
}
};
1# ->
case s1_a2y6 of { (ww2_X1P, ww3_X1Q) ->
main_$s$wgo1 sc_s3gW sc1_s3gV ww2_X1P ww3_X1Q ipv_a2EK
}
}
};
Skip s1_a2yb ->
case s1_a2yb of { (ww2_X1P, ww3_X1Q) ->
main_$s$wgo1 sc_s3gW sc1_s3gV ww2_X1P ww3_X1Q ipv_a2EK
};
Stop -> (# ipv_a2EK, () #)
} } in
case ww_s3ds of wild2_X1R {
Nothing ->
case ww1_s3dt of x1_a2qD { I# ipv_s2FJ ->
case <=# ipv_s2FJ 1000# of {
__DEFAULT -> (# eta_s3dx, () #);
1# ->
jump $j_s2GC
eta_s3dx (Yield x1_a2qD (Just x1_a2qD, I# (+# ipv_s2FJ 1#)))
}
};
Just x_X1S ->
case ww1_s3dt of x1_a2qD { I# ipv_s2FJ ->
case <=# ipv_s2FJ 1000# of {
__DEFAULT -> (# eta_s3dx, () #);
1# ->
case x_X1S of { I# x2_a2F8 ->
case ==# x2_a2F8 ipv_s2FJ of {
__DEFAULT ->
jump $j_s2GC
eta_s3dx (Yield x1_a2qD (Just x1_a2qD, I# (+# ipv_s2FJ 1#)));
1# -> jump $j_s2GC eta_s3dx (Skip (wild2_X1R, I# (+# ipv_s2FJ 1#)))
}
}
}
}
}Instead of a tight loop, we get constructor allocations (Yield,
Skip, Stop, Just, Nothing) on every step. Fusion failed.
The failure comes from a join point that GHC introduces internally:
join {
$j_s2GC ipv_a2EK ipv1_a2EL
= case ipv1_a2EL of {
Yield y_a2y5 s1_a2y6 -> ...This join point wasn’t inlined because it exceeded GHC’s size threshold. That blocked the case-of-case transformation, leaving constructors in the loop.
Because that would globally inline too much, slowing down compilation and bloating code.
We need selective inlining: inline only the bindings that block fusion.
fusion-plugin gives programmers control through fusion
annotations.
{-# ANN type Step Fuse #-}
data Step s a = Yield a s | Skip s | StopThis tells the compiler:
Any binding that scrutinizes or constructs
Stepmust be inlined, no matter its size.
The plugin scans bindings during the simplifier pass:
Step) is involved then the binding is marked INLINE.Let’s try our triple-uniq example again, now with the plugin:
$ ghc -O2 -fplugin Fusion.Plugin uniq.hsAnd the resulting Core is fully fused again:
main_$s$wgo
= \ sc_s3nB sc1_s3nA sc2_s3nz sc3_s3ny eta_s3jY ->
case <=# sc_s3nB 1000# of {
__DEFAULT -> (# eta_s3jY, () #);
1# ->
case ==# sc1_s3nA sc_s3nB of {
__DEFAULT ->
case ==# sc2_s3nz sc_s3nB of {
__DEFAULT ->
case ==# sc3_s3ny sc_s3nB of {
__DEFAULT ->
main_$s$wgo (+# sc_s3nB 1#) sc_s3nB sc_s3nB sc_s3nB eta_s3jY;
1# -> main_$s$wgo (+# sc_s3nB 1#) sc_s3nB sc_s3nB sc3_s3ny eta_s3jY
};
1# ->
main_$s$wgo (+# sc_s3nB 1#) sc_s3nB sc2_s3nz sc3_s3ny eta_s3jY
};
1# ->
main_$s$wgo (+# sc_s3nB 1#) sc1_s3nA sc2_s3nz sc3_s3ny eta_s3jY
}
}Clean, tight, allocation-free code.
Here’s the workflow at a high level:
ModGuts of a module (i.e. all its bindings).INLINE.A pleasant side-effect of this design is that the programmer only needs
to annotate the state type used in a combinator. Previously, you
had to carefully inline every function involved — both inside the
combinator’s implementation and across the entire pipeline. With the
plugin, a single annotation is enough: it automatically inlines every
binding that touches that type, including join points and let bindings
generated internally by GHC.
In other words, you no longer need to scatter INLINE pragmas across
step functions and helper bindings. Annotate the type once, and the
plugin handles the rest.
The plugin looks for binders whose right-hand side scrutinizes a
fusible type (i.e. a type annotated with Fuse).
Case expressions:
If a binder contains a Case scrutinizing a fusible constructor, the
binder is marked INLINE:
binder = case x of { Yield a s -> expr }Since Yield belongs to a fusible type, the whole binder is marked INLINE.
We then recurse into the case alternatives (expr above). If they contain
Let bindings that themselves scrutinize fusible constructors, those
are also marked INLINE:
b1 = let b2 = expr1 in expr2If expr2 scrutinizes a fusible constructor, b1 is marked
INLINE. Both expr1 and expr2 are traversed recursively for further
let bindings to inline.
If the expression we recurse into is an App, Lam, or Cast:
App expr1 expr2
Lam _ expr1
Cast expr1 _The plugin simply traverses their subexpressions, checking for any
Let bindings to be inlined.
The plugin also looks for constructing binders — bindings that
build values of a type annotated with Fuse. These too must be inlined
for fusion to succeed.
To detect them, the plugin traverses Let, Case, Lam, and Cast
nodes in the GHC Core, examining their subexpressions. If it encounters
a Var expression of a fusible type, the corresponding binder is marked
INLINE.
For example, in the Core snippet below the binding step_s9QR
constructs and returns Yield and Stop values (both part of a fusible
Step type). Without inlining this binder, these constructors would
remain in the final Core and block fusion.
let {
step_s9QR
= \ _ [Occ=Dead]
(eta_B0 :: Int)
(eta_B1 [OS=OneShot] :: GHC.State# GHC.RealWorld) ->
case eta_B0 of wild_i8YU { GHC.I# x_i8YV ->
case value_a67r of { GHC.I# y_i8YY ->
case GHC.># x_i8YV y_i8YY of {
__DEFAULT ->
(# eta_B1,
Yield @Int @Int wild_i8YU (GHC.Types.I# (GHC.+# x_i8YV 1#)) #);
1# ->
(# eta_B1, Stop @Int @Int #)
}
}
} } in
...The plugin includes debugging and reporting options:
Dump Core after each pass
-fplugin-opt=Fusion.Plugin:dump-coreOutput of each core-2-core pass is dumped in a separate file. One can compare outputs from successive phases to see what blocks fusion.
Report unfused bindings
-fplugin-opt=Fusion.Plugin:verbose=1..4Track problematic internal bindings (including join points).
These features make it much easier to see inside fusion.
In practice, the runtime performance gains from reliable fusion far outweigh the extra compile time.
The ultimate goal: support fusion annotations natively in GHC. That way, programmers and library authors can control fusion reliably, without relying on a plugin.
This would benefit the entire Haskell ecosystem — especially libraries
like vector and streamly that rely on predictable fusion.
Future extensions could include:
Fusion in GHC is powerful but fragile. A single missed inlining can derail the whole pipeline.
The fusion-plugin, makes fusion reliable:
Though the plugin was motivated by streamly, it is generally usable by any Haskell library, not just streamly.
This closes the story: Haskell can give you expressive, modular code and performance on par with C, without unpredictability.