Internal architecture
Rule Graph Construction
Overview
Build logic in Pants is declared using collections of @rules
with recursively memoized and invalidated results. This framework (known as Pants' "Engine") has similar goals to Bazel's Skyframe and the Salsa framework: users define logic using a particular API, and the framework manages tracking the dependencies between nodes in a runtime graph.
In order to maximize the amount of work that can be reused at runtime, Pants statically computes the memoization keys for the nodes of the runtime graph from the user specified @rules
during startup: this process is known as "rule graph construction". See the Goals
section for more information on the strategy and reasoning for this.
Concepts used in compilers, including live variable analysis and monomorphization, can also be useful in rule graph construction to minimize rule identities and pre-decide which versions of their dependencies they will use.
Concepts
A successfully constructed RuleGraph
contains a graph where nodes have one of three types, Rule
s, Query
s, and Param
s, which map fairly closely to what a Pants @rule
author consumes. The edges between nodes represent dependencies: Query
s are always roots of the graph, Param
s are always leaves, and Rule
s represent the end user logic making up all of the internal nodes of the graph.
Rules
A Rule
is a function or coroutine with all of its inputs declared as part of its type signature. The end user type signature is made up of:
- the return type of the
Rule
- the positional arguments to the
Rule
- a set of
Get
s which declare the runtime requirements of a coroutine, of the formGet(output_type, input_type)
In the RuleGraph
, these are encoded in a Rule trait, with a DependencyKey trait representing both the positional arguments (which have no provided Param
) and the Get
s (which provide their input type as a Param
).
Rule
s never refer to one another by name (i.e., they do not call one another by name): instead, their signature declares their requirements in terms of input/output types, and rule graph construction decides which potential dependencies will provide those requirements.
Queries
The roots/entrypoints of a RuleGraph
are Query
s, which should correspond one-to-one to external callsites that use the engine to request that values are computed. A Query
has an output type, and a series of input types: Query(output_type, (*input_types))
.
If a user makes a request to the engine that does not have a corresponding Query
declared, the engine fails rather than attempting to dynamically determine which Rules
to use to answer the Query
: how a RuleGraph
is constructed should show why that is the case.
Params
Params
are typed, comparable (eq
/hash
) values that represent both the inputs to Rules
, and the building block of the runtime memoization key for a Rule
. The set of Params
(unique by type) that are consumed to create a Rule
's inputs (plus the Rule
's own identity) make up the memoization key for a runtime instance of the Rule
.
Param
s are eventually used as positional args to Rule
s, but it's important to note that the Param
s in a Rule
instance's identity/memoization-key will not always become the positional arguments to that Rule
: in many cases, a Param
will be used by a Rule
's transitive dependencies in order to produce an output value that becomes either a positional argument to the Rule
as it starts, or the result of a Get
while a coroutine Rule
runs.
The Param
s that are available to a Rule
are made available by the Rule
's dependees (its "callers"), but similar to how Rule
s are not called by name, neither are all of their Param
s passed explicitly at each use site. A Rule
will be used to compute the output value for a DependencyKey
: i.e., a positional argument, Get
result, or Query
result. Of these usage sites, only Query
specifies the complete set of Params
that will be available: the other two usages (positional arguments and Get
s) are able to use any Param that will be "in scope" at the use site.
Params
flow down the graph from Query
s and the provided Param
s of Get
s: their presence does not need to be re-declared at each intermediate callsite. When a Rule
consumes a Param
as a positional argument, that Param
will no longer be available to that Rule
's dependencies (but it might still be present in other subgraphs adjacent to that Rule
).
Goals
The goals of RuleGraph
construction are:
- decide which
Rule
s to use to answerQuery
s (transitively, sinceRule
s do not call one another by name); and - determine the minimum set of
Param
inputs needed to satisfy theRule
s below thoseQuery
s
If either of the goals were removed, RuleGraph
construction might be more straightforward:
- If rather than being type-driven,
Rule
s called one another by name, you could statically determine their inputParams
by walking the call graph ofRule
s by name, and collecting their transitive inputParams
. - If rather than needing to compute a minimum set of
Param
inputs for the memoization key, we instead required that all usage sites explicitly declared allParam
s that their dependencies might need, we could relatively easily eliminate candidates based on the combination ofParam
types at a use site. And if we were willing to have very large memoization keys, we could continue to have simple callsites, but skip pruning theParams
that pass from a dependee to a dependency at runtime, and include anyParams
declared in any of aRule
s transitive dependees to be part of its identity.
But both of the goals are important because together they allow for an API that is easy to write Rule
s for, with minimal boilerplate required to get the inputs needed for a Rule
to compute a value, and minimal invalidation. Because the identity of a Rule
is computed from its transitive input Param
s rather than from its positional arguments, Rule
s can accept arbitrarily-many large input values (which don't need to implement hash) with no impact on its memoization hit rate.
Constraints
There are a few constraints that decide which Rule
s are able to provide dependencies for one another:
param_consumption
- When aRule
directly uses aParam
as a positional argument, thatParam
is removed from scope for any of thatRule
's dependencies.- For example, for a
Rule
y
with a positional argumentA
and aGet(B, C)
: if there is aParam
A
in scope aty
and it is used to satisfy the positional argument, it cannot also be used to (transitively) to satisfy theGet(B, C)
(i.e., a hyptothetical rule that consumes bothA
andC
would not be eligible in that position). - On the other hand, for a
Rule
w
withGet(B, C)
andGet(D, E)
, if there is aParam
A
in scope atw
, two dependencyRule
s that consumeA
(transitively) can be used to satisfy thoseGet
s. Only consuming aParam
as a positional argument removes it from scope.
- For example, for a
provided_params
- When deciding whether oneRule
can use anotherRule
to provide the output type of aGet
, a constraint is applied that the candidate depedency must (transitively) consume theParam
that is provided by theGet
.- For example: if a
Rule
z
has aGet(A, B)
, onlyRule
s that compute anA
and (transitively) consume aB
are eligible to be used. This also means that aParam
A
which is already in scope forRule
z
is not eligible to be used, because it would trivially not consumeB
.
- For example: if a
Implementation
As of 3a188a1e06, we construct a RuleGraph
using a combination of data flow analysis and some homegrown (and likely problematic: see the "Issue Overview") node splitting on the call graph of Rule
s.
The construction algorithm is broken up into phases:
- initial_polymorphic - Builds a polymorphic graph while computing an "out-set" for each node in the graph by accounting for which
Param
s are available at each use site. During this phase, nodes may have multiple dependency edges perDependencyKey
, which is what makes them "polymorphic". Each of the possible ways to compute a dependency will likely have different inputParam
requirements, and each node in this phase represents all of those possibilities. - live_param_labeled - Run live variable analysis on the polymorphic graph to compute the initial "in-set" of
Params
used by each node in the graph. Because nodes in the polymorphic graph have references to all possible sources of a particular dependency type, the computed set is conservative (i.e., overly large).- For example: if a
Rule
x
has exactly oneDependencyKey
, but there are two potential dependencies to provide thatDependencyKey
with inputParam
s{A,B}
and{B,C}
(respectively), then at this phase the inputParam
s forx
must be the union of all possibilities:{A,B,C}
. - If we were to stop
RuleGraph
construction at this phase, it would be necessary to do a form of dynamic dispatch at runtime to decide which source of a dependency to use based on theParam
s that were currently in scope. And the sets ofParam
s used in the memoization key for eachRule
would still be overly large, causing excess invalidation.
- For example: if a
- monomorphize - "Monomorphize" the polymorphic graph by using the out-set of available
Param
s (initialized duringinitial_polymorphic
) and the in-set of consumedParam
s (computed duringlive_param_labeled
) to partition nodes (and their dependees) for each valid combination of their dependencies. Combinations of dependencies that would be invalid (see the Constraints section) are not generated, which causes some pruning of the graph to happen during this phase.- Continuing the example from above: the goal of monomorphize is to create one copy of
Rule
x
per legal combination of itsDependencyKey
. Assuming that both ofx
's dependencies remain legal (i.e. that all of{A,B,C}
are still in scope in the dependees ofx
, etc), then two copies ofx
will be created: one that uses the first dependency and has an in-set of{A,B}
, and another that uses the second dependency and has an in-set of{B,C}
.
- Continuing the example from above: the goal of monomorphize is to create one copy of
- prune_edges - Once the monomorphic graph has converged, each node in the graph will ideally have exactly one source of each
DependencyKey
(with the exception ofQuery
s, which are not monomorphized). This phase validates that, and chooses the smallest inputParam
set to use for eachQuery
. In cases where a node has more that one dependency perDependencyKey
, it is because given a particular set of inputParams
there was more than one valid way to compute a dependency. This can happen either because there were too manyParam
s in scope, or because there were multipleRule
s with the sameParam
requirements.- This phase is the only phase that renders errors: all of the other phases mark nodes and edges "deleted" for particular reasons, and this phase consumes that record. A node that has been deleted indicates that that node is unsatisfiable for some reason, while an edge that has been deleted indicates that the source node was not able to consume the target node for some reason.
- If a node has too many sources of a
DependencyKey
, this phase will recurse to attempt to locate the node in theRule
graph where the ambiguity was introduced. Likewise, if a node has no source of aDependencyKey
, this phase will recurse on deleted nodes (which are preserved by the other phases) to attempt to locate the bottom-mostRule
that was missing aDependencyKey
.
- finalize - After
prune_edges
the graph is known to be valid, and this phase generates the final staticRuleGraph
for allRule
s reachable fromQuery
s.