File system
How to safely interact with the file system in your plugin.
It is not safe to use functions like open
or the non-pure operations of pathlib.Path
like you normally might: this will break caching because they do not hook up to Pants's file watcher.
Instead, Pants has several mechanisms to work with the file system in a safe and concurrent way.
If it would help you to have a certain file operation, please let us know by either opening a new GitHub issue or by messaging us on Slack in the #plugins room.
Core abstractions: Digest
and Snapshot
The core building block is a Digest
, which is a lightweight reference to a set of files known about by the engine.
- The
Digest
is only a reference; the files are stored in the engine's persistent content-addressable storage (CAS). - The files do not need to actually exist on disk.
- Every file uses a relative path. This allows the
Digest
to be passed around in different environments safely, such as running in a temporary directory locally or running through remote execution. - The files may be binary files and/or text files.
- The
Digest
may refer to 0 - n files. If it's empty, the digest will be equal topants.engine.fs.EMPTY_DIGEST
. - You will never create a
Digest
directly in rules, only in tests. Instead, you get aDigest
by usingCreateDigest
orPathGlobs
, or using theoutput_digest
from aProcess
that you've run.
Most of Pants's operations with the file system either accept a Digest
as input or return a Digest
. For example, when running a Process
, you may provide a Digest
as input.
A Snapshot
composes a Digest
and adds the useful properties files: tuple[str, ...]
and dirs: tuple[str, ...]
, which store the sorted file names and directory names, respectively. For example:
Snapshot(
digest=Digest(
fingerprint="21bcd9fcf01cc67e9547b7d931050c1c44d668e7c0eda3b5856aa74ad640098b",
serialized_bytes_length=162,
),
files=("f.txt", "grandparent/parent/c.txt"),
dirs=("grandparent", "grandparent/parent"),
)
A Snapshot
is useful when you want to know which files a Digest
refers to. For example, when running a tool, you might set argv=snapshot.files
, and then pass snapshot.digest
to the Process
so that it has access to those files.
Given a Digest
, you may use the engine to enrich it into a Snapshot
:
from pants.engine.fs import Digest, Snapshot
from pants.engine.rules import Get, rule
@rule
async def demo(...) -> Foo:
...
snapshot = await Get(Snapshot, Digest, my_digest)
CreateDigest
: create new files
CreateDigest
allows you to create a new digest with whichever files you would like, even if they do not exist on disk.
from pants.engine.fs import CreateDigest, Digest, FileContent
from pants.engine.rules import Get, rule
@rule
async def demo(...) -> Foo:
...
digest = await Get(Digest, CreateDigest([FileContent("f1.txt", b"hello world")]))
The CreateDigest
constructor expects an iterable including any of these types:
FileContent
objects, which represent a file to create. It takes apath: str
parameter,contents: bytes
parameter, and optionalis_executable: bool
parameter with a default ofFalse
.Directory
objects, which can be used to create empty directories. It takes a single parameter:path: str
. You do not need to use this when creating a file inside a certain directory; this is only to create empty directories.FileEntry
objects, which are handles to existing files fromDigestEntries
. Do not manually create these.
This does not write the Digest
to the build root. Use Workspace.write_digest()
for that.
PathGlobs
: read from filesystem
PathGlobs
allows you to read from the local file system using globbing. That is, sets of filenames with wildcard characters.
from pants.engine.fs import Digest, PathGlobs
from pants.engine.rules import Get, rule
@rule
async def demo(...) -> Foo:
...
digest = await Get(Digest, PathGlobs(["**/*.txt", "!ignore_me.txt"]))
- All globs must be relative paths, relative to the build root.
PathGlobs
uses the same syntax as thesources
field, which is roughly Git's syntax. Use*
for globs over just the current working directory,**
for recursive globs over everything below (at any level the current working directory), and prefix with!
for ignores.PathGlobs
will ignore all values from the global optionpants_ignore
.
By default, the engine will no-op for any globs that are unmatched. If you want to instead warn or error, set glob_match_error_behavior=GlobMatchErrorBehavior.warn
or GlobMatchErrorBehavior.error
. This will require that you also set description_of_origin
, which is a human-friendly description of where the PathGlobs
is coming from so that the error message is helpful. For example:
from pants.engine.fs import GlobMatchErrorBehavior, PathGlobs
PathGlobs(
globs=[shellcheck.options.config],
glob_match_error_behavior=GlobMatchErrorBehavior.error,
description_of_origin="the option `--shellcheck-config`",
)
If you set glob_match_error_behavior
, you may also want to set conjunction
. By default, only one glob must match. If you set conjunction=GlobExpansionConjunction.all_match
, then all globs must match or the engine will warn or error. For example, this would fail, even if the config file existed:
from pants.engine.fs import GlobExpansionConjunction, GlobMatchErrorBehavior, PathGlobs
PathGlobs(
globs=[shellcheck.options.config, "does_not_exist.txt"],
glob_match_error_behavior=GlobMatchErrorBehavior.error,
conjunction=GlobExpansionConjunction.all_match,
description_of_origin="the option `--shellcheck-config`",
)
If you only need to resolve the file names—and don't actually need to use the file content—you can use await Get(Paths, PathGlobs)
instead of await Get(Digest, PathGlobs)
or await Get(Snapshot, PathGlobs)
. This will avoid "digesting" the files to the LMDB Store cache as a performance optimization. Paths
has two properties: files: tuple[str, ...]
and dirs: tuple[str, ...]
.
from pants.engine.fs import Paths, PathGlobs
from pants.engine.rules import Get, rule
@rule
async def demo(...) -> Foo:
...
paths = await Get(Paths, PathGlobs(["**/*.txt", "!ignore_me.txt"]))
logger.info(paths.files)
DigestContents
: read contents of files
DigestContents
allows you to get the file contents from a Digest
.
from pants.engine.fs import Digest, DigestContents
from pants.engine.rules import Get, rule
@rule
async def demo(...) -> Foo:
...
digest_contents = await Get(DigestContents, Digest, my_digest)
for file_content in digest_contents:
logger.info(file_content.path)
logger.info(file_content.content) # This will be `bytes`.
The result will be a sequence of FileContent
objects, which each have a property path: str
and a property content: bytes
. You may want to call content.decode()
to convert to str
.
DigestContents
Only use DigestContents
if you need to read and operate on the content of files directly in your rule.
-
If you are running a
Process
, you only need to pass theDigest
as input and that process will be able to read all the files in its environment. If you only need a list of files included in the digest, useGet(Snapshot, Digest)
. -
If you just need to manipulate the directory structure of a
Digest
, such as renaming files, useDigestEntries
withCreateDigest
or useAddPrefix
andRemovePrefix
. These avoid reading the file content into memory.
Digest
DigestContents
does not have a way to represent empty directories in a Digest
since it is only a sequence of FileContent
objects. That is, passing the FileContent
objects to CreateDigest
will not result in the original Digest
if there were empty directories in that original Digest
. Use DigestEntries
instead if your rule needs to handle empty directories in a Digest
.