Configuration
Compared to other tools that may help you replicate Ethereum state locally, Attestate’s Crawler comes with plenty of configuration options. This page’s purpose is to serve as a reference for all configurable options.
Overview
There are two files which we are using to configure the crawler:
a
.env
file defines all environmental variables. We’ve built the crawler such that only variables go into it that aren’t supposed to be checked into version control, e.g. your Infura API key.a
config.mjs
file that contains a sharable set up of crawler instructions.
First, let’s walk you throught the .env
file.
Environment Variables
@attestate/crawler
internally uses dotenv to automatically load environment
variables from a .env
file in the project root into Node.js’s
process.env
object. However, if necessary, environment variables that have
already been set in, e.g., the .env
file can be overwritten by passing them
before an invoking command.
The following environment variables are required for @attestate/crawler
to run:
RPC_HTTP_HOST
describes the host that Ethereum JSON-RPC extraction request are made against. It must be set to an URL to an Ethereum full node’s JSON-RPC endpoint that starts withhttps://
.ws://
orwss://
prefixes are currently not supported. We support URLs that include the API’s bearer token as is the case with, e.g., Infura or Alchemy.RPC_API_KEY
is the API key for the host extraction requests are made against. It must be set if an Ethereum full node was provisioned behind an HTTP proxy that requires a bearer token authorization via the HTTPAuthorization
header. In this case, the header is structurally set as follows:Authorization: Bearer ${RPC_API_KEY}
.DATA_DIR
is the directory that stores all results from extraction and transformation of the crawler. It must be set to a file system path (relative or absolute).IPFS_HTTPS_GATEWAY
describes the host that IPFS extraction requests are made against. A list of publicly accessible IPFS gateways can be found here.IPFS_HTTPS_GATEWAY_KEY
is the API key for the IPFS host extraction requests are made against. It must be set if an IPFS node was provisioned behind an HTTP proxy that requires a bearer token authorization via the HTTPAuthorization
header. In this case, the header is structurally set as follows:Authorization: Bearer ${IPFS_HTTPS_GATEWAY_KEY}
.ARWEAVE_HTTPS_GATEWAY
describes the host that Arweave extraction requests are made against. A commonly-used Arweave gateway ishttps://arweave.net
.
Note
In some cases, you may only work with Ethereum, however, the crawler will
complain if, e.g., IPFS_HTTPS_GATEWAY
isn’t defined. Also, some Ethereum
full node providers will append the API key in the RPC_HTTP_HOST
URI. In
those cases it is sufficient to define those variables as an empty string:
RPC_API_KEY=""
.
Configuration File
The configuration file is a .mjs
that, using ESM’s export default
,
exports a list of tasks to be run. It is normatively specified as a JSON schema
in the code base.
The @attestate/crawler
repository always contains a pre-defined
config.mjs
file that you can copy.
Structurally, it is defined as follows:
export default {
// First, there is the crawl path property, which we'll describe later in
// this section.
path: {
"...": "..."
},
queue: {
options: {
// The queue's concurrency controls how many requests are sent to
// external resources concurrently.
concurrent: 100
},
},
// In case an external resource implements rate limiting, then the crawler
// can be rate limited here to not pass any of these limits.
endpoints: {
["https://ipfs.io"]: {
timeout: 10_000,
requestsPerUnit: 25,
unit: "second",
},
}
};
Structurally, the path property looks as follows
const path = [
{
name: "Task #1",
extractor: { /* ... */ },
transformer: { /* ... */ },
loader: { /* ... */ },
},
{
name: "Task #2",
"...": "..."
}
]
The crawler implements an Extract, Transform and Load stage separation which is reflected in the names of a task’s phases. Attestate Crawler executes them sequentially in order: (1) extraction, (2) transformation, (3) loading.
Below is a fully configured crawl path to fetch all Ethereum block logs
within a range of start=0
block and end=1
block (extractor.args
).
The output of the requests are stored in extractor.output.path
with the
pre-configured DATA_DIR
environment variable.
const path = [
{
name: "call-block-logs",
extractor: {
module: {
// NOTE: By convention, an extractior module must always implement an
// init and an update function.
init: (arg1, arg2, ...) => { /* ... */ },
update: (message) => { /* ... */ },
},
// NOTE: The arguments are passed into the module's init function
args: [0, 1],
output: {
// NOTE: An output path is defined to persist the extractor's
// requests.
path: resolve(env.DATA_DIR, "call-block-logs-extraction"),
},
},
"...": "..."
},
];
Upon completing extraction, a transformation is scheduled to filter events by
the EIP-20/EIP-721 transfer signature. A transformer’s module consists of a
single function onLine(line)
that is invoked for each line of the
transformer.input.path
. The input’s path is set to the data we have
extracted in the extraction phase prioly.
/*
* NOTE: After the extraction phase, we're filtering all events by topics.
* We're generating the transfer event's signature using the keccak256 hash
* function.
*
* keccak256("Transfer(address,address,uint256)") == "0xddf...";
*/
const topic0 =
"0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef";
const path = [
{
name: "call-block-logs",
"...": "...",
transformer: {
module: {
// NOTE: onLine gets invoked for each line in `input.path`.
onLine: line => { /* ... */ },
},
args: [topic0],
// NOTE: A transformer always requires an `input.path` and `output.path`
// property to be present.
input: {
path: resolve(env.DATA_DIR, "call-block-logs-extraction"),
},
output: {
path: resolve(env.DATA_DIR, "call-block-logs-transformation"),
},
},
"...": "...",
}
];
Upon completion of the transformation step, the loading phase is initiated. In
it the transformation’s output is loaded into LMDB. For that, a strategy must implement a
loader.module.direct
and loader.module.order
generator function.
These functions must allocate a key-value relationship between each of the data
points:
direct()
’s yieldedkey
must be globally unique (like a primary key).order()
’s yieldedkey
must be unique and totally, lexographically orderable.
const path = [
{
name: "call-block-logs",
loader: {
module: {
direct: function* (line) {
const log = JSON.parse(line);
// NOTE: To access a transaction directly by its identifier, in
// `direct` we select ``transactionHash`` as the key for the entire
// `log`.
yield {
key: log.transactionHash,
value: log
}
},
order: function* (line) {
const log = JSON.parse(line);
// NOTE: Since we want to get an ordered list of events, we
// construct a total order of transactions of their block number
// and their "height" within a block (`transactionIndex`).
// LMDB will then allow us to do a range call on these keys to
// instantly retrieve them orderly.
yield {
key: [log.blockNumber, log.transactionIndex],
value: log.transactionHash
}
},
},
input: {
path: resolve(env.DATA_DIR, "call-block-logs-transformer"),
},
output: {
path: resolve(env.DATA_DIR, "call-block-logs-loader"),
}
}
},
];
And that’s all! A full configuration of the Attestate crawler can be found on GitHub.