Strategy Specification
@attestate/crawler
defines a strategy interface that integrators can use
to custom-build strategies.
Note
If you’re interested in implementing a custom strategy, consider checking out the source of github.com/attestate/crawler-call-block-logs, a reference implementation of a strategy we try to keep continuously updated.
Extractor
An extractor strategy’s purpose is to communicate with the extraction worker by passing canonical messages back and forth. The worker implements support for a variety of data sources including IPFS, Arweave, GraphQL and JSON-RPC.
All worker messages are defined using JSON schema. To look-up a message’s structure, visit this special folder in the crawler’s code base.
An extractor is an ESM module that exports a
function init()
and afunction update()
. Both must return an object that complies to the below outlined “Lifecycle Return Message.”For every invocation of an extractor strategy,
function init()
is called once upon initiation.Subsequently, for every resolved message within
messages: [...]
,function update(message)
is called.A strategy completes when either no new messages are in the worker’s queue or when an explicit
type: "exit"
message is sent to a worker.config.path[].extractor.args
is an array of values that the crawler passes into an extractor’sfunction init(args[0], args[1], ...)
.
Lifecycle Return Message:
Return values of function init()
and function update()
must be a type
of object that contain a messages: Array
property and a write: String ||
null
. write
’s value is written directly into a flat file at
config.path[].extractor.output.path
(line by line). All elements in
messages
are forwarded to the extraction worker. They are returned to
function update(message)
upon completion.
An example:
function init() {
return {
// NOTE: We write hello world to a new line at `output.path`.
write: "hello world",
// NOTE: We also fire a request to our Ethereum full node asking to
// download the event logs from block 0 to block 1 for the DAI stablecoin
// contract.
messages: [
type: "json-rpc",
method: "eth_getLogs",
params: [
{
fromBlock: 0,
toBlock: 1,
address: "0x6B175474E89094C44Da98b954EedeAC495271d0F",
["x0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef"]
},
],
version: "0.0.1",
options: {
url: "https://myfullnode.com"
}
]
};
}
Transformer
A transformer’s purpose is to sanitize, rearrange and filter data on the user’s file system after an extraction. The
@attestate/crawler
reads theconfig.path[].input.path
line-by-line and invokes thefunction onLine(line)
.A transformer ESM module exports a
function onLine(line, ...args)
that must return an object containing a propertywrite: String
. It doesn’t trigger new extraction worker messages.config.path[].transformer.args
is an array of values that the crawler passes into an transformer’sfunction onLine(line, args[0], args[1], ...)
. Note, however, that the first argument passed intofunction onLine(...)
is always theline: String
, an argument passed-in from the crawler itself.
Note
Consider that running a transformer on extraction results is much cheaper than re-extracting data from external sources. So when building a new strategy, instead of making the extractor fail when an API has changed, ensure that the transformer fails as it’s cheap to re-run.
Loader
A loader’s purpose is to store and arrange crawler results in an efficiently queryable data storage format.
Attestate Crawler implements LMDB to enable direct storage access and creation of indices.
A loader ESM module exports two generator functions
function* direct(line)
andfunction* order(line)
.function* order(line)
mustyield
an object{ key: any, value: any }
wherekey
is chosen and arranged such that it can be lexographically ordered (e.g., an Ethereum transaction’sblockNumber
andtransactionIndex
).function* direct(line)
mustyield
an object{ key: any, value: any }
wherekey
must be unique in the entire set (e.g., an Ethereum transactionstransactionHash
).key
andvalue
must comply with the guidelines of the LMDB documentation.
Internally, the Attestate Crawler will create a new LMDB instance at
config.path[].loader.output.path
. For each strategy, it’ll create “order”
and “direct” tables from the following naming scheme
for order
{config.path[].strategy.name}:order
andfor direct
{config.path[].strategy.name}:direct
.
The yielded values for function* order()
and function* direct()
(key
and value
) will be stored in these database sub-tables
accordingly.