Getting Started

Running an Ethereum full node

To run the crawler effectively, you’ll either have to:

make an account with an Ethereum full node provider in the cloud
run a full node yourself.

But to make this decision, please consider that crawling Ethereum through a cloud provider is considerably slower than co-locating the crawler and the Ethereum full node on one machine. Co-location is faster simply through mere proximity of the processes and the lack of having to transport over a big network that can produce latency and failures.

Internally, we’ve successfully executed the crawler over > 1TB data sets by running it on the same machine as a fully-sync’ed Erigon Ethereum full node. But for testing things and playing around, Infura or Alchemy are more than sufficient!

Installation

First, we’re downloading the source code:

# EITHER: Clone the repository if you want to use the CLI
git clone git@github.com:attestate/crawler.git

# OR: install the dependency via npm
npm install @attestate/crawler

# Copy the example .env file
# ⚠️ Be sure to update the variables in `.env` with the appropriate values!
cp .env-copy .env

# Install the dependencies
npm i

Before we can run the crawler, however, we’ll have to make sure all mandatory environment variables are set in the .env file.

Configuring environment variables

The following environment variables are required:

RPC_HTTP_HOST=https://
DATA_DIR=data
EXTRACTION_WORKER_CONCURRENCY=12
IPFS_HTTPS_GATEWAY=https://
ARWEAVE_HTTPS_GATEWAY=https://

For more details on the crawler’s environment variable configuration, head over to the environment variable reference docs.

Using the command line interface

Alternatively, the crawler can be used on a UNIX-compatible command line interface. You can find the crawler.mjs file in the root of the source code directory.

Usage: crawler.mjs <options>

Options:
  --help     Show help                                                 [boolean]
  --version  Show version number                                       [boolean]
  --path     Sequence of strategies that the crawler will follow.     [required]
  --config   Configuration for CLI