Getting Started
Running an Ethereum full node
To run the crawler effectively, you’ll either have to:
make an account with an Ethereum full node provider in the cloud
run a full node yourself.
But to make this decision, please consider that crawling Ethereum through a cloud provider is considerably slower than co-locating the crawler and the Ethereum full node on one machine. Co-location is faster simply through mere proximity of the processes and the lack of having to transport over a big network that can produce latency and failures.
Internally, we’ve successfully executed the crawler over > 1TB data sets by running it on the same machine as a fully-sync’ed Erigon Ethereum full node. But for testing things and playing around, Infura or Alchemy are more than sufficient!
Installation
First, we’re downloading the source code:
# EITHER: Clone the repository if you want to use the CLI
git clone git@github.com:attestate/crawler.git
# OR: install the dependency via npm
npm install @attestate/crawler
# Copy the example .env file
# ⚠️ Be sure to update the variables in `.env` with the appropriate values!
cp .env-copy .env
# Install the dependencies
npm i
Before we can run the crawler, however, we’ll have to make sure all mandatory
environment variables are set in the .env
file.
Configuring environment variables
The following environment variables are required:
RPC_HTTP_HOST=https://
DATA_DIR=data
IPFS_HTTPS_GATEWAY=https://
ARWEAVE_HTTPS_GATEWAY=https://
For more details on the crawler’s environment variable configuration, head over to the environment variable reference docs.
Using the command line interface
Alternatively, the crawler can be used on a UNIX-compatible command line
interface. You can find the crawler.mjs
file in the root of the source code
directory.
crawler.mjs [command]
Commands:
crawler.mjs run [path] Run a crawl given a path
crawler.mjs range [path] [table] [key] Query an LMDB key range in a table
Options:
--help Show help [boolean]
--version Show version number [boolean]
Configuring Your First Crawl
To run a crawl, next up, configure the crawl path.