ArangoDB v3.13 is under development and not released yet. This documentation is not final and potentially incomplete.

Graph Analytics

ArangoGraph offers Graph Analytics Engines to run graph algorithms on your data separately from your ArangoDB deployments

ArangoDB Platform ArangoGraph

Graph analytics is a branch of data science that deals with analyzing information networks known as graphs, and extracting information from the data relationships. It ranges from basic measures that characterize graphs, over PageRank, to complex algorithms. Common use cases include fraud detection, recommender systems, and network flow analysis.

ArangoDB offers a feature for running algorithms on your graph data, called Graph Analytics Engines (GAEs). It is available on request for the ArangoGraph Insights Platform  and included in the ArangoDB Platform.

Key features:

  • Separation of storage and compute: GAEs are a solution that lets you run graph analytics independent of your ArangoDB Core, including on dedicated machines optimized for compute tasks. This separation of OLAP and OLTP workloads avoids affecting the performance of the transaction-oriented database systems.

  • Fast data loading: You can easily and efficiently import graph data from ArangoDB and export results back to ArangoDB.

  • In-memory processing: All imported data is held and processed in the main memory of the compute machines for very fast execution of graph algorithms such as connected components, label propagation, and PageRank.

Workflow

The following lists outlines how you can use Graph Analytics Engines (GAEs). How to perform the steps is detailed in the subsequent sections.

  1. Determine the approximate size of the data that you will load into the GAE and ensure the machine to run the engine on has sufficient memory. The data as well as the temporarily needed space for computations and results needs to fit in memory.
  2. Start a graphanalytics service via the GenAI service that manages various Platform components for graph intelligence and machine learning. It only takes a few seconds until the engine service can be used. The engine runs adjacent to the pods of the ArangoDB Core.
  3. Load graph data from the ArangoDB Core into the engine. You can load named graphs or sets of node and edge collections. This loads the edge information and a configurable subset of the node attributes.
  4. Run graph algorithms on the data. You only need to load the data once per engine and can then run various algorithms with different settings.
  5. Write the computation results back to the ArangoDB Core.
  6. Stop the engine service once you are done.

Before you can use Graph Analytics Engines, you need to request the feature via Request help in the ArangoGraph dashboard for a deployment.

The deployment needs to use AWS as the cloud provider.

Single server deployments using ArangoDB version 3.11 are not supported.

  1. Determine the approximate size of the data that you will load into the GAE to select an engine size with sufficient memory. The data as well as the temporarily needed space for computations and results needs to fit in memory.
  2. Deploy an engine of the desired size and of type gral. It only takes a few seconds until the engine can be used. The engine runs adjacent to a particular ArangoGraph deployment.
  3. Load graph data from ArangoDB into the engine. You can load named graphs or sets of node and edge collections. This loads the edge information and a configurable subset of the node attributes.
  4. Run graph algorithms on the data. You only need to load the data once per engine and can then run various algorithms with different settings.
  5. Write the computation results back to ArangoDB.
  6. Delete the engine once you are done.

Authentication

You can use any of the available authentication methods the ArangoDB Platform supports to start and stop graphanalytics services via the GenAI service as well as to authenticate requests to the Engine API.

  • HTTP Basic Authentication
  • Access tokens
  • JWT session tokens

The Management API for deploying and deleting engines requires an ArangoGraph API key. See Generating an API Key on how to create one.

You then need to generate an access token using the API key. See Authenticating with Oasisctl on how to do so using oasisctl login.

The Engine API uses one of two authentication methods, depending on the auto login to database UI setting in ArangoGraph:

  • Enabled: You can use an ArangoGraph access token created with an API key (see above), allowing you to use one token for both the Management API and the Engine API.
  • Disabled: You need use a JWT user token created from ArangoDB credentials. These session tokens need to be renewed every hour by default. See HTTP API Authentication for details.

Start and stop Graph Analytics Engines

The interface for managing the engines depends on the environment you use:

GenAI service

ArangoDB Platform

GAEs are deployed and deleted via the GenAI service in the ArangoDB Platform.

If you use cURL, you need to use the -k / --insecure option for requests if the Platform deployment uses a self-signed certificate (default).

Start a graphanalytics service

POST <ENGINE_URL>/gen-ai/v1/graphanalytics

Start a GAE via the GenAI service with an empty request body:

# Example with a JWT session token
ADB_TOKEN=$(curl -sSk -d '{"username":"root", "password": ""}' -X POST https://127.0.0.1:8529/_open/auth | jq -r .jwt)

Service=$(curl -sSk -H "Authorization: bearer $ADB_TOKEN" -X POST https://127.0.0.1:8529/gen-ai/v1/graphanalytics)
ServiceID=$(echo "$Service" | jq -r ".serviceInfo.serviceId")
if [[ "$ServiceID" == "null" ]]; then 
  echo "Error starting gral engine"
else
  echo "Engine started successfully"
fi
echo "$Service" | jq

List the services

POST <ENGINE_URL>/gen-ai/v1/list_services

You can list all running services managed by the GenAI service, including the graphanalytics services:

curl -sSk -H "Authorization: bearer $ADB_TOKEN" -X POST https://127.0.0.1:8529/gen-ai/v1/list_services | jq

Stop a graphanalytics service

Delete the desired engine via the GenAI service using the service ID:

curl -sSk -H "Authorization: bearer $ADB_TOKEN" -X DELETE https://127.0.0.1:8529/gen-ai/v1/service/$ServiceID | jq

Management API

ArangoGraph

GAEs are deployed and deleted with the Management API for graph analytics on the ArangoGraph Insights Platform. You can also list the available engine sizes and get information about deployed engines.

To determine the base URL of the management API, use the ArangoGraph dashboard and copy the APPLICATION ENDPOINT of the deployment that holds the graph data you want to analyze. Replace the port with 8829 and append /graph-analytics/api/graphanalytics/v1, e.g. https://123456abcdef.arangodb.cloud:8829/graph-analytics/api/graphanalytics/v1.

Store the base URL in a variable called BASE_URL:

BASE_URL='https://...'

To authenticate requests, you need to use the following HTTP header:

Authorization: bearer <ARANGO_GRAPH_TOKEN>

You can create an ArangoGraph access token with oasisctl login. Save it in a variable to ease scripting. Note that this should be the token string only and not include quote marks. The following examples assume Bash as the shell and that the curl and jq commands are available.

ARANGO_GRAPH_TOKEN="$(oasisctl login --key-id "<AG_KEY_ID>" --key-secret "<AG_KEY_SECRET>")"

Example with cURL that uses the token variable:

curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" "$BASE_URL/api-version"

Request and response payloads are JSON-encoded in the management API.

Get the API version

GET <BASE_URL>/api-version

Retrieve the version information of the management API.

curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" "$BASE_URL/api-version"

List engine sizes

GET <BASE_URL>/enginesizes

List the available engine sizes, which is a combination of the number of cores and the size of the RAM, starting at 1 CPU and 4 GiB of memory (e4).

curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" "$BASE_URL/enginesizes"

List engine types

GET <BASE_URL>/enginetypes

List the available engine types. The only type supported for GAE workloads is called gral.

curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" "$BASE_URL/enginetypes"

Deploy an engine

POST <BASE_URL>/engines

Set up a GAE adjacent to the ArangoGraph deployment, for example, using an engine size of e4.

The engine ID is returned in the id attribute.

curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" -X POST -d '{"type_id":"gral","size_id":"e4"}' "$BASE_URL/engines"

List all engines

GET <BASE_URL>/engines

List all deployed GAEs of a ArangoGraph deployment.

The engine IDs are in the id attributes.

curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" "$BASE_URL/engines"

Get an engine

GET <BASE_URL>/engines/<ENGINE_ID>

List the detailed information about a specific GAE.

ENGINE_ID="zYxWvU9876"
curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" "$BASE_URL/engines/$ENGINE_ID"

Delete an engine

DELETE <BASE_URL>/engines/<ENGINE_ID>

Delete a no longer needed GAE, freeing any data it holds in memory.

ENGINE_ID="zYxWvU9876"
curl -H "Authorization: bearer $ARANGO_GRAPH_TOKEN" -X DELETE "$BASE_URL/engines/$ENGINE_ID"

Engine API

Determine the engine URL

To determine the base URL of the engine API, use the base URL of the Platform deployment and append /gral/<SERVICE_ID>, e.g. https://127.0.0.1:8529/gral/arangodb-gral-tqcge.

The service ID is returned by the call to the GenAI service for starting the graphanalytics service. You can also list the service IDs like so:

kubectl -n arangodb get svc arangodb-gral -o jsonpath="{.spec.selector.release}"

Store the base URL in a variable called ENGINE_URL:

ENGINE_URL='https://...'

To authenticate requests, you need to use a bearer token in HTTP header:

Authorization: bearer <TOKEN>

You can save the token in a variable to ease scripting. Note that this should be the token string only and not include quote marks. The following examples assume Bash as the shell and that the curl and jq commands are available.

An example of authenticating a request using cURL and a session token:

PLATFORM_BASEURL="https://127.0.0.1:8529"

ADB_TOKEN=$(curl -X POST -d "{\"username\":\"<ADB_USER>\",\"password\":\"<ADB_PASS>\"}" "$PLATFORM_BASEURL/_open/auth" | jq -r '.jwt')

curl -H "Authorization: bearer $ADB_TOKEN" "$ENGINE_URL/v1/jobs"

To determine the base URL of the engine API, use the ArangoGraph dashboard and copy the APPLICATION ENDPOINT of the deployment that holds the graph data you want to analyze. Replace the port with 8829 and append /graph-analytics/engines/<ENGINE_ID>, e.g. https://<123456abcdef>.arangodb.cloud:8829/graph-analytics/engines/zYxWvU9876. If you can’t remember the engine ID, you can List all engines.

Store the base URL in a variable called ENGINE_URL:

ENGINE_URL='https://...'

To authenticate requests, you need to use a bearer token in HTTP header:

Authorization: bearer <TOKEN>
  • If Auto login to database UI is enabled for the ArangoGraph deployment, this can be the same access token as used for the management API.
  • If it is disabled, use an ArangoDB session token (JWT user token) instead.

You can save the token in a variable to ease scripting. Note that this should be the token string only and not include quote marks. The following examples assume Bash as the shell and that the curl and jq commands are available.

An example of authenticating a request using cURL and a session token:

APPLICATION_ENDPOINT="https://123456abcdef.arangodb.cloud:8529"

ADB_TOKEN=$(curl -X POST -d "{\"username\":\"<ADB_USER>\",\"password\":\"<ADB_PASS>\"}" "$APPLICATION_ENDPOINT/_open/auth" | jq -r '.jwt')

curl -H "Authorization: bearer $ADB_TOKEN" "$ENGINE_URL/v1/jobs"

All requests to the engine API start jobs, each representing an operation. You can check the progress of operations and check if errors occurred. You can submit jobs concurrently and they also run concurrently.

You can find the API reference documentation with detailed descriptions of the request and response data structures at https://arangodb.github.io/graph-analytics .

Request and response payloads are JSON-encoded in the engine API.

Load data

POST <ENGINE_URL>/v1/loaddata

Import graph data from a database of the ArangoDB deployment. You can import named graphs as well as sets of node and edge collections (see Managed and unmanaged graphs).

curl -H "Authorization: bearer $ADB_TOKEN" -XPOST -d '{"database":"_system","graph_name":"connectedComponentsGraph"}' "$ENGINE_URL/v1/loaddata"

Run algorithms

PageRank

POST <ENGINE_URL>/v1/pagerank

PageRank is a well known algorithm to rank nodes in a graph: the more important a node, the higher rank it gets. It goes back to L. Page and S. Brin’s paper  and is used to rank pages in in search engines (hence the name). The algorithm runs until the execution converges. To run for a fixed number of iterations, use the maximum_supersteps parameter.

The rank of a node is a positive real number. The algorithm starts with every node having the same rank (one divided by the number of nodes) and sends its rank to its out-neighbors. The computation proceeds in iterations. In each iteration, the new rank is computed according to the formula ( (1 - damping_factor) / total number of nodes) + (damping_factor * the sum of all incoming ranks). The value sent to each of the out-neighbors is the new rank divided by the number of those neighbors, thus every out-neighbor gets the same part of the new rank.

The algorithm stops when at least one of the two conditions is satisfied:

  • The maximum number of iterations is reached. This is the same maximum_supersteps parameter as for the other algorithms.
  • Every node changes its rank in the last iteration by less than a certain threshold. The threshold is hardcoded to 0.0000001.

It is possible to specify an initial distribution for the node documents in your graph. To define these seed ranks / centralities, you can specify a seeding_attribute in the properties for this algorithm. If the specified field is set on a document and the value is numeric, then it is used instead of the default initial rank of 1 / numNodes.

Parameters:

  • graph_id
  • damping_factor
  • maximum_supersteps
  • seeding_attribute (optional, for seeded PageRank)

Result: the rank of each node

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" -XPOST -d "{\"graph_id\":$GRAPH_ID,\"damping_factor\":0.85,\"maximum_supersteps\":500,\"seeding_attribute\":\"seed_attr\"}" "$ENGINE_URL/v1/pagerank"

Weakly Connected Components (WCC)

POST <ENGINE_URL>/v1/wcc

The weakly connected component algorithm partitions a graph into maximal groups of nodes, so that within a group, all nodes are reachable from each node by following the edges, ignoring their direction.

In other words, each weakly connected component is a maximal subgraph such that there is a path between each pair of nodes where one can also follow edges against their direction in a directed graph.

Parameters:

  • graph_id

Result: a component ID for each node. All nodes from the same component obtain the same component ID, every two nodes from different components obtain different IDs.

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" -XPOST -d "{\"graph_id\":$GRAPH_ID}" "$ENGINE_URL/v1/wcc"

Strongly Connected Components (SCC)

POST <ENGINE_URL>/v1/scc

The strongly connected components algorithm partitions a graph into maximal groups of nodes, so that within a group, all nodes are reachable from each node by following the edges in their direction.

In other words, a strongly connected component is a maximal subgraph, where for every two nodes, there is a path from one of them to the other, forming a cycle. In contrast to a weakly connected component, one cannot follow edges against their direction.

Parameters:

  • graph_id

Result: a component ID for each node. All nodes from the same component obtain the same component ID, every two nodes from different components obtain different IDs.

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" -XPOST -d "{\"graph_id\":$GRAPH_ID}" "$ENGINE_URL/v1/scc"

Vertex Centrality

Centrality measures help identify the most important nodes in a graph. They can be used in a wide range of applications: to identify influencers in social networks, or middlemen in terrorist networks.

There are various definitions for centrality, the simplest one being the node degree. These definitions were not designed with scalability in mind. It is probably impossible to discover an efficient algorithm which computes them in a distributed way. Fortunately there are scalable substitutions available, which should be equally usable for most use cases.

Illustration of an execution of different centrality measures (Freeman 1977)

Betweenness Centrality

POST <ENGINE_URL>/v1/betweennesscentrality

A relatively expensive algorithm with complexity O(V*E) where V is the number of nodes and E is the number of edges in the graph.

Betweenness-centrality can be approximated by cheaper algorithms like Line Rank but this algorithm strives to compute accurate centrality measures.

Parameters:

  • graph_id
  • k (number of start nodes, 0 = all)
  • undirected
  • normalized
  • parallelism

Result: a centrality measure for each node

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" -XPOST -d "{\"graph_id\":$GRAPH_ID,\"k\":0,\"undirected\":false,\"normalized\":true}" "$ENGINE_URL/v1/betweennesscentrality"
LineRank

POST <ENGINE_URL>/v1/linerank

Another common measure is the betweenness centrality : It measures the number of times a node is part of shortest paths between any pairs of nodes. For a node v betweenness is defined as:

Vertex Betweenness Formula

Where the σ represents the number of shortest paths between x and y, and σ(v) represents the number of paths also passing through a node v. By intuition a node with higher betweenness centrality has more information passing through it.

LineRank approximates the random walk betweenness of every node in a graph. This is the probability that someone, starting on an arbitrary node, visits this node when they randomly choose edges to visit.

The algorithm essentially builds a line graph out of your graph (switches the nodes and edges), and then computes a score similar to PageRank. This can be considered a scalable equivalent to vertex betweenness, which can be executed distributedly in ArangoDB. The algorithm is from the paper Centralities in Large Networks: Algorithms and Observations (U Kang et.al. 2011).

Parameters:

  • graph_id
  • damping_factor
  • maximum_supersteps

Result: the line rank of each node

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" -XPOST -d "{\"graph_id\":$GRAPH_ID,\"damping_factor\":0.0000001,\"maximum_supersteps\":500}" "$ENGINE_URL/v1/linerank"

Community Detection

Graphs based on real world networks often have a community structure. This means it is possible to find groups of nodes such that each node group is internally more densely connected than outside the group. This has many applications when you want to analyze your networks, for example Social networks include community groups (the origin of the term, in fact) based on common location, interests, occupation, etc.

Label Propagation

POST <ENGINE_URL>/v1/labelpropagation

Label Propagation  can be used to implement community detection on large graphs.

The algorithm assigns an initial community identifier to every node in the graph using a user-defined attribute. The idea is that each node should be in the community that most of its neighbors are in at the end of the computation.

In each iteration of the computation, a node sends its current community ID to all its neighbor nodes, inbound and outbound (ignoring edge directions). After that, each node adopts the community ID it received most frequently in the last step.

It can happen that a node receives multiple most frequent community IDs. In this case, one is chosen either randomly or using a deterministic choice depending on a setting for the algorithm. The rules for a deterministic tiebreak are as follows:

  • If a node obtains only one community ID and the ID of the node from the previous step, its old ID, is less than the obtained ID, the old ID is kept.
  • If a node obtains more than one ID, its new ID is the lowest ID among the most frequently obtained IDs. For example, if the initial IDs are numbers and the obtained IDs are 1, 2, 2, 3, 3, then 2 is the new ID.
  • If, however, no ID arrives more than once, the new ID is the minimum of the lowest obtained IDs and the old ID. For example, if the old ID is 5 and the obtained IDs are 3, 4, 6, then the new ID is 3. If the old ID is 2, it is kept.

The algorithm runs until it converges or reaches the maximum iteration bound. It may not converge on large graphs if the synchronous variant is used.

  • Synchronous: The new community ID of a node is based on the community IDs of its neighbors from the previous iteration. With (nearly) bipartite  subgraphs, this may lead to the community IDs changing back and forth in each iteration within the two halves of the subgraph.
  • Asynchronous: A node determines the new community ID using the most up-to-date community IDs of its neighbors, whether those updates occurred in the current iteration or the previous one. The order in which nodes are updated in each iteration is chosen randomly. This leads to more stable community IDs.

Parameters:

  • graph_id
  • start_label_attribute
  • synchronous
  • random_tiebreak
  • maximum_supersteps

Result: a community ID for each node

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" -XPOST -d "{\"graph_id\":$GRAPH_ID,\"start_label_attribute\":\"start_attr\",\"synchronous\":false,\"random_tiebreak\":false,\"maximum_supersteps\":500}" "$ENGINE_URL/v1/labelpropagation"
Attribute Propagation

POST <ENGINE_URL>/v1/attributepropagation

The attribute propagation algorithm can be used to implement community detection. It works similar to the label propagation algorithm, but every node additionally accumulates a memory of observed labels instead of forgetting all but one label.

The algorithm assigns an initial value to every node in the graph using a user-defined attribute. The attribute value can be a list of strings to initialize the set of labels with multiple labels.

In each iteration of the computation, the following steps are executed:

  1. Each node propagates its set of labels along the edges to all direct neighbor nodes. Whether inbound or outbound edges are followed depends on an algorithm setting.
  2. Each node adds the labels it receives to its own set of labels.

After a specified maximal number of iterations or if no label set changes any more, the algorithm stops.

If there are many labels and the graph is well-connected, the result set can be very large.

Parameters:

  • graph_id
  • start_label_attribute: The attribute to initialize labels with. Use "@id" to use the document IDs of the nodes.
  • synchronous: Whether synchronous or asynchronous label propagation is used.
  • backwards: Whether labels are propagated in edge direction (false) or the opposite direction (true).
  • maximum_supersteps: Maximum number of iterations.

Result: The set of accumulated labels of each node.

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" -XPOST -d "{\"graph_id\":$GRAPH_ID,\"start_label_attribute\":\"start_attr\",\"synchronous\":false,\"backwards\":false,\"maximum_supersteps\":500}" "$ENGINE_URL/v1/attributepropagation"

Store job results

POST <ENGINE_URL>/v1/storeresults

You need to specify to which ArangoDB database and target_collection to save the results to. They need to exist already.

You also need to specify a list of job_ids with one or more jobs that have run graph algorithms.

Each algorithm outputs one value for each node, and you can define the target attribute to store the information in with attribute_names. It has to be a list with one attribute name for every job in the job_ids list.

You can optionally set the degree of parallelism and the batch_size for saving the data.

Parameters:

  • database
  • target_collection
  • job_ids
  • attribute_names
  • parallelism
  • batch_size
JOB_ID="123"
curl -H "Authorization: bearer $ADB_TOKEN" -X POST -d "{\"database\":\"_system\",\"target_collection\":\"coll\",\"job_ids\":[$JOB_ID],\"attribute_names\":[\"attr\"]}" "$ENGINE_URL/v1/storeresults"

List all jobs

GET <ENGINE_URL>/v1/jobs

List all active and finished jobs.

curl -H "Authorization: bearer $ADB_TOKEN" "$ENGINE_URL/v1/jobs"

Get a job

GET <ENGINE_URL>/v1/jobs/<JOB_ID>

Get detailed information about a specific job.

JOB_ID="123"
curl -H "Authorization: bearer $ADB_TOKEN" "$ENGINE_URL/v1/jobs/$JOB_ID"

Delete a job

DELETE <ENGINE_URL>/v1/jobs/<JOB_ID>

Delete a specific job.

JOB_ID="123"
curl -H "Authorization: bearer $ADB_TOKEN" -X DELETE "$ENGINE_URL/v1/jobs/$JOB_ID"

List all graphs

GET <ENGINE_URL>/v1/graphs

List all loaded sets of graph data that reside in the memory of the engine node.

curl -H "Authorization: bearer $ADB_TOKEN" "$ENGINE_URL/v1/graphs"

Get a graph

GET <ENGINE_URL>/v1/graphs/<GRAPH_ID>

Get detailed information about a specific set of graph data.

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" "$ENGINE_URL/v1/graphs/$GRAPH_ID"

Delete a graph

DELETE <ENGINE_URL>/v1/graphs/<GRAPH_ID>

Delete a specific set of graph data, removing it from the memory of the engine node.

GRAPH_ID="234"
curl -H "Authorization: bearer $ADB_TOKEN" -X DELETE "$ENGINE_URL/v1/graphs/$GRAPH_ID"