This describes neo4j-embedded, a Python library that lets you use the embedded Neo4j database in Python.
Tutorials
This describes how to get started with Neo4j embedded in python. See reference for the full reference documentation.
You have to have installed the neo4j-embedded python library to try these examples, see installation.
Hello, world!
Here is a simple example to get you started.
from neo4j import GraphDatabase
# Create a database
db = GraphDatabase(folder_to_put_db_in)
# All write operations happen in a transaction
with db.transaction:
firstNode = db.node(name='Hello')
secondNode = db.node(name='world!')
# Create a relationship with type 'knows'
relationship = firstNode.knows(secondNode, name='graphy')
# Read operations can happen anywhere
message = ' '.join([firstNode['name'], relationship['name'], secondNode['name']])
print message
# Delete the data
with db.transaction:
firstNode.knows.single.delete()
firstNode.delete()
secondNode.delete()
# Always shut down your database when your application exits
db.shutdown()
A sample app using cypher and indexes
This example shows you how to get started building something like a simple invoice tracking application with Neo4j.
We start out by importing Neo4j, and creating some meta data that we will use to organize our actual data with.
from neo4j import GraphDatabase, INCOMING, Evaluation
# Create a database
db = GraphDatabase(folder_to_put_db_in)
# All write operations happen in a transaction
with db.transaction:
# A node to connect customers to
customers = db.node()
# A node to connect invoices to
invoices = db.node()
# Connected to the reference node, so
# that we can always find them.
db.reference_node.CUSTOMERS(customers)
db.reference_node.INVOICES(invoices)
# An index, helps us rapidly look up customers
customer_idx = db.node.indexes.create('customers')
Domain logic
Then we define some domain logic that we want our application to be able to perform. Our application has two domain objects, Customers and Invoices. Let’s create methods to add new customers and invoices.
def create_customer(name):
with db.transaction:
customer = db.node(name=name)
customer.INSTANCE_OF(customers)
# Index the customer by name
customer_idx['name'][name] = customer
return customer
def create_invoice(customer, amount):
with db.transaction:
invoice = db.node(amount=amount)
invoice.INSTANCE_OF(invoices)
invoice.SENT_TO(customer)
return customer
In the customer case, we create a new node to represent the customer and connect it to the customers node. This helps us find customers later on, as well as determine if a given node is a customer.
We also index the name of the customer, to allow for quickly finding customers by name.
In the invoice case, we do the same, except no indexing. We also connect each new invoice to the customer it was sent to, using a relationship of type SENT_TO.
Next, we want to be able to retrieve customers and invoices that we have added. Because we are indexing customer names, finding them is quite simple.
def get_customer(name):
return customer_idx['name'][name].single
Lets say we also like to do something like finding all invoices for a given customer that are above some given amount. This could be done by writing a cypher query, like this:
def get_invoices_with_amount_over(customer, min_sum):
# Find all invoices over a given sum for a given customer.
# Note that we return an iterator over the "invoice" column
# in the result (['invoice']).
return db.query('''START customer=node({customer_id})
MATCH invoice-[:SENT_TO]->customer
WHERE has(invoice.amount) and invoice.amount >= {min_sum}
RETURN invoice''',
customer_id = customer.id, min_sum = min_sum)['invoice']
Creating data and getting it back
Putting it all together, we can create customers and invoices, and use the search methods we wrote to find them.
for name in ['Acme Inc.', 'Example Ltd.']:
create_customer(name)
# Loop through customers
for relationship in customers.INSTANCE_OF:
customer = relationship.start
for i in range(1,12):
create_invoice(customer, 100 * i)
# Finding large invoices
large_invoices = get_invoices_with_amount_over(get_customer('Acme Inc.'), 500)
# Getting all invoices per customer:
for relationship in get_customer('Acme Inc.').SENT_TO.incoming:
invoice = relationship.start
Reference Documentation
The source code for this project lives on GitHub: https://github.com/neo4j-contrib/python-embedded
Installation
|
Note
|
The Neo4j database itself (from the Community Edition) is included in the neo4j-embedded distribution. |
Installation on OSX/Linux
Prerequisites
|
Caution
|
Make sure that the entire stack used is either 64bit or 32bit (no mixing, that is). That means the JVM, Python and JPype. |
First, install JPype:
-
Download the latest version of JPype from http://sourceforge.net/projects/jpype/files/JPype/.
-
Unzip the file.
-
Open a console and navigate into the unzipped folder.
-
Run sudo python setup.py install
JPype is also available in the Debian repos:
sudo apt-get install python-jpype
Then, make sure the JAVA_HOME environment variable is set to your jre or jdk folder, so that JPype can find the JVM.
|
Note
|
Installation can be problematic on OSX. See the following Stack Overflow discussion for help: http://stackoverflow.com/questions/8525193/cannot-install-jpype-on-os-x-lion-to-use-with-neo4j and this blog post may be of help as well: http://blog.y3xz.com/blog/2011/04/29/installing-jpype-on-mac-os-x/ |
Installing neo4j-embedded
You can install neo4j-embedded with your python package manager of choice:
sudo pip install neo4j-embedded
sudo easy_install neo4j-embedded
Or install manually:
-
Download the latest appropriate version of JPype from http://sourceforge.net/projects/jpype/files/JPype/ for 32bit or from http://www.lfd.uci.edu/~gohlke/pythonlibs/ for 64bit.
-
Unzip the file.
-
Open a console and navigate into the unzipped folder.
-
Run sudo python setup.py install
Installation on Windows
Prerequisites
|
Warning
|
It is imperative that the entire stack used is either 64bit or 32bit (no mixing, that is). That means the JVM, Python, JPype and all extra DLLs (see below). |
First, install JPype:
|
Note
|
Notice that JPype only works with Python 2.6 and 2.7. Also note that there are different downloads depending on which version you use. |
-
Download the latest appropriate version of JPype from http://sourceforge.net/projects/jpype/files/JPype/ for 32bit or from http://www.lfd.uci.edu/~gohlke/pythonlibs/ for 64bit.
-
Run the installer.
Then, make sure the JAVA_HOME environment variable is set to your jre or jdk folder. There is a description of how to set environment variables in [python-embedded-installation-windows-dlls].
|
Note
|
There may be DLL files missing from your system that are required by JPype. See DLLs for instructions for how to fix this. |
Installing neo4j-embedded
-
Download the latest version from http://pypi.python.org/pypi/neo4j-embedded/.
-
Run the installer.
Solving problems with missing DLL files
Certain versions of Windows ship without DLL files needed to programmatically launch a JVM. You will need to make IEShims.dll and certain debugging dlls available on Windows.
IEShims.dll is normally included with Internet Explorer installs. To make windows find this file globally, you need to add the IE install folder to your PATH.
-
Right click on "My Computer" or "Computer".
-
Select "Properties".
-
Click on "Advanced" or "Advanced system settings".
-
Click the "Environment variables" button.
-
Find the path varible, and add C:\Program Files\Internet Explorer to it (or the install location of IE, if you have installed it somewhere else).
Required debugging dlls are bundled with Microsoft Visual C++ Redistributable libraries.
If you are still getting errors about missing DLL files, you can use http://www.dependencywalker.com/ to open your jvm.dll (located in JAVA_HOME/bin/client/ or JAVA_HOME/bin/server/), and it will tell you if there are other missing dlls.
Core API
This section describes how get get up and running, and how to do basic operations.
Getting started
Creating a database
from neo4j import GraphDatabase # Create db db = GraphDatabase(folder_to_put_db_in) # Always shut down your database db.shutdown()
Creating a database, with configuration
Please see Neo4j Configuration for what options you can use here.
from neo4j import GraphDatabase # Example configuration parameters db = GraphDatabase(folder_to_put_db_in, string_block_size=200, array_block_size=240) db.shutdown()
JPype JVM configuration
You can set extra arguments to be passed to the JVM using the NEO4J_PYTHON_JVMARGS environment variable. This can be used to, for instance, increase the max memory for the database.
Note that you must set this before you import the neo4j package, either by setting it before you start python, or by setting it programatically in your app.
import os os.environ['NEO4J_PYTHON_JVMARGS'] = '-Xms128M -Xmx512M' import neo4j
You can also override the classpath used by neo4j-embedded, by setting the NEO4J_PYTHON_CLASSPATH environment variable.
Transactions
All write operations to the database need to be performed from within transactions. This ensures that your database never ends up in an inconsistent state.
See Neo4j Transactions for details on how Neo4j handles transactions.
We use the python with statement to define a transaction context. If you are using an older version of Python, you may have to import the with statement:
from __future__ import with_statement
Either way, this is how you get into a transaction:
# Start a transaction
with db.transaction:
# This is inside the transactional
# context. All work done here
# will either entirely succeed,
# or no changes will be applied at all.
# Create a node
node = db.node()
# Give it a name
node['name'] = 'Cat Stevens'
# The transaction is automatically
# commited when you exit the with
# block.
Nodes
This describes operations that are specific to node objects. For documentation on how to handle properties on both relationships and nodes, see properties.
Creating a node
with db.transaction:
# Create a node
thomas = db.node(name='Thomas Anderson', age=42)
Fetching a node by id
# You don't have to be in a transaction # to do read operations. a_node = db.node[some_node_id] # Ids on nodes and relationships are available via the "id" # property, eg.: node_id = a_node.id
Fetching the reference node
reference = db.reference_node
Removing a node
with db.transaction:
node = db.node()
node.delete()
|
Tip
|
See also Neo4j Delete Semantics. |
Removing a node by id
with db.transaction:
del db.node[some_node_id]
Accessing relationships from a node
For details on what you can do with the relationship objects, see relationships.
# All relationships on a node
for rel in a_node.relationships:
pass
# Incoming relationships
for rel in a_node.relationships.incoming:
pass
# Outgoing relationships
for rel in a_node.relationships.outgoing:
pass
# Relationships of a specific type
for rel in a_node.mayor_of:
pass
# Incoming relationships of a specific type
for rel in a_node.mayor_of.incoming:
pass
# Outgoing relationships of a specific type
for rel in a_node.mayor_of.outgoing:
pass
Getting and/or counting all nodes
Use this with care, it will become extremely slow in large datasets.
for node in db.nodes:
pass
# Shorthand for iterating through
# and counting all nodes
number_of_nodes = len(db.nodes)
Relationships
This describes operations that are specific to relationship objects. For documentation on how to handle properties on both relationships and nodes, see properties.
Creating a relationship
with db.transaction:
# Nodes to create a relationship between
steven = self.graphdb.node(name='Steve Brook')
poplar_bluff = self.graphdb.node(name='Poplar Bluff')
# Create a relationship of type "mayor_of"
relationship = steven.mayor_of(poplar_bluff, since="12th of July 2012")
# Or, to create relationship types with names
# that would not be possible with the above
# method.
steven.relationships.create('mayor_of', poplar_bluff, since="12th of July 2012")
Fetching a relationship by id
the_relationship = db.relationship[a_relationship_id]
Removing a relationship
with db.transaction:
# Create a relationship
source = db.node()
target = db.node()
rel = source.Knows(target)
# Delete it
rel.delete()
|
Tip
|
See also Neo4j Delete Semantics. |
Removing a relationship by id
with db.transaction:
del db.relationship[some_relationship_id]
Relationship start node, end node and type
relationship_type = relationship.type start_node = relationship.start end_node = relationship.end
Getting and/or counting all relationships
Use this with care, it will become extremely slow in large datasets.
for rel in db.relationships:
pass
# Shorthand for iterating through
# and counting all relationships
number_of_rels = len(db.relationships)
Properties
Both nodes and relationships can have properties, so this section applies equally to both node and relationship objects. Allowed property values include strings, numbers, booleans, as well as arrays of those primitives. Within each array, all values must be of the same type.
Setting properties
with db.transaction:
node_or_rel['name'] = 'Thomas Anderson'
node_or_rel['age'] = 42
node_or_rel['favourite_numbers'] = [1,2,3]
node_or_rel['favourite_words'] = ['banana','blue']
Getting properties
numbers = node_or_rel['favourite_numbers']
Removing properties
with db.transaction:
del node_or_rel['favourite_numbers']
Looping through properties
# Loop key and value at the same time
for key, value in node_or_rel.items():
pass
# Loop property keys
for key in node_or_rel.keys():
pass
# Loop property values
for value in node_or_rel.values():
pass
Paths
A path object represents a path between two nodes in the graph. Paths thus contain at least two nodes and one relationship, but can reach arbitrary length. It is used in various parts of the API, most notably in traversals.
Accessing the start and end nodes
start_node = path.start end_node = path.end
Accessing the last relationship
last_relationship = path.last_relationship
Looping through the entire path
You can loop through all elements of a path directly, or you can choose to only loop through nodes or relationships. When you loop through all elements, the first item will be the start node, the second will be the first relationship, the third the node that the relationship led to and so on.
for item in path:
# Item is either a Relationship,
# or a Node
pass
for nodes in path.nodes:
# All nodes in a path
pass
for nodes in path.relationships:
# All relationships in a path
pass
Indexes
In order to rapidly find nodes or relationship based on properties, Neo4j supports indexing. This is commonly used to find start nodes for traversals.
By default, the underlying index is powered by Apache Lucene, but it is also possible to use Neo4j with other index implementations.
You can create an arbitrary number of named indexes. Each index handles either nodes or relationships, and each index works by indexing key/value/object triplets, object being either a node or a relationship, depending on the index type.
Index management
Just like the rest of the API, all write operations to the index must be performed from within a transaction.
Creating an index
Create a new index, with optional configuration.
with db.transaction:
# Create a relationship index
rel_idx = db.relationship.indexes.create('my_rels')
# Create a node index, passing optional
# arguments to the index provider.
# In this case, enable full-text indexing.
node_idx = db.node.indexes.create('my_nodes', type='fulltext')
Retrieving a pre-existing index
with db.transaction:
node_idx = db.node.indexes.get('my_nodes')
rel_idx = db.relationship.indexes.get('my_rels')
Deleting indexes
with db.transaction:
node_idx = db.node.indexes.get('my_nodes')
node_idx.delete()
rel_idx = db.relationship.indexes.get('my_rels')
rel_idx.delete()
Checking if an index exists
exists = db.node.indexes.exists('my_nodes')
Indexing things
Adding nodes or relationships to an index
with db.transaction:
# Indexing nodes
a_node = db.node()
node_idx = db.node.indexes.create('my_nodes')
# Add the node to the index
node_idx['akey']['avalue'] = a_node
# Indexing relationships
a_relationship = a_node.knows(db.node())
rel_idx = db.relationship.indexes.create('my_rels')
# Add the relationship to the index
rel_idx['akey']['avalue'] = a_relationship
Removing indexed items
Removing items from an index can be done at several levels of granularity. See the example below.
# Remove specific key/value/item triplet del idx['akey']['avalue'][item] # Remove all instances under a certain # key del idx['akey'][item] # Remove all instances all together del idx[item]
Searching the index
You can retrieve indexed items in two ways. Either you do a direct lookup, or you perform a query. The direct lookup is the same across different index providers while the query syntax depends on what index provider you use. As mentioned previously, Lucene is the default and by far most common index provider.
There is a python library for programatically generating Lucene queries, available at GitHub.
|
Important
|
Unless you loop through the entire index result, you have to close the result when you are done with it. If you do not, the database does not know when it can release the resources the result is taking up. |
Direct lookups
hits = idx['akey']['avalue']
for item in hits:
pass
# Always close index results when you are
# done, to free up resources.
hits.close()
Querying
hits = idx.query('akey:avalue')
for item in hits:
pass
# Always close index results when you are
# done, to free up resources.
hits.close()
Cypher Queries
You can use the Cypher query language from neo4j-embedded. Read more about cypher syntax and cool stuff you can with it here: Cypher Reference.
Querying and reading the result
Basic query
To execute a plain text cypher query, do this:
result = db.query("START n=node(0) RETURN n")
Retrieve query result
Cypher returns a tabular result. You can either loop through the table row-by-row, or you can loop through the values in a given column. Here is how to loop row-by-row:
root_node = "START n=node(0) RETURN n"
# Iterate through all result rows
for row in db.query(root_node):
node = row['n']
# We know it's a single result,
# so we could have done this as well
node = db.query(root_node).single['n']
Here is how to loop through the values of a given column:
root_node = "START n=node(0) RETURN n"
# Fetch an iterator for the "n" column
column = db.query(root_node)['n']
for cell in column:
node = cell
# Coumns support "single":
column = db.query(root_node)['n']
node = column.single
List the result columns
You can get a list of the column names in the result like this:
result = db.query("START n=node(0) RETURN n,count(n)")
# Get a list of the column names
columns = result.keys()
Parameterized and prepared queries
Parameterized queries
Cypher supports parameterized queries, see Cypher Parameters. This is how you use them in neo4j-embedded.
result = db.query("START n=node({id}) RETURN n",id=0)
node = result.single['n']
Prepared queries
Prepared queries, where you could retrieve a pre-parsed version of a cypher query to be used later, is deprecated. Cypher will recognize if it has previously parsed a given query, and won’t parse the same string twice.
So, in effect, all cypher queries are prepared queries, if you use them more than once. Use parameterized queries to gain the full power of this - then a generic query can be pre-parsed, and modified with parameters each time it is executed.
Traversals
|
Warning
|
Traversal support in neo4j-embedded for python is deprecated as of Neo4j 1.7 GA. Please see Cypher or the core API instead. This is done because the traversal framework requires a very tight coupling between the JVM and python. To keep improving performance, we need to break that coupling. |
The below documentation will be removed in neo4j-embedded 1.8, and support for traversals will be dropped in neo4j-embedded 1.9.
The traversal API used here is essentially the same as the one used in the Java API, with a few modifications.
Traversals start at a given node and uses a set of rules to move through the graph and to decide what parts of the graph to return.
Basic traversals
Following a relationship
The most basic traversals simply follow certain relationship types, and return everything they encounter. By default, each node is visited only once, so there is no risk of infinite loops.
traverser = db.traversal()\
.relationships('related_to')\
.traverse(start_node)
# The graph is traversed as
# you loop through the result.
for node in traverser.nodes:
pass
Following a relationship in a specific direction
You can tell the traverser to only follow relationships in some specific direction.
from neo4j import OUTGOING, INCOMING, ANY
traverser = db.traversal()\
.relationships('related_to', OUTGOING)\
.traverse(start_node)
Following multiple relationship types
You can specify an arbitrary number of relationship types and directions to follow.
from neo4j import OUTGOING, INCOMING, ANY
traverser = db.traversal()\
.relationships('related_to', INCOMING)\
.relationships('likes')\
.traverse(start_node)
Traversal results
A traversal can give you one of three different result types: nodes, relationships or paths.
Traversals are performed lazily, which means that the graph is traversed as you loop through the result.
traverser = db.traversal()\
.relationships('related_to')\
.traverse(start_node)
# Get each possible path
for path in traverser:
pass
# Get each node
for node in traverser.nodes:
pass
# Get each relationship
for relationship in traverser.relationships:
pass
Uniqueness
To avoid infinite loops, it’s important to define what parts of the graph can be re-visited during a traversal. By default, uniqueness is set to NODE_GLOBAL, which means that each node is only visited once.
Here are the other options that are available.
from neo4j import Uniqueness
# Available options are:
Uniqueness.NONE
# Any position in the graph may be revisited.
Uniqueness.NODE_GLOBAL
# Default option
# No node in the entire graph may be visited
# more than once. This could potentially
# consume a lot of memory since it requires
# keeping an in-memory data structure
# remembering all the visited nodes.
Uniqueness.RELATIONSHIP_GLOBAL
# No relationship in the entire graph may be
# visited more than once. For the same
# reasons as NODE_GLOBAL uniqueness, this
# could use up a lot of memory. But since
# graphs typically have a larger number of
# relationships than nodes, the memory
# overhead of this uniqueness level could
# grow even quicker.
Uniqueness.NODE_PATH
# A node may not occur previously in the
# path reaching up to it.
Uniqueness.RELATIONSHIP_PATH
# A relationship may not occur previously in
# the path reaching up to it.
Uniqueness.NODE_RECENT
# Similar to NODE_GLOBAL uniqueness in that
# there is a global collection of visited
# nodes each position is checked against.
# This uniqueness level does however have a
# cap on how much memory it may consume in
# the form of a collection that only
# contains the most recently visited nodes.
# The size of this collection can be
# specified by providing a number as the
# second argument to the
# uniqueness()-method along with the
# uniqueness level.
Uniqueness.RELATIONSHIP_RECENT
# works like NODE_RECENT uniqueness, but
# with relationships instead of nodes.
traverser = db.traversal()\
.uniqueness(Uniqueness.NODE_PATH)\
.traverse(start_node)
Ordering
You can traverse either depth first, or breadth first. Depth first is the default, because it has lower memory overhead.
# Depth first traversal, this
# is the default.
traverser = db.traversal()\
.depthFirst()\
.traverse(self.source)
# Breadth first traversal
traverser = db.traversal()\
.breadthFirst()\
.traverse(start_node)
Evaluators - advanced filtering
In order to traverse based on other critera, such as node properties, or more complex things like neighboring nodes or patterns, we use evaluators. An evaluator is a normal Python method that takes a path as an argument, and returns a description of what to do next.
The path argument is the current position the traverser is at, and the description of what to do can be one of four things, as seen in the example below.
from neo4j import Evaluation
# Evaluation contains the four
# options that an evaluator can
# return. They are:
Evaluation.INCLUDE_AND_CONTINUE
# Include this node in the result and
# continue the traversal
Evaluation.INCLUDE_AND_PRUNE
# Include this node in the result, but don't
# continue the traversal
Evaluation.EXCLUDE_AND_CONTINUE
# Exclude this node from the result, but
# continue the traversal
Evaluation.EXCLUDE_AND_PRUNE
# Exclude this node from the result and
# don't continue the traversal
# An evaluator
def my_evaluator(path):
# Filter on end node property
if path.end['message'] == 'world':
return Evaluation.INCLUDE_AND_CONTINUE
# Filter on last relationship type
if path.last_relationship.type.name() == 'related_to':
return Evaluation.INCLUDE_AND_PRUNE
# You can do even more complex things here, like subtraversals.
return Evaluation.EXCLUDE_AND_CONTINUE
# Use the evaluator
traverser = db.traversal()\
.evaluator(my_evaluator)\
.traverse(start_node)