18.1. Batch Insertion

Prev		Next

18.1.1. Batch Inserter Examples
18.1.2. Batch Graph Database
18.1.3. Index Batch Insertion

Neo4j has a batch insertion facility intended for initial imports, which bypasses transactions and other checks in favor of performance. This is useful when you have a big dataset that needs to be loaded once.

Batch insertion is inlcuded in the neo4j-kernel component, which is part of all Neo4j distributions and editions.

Be aware of the following points when using batch insertion:

The intended use is for initial import of data.
Batch insertion is not thread safe.
Batch insertion is non-transactional.
Unless shutdown is successfully invoked at the end of the import, the database files will be corrupt.

	Warning
	Always perform batch insertion in a single thread (or use synchronization to make only one thread at a time access the batch inserter) and invoke `shutdown` when finished.

18.1.1. Batch Inserter Examples

Creating a batch inserter is similar to how you normally create data in the database, but in this case the low-level BatchInserter interface is used. As we have already pointed out, you can’t have multiple threads using the batch inserter concurrently without external synchronization.

	Tip
	The source code of the examples is found here: BatchInsertDocTest.java

To get hold of a BatchInseter, use BatchInserters and then go from there:

BatchInserter inserter = BatchInserters.inserter( "target/batchinserter-example", fileSystem );
Map<String, Object> properties = new HashMap<String, Object>();
properties.put( "name", "Mattias" );
long mattiasNode = inserter.createNode( properties );
properties.put( "name", "Chris" );
long chrisNode = inserter.createNode( properties );
RelationshipType knows = DynamicRelationshipType.withName( "KNOWS" );
// To set properties on the relationship, use a properties map
// instead of null as the last parameter.
inserter.createRelationship( mattiasNode, chrisNode, knows, null );
inserter.shutdown();

To gain good performance you probably want to set some configuration settings for the batch inserter. Read Section 25.9.2, “Batch insert example” for information on configuring a batch inserter. This is how to start a batch inserter with configuration options:

Map<String, String> config = new HashMap<String, String>();
config.put( "neostore.nodestore.db.mapped_memory", "90M" );
BatchInserter inserter = BatchInserters.inserter(
        "target/batchinserter-example-config", fileSystem, config );
// Insert data here ... and then shut down:
inserter.shutdown();

In case you have stored the configuration in a file, you can load it like this:

InputStream input = fileSystem.openAsInputStream(
        new File( "target/batchinsert-config" ) );
Map<String, String> config = MapUtil.load( input );
BatchInserter inserter = BatchInserters.inserter(
        "target/batchinserter-example-config", fileSystem, config );
// Insert data here ... and then shut down:
inserter.shutdown();

18.1.2. Batch Graph Database

In case you already have code for data import written against the normal Neo4j API, you could consider using a batch inserter exposing that API.

	Note
	This will not perform as good as using the `BatchInserter` API directly.

Also be aware of the following:

Starting a transaction or invoking Transaction.finish() or Transaction.success() will do nothing.
Invoking the Transaction.failure() method will generate a NotInTransaction exception.
Node.delete() and Node.traverse() are not supported.
Relationship.delete() is not supported.
Event handlers and indexes are not supported.
GraphDatabaseService.getRelationshipTypes(), getAllNodes() and getAllRelationships() are not supported.

With these precautions in mind, this is how to do it:

GraphDatabaseService batchDb =
        BatchInserters.batchDatabase( "target/batchdb-example", fileSystem );
Node mattiasNode = batchDb.createNode();
mattiasNode.setProperty( "name", "Mattias" );
Node chrisNode = batchDb.createNode();
chrisNode.setProperty( "name", "Chris" );
RelationshipType knows = DynamicRelationshipType.withName( "KNOWS" );
mattiasNode.createRelationshipTo( chrisNode, knows );
batchDb.shutdown();

	Tip
	The source code of the example is found here: BatchInsertDocTest.java

18.1.3. Index Batch Insertion

For general notes on batch insertion, see Section 18.1, “Batch Insertion”.

Indexing during batch insertion is done using BatchInserterIndex which are provided via BatchInserterIndexProvider. An example:

BatchInserter inserter = BatchInserters.inserter( "target/neo4jdb-batchinsert" );
BatchInserterIndexProvider indexProvider =
        new LuceneBatchInserterIndexProvider( inserter );
BatchInserterIndex actors =
        indexProvider.nodeIndex( "actors", MapUtil.stringMap( "type", "exact" ) );
actors.setCacheCapacity( "name", 100000 );

Map<String, Object> properties = MapUtil.map( "name", "Keanu Reeves" );
long node = inserter.createNode( properties );
actors.add( node, properties );

//make the changes visible for reading, use this sparsely, requires IO!
actors.flush();

// Make sure to shut down the index provider as well
indexProvider.shutdown();
inserter.shutdown();

The configuration parameters are the same as mentioned in Section 19.10, “Configuration and fulltext indexes”.

Best practices

Here are some pointers to get the most performance out of BatchInserterIndex:

Try to avoid flushing too often because each flush will result in all additions (since last flush) to be visible to the querying methods, and publishing those changes can be a performance penalty.
Have (as big as possible) phases where one phase is either only writes or only reads, and don’t forget to flush after a write phase so that those changes becomes visible to the querying methods.
Enable caching for keys you know you’re going to do lookups for later on to increase performance significantly (though insertion performance may degrade slightly).

	Note
	Changes to the index are available for reading first after they are flushed to disk. Thus, for optimal performance, read and lookup operations should be kept to a minimum during batchinsertion since they involve IO and impact speed negatively.