Neo4j bulk data import
01 January 2018
Neo4j has a csv import tool that makes importing large datasets pretty easy,
especially if you happen to be exporting data out of postgres with the COPY
command. We’re going to take a quick look at importing with the imdb movie
dataset (docs here: http://www.imdb.com/interfaces).
There is more than one way to import data into Neo4j, like the cypher
command LOAD CSV which is recommended for medium sized datasets up to 10M
records. This is a little smaller than our movie dataset, so the one we will
look at uses the command line neo4j-admin import, recommended by Neo4j for
huge datasets.
This is the example command we will be using:
neo4j-admin import \
--nodes name.basics.csv \
--nodes title.basics.csv \
--relationships title.principals.csv \
--ignore-missing-nodesThe csv files need to be formatted in a particular way to correctly create nodes
and relationships, and the ignore-missing-nodes needs to be added so the
importer ignores relationships with no nodes instead of throwing errors (them
imdb dataset we are using references a few missing nodes)
Nodes: <primary_key_name>:ID,attr1,attr2,:LABEL
<primary_key_name>:Idis the id of our node, can be any id name like
movieId:ID or personId:ID
:LABELis any label attached- attributes can be added in between
:START_IDand:END_ID
Relationships: :START_ID,attr1,attr2,:END_ID,:TYPE
:START_IDand:END_IDrefer to the ids of the nodes in the
relationship
:TYPEis the relationship type- attributes can be added in between
:START_IDand:END_ID
For our imdb data we are going to pull 3 files:
https://datasets.imdbws.com/name.basics.tsv.gz, actor name to idhttps://datasets.imdbws.com/title.basics.tsv.gz, movie name to idhttps://datasets.imdbws.com/title.principals.tsv.gz, movie id to actor ids
(the cast)
The files are tab seperated and gzipped, so we’ll go through the motions of quickly changing them over to a csv format.
For the actors, it will roughly look like this:
require 'open-uri'
url = 'https://datasets.imdbws.com/name.basics.tsv.gz'
csv_filepath = File.join(Dir.pwd, 'name.basics.csv')
csv_headers = "personId:ID,name,:LABEL \n"
zipped = open(url)
unzipped = Zlib::GzipReader.new(zipped)
File.open(csv_filepath, 'w') do |csv_file|
csv_file.write(csv_headers)
unzipped.each_line do |line|
values = line.strip.split("\t")
personID = values[0]
name = values[1]
csv_file.write("#{personID},#{name},Actor \n")
end
endMovies will mostly look the same, except we are only interested in titleType of movies (skipping episodes, shorts, …):
require 'open-uri'
url = 'https://datasets.imdbws.com/title.basics.tsv.gz'
csv_filepath = File.join(Dir.pwd, 'title.basics.csv')
csv_headers = "movieId:ID,name,:LABEL \n"
zipped = open(url)
unzipped = Zlib::GzipReader.new(zipped)
File.open(csv_filepath, 'w') do |csv_file|
csv_file.write(csv_headers)
unzipped.each_line do |line|
values = line.strip.split("\t")
personID = values[0]
name = values[2]
csv_file.write("#{movieID},#{name},Movie \n") unless values[1] == 'movie'
end
endThe relationships:
url = 'https://datasets.imdbws.com/title.principals.tsv.gz'
csv_filepath = File.join(Dir.pwd, 'title.principals.csv')
csv_headers = ":START_ID,:END_ID,:TYPE \n"
zipped = open(url)
unzipped = Zlib::GzipReader.new(zipped)
File.open(csv_filepath, 'w') do |csv_file|
csv_file.write(csv_headers)
unzipped.each_line do |line|
values = line.strip.split("\t")
movieID = values[0]
cast = values[1].split(",")
cast.each_with_object('') do |actorID, str|
csv_file.write("#{actorID},#{imdbID},ACTED_IN \n")
end
end
endNow that we have the csv data for all our movie and actors, we just pass them to the command from the beginning and we are ready to play with some graph data for movies!