Neo4j bulk data import
01 January 2018
Neo4j has a csv import tool that makes importing large datasets pretty easy,
especially if you happen to be exporting data out of postgres with the COPY
command. We’re going to take a quick look at importing with the imdb movie
dataset (docs here: http://www.imdb.com/interfaces).
There is more than one way to import data into Neo4j, like the cypher
command LOAD CSV
which is recommended for medium sized datasets up to 10M
records. This is a little smaller than our movie dataset, so the one we will
look at uses the command line neo4j-admin import
, recommended by Neo4j for
huge datasets.
This is the example command we will be using:
neo4j-admin import \
--nodes name.basics.csv \
--nodes title.basics.csv \
--relationships title.principals.csv \
--ignore-missing-nodes
The csv files need to be formatted in a particular way to correctly create nodes
and relationships, and the ignore-missing-nodes
needs to be added so the
importer ignores relationships with no nodes instead of throwing errors (them
imdb dataset we are using references a few missing nodes)
Nodes: <primary_key_name>:ID,attr1,attr2,:LABEL
<primary_key_name>:Id
is the id of our node, can be any id name like
movieId:ID
or personId:ID
:LABEL
is any label attached- attributes can be added in between
:START_ID
and:END_ID
Relationships: :START_ID,attr1,attr2,:END_ID,:TYPE
:START_ID
and:END_ID
refer to the ids of the nodes in the
relationship
:TYPE
is the relationship type- attributes can be added in between
:START_ID
and:END_ID
For our imdb data we are going to pull 3 files:
https://datasets.imdbws.com/name.basics.tsv.gz
, actor name to idhttps://datasets.imdbws.com/title.basics.tsv.gz
, movie name to idhttps://datasets.imdbws.com/title.principals.tsv.gz
, movie id to actor ids
(the cast)
The files are tab seperated and gzipped, so we’ll go through the motions of quickly changing them over to a csv format.
For the actors, it will roughly look like this:
require 'open-uri'
url = 'https://datasets.imdbws.com/name.basics.tsv.gz'
csv_filepath = File.join(Dir.pwd, 'name.basics.csv')
csv_headers = "personId:ID,name,:LABEL \n"
zipped = open(url)
unzipped = Zlib::GzipReader.new(zipped)
File.open(csv_filepath, 'w') do |csv_file|
csv_file.write(csv_headers)
unzipped.each_line do |line|
values = line.strip.split("\t")
personID = values[0]
name = values[1]
csv_file.write("#{personID},#{name},Actor \n")
end
end
Movies will mostly look the same, except we are only interested in titleType of movies (skipping episodes, shorts, …):
require 'open-uri'
url = 'https://datasets.imdbws.com/title.basics.tsv.gz'
csv_filepath = File.join(Dir.pwd, 'title.basics.csv')
csv_headers = "movieId:ID,name,:LABEL \n"
zipped = open(url)
unzipped = Zlib::GzipReader.new(zipped)
File.open(csv_filepath, 'w') do |csv_file|
csv_file.write(csv_headers)
unzipped.each_line do |line|
values = line.strip.split("\t")
personID = values[0]
name = values[2]
csv_file.write("#{movieID},#{name},Movie \n") unless values[1] == 'movie'
end
end
The relationships:
url = 'https://datasets.imdbws.com/title.principals.tsv.gz'
csv_filepath = File.join(Dir.pwd, 'title.principals.csv')
csv_headers = ":START_ID,:END_ID,:TYPE \n"
zipped = open(url)
unzipped = Zlib::GzipReader.new(zipped)
File.open(csv_filepath, 'w') do |csv_file|
csv_file.write(csv_headers)
unzipped.each_line do |line|
values = line.strip.split("\t")
movieID = values[0]
cast = values[1].split(",")
cast.each_with_object('') do |actorID, str|
csv_file.write("#{actorID},#{imdbID},ACTED_IN \n")
end
end
end
Now that we have the csv data for all our movie and actors, we just pass them to the command from the beginning and we are ready to play with some graph data for movies!