November 16, 2016

Documentation

In this page a brief documentation of the ARG Database will be provided. More details can be found in this file.

The graph files

The graphs directory that you can download here contains 168 .zip files in which the actual graph files are stored in a compressed format. You will need a decompression utility able to deal with the .zip format in order to extract the graph files, if you have not yet installed one.

The name of the .zip files reflect the kind of graphs that are stored in the corresponding archives; in particular, the name is made of a prefix, which indicates the type of matching that holds between the graphs, and an identifier of the type of graphs (see section References for more details on the graph types). In TABLE 1 there is a list of prefixes with the corresponding meaning; TABLE 2 contains the description of the identifiers used for the types.

The prefix and the type identifier are separated by an underscore (“_”). Examples:

  • The file si2_m2D.zip has the prefix si2, meaning that it contains pairs of graphs for which there is a graph-subgraph isomorphism relation, and the subgraph contain 20% of the nodes of the full graph; the type identifier is m2D, corresponding to bidimensional meshes with no additional random edges.
  • The file mcs10_m2D.zip has the prefix mcs10, meaning that it contains pairs of graphs holding a common subgraph whose size is 10% of the size of the graphs; the type identifier is m2D, corresponding to bidimensional meshes with no additional random edges.

TABLE 1: Filename prefixes
Prefix Description
iso Pairs of isomorphic graphs
si2 Pairs of graph-subgraph isomorphic unlabeled graphs; the subgraph size is 20% of the full graph
si4 Pairs of graph-subgraph isomorphic unlabeled graphs; the subgraph size is 40% of the full graph
si6 Pairs of graph-subgraph isomorphic unlabeled graphs; the subgraph size is 60% of the full graph
mcs10 Pairs of labeled graphs, whose maximum common subgraph size is 10% of the graph size
mcs30 Pairs of labeled graphs, whose maximum common subgraph size is 30% of the graph size
mcs50 Pairs of labeled graphs, whose maximum common subgraph size is 50% of the graph size
mcs70 Pairs of labeled graphs, whose maximum common subgraph size is 70% of the graph size
mcs90 Pairs of labeled graphs, whose maximum common subgraph size is 90% of the graph size

TABLE 2: Graph type identifiers
Identifier Description
r001 randomly generated graphs with eta = 0.01
r005 randomly generated graphs with eta = 0.05
r01 randomly generated graphs with eta = 0.1
r02 randomly generated graphs with eta = 0.2
m2D bi-dimensional meshes with rho = 0
m2Dr2 bi-dimensional meshes with rho = 0.2
m2Dr4 bi-dimensional meshes with rho = 0.4
m2Dr6 bi-dimensional meshes with rho = 0.6
m3D tri-dimensional meshes with rho = 0
m3Dr2 tri-dimensional meshes with rho = 0.2
m3Dr4 tri-dimensional meshes with rho = 0.4
m3Dr6 tri-dimensional meshes with rho = 0.6
m4D quadri-dimensional meshes with rho = 0
m4Dr2 quadri-dimensional meshes with rho = 0.2
m4Dr4 quadri-dimensional meshes with rho = 0.4
m4Dr6 quadri-dimensional meshes with rho = 0.6
b03 bounded-valence graphs with valence = 3
b03m modified bounded-valence graphs with valence = 3
b06 bounded-valence graphs with valence = 6
b06m modified bounded-valence graphs with valence = 6
b09 bounded-valence graphs with valence = 9
b09m modified bounded-valence graphs with valence = 9

The files inside each .zip archive are divided into subdirectories, using three levels of grouping. The first level correspond to the type of matching, and the directories are named using the prefixes of TABLE 1. The second level groups together similar graph types; TABLE 2 shows the names of the directories, and the graph types they contain. Finally, the third level corresponds to graph types, and the directories are named after TABLE 2. For example, the graphs in the file si4_r01.zip get expanded into the directory si4/rand/r01/.


Group Corresponding types
rand r001, r005, r01 ,r02
m2D m2D, m2Dr2, m2Dr4, m2Dr6
m3D m3D, m3Dr2, m3Dr4, m3Dr6
m4D m4D, m4Dr2, m4Dr4, m4Dr6
bvg b03, b03m, b06, b06m, b09, b09m

Type groups

The name of the graph files in each .zip archive are obtained from the name of the .zip file by adding a suffix and an extension. The suffix, separated by an underscore (“_&”) from the graph type identifier, indicates the number of nodes in the graphs. The first letter of the suffix is “s“; for small graphs (i.e. up to 100 nodes) and “m” for medium-sized graphs (i.e. above 100 nodes but below 2000); after this letter, there is the decimal representation of the number of nodes. The file extension (separated by a period “.” from the rest of the file name) is made of a letter, which can be “A” or “B“, and two digits, forming a number from 00 to 99, that identifies one of the pairs of matching graphs. Within the pair, the graph with the “A” in the filename extension is the subgraph, for the subgraph-isomorphism matching types. For example, a graph file name could be mcs50_r01_s15.A27, meaning that the graph is the first of the pair #27 of randomly-generated graphs having a maximum common subgraph whose dimension is 50% of size of the graphs, with eta=0.01 and having 15 nodes.


The graph file format

The graphs are stored in a compact binary format, one graph per file. The file is composed of 16 bit words, which are represented using the so-called little-endian convention, i.e. the least significant byte of the word is stored first. Two different formats are used for labeled and unlabeled graphs. The functions to read these two different formats are provided below.


How to read unlabeled graphs

How to read labeled graphs


The ground truth files

For each .zip file in the graphs directory there is a corresponding .gtr file containing the ground truth regarding the pairs of graphs in the .zip archive. For example, the ground truth of the file iso_b03.zip is in iso_b03.gtr.

A ground truth file is a text file containing a line for each pair of graphs in the .zip file;

  • Unlabeled graphs. The line has two fields, separated by a TAB character (ASCII 9), and is terminated by a LF character (ASCII 10). The first field is the name of the first graph of the pair, while the second field is the number of matchings found.
  • Labeled graphs. The line has four fields, separated by a TAB character (ASCII 9), and is terminated by a LF character (ASCII 10). The first field is the name of the first graph of the pair, while the other fields are the number of matchings found. More in details, the second field is the number of maximum common sungraphs when the number of attributes of the graphs is fixed to 33% of graph size. Analogously, the third and the fourth fields are respectively the number of maximum common subgraphs when the number of attributes is fixed to 50% and 75% of the graph size.

For example, a line of the iso_b03.gtr file could be iso_b03_m1000.A00 1 to signify that between the graphs iso_b03_m1000.A00 and iso_b03_m1000.B00 there is only one isomorphism, while a line of the mcs10_b03.gtr file could be mcs10_b03_s20.A00 1 2 1 meaning that the graphs mcs10_b03_s20.A00 and mcs10_b03_s20.B00 holds respectively one, two and one maximum common subgraph when the number of attributes of the graphs is fixed to 33%, 50% and 75% of graph size.
All the graph pairs in the iso_* files are present in the corresponding ground truth files. However, for graph-subgraph isomorphisms, and for maxmum common subgraphs only a subset of the pairs have been accounted for in the ground truth files, due to the considerably larger computation time needed to generate this information.


Graphs Generator

The source files of the programs used to generate the graphs composing the database can be found here. These programs have been developed and used under Linux, and are not completely portable to other operating systems, since they use the /dev/urandom pseudo-device for initializing the random number generator. Each program generates a pair of graphs; we have used shell scripts to call repeatedly the programs with the proper arguments for generating the whole database.
In the following table there is an outline of the programs and of their command line arguments.


Name Arguments Types generated
gen nodes edges file1 file2 iso*r001, iso*r005, iso*r01
gensub subgraph-nodes nodes edges file1 file2 si*r001, si*r005, si*r01
genm2D nodes extra-edges file1 file2 iso*m2D, iso*m2Dr2, iso*m2Dr4, iso*m2Dr6
gensubm2D subgraph-nodes nodes extra-edges file1 file2 si*m2D, si*m2Dr2, si*m2Dr4, si*m2Dr6
genm3D nodes extra-edges file1 file2 iso*m3D, iso*m3Dr2, iso*m3Dr4, iso*m3Dr6
gensubm3D subgraph-nodes nodes extra-edges file1 file2 si*m3D, si*m3Dr2, si*m3Dr4, si*m3Dr6
genm4D nodes extra-edges file1 file2 iso*m4D, iso*m4Dr2, iso*m4Dr4, iso*m4Dr6
gensubm4D subgraph-nodes nodes extra-edges file1 file2 si*m4D, si*m4Dr2, si*m4Dr4, si*m4Dr6
genbvg nodes valence file1 file2 iso*b03, iso*b06, iso*b09
gensubbvg subgraph-nodes nodes valence file1 file2 si*b03, si*b06, si*b09
genbvgm nodes valence file1 file2 iso*b03m, iso*b06m, iso*b09m
gensubbvgm subgraph-nodes nodes valence file1 file2 si*b03m, si*b06m, si*b09m
genrndat nodes nodesub edges edgesub file1 file2 mcs*r005, mcs*r01, mcs*r02
genm2Dat nodes nodesub edges extra-edgesub file1 file2 mcs*m2D, mcs*m2Dr2, mcs*m2Dr4, mcs*m2Dr6
genm3Dat nodes nodesub edges extra-edgesub file1 file2 mcs*m3D, mcs*m3Dr2, mcs*m3Dr4, mcs*m3Dr6
genm4Dat nodes nodesub edges extra-edgesub file1 file2 mcs*m4D, mcs*m4Dr2, mcs*m4Dr4, mcs*m4Dr6
genbvgat nodes nodesub edges valence file1 file2 mcs*b03, mcs*b06, mcs*b09
genbvgmat nodes nodesub edges valence file1 file2 mcs*b03m, mcs*b06m, mcs*b09m

For each program there is one source file with extension .cc. The programs gen, gensub, genm2D and gensubm2D need to be linked with the VFLIB graph matching library. All other files are self-contained.