Wanting to know the memory footprint and performance of different java options to keep simple string lookups in memory, I profiled different java.util collection classes that are filled with the same list of strings to see how much the memory usage differs. I then loaded the same data into 2 lucene in memory indices using lucenes RAMDirectory. I finally also evaluated an embedded file based & in memory H2 database. The data that has been loaded are 1.573.345 scientific name strings, the longest being about 150 characters. The original uncompressed text file is 31.6MB (zipped 8.1MB). To also test ID lookup in case of java.util.map or the KVP lucene index, the row number of each name has been used. The machine I used for testing was a MacPro 8-core 3GHz, 5GB RAM using Java6 with 2GB of memory (-Xmx2g) on Mac OSX 64bit. Here are the shortened results using System.currentTimeMillis() and JProfiler inspecting deep object copies in heap dumps (a seriously memory intensive thing too in some cases like lucene and H2 which even crashed):
Text file: 31.6 MB Zipped: 8.1 MB java.util.HashSet
# 264MB # contains test of 10.000 x 12 terms took 28 msecs # uses HashMap internally... java.util.TreeSet # 256MB # contains test of 10.000 x 12 terms took 98 msecs # uses TreeMap internally... java.util.HashMap # 300MB # contains (key) test of 10.000 x 12 terms took 21 msecs # includes Integer values as opposed to above Set java.util.TreeMap # 292MB # contains (key) test of 10.000 x 12 terms took 93 msecs # includes Integer values as opposed to above Set java.util.ArrayList # 172MB # contains test of 100 x 12 terms took 43257 msecs java.util.LinkedList # 220MB # contains test of 100 x 12 terms took 50564 msecs String array # 172MB org.apache.commons.collections.map.HashedMap # 384 MB # contains (key) test of 10000 x 12 terms took 23 msecs # no generics support ! javolution.util.FastList # 220 MB # contains test of 100 x 12 terms took 58886 msecs javolution.util.FastMap # 396 MB # contains (key) test of 10000 x 12 terms took 18 msecs javolution.util.FastSet # 276MB # contains test of 10000 x 12 terms took 10 msecs javolution.util.FastTable # 172MB # contains test of 100 x 12 terms took 50961 msecs gnu.trove.THashMap # 331MB # contains (key) test of 10000 x 12 terms took 65 msecs gnu.trove.THashSet # 185 MB # contains test of 10000 x 12 terms took 47 msecs com.google.common.collect.ImmutableSet # 331MB # contains test of 10000 x 12 terms took 19 msecs com.google.common.collect.ImmutableMap # 404MB # contains (key) test of 10000 x 12 terms took 32 msecs Lucene KVP index # 94MB # lucene 100 x 12 TermQueries took 58 msecs # lucene 10000 x 12 TermQueries took 590 msecs # in memory Index building a key value index with each term being a document and storing the value as a document field # overhead of using an IndexSearcher to just the RAMDirectory is minimal # 1573345 records loaded into Lucene KVP index in 13453 msecs Lucene term index # 44MB # lucene 100 x 12 TermQueries took 32 msecs # lucene 10000 x 12 TermQueries took 369 msecs # in memory Index storing only the pure term index # 1573345 records loaded into simple Lucene term index in 10222 msecs H2 file based # 33MB (connection object, not during querying) # sql equals test of 10000 x 12 terms took 634 msecs # file based db with 1 table, 1 indexed varchar(255) column # 1573345 records loaded into H2 file database in 245025 msecs H2 in memory # ???MB (JProfiler requires >3gig memory) # sql equals test of 10000 x 12 terms took 1024 msecs # in memory db with 1 table, 1 indexed varchar(255) column # 1573345 records loaded into H2 in memory database in 18760 msecs
Lucene came out with the least memory footprint of only 44MB, gnu.trove does best in terms of memory for classic sets, while javolution outperformes anyone else on speed. But their footprint is even slightly larger than the java.utilHashSet. Compared to KVP lucene the performance of H2 is pretty similar. Building the H2 file db took more than 10 times longer than the in memory one, so for regular updated and writes in memory offers lot more performance, but reads are a surprise. The file based version was nearly twice as fast as the in memory one! gnu.trove contains a lot of specialised classes to hold primitives as keys in sets or maps for example. In case one is using int or long this should be a much better footprint, but I haven't tested it as I am interested in Strings currently. For those interested, here is the source code that created the objects and did the time measurments (most tests are commented out in this particular revision).