What about really big dictionaries

Nov 17, 2009 at 7:42 AM

Hi,

I tried to add to a the dictionary 15 M strings and failed before half way. an Out Of Memory Exception was thrown. I thought that the idea of this dictionary is that it can be really big because everything is held on the disk.

Maybe a flush should be called once in 100000 inserts or something similar ? are there any guide lines for really big dictionaries ?

Liron.

Coordinator
Nov 17, 2009 at 12:02 PM

First, sorry for your inconvenience on this, and the explanation is straight forward.

Internally I use a Dictionary<int,{128bit struct}> to hold the key hashes and file positions for the actual key and value. This dictionary does not hava a default capacity, and will thus perform badly when adding millions of entries. I tried setting the initial capacity to 100mill, but that gave me out of memory on init. Setting it to 20 mill worked fine.

            string path = @"c:\temp\";
            var dict = new mAdcOW.DataStructures.Dictionary<string, string>(path);

            for (int i = 0; i < 60*1024*1024; i++)
            {
                if( i % 500000 == 0)
                {
                    Console.WriteLine("Count: {0}\tMemusage:{1}", i, Process.GetCurrentProcess().WorkingSet64);
                }
                var text = Guid.NewGuid().ToString();
                dict[text] = text;
            }

(If you use the same string for key/value in the Dictionary from .Net you will save space due to them pointing to the same reference.)

With the test program I got this output:

Count: 17500000 Memusage:2196598784
Count: 18000000 Memusage:2297872384
Count: 18500000 Memusage:2392596480
Count: 19000000 Memusage:2482249728
Count: 19500000 Memusage:2563072000
Count: 20000000 Memusage:2614714368
Count: 20500000 Memusage:2979471360
Count: 21000000 Memusage:3050565632
Count: 21500000 Memusage:2915487744

Over 20mill entries, using 3gb ram.

So, as long as your key and value types are longer than what the internal Dictionary uses it's useful. Otherwise you could just as well use the regular one. I have some thoughts as to replace internal dictionary as well. I'll post an update on it once I've done some tests.

Memory benchmarks for strings twice a guid:

System.Collections.Generic.Dictionary<>
Count: 0        Memusage:572211200
Count: 500000   Memusage:760725504
Count: 1000000  Memusage:944480256
Count: 1500000  Memusage:1129787392
Count: 2000000  Memusage:1313652736
Count: 2500000  Memusage:1498492928
Count: 3000000  Memusage:1666023424

mAdcOW.DataStructures.Dictionary<>
Count: 0        Memusage:590802944
Count: 500000   Memusage:641163264
Count: 1000000  Memusage:685953024
Count: 1500000  Memusage:741453824
Count: 2000000  Memusage:768737280
Count: 2500000  Memusage:829206528
Count: 3000000  Memusage:879022080

Summed up, the larger the key/value sizes, the more benefit the disk versions has.

Nov 17, 2009 at 12:32 PM

Thanks for the explenations.

I passed 20 M to the ctor of the dictionary, but now after 5 and something million rows i get "I/O error occurred. (-2147024888)".

this are the lines that throw the exception (MemoryMappedFile.cs, line320)

if (baseAddress == IntPtr.Zero)
           throw new FileMapIOException(Marshal.GetHRForLastWin32Error());

I tried to see where i can read the documentation of the winterdom dll but couldn't find.

What do u say ?

Liron.

Coordinator
Nov 17, 2009 at 2:59 PM

I suspect you are running on a 32bit system, or running VS in debug mode on a 64bit system (since VS runs in 32bit).

The error code 0x8007008 says that you don't have more address space available in the OS. The project is specifically tailored for 64bit systems in order to optimize the usage of memory mapped files. If you revert to the original code from WinterDom you can set the viewsize to only a portion of the backing file, preserving your address space.

When I change from WinterDom to .Net 4.0 I might make it work for 32bit systems as well.

 

Nov 17, 2009 at 3:06 PM

I am running on a 32-bit machine, I didn't notice that it meant to run on 64-bit.

I assumed that since 32-bit software usually suffers from lack of memory compared to 64-bit this will be my solution for a big dictionary.

My dictionary can be inserted to a Dictionary<string,int> in 64-bit with no problems, but fails on 32-bit, that's the reason i was looking for an on disk dictionary.

I think that this component is more important for 32-bit machines. don't you ?

Liron.

 

Coordinator
Nov 17, 2009 at 3:16 PM

There are many 64bit systems with only 4gb of ram, and more and more desktop/servers are running 64bit, so it should be useful on both.

I'll edit the front page text to say something more about 64bit.

 

Dec 28, 2009 at 8:58 AM

 

HI Liron!

I tryed to run on my 32bit machine(on vs2008) the code example you gave above.

However got exeception with Win32 Error Code -2147024888( Not enough storage is available to process this command.)

Do I have to make some configuration to dictionary to enable to store large amount of data (up to 16gb)

thanks,

Roman

Dec 28, 2009 at 9:54 AM

I didn't give an example, but you are probably talking about wobba's code.

I don't think you can do something to run this example on your machine, you should get a more powerful machine :) or start looking for other solutions which uses the disk and not memory mapped files which are limited the size of the process.

Liron.

Dec 28, 2009 at 10:46 AM

thanks for the reply

I am  pretty sure that MMF could give me the solution I need, since I don't have to keep all the 16gb in memory, I can use much smaller views.

The question is does disk based dictionary has an ability to store large amount of data , but keep in memory only partial view of it.

I tried to run wobba example on 64 bit, but my test program memory usage grew periodically ,  is there an option to limit size  of current view of disk based dictionary, i.e. to limit the process memory size 

Dec 28, 2009 at 12:03 PM

In General keeping an MRU or some other view offcourse is possible and even makes sense. Wobba should answer this question related to his own implementation.

Liron.

Dec 28, 2009 at 12:39 PM

Wobba, could you please answer to that?

Coordinator
Dec 29, 2009 at 8:45 AM

I have reworked the code for the Dictionary so that it don't hold the hashes in memory. That will make it work much better. It's not quite done yet, but I hope to put out a new release in not too long.

To address @roman300189 question: my current implementation address the whole file(and one time per thread using it as well). For large files this will make you eat up your 4GB of address space pretty quickly. And that's why you get the error.

I more or less ripped out the view moving code from the Winterdom library for this, since allocation a view is an "expensive" operation. When I migrate the project to .Net 4.0 I'll put that back in, so 32bit systems can get a better use of this project. Or if I find the time I'll put it into the existing solution as well.

Jan 8, 2010 at 6:34 AM

Hi Wobba ,

 I am a new bie in using MMF but i  found you example very useful as i have to manage very large dictionaries.

I am facing some problem in using the MMF  Dictionary

when i try to initailize the dictionary object with key and value both strings <font size="2">

 

</font>

string path = AppDomain.CurrentDomain.BaseDirectory;

var Dict = new mAdcOW.DataStructures.Dictionary<string, string>(path);

i receive an exception 

Missing Method Exception

Constructor on type 'System.String' not found.

it works fine when i use key and value both as Int

Can you please help me out .

Am i missing something

My OS is 64 bit with 4 GB ram

 

With regards

Abhinav gupta

 

Coordinator
Jan 8, 2010 at 8:08 AM

This has been reported before, and is fixed. The reason is that a string don't have a default empty constructor, so the instantiation of a string fails.

This is fixed in the dev branch and will come in the next release which I plan to do over the weekend.

Your quickest solution is to download the latest sources from http://mmf.codeplex.com/SourceControl/list/changesets, which also has the new dictionary implementation, where also the keyhashes are stored on disk, so it will scale much better. The code there will most likely become the release. I'm adding some more tests and benchmarking before the release.

Jan 12, 2010 at 5:46 AM

Hi Wobba,

String key values are now adding in the dictionary of the latest release and the speed is also very good

thanks for ur quick reply and help

 :)) great job .

Cheers

Abhinav Gupta

 

 

 

Coordinator
Jan 12, 2010 at 8:20 AM
abhinav_007 wrote:

Hi Wobba,

String key values are now adding in the dictionary of the latest release and the speed is also very good

thanks for ur quick reply and help

 :)) great job .

Cheers

Abhinav Gupta

 

 

 

Being curious, how much data are you putting into the Dictionary?

One point to think about is that currently it will not reallocate the buckets in the hash when the number of entries exceed the buckets. It will instead chain the keys, which will make it slower over time as you have more and more data with the same hash. So if you know you will have 10 million items, instantiate the Dictionary with that number.

Jan 12, 2010 at 10:43 AM

 

I have around 11 Dictionaries in the same program filling at the same time with approx 5 million items in one 3.5 million items in other and so on.

I tried initializing the dictionary size and the performance was drastically improved .. again thanks for your suggestion

 

Regards,

Abhinav Gupta

Coordinator
Jul 13, 2010 at 8:12 PM

A working 32bit version is now located in the source code under the Experimental32bit branch. It runs all the tests, but I haven't tested it with huge files.

Jun 27, 2012 at 3:28 PM
wobba wrote:

This has been reported before, and is fixed. The reason is that a string don't have a default empty constructor, so the instantiation of a string fails.

This is fixed in the dev branch and will come in the next release which I plan to do over the weekend.

Your quickest solution is to download the latest sources from http://mmf.codeplex.com/SourceControl/list/changesets, which also has the new dictionary implementation, where also the keyhashes are stored on disk, so it will scale much better. The code there will most likely become the release. I'm adding some more tests and benchmarking before the release.


Has the dev code branch been incorporated into the latest download?

Thanks...

Coordinator
Jun 28, 2012 at 8:28 AM

Hi,

If you pull down the latest sources (build 67418) everything should load and compile and string is included also included in the v3 binary release.

I should probably make a new 3.1 release, but haven't focused on the project for a while. There is work ongoing to use the .Net4 mmf API as well.

Thanks,
Mikael