Psychochild's Blog

A developer's musings on game development and writing.

11 January, 2014

Python: PartialMatchDict
Filed under: — Psychochild @ 4:15 PM
(This post has been viewed 1838 times.)

I wrote a class for some Python code I'm working on, and I figured there might be others out there who would find it interesting. The class lets you set up a typical dictionary data structure (where hashable items are used as keys to look up values stored in the structure), but will allow you to do partial and case-insensitive matches.

A bit more details, and the code, after the break

I included a bit of testing code at the bottom of the code. If you run this from the Python interpreter directly from the command line, it should execute the test code and spit out the results to give you an idea of how it works.

Yes, I know this kinda goes against the purpose of a Python dictionary. It only works for strings instead of all hashable objects; it will try to turn your keys into strings, actually. And, the search is currently linear if you don't include the original key, so performance will be an issue in bigger structures. I'm sure Python purists will line up around the block to tell me this is not really Pythonic. That's fine, it worked for what I needed it and thought it was unique enough to be worth sharing.

Where would you use this? In a text MUD, you could use this for lookups of names of other players, or for abbreviations of commands. So the command "te psy I think this is okay." might match "te" to the command "tell" and "psy" to the name "Psychochild" in the game, assuming there are no other commands/names that begin with the same letters.

One way to extend this might be to have an most recently used (MRU) cache. This would decrease performance in smaller dictionaries, but might help if you have a large dictionary with a few more frequently used keys. Maybe store and re-order the keys in the additional lookup code to improve access if you have some keys that would be used more often in a short period of time, such as abbreviations for names between two people chatting.

Anyway, I'm releasing this under the MIT license, which basically translates to "do whatever you want with it, just keep the copyright statement attached and don't blame me if anything goes wrong." A note to me if you do use it would also be appreciated. :)

Feel free to ask questions in the comments below, or if you post up an improvement.


# PartialMatchDict - a dictionary that will allow partial key matches.
#
# Code copyright: Brian 'Psychochild' Green - http://psychochild.org/
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#
#   1. Redistributions of source code must retain the above copyright
#      notice, this list of conditions and the following disclaimer.
#
#   2. Redistributions in binary form must reproduce the above copyright
#      notice, this list of conditions and the following disclaimer in the
#      documentation and/or other materials provided with the distribution.
#
#   3. The names of its contributors may not be used to endorse or promote
#      products derived from this software without specific prior written
#      permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
# A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER
# OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

# TODO: Cache limited MRU entries for increased performance in large structures.
# - Include a dict that maps partial keys to full keys?  Or store keys in own data structure and re-order them based on use.
# - Limit size to number of elements?
# - Need to invalidate cache if we add more items.
# TODO: Create custom exception types to differentiate between keys not found and keys not unique.

class PartialMatchDict(dict):
    """
    A dictionary structure that will return a value if its key at least partially uniquely matches a passed-in key.
    Exact matches returned first, then checks for partial matches to the beginnings of keys while ignoring case.
    Example:
    >>> pd = PartialMatchDict()
    >>> pd["thelongstring"] = 1
    >>> pd["the"] = 2
    >>> pd["TheString"] = 3
    >>> pd["thelongstr"]
    1
    >>> pd["the"]
    2
    >>> pd["thestring"]
    3
    >>> pd["th"]
    KeyError: "Partial key 'th' not unique"
    >>> pd["anotherstring"]
    KeyError: "Missing key 'anotherstring'"
    """

    def __init__(self, items = None):
        super(PartialMatchDict, self).__init__()

        if items:
            self.update(items)

    # PartialSearch - Do the partial search.
    # param: partialKey = key to try to match.
    # returns: a tuple, 0th element is True if found a value, None if no possible match, False if no unique match was found;
    #                   1st element indicates value if successful, or first possible value if not unique.
    def PartialSearch(self, partialKey):
        # if the item is in the dictionary then just return it
        if dict.__contains__(self, partialKey):
            return (True, dict.__getitem__(self, partialKey))

        matchFound = (None, None)
        # Don't partial match an empty string, because it will match anything.
        if len(partialKey) == 0:
            # "" is not a unique key.
            return (False, "(empty string)")
        # See if any key starts with our partialKey.  Use lower() to ignore case.
        lowerPartialKey = str(partialKey).lower()
        for key in self:
            if str(key).lower().startswith(lowerPartialKey):
                if not matchFound[0]:
                    matchFound = (True, dict.__getitem__(self, key))
                else:
                    # More than one potential match, so set our success indicator to False.
                    matchFound = (False, matchFound[1])
                    break
        return matchFound

    # Find out if we have an exact key in the set.
    def has_exact_key(self, item):
        return dict.__contains__(self, item)

    # Override dict method, used for "in" operator.
    def __contains__(self, item):
        # Compare to True to have None (not found at all) returns False.
        return (self.PartialSearch(item)[0] == True)

    # Override dict method, used when looking up key using [].
    def __getitem__(self, item):
        wasFound, theValue = self.PartialSearch(item)
        if not wasFound:
            errorMsg = "%s"
            if wasFound == None:
                # Couldn't find anything at all.
                errorMsg = "Missing key '%s'"
            else:
                # Key not unique
                errorMsg = "Partial key '%s' not unique"
            raise KeyError(errorMsg % (str(item)))
        else:
            return theValue

# If we run the file from the command prompt, do some tests.
if __name__ == '__main__':
    testdict = PartialMatchDict()
    print "** Assigning testdict['Primary'] = 1"
    testdict['Primary'] = 1
    print "** Assigning testdict['Secondary'] = 2"
    testdict['Secondary'] = 2
    print "** Assigning testdict['Tertiary'] = 3"
    testdict['Tertiary'] = 3
    print "** Assigning testdict['thirtieth'] = 30"
    testdict['thirtieth'] = 30

    # Make sure we find what we expect.
    print "testdict['Primary'] == ", testdict['Primary']
    print "testdict['primary'] == ", testdict['primary']
    print "testdict['secondary'] == ", testdict['secondary']
    print "testdict['tertiary'] == ", testdict['tertiary']
    print "testdict['tert'] == ", testdict['tert']
    print "testdict['thirtieth'] == ", testdict['thirtieth']
    print "testdict['thirt'] == ", testdict['thirt']
    print "testdict['th'] == ", testdict['th']
    print "** Assigning testdict['th'] = None"
    testdict['th'] = None
    print "testdict['th'] == ", testdict['th']
    print "testdict['thirt'] == ", testdict['thirt']

    # Expect some errors.
    try:
        print testdict['quart']
    except KeyError:
        print "There is no possible match, so could not find testdict['quart']"
    try:
        print testdict['t']
    except KeyError:
        print "Possible matches are not unique, so could not find testdict['t']"

--


« Previous Post:





7 Comments »

  1. And, the search is currently linear if you don't include the original key

    Specifically, the only way to get a less than N-element search for a dictionary with N elements is if you do have duplicates. It won't stop the search until the end is reached or a second partial match is made.

    I think a radix tree (or "trie") whose key is the exact match in the dictionary is a much better way to implement this sort of search - at least if there are enough keys that a linear search every time worries you. You do need to know what the (lowercase) key space is, but you can make a fairly efficient data structure that can identify lack of match, ambiguous match, and exact match very quickly.

    That may seem like it's going overboard, but my experience is that a library or module with an O(N) interface is going to get abused as if it were an O(log N) interface.

    Comment by Matthew Weigel — 11 January, 2014 @ 5:49 PM

  2. Heh, beaten to the punch. I was going to suggest a trie as well. It's used in predictive auto-complete, like on a mobile phone. Which is pretty much the same problem as you are solving.

    Wikipedia has a good article on it: http://en.wikipedia.org/wiki/Trie

    Comment by RohanV — 11 January, 2014 @ 6:03 PM

  3. Which data structure works better depends on your usage. For a list of names of people currently logged in a MUD, you're going to have the data changing frequently, and adding and removing elements from a radix tree is going to have some cost associated with it. You also need more memory to store the structure in place; you're essentially trading a bit of speed for a significant increase in memory usage with smaller amounts of data.

    So, at smaller scales the linear search will work fine and the radix tree would probably be overkill. If you need a more scalable solution, the radix tree would likely be better.

    Comment by Psychochild — 12 January, 2014 @ 12:31 PM

  4. I'm not sure you are right about space. Hash maps require space for all the empty keys, which balances out the overhead in some of the forms of tries.

    Check out the Ternary Search Tree from http://algs4.cs.princeton.edu/lectures/52Tries.pdf

    Also, your algorithm does depend on how python's startsWith is implemented. It might be O(nm) performance, where m is the length of the strings involved.

    As well, I would move the lower() calls outside the loop, and lowercase the strings on insertion. Would save a little time when searching. Actually, now that I'm re-reading the code, the exact match is case-sensitive, while the partial match is not case-sensitive. I'm not sure if this behavior is desired. It seems like the kind of thing that would trip someone up.

    Comment by RohanV — 12 January, 2014 @ 3:18 PM

  5. RohanV wrote:
    Actually, now that I'm re-reading the code, the exact match is case-sensitive, while the partial match is not case-sensitive. I'm not sure if this behavior is desired.

    Well, the goal was that I wanted the structure to still act like a Python dictionary except that it would also make partial, case-insensitive matches. Since case matters in dictionaries, I left that as a desired feature.

    As I said, the usage case here is where internally I'll use the fully qualified names and want Python's dictionary efficiency, but user input is less predictable. I wanted to keep this within one data structure, rather than trying to update two different data structures; a dictionary for internal use and speed, and a second to do partial matches based on user input. Trying to synchronize two data structures would lead to the possibility of introducing more bugs.

    You are right about doing the lower() conversion outside the loop and I edited the code above. However, worrying about the speed of implementation of a core function like startswith() outside of finding bottlenecks in a running systems smells a bit like premature optimization, which we all know is the root of all programming evils. ;)

    Anyway, thanks for the links and info. It had been a while since I looked into radix trees, so reading your links was a good refresher course. :)

    Comment by Psychochild — 12 January, 2014 @ 5:58 PM

  6. One reason the case-sensitive exact match vs. case-insensitive partial match might matter is the usage example given in the documentation: the key "thestring" is not going to be found via exact match.

    If you don't want to care about case, I would store lower (or upper) case keys only, then force searches for lower case keys in PartialSearch right from the start. It does change the behaviour a little in that providing the exact key and case is no longer the only way to use the exact match.

    With regards to using a Trie: if you care about space efficiency, basically don't use Python. That'd be the perfect use-case for dropping down to a C module.

    Comment by unwesen — 12 January, 2014 @ 11:58 PM

  7. unwesen wrote:
    the key "thestring" is not going to be found via exact match.

    It would be found if you had an exact key. The first check in PartialMatch is for exact keys in the dictionary. So, here's how it would work:

    >>> pd["TheString"] = 1
    >>> pd["thestring"]
    1
    >>> pd["thestring"] = 2
    >>> pd["thestring"]
    2

    There's only an attempt at a partial match if there is no exact match. Changing this behavior away from how Python dictionaries work would not fit my original desire of retaining the core behavior of the Python dictionary structure. If I were to change the core behavior to force lower-case matches, I'm not sure there would be much benefit in deriving the class from the dictionary base class.

    Comment by Psychochild — 14 January, 2014 @ 12:25 AM

Leave a comment

I value your comment and think the discussions are the best part of this blog. However, there's this scourge called comment spam, so I choose to moderate comments rather than giving filthy spammers any advantage.

If this is your first comment, it will be held for moderation and therefore will not show up immediately. I will approve your comment when I can, usually within a day. Comments should eventually be approved if not spam. If your comment doesn't show up and it wasn't spam, send me an email as the spam catchers might have caught it by accident.

Line and paragraph breaks automatic, HTML allowed: <a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <div align=""> <em> <font color="" size="" face=""> <i> <li> <ol> <strike> <strong> <sub> <sup> <ul>

Email Subscription

Get posts by email:


Recent Comments

Categories

Search the Blog

Calendar

April 2014
S M T W T F S
« Feb    
 12345
6789101112
13141516171819
20212223242526
27282930  

Meta

Archives

Standard Disclaimer

I speak only for myself, not for any company.
(More information here)

My Book





Information

Around the Internet

Game and Online Developers

Game News Sites

Game Ranters and Discussion

Help for Businesses

Other Fun Stuff

Quiet (aka Dead) Sites

Posts Copyright Brian Green, aka Psychochild. Comments belong to their authors.

Google