Wednesday, September 19, 2012

How to find whether a string is an anagram

First things first, what is an anagram? Anagram is a word, phrase, name or a sentence formed by rearranging the letters of an another word, phrase, name of sentence.The key rule here is that all letters of the name/phrase/word/sentence should be used ONCE and only ONCE. For example
  • cinema and iceman.
  • Longest Anagram ever created is here.
  • Stone age and Stage one.
  • A chevrolet and Love the car
  • A carton of cigarettes and I got a taste for cancer.
Some more examples are here.

Anyways back to Computer Science, how to write an algorithm, which can check whether a given word/phrase/sentence is an anagram or not.
The rule for an anagram is that both the strings should be formed using the exactly the same letters as shown above in the example.

Lets take cinema and iceman
In the first word cinema,
c occurs once, i occurs once and so do the rest of the letters i.e., n, e, m and a. Now observe that in the anagram word iceman:
i occurs once, i occurs once and so do the rest of the letters i.e, e, m, a and n. The key takeaway is that all these letters occur only once. If the second letter uses a letter, which is not present in the first, then it is not an anagram.

Two approaches:
  1. A simple approach would be to sort both the words/phrase/sentence and compare for equality. If they are equal then they are anagrams otherwise not.
  2. Another approach is to  check if the two strings have identical counts for each unique char.
The approach I am proposing is similar to the second approach above. It is:
  1. Take the first string and create a binary sorted tree out of it. Note the data that the node will contain would be the letter and a count, which basically is used to keep track of occurrence of that particular character.
  2. Once the tree is populated with the first string, now add the second string also in the BST. The only difference is that if while populating the data, you end up in a situation where a new node has to be created, then you know it is not an anagram.
  3. There is one more case and which is that the unique letters are same, however their counts are different. So while adding the second string to the BST, if you run into a existing letter, decrement that nodes letter count.
  4. If we exhaust all the elements of the second tree as well, do a tree-traversal using any mechanism (pre/post/in/level), the requirement is that the letter count should be zero for it to an anagram.
class Node {
    public:
        Node( wchar_t wchElement ) : m_wchElement( wchElement ),
                                     m_nCount( 1 ),
                                     m_pLeft( NULL ),
                                     m_pRight( NULL ) {};
        ~Node() {
            delete m_pLeft;
            m_pLeft = NULL;
            delete m_pRight;
            m_pRight = NULL;
        };

        int  GetCount() const       { return m_nCount;};
        void IncrementCount()       { ++m_nCount;};
        void DecrementCount()       { --m_nCount;};

        wchar_t GetData() const     { return m_wchElement;};
        Node* GetLeft() const       { return m_pLeft;};
        Node* GetRight() const      { return m_pRight;};

        void  SetLeft( Node* pNode) { m_pLeft = pNode;};
        void  SetRight( Node* pNode){ m_pRight = pRight;};
    private:
        //hidden constructor
        Node();
        wchar_t m_wchElement;
        int     m_nCount;
        Node*   m_pLeft;
        Node*   m_pRight;
};

class BST {
    public:
        BST(): m_pRoot( NULL ) {}
        ~BST() {
            //function to empty the tree, left as an exercise
            //for the reader
        }

        Node* Add( wchar_t wchElement, bool bIncrementCount = true)
        bool  IsAnagram( wchar_t* pszString1, wchar_t* pszString2 );
        bool  PreOrder( Node* pNode );
    private:
        Node* m_pRoot;
};

bool BST::Add( wchar_t wchElement, bool bIncrementCount=/*true*/) {
    //Note: When adding the first string
    //the return values are of no use
    //they will come into use while adding the second string
    //i.e, when bIncrementCount is false.
    bool bAnagram = true;
    if( !m_pRoot ) {
        m_pRoot = new Node( wchElement );
        return bAnagram;
    }

    Node* pRoot = m_pRoot;
    while( pRoot ) {
        if( pRoot->GetData() == wchElement ) {
            bIncrementCount ? pRoot->IncrementCount():
                              pRoot->DecrementCount();
        }
        elseif( pRoot->GetData() < wchElement ) {
            if( pRoot->GetLeft() ) {
                pRoot = pRoot->GetLeft();
            }
            else {
                if( nIncrementCount ) {
                    pRoot->SetLeft( new Node( wchElement ) );
                }
                else {
                    //we are adding the second string
                    //and we have found a unique element
                    //which is not in the first string
                    //so this is not a anagram.
                    bAnagram = false;
                    break;
                }
            }
        }
        else {
            //greater so operate on the right sub-tree
            if( pRoot->GetRight() ) {
                pRoot = pRoot->GetRight();
            }
            else {
                if( bIncrementCount ) {
                    pRoot->SetRight( new Node( wchElement ) );
                }
                else {
                    //we are adding the second string
                    //and we have found a unique element
                    //which is not in the first string
                    //so this is not a anagram.
                    bAnagram = false;
                    break;
                }
            }
        }
    }

    return bAnagram;
}

bool BST::IsAnagram( wchar_t *pszString1, wchar_t *pszString2 ) {
    wchar_t *pStart = pszString1;
    while( *pStart ) {
        Add( *pStart++, true );
    }

    //add the pszString2, while adding the next string
    //we will get to know any character which is unique
    //in the 2nd string.
    //however there may be strings like abcd and abcc, or
    //like xxxx and xxx. For such cases while populating the
    //BST we will not come to know, so for such cases we will later
    //use one of the traversal techniques to see how many elements
    //still have a non-zero count. If they have then it is not an 
    //anagram
    pStart = pszString2;
    while( *pStart ) {
        if( !Add( *pStart++, false ) )
            return false;
    }
    //Here we will use one of the traversals techniques to traverse
    //all the nodes and check the count at each node
    //if we find any node having non-zero count, then we know for 
    //sure that it is not an anagram.

    return PreOrder( m_pRoot );
}

//rLR
bool BST::PreOrder( Node* pNode ) {
    if( !pNode )
        return true;
    if( pNode->GetCount() != 0 )
        return false;
    if( PreOrder( pNode->GetLeft() ) )
        return PreOrder( pNode->GetRight() );
    else
        return false;
}

int wmain( int argc, wchar_t** argv ) {
     wchar_t *pszString1 = L"cinema";
     wchar_t *pszString2 = L"iceman";

     BST oBST;
     if( oBST.IsAnagram( pszString1, pszString2 ) ) {
         std::wcout << L"Is Anagram";
     }
     else {
         std::wcout << L"NOT is Anagram";
     }

     std::wcout << std::endl;

     return 0;
}

I agree there is scope of improvement in the above code, however it basically demonstrates a way to find whether strings/words/phrases are anagrams. The above code can be modified to read the two strings from files, or from user input, or from any other input mechanism.

No comments: