Friday, September 14, 2012

Algorithm to find whether a string contains unique characters

There are multiple ways to do this. The brute force method is to compare each character with the next character and the moment there is a duplicate, return with a false. The brute force method is O(n2) for time complexity but does not require any overhead for space. Here is the brute force algorithm:

bool HasUnique( wchar_t* pszString ) {

    //In this we take the first character
    //and compare it with the next character
    //and so on till we reach the end
    //then we start with the next character
    //and compare it again with each character
    if( !pszString )
        return true;

    size_t nLen = wcslen( pszString );
    for( size_t i = 0 ; i < nLen ; ++i ) {
        for( size_t j = i + 1; j < nLen ; ++j ) {
            if( pszString[ i ] == pszString[ j ] )
                return false;
        }
    }

    return true;
}


However the above is not optimal, given that in today's computers space is not expensive, however
computing time is still important. So a better approach could be to use some sort of mechanism, where each unique element encountered is remembered, and the moment we encounter a scenario, where we try to re-remember an already remembered element, we would know that the string (or dataset) is not unique. There are various approaches for this, the simplest being using a boolean array of 256 elements. 256 because this is the range of ASCII. However this approach will fail for unicode elements. Anyways here is this algorithm:

bool HasUnique( char* pszString ) {
    bool bStatus[ 256 ] = { false };
    
    if( !pszString )
        return true;

    size_t size   = strlen( pszString );
    int nASCIIVal = 0;
    for( size_t i = 0 ; i < size ; ++i ) {
        nASCIIVal = pszString[ i ];
        if( bStatus[ nASCIIVal ] )
            return false;
        
        bStatus[ nASCIIVal ] = true;
    }

    return true;
}


The above algo can be improved and made generic such that it works for all kind of elements by simply putting this in a map. Basically put the element encountered in the map, if it is not there, if it is already there, then you know that the string does not contain unique elements.
This can be extended to solve many similar problems like, find all duplicates, count the occurrence of each character/element in an array

I am going to write my own code to do the same. The code to 

//class to wrap the Node for a tree.
class Node {
    public:
        Node( int nData ) : 

            m_pLeft( NULL ), m_pRight( NULL ), m_nData( nData ) {};
        ~Node() {};
        Node*    GetLeft()    const    { return m_pLeft;    };
        Node*    GetRight()   const    { return m_pRight;   };
        int      GetData()    const    { return m_nData;    };

        void    SetLeft( Node* pNode )  { m_pLeft  = pNode; };
        void    SetRight( Node* pNode ) { m_pRight = pNode; };
        void    SetData( int nData )    { m_nData  = nData; };
    protected:
        Node() : m_pLeft( NULL ), m_pRight( NULL ), m_nData( 0 ) {};
        Node* m_pLeft;
        Node* m_pRight;
        int   m_nData;   
};//class Node


The idea is to add elements in a binary sorted tree and if while inserting elements we run across duplicates, we know the entries are not unique.

class Tree {
public:
    Tree(): m_pRoot( NULL ) {};
    ~Tree();
    Node* Add( int nData, bool& bUnique );
protected:
    Node* m_pRoot;
};

Node* Tree::Add( int nData, bool& bUnique ) {
    Node* pNode = new Node( nData );
    bUnique = true;
    if( !m_pRoot ) {
        m_pRoot = pNode;

        return m_pRoot;
    }

    Node* pRoot = m_pRoot;
    while( pRoot ) {
        if( pRoot->GetData() == nData ) {
            //found duplicate
            bUnique = false;
        }
        else if( pRoot->GetData() < nData ) {
            //now check if there is a left child
            if( pRoot->GetLeft() ) {
                pRoot = pRoot->GetLeft();
            }
            else {
                pRoot->SetLeft( pNode );
            }
        }
        else {
            //now check if there is a right child
            if( pRoot->GetRight() ) {
                pRoot = pRoot->GetRight();
            }
            else {
                pRoot->SetRight( pNode );
            }
        }

    }
}

So you see this function can be used to create a binary sorted tree and while inserting data itself we will know whether the element being inserted is unique or not. The complexity of this is: O( log n).

No comments: