Charset Detector Interface

This is the charset detector’s interface that is exposed to outside world, in our case, the browser. In the very beginning, caller calls detector’s "Init()" method and let detector know how it would like to be notified about the detecting result. Observer pattern is used in this case. Then the caller just need to feed charset detector with text data through "DoIt()". This can be done through a series "DoIt()" calls, with each call only contains part of the data. This can be very useful if the data is only partially available at one time. In our case, since the data comes from network, we can start detecting long before network finishes transferring all data. When detector is confident enough about one encoding, it will notify its caller and stop detecting. If all data has been feed to detector but detector still is not confident enough about any encoding, method "Done" will tell detector to make a best guess.

class nsICharsetDetector : public nsISupports {
  public:
  NS_DEFINE_STATIC_IID_ACCESSOR(NS_ICHARSETDETECTOR_IID)

  //Setup the observer so it know how to notify the answer
  NS_IMETHOD Init(nsICharsetDetectionObserver* observer) = 0;

  //Feed a block of bytes to the detector.
  //It will call the Notify function of the nsICharsetObserver if it
  //find out the answer.
  // aBytesArray - array of bytes
  // aLen - length of aBytesArray
  // oDontFeedMe - return PR_TRUE if the detector do not need the
  // following block
  // PR_FALSE it need more bytes.
  // This is used to enhance performance
  NS_IMETHOD DoIt(const char* aBytesArray, PRUint32 aLen, PRBool* oDontFeedMe) = 0;

  //It also tell the detector the last chance the make a decision
  NS_IMETHOD Done() = 0;
};
 
 

Inside Charset Detector

Inside Charset Detector, major work is done by function "HandleData()". In fact, "DoIt" has very little extra thing to do other than call "HandleData". The following is the algorithm logic using C-Like Pseudo-Language. Some detail is drop in order to make main point more clear.

HandleData(batch_of_text)
{
  if (batch_of_text contains BOM)
    report UCS2;
  if ((inputState is PureAscii) || (inputState is EscAscii))
    if (batch_of_text contains 8-bits-byte)
      inputState = HighByte;
    else if ((inputState is PureAscii ) && (batch_of_text contains Esc_Sequence) )
      inputState = EscAscii;

  if (inputState is HighByte)
  {
    Remove Ascii character that is not neighboring to 8-bits byte
    For each prober in multibyte_probers
    Prober.HandleData(batch_of_text);
    For each prober in singlebyte_probers
    Prober.HandleData(batch_of_text);
  }
  else if (inputState is EscAscii)
  {
    For each prober in (ISO2022_XX or HZ)
    Prober.HandleData(batch_of_text);
  }
}

nsUniversalDetector.h
nsUniversalDetector.cpp

Implemented the high level control logic.
 
 

Charset Prober

A charset prober verifies if the input data is belong to certain encoding or group of encoding. It maintains its state in member "mState", which has 3 possible value. State "eDetecting" means it hasn’t found any sure answer yet, "eFoundIt" and "eNotMe" carries the same meaning as their names. Method "GetCharSetName" tell its caller its sure answer or best guess.

Generally, for each encoding we implemented a charset prober. Several probers can be wrapped together with a wrapper prober. It is also possible for a prober to "probe" several encodings. Each charset prober is designed, implemented and working independently. This enables prober caller to eliminate certain probers when it has any pre-knowledge. For example, if user know that an html page is some kind of Japanese encoding, non-Japanese charset probers will not be fired. If user have not interest in certain languages, they can also eliminate those charset probers. Those measures will lead to a small footprint and faster performance.

typedef enum {
  eDetecting = 0,
  eFoundIt = 1,
  eNotMe = 2
} nsProbingState;

class nsCharSetProber {
  public:
    nsCharSetProber(){};
    virtual const char* GetCharSetName() {return "";};
    virtual nsProbingState HandleData(const char* aBuf, PRUint32 aLen) = 0;
    nsProbingState GetState(void) {return mState;};
    virtual void Reset(void) {mState = eDetecting;};
    virtual float GetConfidence(void) = 0;
    virtual void SetOpion() {};
  protected:
    nsProbingState mState;
};
 
 

How multi-byte encoding charset prober works

For charset prober verifying SJIS, EUC-JP, EUC-KR, EUC-CN (or GB2312), EUC-TW, Big5 encodings, each prober embeds state machine (mCodingSM), which identify legal byte sequence base on its encoding scheme. If an illegal byte sequence is met, this state machine will reach "eError" state. That signifies a failure for this prober, and prober will report negative answer to its caller. Once state machine reach "eStart" state, it means sequence of bytes has been identified as a character. This character will be sent to Character distribution analyzer (mDistributionAnalyser) and 2-Char sequence analyzer (mContextAnalyser) for statistic sampling. "GetConfidence" call will let its caller know the likelihood of input charset being of this encoding.

Inside "HandleData" method each time after a batch of text has been processed, shortcut judgement is performed. If the prober receives enough data and reaches certain confidence level, it will set its state to be "eFoundIt" and notify its caller an immediate sure answer.

For encoding like ISO_2022 and HZ, since the embedded state machine can do almost a perfect job along, no other statistic sampling is done.

Big5Freq.tab

EUCKRFreq.tab

EUCTWFreq.tab

GB2312Freq.tab

JISFreq.tab

Those files defined the frequency table (Character to frequency order mapping) for each language. Since Big5 and EUC-TW are not basing on the same charset standard like EUC-JP and SJIS do, 2 tables is defined.

CharDistribution.h

CharDistribution.cpp

Implementation for Character distribution analyzer.

nsPkgInt.h

nsCodingStateMachine.h

Those are bases of state machine implementation.

nsEscSM.cpp

State machine for ISO-2022XX and HZ.

nsMBCSSM.cpp

State machines for Big5, EUC-JP, EUC-KR, EUC-TW, GB2312, SJIS, and UTF8.

JpCntx.h

JpCntx.cpp

Japanese hiragana sequence analyzer.

nsBig5Prober.h

nsBig5Prober.cpp

nsEUCKRProber.h

nsEUCKRProber.cpp

nsEUCJPProber.h

nsEUCJPProber.cpp

nsEUCTWProber.h

nsEUCTWProber.cpp

nsSJISProber.h

nsSJISProber.cpp

nsGB2312Prober.h

nsGB2312Prober.cpp

nsUTF8Prober.h

nsUTF8Prober.cpp

Charset Prober classes definition and implementation for each encoding. Each prober has an embedded state machine and a character distribution analyzer except UTF8, which state machine is good enough.

nsMBCSProber.h

nsMBCSProber.cpp

This is a wrapper of all the MBCS probers. I was expecting to put some high level logic which base on multiple encoding knowledge to appears here in the very beginning. That might still be needed in future.
 
 

How single-byte encoding charset prober works

For each encoding, a table is used to map a character to an encoding independent identification number. Those identification numbers in fact come from characters’ frequency order but with some adjustment. For each language, a 2-D matrix is defined as language model. If cell <x, y> is 0, it means sequence <character(x), character(y)> is a rarely used sequence in this language, with character(x) representing the character whose identification number is x. The 2-D matrix only defines sequence of a subset of all the characters. For characters whose identification number is out of this range, those characters are ignored. Since some of the sequences, like ascii-to-ascii sequences, have no relation with the language we try to verify, and those sequences should not be counted. In current implementation, a sequence will be counted if both characters are 8-bits ones. In some situations, one 8-bits character sequence is expected to be counted.

LangCyrillicModel.cpp : these files defined a mapping table for each encoding and a 2-D matrix for all Cyrillic languages. A "SequenceModel" structure is also defined for each encoding. This structure will be used to initialize a single-byte character prober class. All Cyrillic encodings are sharing the same prober class implementation.

nsSBCharSetProber.h

nsSBCharSetProber.cpp : These 2 files defined and implemented single-byte charset prober.