This is the charset detector’s interface that is exposed to outside world, in our case, the browser. In the very beginning, caller calls detector’s "Init()" method and let detector know how it would like to be notified about the detecting result. Observer pattern is used in this case. Then the caller just need to feed charset detector with text data through "DoIt()". This can be done through a series "DoIt()" calls, with each call only contains part of the data. This can be very useful if the data is only partially available at one time. In our case, since the data comes from network, we can start detecting long before network finishes transferring all data. When detector is confident enough about one encoding, it will notify its caller and stop detecting. If all data has been feed to detector but detector still is not confident enough about any encoding, method "Done" will tell detector to make a best guess.
class nsICharsetDetector : public
nsISupports {
public:
NS_DEFINE_STATIC_IID_ACCESSOR(NS_ICHARSETDETECTOR_IID)
//Setup the observer so
it know how to notify the answer
NS_IMETHOD Init(nsICharsetDetectionObserver*
observer) = 0;
//Feed a block of bytes
to the detector.
//It will call the Notify
function of the nsICharsetObserver if it
//find out the answer.
// aBytesArray - array
of bytes
// aLen - length of aBytesArray
// oDontFeedMe - return
PR_TRUE if the detector do not need the
// following block
// PR_FALSE it need more
bytes.
// This is used to enhance
performance
NS_IMETHOD DoIt(const
char* aBytesArray, PRUint32 aLen, PRBool* oDontFeedMe) = 0;
//It also tell the detector
the last chance the make a decision
NS_IMETHOD Done() = 0;
};
Inside Charset Detector
Inside Charset Detector, major work is done by function "HandleData()". In fact, "DoIt" has very little extra thing to do other than call "HandleData". The following is the algorithm logic using C-Like Pseudo-Language. Some detail is drop in order to make main point more clear.
HandleData(batch_of_text)
{
if (batch_of_text contains
BOM)
report UCS2;
if ((inputState is PureAscii)
|| (inputState is EscAscii))
if (batch_of_text
contains 8-bits-byte)
inputState = HighByte;
else if ((inputState
is PureAscii ) && (batch_of_text contains Esc_Sequence) )
inputState = EscAscii;
if (inputState is HighByte)
{
Remove Ascii
character that is not neighboring to 8-bits byte
For each
prober in multibyte_probers
Prober.HandleData(batch_of_text);
For each
prober in singlebyte_probers
Prober.HandleData(batch_of_text);
}
else if (inputState is
EscAscii)
{
For each
prober in (ISO2022_XX or HZ)
Prober.HandleData(batch_of_text);
}
}
nsUniversalDetector.h
nsUniversalDetector.cpp
Implemented the high level
control logic.
Charset Prober
A charset prober verifies if the input data is belong to certain encoding or group of encoding. It maintains its state in member "mState", which has 3 possible value. State "eDetecting" means it hasn’t found any sure answer yet, "eFoundIt" and "eNotMe" carries the same meaning as their names. Method "GetCharSetName" tell its caller its sure answer or best guess.
Generally, for each encoding we implemented a charset prober. Several probers can be wrapped together with a wrapper prober. It is also possible for a prober to "probe" several encodings. Each charset prober is designed, implemented and working independently. This enables prober caller to eliminate certain probers when it has any pre-knowledge. For example, if user know that an html page is some kind of Japanese encoding, non-Japanese charset probers will not be fired. If user have not interest in certain languages, they can also eliminate those charset probers. Those measures will lead to a small footprint and faster performance.
typedef enum {
eDetecting = 0,
eFoundIt = 1,
eNotMe = 2
} nsProbingState;
class nsCharSetProber {
public:
nsCharSetProber(){};
virtual const
char* GetCharSetName() {return "";};
virtual nsProbingState
HandleData(const char* aBuf, PRUint32 aLen) = 0;
nsProbingState
GetState(void) {return mState;};
virtual void
Reset(void) {mState = eDetecting;};
virtual float
GetConfidence(void) = 0;
virtual void
SetOpion() {};
protected:
nsProbingState
mState;
};
How multi-byte encoding charset prober works
For charset prober verifying SJIS, EUC-JP, EUC-KR, EUC-CN (or GB2312), EUC-TW, Big5 encodings, each prober embeds state machine (mCodingSM), which identify legal byte sequence base on its encoding scheme. If an illegal byte sequence is met, this state machine will reach "eError" state. That signifies a failure for this prober, and prober will report negative answer to its caller. Once state machine reach "eStart" state, it means sequence of bytes has been identified as a character. This character will be sent to Character distribution analyzer (mDistributionAnalyser) and 2-Char sequence analyzer (mContextAnalyser) for statistic sampling. "GetConfidence" call will let its caller know the likelihood of input charset being of this encoding.
Inside "HandleData" method each time after a batch of text has been processed, shortcut judgement is performed. If the prober receives enough data and reaches certain confidence level, it will set its state to be "eFoundIt" and notify its caller an immediate sure answer.
For encoding like ISO_2022 and HZ, since the embedded state machine can do almost a perfect job along, no other statistic sampling is done.
Big5Freq.tab
EUCKRFreq.tab
EUCTWFreq.tab
GB2312Freq.tab
JISFreq.tab
Those files defined the frequency table (Character to frequency order mapping) for each language. Since Big5 and EUC-TW are not basing on the same charset standard like EUC-JP and SJIS do, 2 tables is defined.
CharDistribution.h
CharDistribution.cpp
Implementation for Character distribution analyzer.
nsPkgInt.h
nsCodingStateMachine.h
Those are bases of state machine implementation.
nsEscSM.cpp
State machine for ISO-2022XX and HZ.
nsMBCSSM.cpp
State machines for Big5, EUC-JP, EUC-KR, EUC-TW, GB2312, SJIS, and UTF8.
JpCntx.h
JpCntx.cpp
Japanese hiragana sequence analyzer.
nsBig5Prober.h
nsBig5Prober.cpp
nsEUCKRProber.h
nsEUCKRProber.cpp
nsEUCJPProber.h
nsEUCJPProber.cpp
nsEUCTWProber.h
nsEUCTWProber.cpp
nsSJISProber.h
nsSJISProber.cpp
nsGB2312Prober.h
nsGB2312Prober.cpp
nsUTF8Prober.h
nsUTF8Prober.cpp
Charset Prober classes definition and implementation for each encoding. Each prober has an embedded state machine and a character distribution analyzer except UTF8, which state machine is good enough.
nsMBCSProber.h
nsMBCSProber.cpp
This is a wrapper of all the MBCS probers. I was expecting
to put some high level logic which base on multiple encoding knowledge
to appears here in the very beginning. That might still be needed in future.
How single-byte encoding charset prober works
For each encoding, a table is used to map a character to an encoding independent identification number. Those identification numbers in fact come from characters’ frequency order but with some adjustment. For each language, a 2-D matrix is defined as language model. If cell <x, y> is 0, it means sequence <character(x), character(y)> is a rarely used sequence in this language, with character(x) representing the character whose identification number is x. The 2-D matrix only defines sequence of a subset of all the characters. For characters whose identification number is out of this range, those characters are ignored. Since some of the sequences, like ascii-to-ascii sequences, have no relation with the language we try to verify, and those sequences should not be counted. In current implementation, a sequence will be counted if both characters are 8-bits ones. In some situations, one 8-bits character sequence is expected to be counted.
LangCyrillicModel.cpp : these files defined a mapping table for each encoding and a 2-D matrix for all Cyrillic languages. A "SequenceModel" structure is also defined for each encoding. This structure will be used to initialize a single-byte character prober class. All Cyrillic encodings are sharing the same prober class implementation.
nsSBCharSetProber.h
nsSBCharSetProber.cpp : These 2 files defined and implemented single-byte charset prober.