From 0fa88bba458dffb0248b53dabf3c1a9d65487903 Mon Sep 17 00:00:00 2001 From: "shanjian%netscape.com" Date: Tue, 17 Jul 2001 23:29:53 +0000 Subject: [PATCH] Universal charset detector in birdview. --- .../universalchardet/doc/ChardetInterface.htm | 231 ++++++++++++++++++ 1 file changed, 231 insertions(+) create mode 100644 extensions/universalchardet/doc/ChardetInterface.htm diff --git a/extensions/universalchardet/doc/ChardetInterface.htm b/extensions/universalchardet/doc/ChardetInterface.htm new file mode 100644 index 000000000000..8ea2fd35589a --- /dev/null +++ b/extensions/universalchardet/doc/ChardetInterface.htm @@ -0,0 +1,231 @@ + + + + + + Charset Detector Interface + + +Charset Detector Interface +

This is the charset detector’s interface that is exposed +to outside world, in our case, the browser. In the very beginning, caller +calls detector’s "Init()" method and let detector know how it would like +to be notified about the detecting result. Observer pattern is used in +this case. Then the caller just need to feed charset detector with text +data through "DoIt()". This can be done through a series "DoIt()" calls, +with each call only contains part of the data. This can be very useful +if the data is only partially available at one time. In our case, since +the data comes from network, we can start detecting long before network +finishes transferring all data. When detector is confident enough about +one encoding, it will notify its caller and stop detecting. If all data +has been feed to detector but detector still is not confident enough about +any encoding, method "Done" will tell detector to make a best guess. +

class nsICharsetDetector : public +nsISupports { +
  public: +
  NS_DEFINE_STATIC_IID_ACCESSOR(NS_ICHARSETDETECTOR_IID) +

  //Setup the observer so +it know how to notify the answer +
  NS_IMETHOD Init(nsICharsetDetectionObserver* +observer) = 0; +

  //Feed a block of bytes +to the detector. +
  //It will call the Notify +function of the nsICharsetObserver if it +
  //find out the answer. +
  // aBytesArray - array +of bytes +
  // aLen - length of aBytesArray +
  // oDontFeedMe - return +PR_TRUE if the detector do not need the +
  // following block +
  // PR_FALSE it need more +bytes. +
  // This is used to enhance +performance +
  NS_IMETHOD DoIt(const +char* aBytesArray, PRUint32 aLen, PRBool* oDontFeedMe) = 0; +

  //It also tell the detector +the last chance the make a decision +
  NS_IMETHOD Done() = 0; +
}; +
  +
  +

Inside Charset Detector +

Inside Charset Detector, major work is done by function +"HandleData()". In fact, "DoIt" has very little extra thing to do other +than call "HandleData". The following is the algorithm logic using C-Like +Pseudo-Language. Some detail is drop in order to make main point more clear. +

HandleData(batch_of_text) +
{ +
  if (batch_of_text contains +BOM) +
    report UCS2; +
  if ((inputState is PureAscii) +|| (inputState is EscAscii)) +
    if (batch_of_text +contains 8-bits-byte) +
      +inputState = HighByte; +
    else if ((inputState +is PureAscii ) && (batch_of_text contains Esc_Sequence) ) +
      +inputState = EscAscii; +

  if (inputState is HighByte) +
  { +
    Remove Ascii +character that is not neighboring to 8-bits byte +
    For each +prober in multibyte_probers +
    Prober.HandleData(batch_of_text); +
    For each +prober in singlebyte_probers +
    Prober.HandleData(batch_of_text); +
  } +
  else if (inputState is +EscAscii) +
  { +
    For each +prober in (ISO2022_XX or HZ) +
    Prober.HandleData(batch_of_text); +
  } +
} +

nsUniversalDetector.h +
nsUniversalDetector.cpp +

Implemented the high level +control logic. +
  +
  +

Charset Prober +

A charset prober verifies if the input data is belong +to certain encoding or group of encoding. It maintains its state in member +"mState", which has 3 possible value. State "eDetecting" means it hasn’t +found any sure answer yet, "eFoundIt" and "eNotMe" carries the same meaning +as their names. Method "GetCharSetName" tell its caller its sure answer +or best guess. +

Generally, for each encoding we implemented a charset +prober. Several probers can be wrapped together with a wrapper prober. +It is also possible for a prober to "probe" several encodings. Each charset +prober is designed, implemented and working independently. This enables +prober caller to eliminate certain probers when it has any pre-knowledge. +For example, if user know that an html page is some kind of Japanese encoding, +non-Japanese charset probers will not be fired. If user have not interest +in certain languages, they can also eliminate those charset probers. Those +measures will lead to a small footprint and faster performance. +

typedef enum { +
  eDetecting = 0, +
  eFoundIt = 1, +
  eNotMe = 2 +
} nsProbingState; +

class nsCharSetProber { +
  public: +
    nsCharSetProber(){}; +
    virtual const +char* GetCharSetName() {return "";}; +
    virtual nsProbingState +HandleData(const char* aBuf, PRUint32 aLen) = 0; +
    nsProbingState +GetState(void) {return mState;}; +
    virtual void +Reset(void) {mState = eDetecting;}; +
    virtual float +GetConfidence(void) = 0; +
    virtual void +SetOpion() {}; +
  protected: +
    nsProbingState +mState; +
}; +
  +
  +

How multi-byte encoding charset prober works +

For charset prober verifying SJIS, EUC-JP, EUC-KR, EUC-CN +(or GB2312), EUC-TW, Big5 encodings, each prober embeds state machine (mCodingSM), +which identify legal byte sequence base on its encoding scheme. If an illegal +byte sequence is met, this state machine will reach "eError" state. That +signifies a failure for this prober, and prober will report negative answer +to its caller. Once state machine reach "eStart" state, it means sequence +of bytes has been identified as a character. This character will be sent +to Character distribution analyzer (mDistributionAnalyser) and 2-Char sequence +analyzer (mContextAnalyser) for statistic sampling. "GetConfidence" call +will let its caller know the likelihood of input charset being of this +encoding. +

Inside "HandleData" method each time after a batch of +text has been processed, shortcut judgement is performed. If the prober +receives enough data and reaches certain confidence level, it will set +its state to be "eFoundIt" and notify its caller an immediate sure answer. +

For encoding like ISO_2022 and HZ, since the embedded +state machine can do almost a perfect job along, no other statistic sampling +is done. +

Big5Freq.tab +

EUCKRFreq.tab +

EUCTWFreq.tab +

GB2312Freq.tab +

JISFreq.tab +

Those files defined the frequency table (Character +to frequency order mapping) for each language. Since Big5 and EUC-TW are +not basing on the same charset standard like EUC-JP and SJIS do, 2 tables +is defined. +

CharDistribution.h +

CharDistribution.cpp +

Implementation for Character distribution analyzer. +

nsPkgInt.h +

nsCodingStateMachine.h +

Those are bases of state machine implementation. +

nsEscSM.cpp +

State machine for ISO-2022XX and HZ. +

nsMBCSSM.cpp +

State machines for Big5, EUC-JP, EUC-KR, EUC-TW, GB2312, +SJIS, and UTF8. +

JpCntx.h +

JpCntx.cpp +

Japanese hiragana sequence analyzer. +

nsBig5Prober.h +

nsBig5Prober.cpp +

nsEUCKRProber.h +

nsEUCKRProber.cpp +

nsEUCJPProber.h +

nsEUCJPProber.cpp +

nsEUCTWProber.h +

nsEUCTWProber.cpp +

nsSJISProber.h +

nsSJISProber.cpp +

nsGB2312Prober.h +

nsGB2312Prober.cpp +

nsUTF8Prober.h +

nsUTF8Prober.cpp +

Charset Prober classes definition and implementation +for each encoding. Each prober has an embedded state machine and a character +distribution analyzer except UTF8, which state machine is good enough. +

nsMBCSProber.h +

nsMBCSProber.cpp +

This is a wrapper of all the MBCS probers. I was expecting +to put some high level logic which base on multiple encoding knowledge +to appears here in the very beginning. That might still be needed in future. +
  +
  +

How single-byte encoding charset prober works +

For each encoding, a table is used to map a character +to an encoding independent identification number. Those identification +numbers in fact come from characters’ frequency order but with some adjustment. +For each language, a 2-D matrix is defined as language model. If cell <x, +y> is 0, it means sequence <character(x), character(y)> is a rarely +used sequence in this language, with character(x) representing the character +whose identification number is x. The 2-D matrix only defines sequence +of a subset of all the characters. For characters whose identification +number is out of this range, those characters are ignored. Since some of +the sequences, like ascii-to-ascii sequences, have no relation with the +language we try to verify, and those sequences should not be counted. In +current implementation, a sequence will be counted if both characters are +8-bits ones. In some situations, one 8-bits character sequence is expected +to be counted. +

LangCyrillicModel.cpp : these files defined a mapping +table for each encoding and a 2-D matrix for all Cyrillic languages. A +"SequenceModel" structure is also defined for each encoding. This structure +will be used to initialize a single-byte character prober class. All Cyrillic +encodings are sharing the same prober class implementation. +

nsSBCharSetProber.h +

nsSBCharSetProber.cpp : These 2 files defined and implemented +single-byte charset prober. + +