Universal charset detector in birdview.

2024-12-11 16:32:59 +00:00 · 2001-07-17 23:29:53 +00:00 · 2001-07-17 23:29:53 +00:00 · 0fa88bba45
commit 0fa88bba45
parent 3570e5b622
1 changed files with 231 additions and 0 deletions
--- a/extensions/universalchardet/doc/ChardetInterface.htm
+++ b/extensions/universalchardet/doc/ChardetInterface.htm
@ -0,0 +1,231 @@
+<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
+<html>
+<head>
+   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
+   <meta name="GENERATOR" content="Mozilla/4.76 [en] (WinNT; U) [Netscape]">
+   <title>Charset Detector Interface</title>
+</head>
+<body>
+<font face="Arial">Charset Detector Interface</font>
+<p><font size=-1>This is the charset detector’s interface that is exposed
+to outside world, in our case, the browser. In the very beginning, caller
+calls detector’s "Init()" method and let detector know how it would like
+to be notified about the detecting result. Observer pattern is used in
+this case. Then the caller just need to feed charset detector with text
+data through "DoIt()". This can be done through a series "DoIt()" calls,
+with each call only contains part of the data. This can be very useful
+if the data is only partially available at one time. In our case, since
+the data comes from network, we can start detecting long before network
+finishes transferring all data. When detector is confident enough about
+one encoding, it will notify its caller and stop detecting. If all data
+has been feed to detector but detector still is not confident enough about
+any encoding, method "Done" will tell detector to make a best guess.</font>
+<p><font face="Courier New"><font size=-1>class nsICharsetDetector : public
+nsISupports {</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; public:</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; NS_DEFINE_STATIC_IID_ACCESSOR(NS_ICHARSETDETECTOR_IID)</font></font>
+<p><font face="Courier New"><font size=-1>&nbsp; //Setup the observer so
+it know how to notify the answer</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; NS_IMETHOD Init(nsICharsetDetectionObserver*
+observer) = 0;</font></font>
+<p><font face="Courier New"><font size=-1>&nbsp; //Feed a block of bytes
+to the detector.</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; //It will call the Notify
+function of the nsICharsetObserver if it</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; //find out the answer.</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; // aBytesArray - array
+of bytes</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; // aLen - length of aBytesArray</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; // oDontFeedMe - return
+PR_TRUE if the detector do not need the</font></font>
+<br>&nbsp; <font face="Courier New"><font size=-1>// following block</font></font>
+<br>&nbsp; <font face="Courier New"><font size=-1>// PR_FALSE it need more
+bytes.</font></font>
+<br>&nbsp; <font face="Courier New"><font size=-1>// This is used to enhance
+performance</font></font>
+<br>&nbsp; <font face="Courier New"><font size=-1>NS_IMETHOD DoIt(const
+char* aBytesArray, PRUint32 aLen, PRBool* oDontFeedMe) = 0;</font></font>
+<p>&nbsp; /<font face="Courier New"><font size=-1>/It also tell the detector
+the last chance the make a decision</font></font>
+<br>&nbsp; <font face="Courier New"><font size=-1>NS_IMETHOD Done() = 0;</font></font>
+<br><font size=-1>}<font face="Courier New">;</font></font>
+<br>&nbsp;
+<br>&nbsp;
+<p><font face="Arial">Inside Charset Detector</font>
+<p><font size=-1>Inside Charset Detector, major work is done by function
+"HandleData()". In fact, "DoIt" has very little extra thing to do other
+than call "HandleData". The following is the algorithm logic using C-Like
+Pseudo-Language. Some detail is drop in order to make main point more clear.</font>
+<p><font face="Courier New"><font size=-1>HandleData(batch_of_text)</font></font>
+<br><font face="Courier New"><font size=-1>{</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; if (batch_of_text contains
+BOM)</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; report UCS2;</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; if ((inputState is PureAscii)
+|| (inputState is EscAscii))</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; if (batch_of_text
+contains 8-bits-byte)</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+inputState = HighByte;</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; else if ((inputState
+is PureAscii ) &amp;&amp; (batch_of_text contains Esc_Sequence) )</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+inputState = EscAscii;</font></font>
+<p><font face="Courier New"><font size=-1>&nbsp; if (inputState is HighByte)</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; {</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Remove Ascii
+character that is not neighboring to 8-bits byte</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; For each
+prober in multibyte_probers</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Prober.HandleData(batch_of_text);</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; For each
+prober in singlebyte_probers</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Prober.HandleData(batch_of_text);</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; }</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; else if (inputState is
+EscAscii)</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; {</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; For each
+prober in (ISO2022_XX or HZ)</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Prober.HandleData(batch_of_text);</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; }</font></font>
+<br><font face="Courier New"><font size=-1>}</font></font>
+<p><i><font face="Courier New"><font size=-1>nsUniversalDetector.h</font></font></i>
+<br><i><font face="Courier New"><font size=-1>nsUniversalDetector.cpp</font></font></i>
+<p><i><font face="Courier New"><font size=-1>Implemented the high level
+control logic.</font></font></i>
+<br>&nbsp;
+<br>&nbsp;
+<p>Charset Prober
+<p><font size=-1>A charset prober verifies if the input data is belong
+to certain encoding or group of encoding. It maintains its state in member
+"mState", which has 3 possible value. State "eDetecting" means it hasn’t
+found any sure answer yet, "eFoundIt" and "eNotMe" carries the same meaning
+as their names. Method "GetCharSetName" tell its caller its sure answer
+or best guess.</font>
+<p><font size=-1>Generally, for each encoding we implemented a charset
+prober. Several probers can be wrapped together with a wrapper prober.
+It is also possible for a prober to "probe" several encodings. Each charset
+prober is designed, implemented and working independently. This enables
+prober caller to eliminate certain probers when it has any pre-knowledge.
+For example, if user know that an html page is some kind of Japanese encoding,
+non-Japanese charset probers will not be fired. If user have not interest
+in certain languages, they can also eliminate those charset probers. Those
+measures will lead to a small footprint and faster performance.</font>
+<p><font face="Courier New"><font size=-1>typedef enum {</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; eDetecting = 0,</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; eFoundIt = 1,</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; eNotMe = 2</font></font>
+<br><font size=-1>}<font face="Courier New"> nsProbingState;</font></font>
+<p><font face="Courier New"><font size=-1>class nsCharSetProber {</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; public:</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; nsCharSetProber(){};</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual const
+char* GetCharSetName() {return "";};</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual nsProbingState
+HandleData(const char* aBuf, PRUint32 aLen) = 0;</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; nsProbingState
+GetState(void) {return mState;};</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual void
+Reset(void) {mState = eDetecting;};</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual float
+GetConfidence(void) = 0;</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual void
+SetOpion() {};</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp; protected:</font></font>
+<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; nsProbingState
+mState;</font></font>
+<br><font face="Courier New"><font size=-1>};</font></font>
+<br>&nbsp;
+<br>&nbsp;
+<p><font face="Arial">How multi-byte encoding charset prober works</font>
+<p><font size=-1>For charset prober verifying SJIS, EUC-JP, EUC-KR, EUC-CN
+(or GB2312), EUC-TW, Big5 encodings, each prober embeds state machine (mCodingSM),
+which identify legal byte sequence base on its encoding scheme. If an illegal
+byte sequence is met, this state machine will reach "eError" state. That
+signifies a failure for this prober, and prober will report negative answer
+to its caller. Once state machine reach "eStart" state, it means sequence
+of bytes has been identified as a character. This character will be sent
+to Character distribution analyzer (mDistributionAnalyser) and 2-Char sequence
+analyzer (mContextAnalyser) for statistic sampling. "GetConfidence" call
+will let its caller know the likelihood of input charset being of this
+encoding.</font>
+<p><font size=-1>Inside "HandleData" method each time after a batch of
+text has been processed, shortcut judgement is performed. If the prober
+receives enough data and reaches certain confidence level, it will set
+its state to be "eFoundIt" and notify its caller an immediate sure answer.</font>
+<p><font size=-1>For encoding like ISO_2022 and HZ, since the embedded
+state machine can do almost a perfect job along, no other statistic sampling
+is done.</font>
+<p><i><font size=-1>Big5Freq.tab</font></i>
+<p><i><font size=-1>EUCKRFreq.tab</font></i>
+<p><i><font size=-1>EUCTWFreq.tab</font></i>
+<p><i><font size=-1>GB2312Freq.tab</font></i>
+<p><i><font size=-1>JISFreq.tab</font></i>
+<p><i><font size=-1>Those files defined the frequency table (Character
+to frequency order mapping) for each language. Since Big5 and EUC-TW are
+not basing on the same charset standard like EUC-JP and SJIS do, 2 tables
+is defined.</font></i>
+<p><i><font size=-1>CharDistribution.h</font></i>
+<p><i><font size=-1>CharDistribution.cpp</font></i>
+<p><i><font size=-1>Implementation for Character distribution analyzer.</font></i>
+<p><i><font size=-1>nsPkgInt.h</font></i>
+<p><i><font size=-1>nsCodingStateMachine.h</font></i>
+<p><i><font size=-1>Those are bases of state machine implementation.</font></i>
+<p><i><font size=-1>nsEscSM.cpp</font></i>
+<p><i><font size=-1>State machine for ISO-2022XX and HZ.</font></i>
+<p><i><font size=-1>nsMBCSSM.cpp</font></i>
+<p><i><font size=-1>State machines for Big5, EUC-JP, EUC-KR, EUC-TW, GB2312,
+SJIS, and UTF8.</font></i>
+<p><i><font size=-1>JpCntx.h</font></i>
+<p><i><font size=-1>JpCntx.cpp</font></i>
+<p><i><font size=-1>Japanese hiragana sequence analyzer.</font></i>
+<p><i><font size=-1>nsBig5Prober.h</font></i>
+<p><i><font size=-1>nsBig5Prober.cpp</font></i>
+<p><i><font size=-1>nsEUCKRProber.h</font></i>
+<p><i><font size=-1>nsEUCKRProber.cpp</font></i>
+<p><i><font size=-1>nsEUCJPProber.h</font></i>
+<p><i><font size=-1>nsEUCJPProber.cpp</font></i>
+<p><i><font size=-1>nsEUCTWProber.h</font></i>
+<p><i><font size=-1>nsEUCTWProber.cpp</font></i>
+<p><i><font size=-1>nsSJISProber.h</font></i>
+<p><i><font size=-1>nsSJISProber.cpp</font></i>
+<p><i><font size=-1>nsGB2312Prober.h</font></i>
+<p><i><font size=-1>nsGB2312Prober.cpp</font></i>
+<p><i><font size=-1>nsUTF8Prober.h</font></i>
+<p><i><font size=-1>nsUTF8Prober.cpp</font></i>
+<p><i><font size=-1>Charset Prober classes definition and implementation
+for each encoding. Each prober has an embedded state machine and a character
+distribution analyzer except UTF8, which state machine is good enough.</font></i>
+<p><i><font size=-1>nsMBCSProber.h</font></i>
+<p><i><font size=-1>nsMBCSProber.cpp</font></i>
+<p><i><font size=-1>This is a wrapper of all the MBCS probers. I was expecting
+to put some high level logic which base on multiple encoding knowledge
+to appears here in the very beginning. That might still be needed in future.</font></i>
+<br>&nbsp;
+<br>&nbsp;
+<p><font face="Arial">How single-byte encoding charset prober works</font>
+<p><font size=-1>For each encoding, a table is used to map a character
+to an encoding independent identification number. Those identification
+numbers in fact come from characters’ frequency order but with some adjustment.
+For each language, a 2-D matrix is defined as language model. If cell &lt;x,
+y> is 0, it means sequence &lt;character(x), character(y)> is a rarely
+used sequence in this language, with character(x) representing the character
+whose identification number is x. The 2-D matrix only defines sequence
+of a subset of all the characters. For characters whose identification
+number is out of this range, those characters are ignored. Since some of
+the sequences, like ascii-to-ascii sequences, have no relation with the
+language we try to verify, and those sequences should not be counted. In
+current implementation, a sequence will be counted if both characters are
+8-bits ones. In some situations, one 8-bits character sequence is expected
+to be counted.</font>
+<p><i><font size=-1>LangCyrillicModel.cpp : these files defined a mapping
+table for each encoding and a 2-D matrix for all Cyrillic languages. A
+"SequenceModel" structure is also defined for each encoding. This structure
+will be used to initialize a single-byte character prober class. All Cyrillic
+encodings are sharing the same prober class implementation.</font></i>
+<p><i><font size=-1>nsSBCharSetProber.h</font></i>
+<p><i><font size=-1>nsSBCharSetProber.cpp : These 2 files defined and implemented
+single-byte charset prober.</font></i>
+</body>
+</html>