mirror of
https://github.com/mozilla/gecko-dev.git
synced 2024-12-11 16:32:59 +00:00
Universal charset detector in birdview.
This commit is contained in:
parent
3570e5b622
commit
0fa88bba45
231
extensions/universalchardet/doc/ChardetInterface.htm
Normal file
231
extensions/universalchardet/doc/ChardetInterface.htm
Normal file
@ -0,0 +1,231 @@
|
||||
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
|
||||
<html>
|
||||
<head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
||||
<meta name="GENERATOR" content="Mozilla/4.76 [en] (WinNT; U) [Netscape]">
|
||||
<title>Charset Detector Interface</title>
|
||||
</head>
|
||||
<body>
|
||||
<font face="Arial">Charset Detector Interface</font>
|
||||
<p><font size=-1>This is the charset detector’s interface that is exposed
|
||||
to outside world, in our case, the browser. In the very beginning, caller
|
||||
calls detector’s "Init()" method and let detector know how it would like
|
||||
to be notified about the detecting result. Observer pattern is used in
|
||||
this case. Then the caller just need to feed charset detector with text
|
||||
data through "DoIt()". This can be done through a series "DoIt()" calls,
|
||||
with each call only contains part of the data. This can be very useful
|
||||
if the data is only partially available at one time. In our case, since
|
||||
the data comes from network, we can start detecting long before network
|
||||
finishes transferring all data. When detector is confident enough about
|
||||
one encoding, it will notify its caller and stop detecting. If all data
|
||||
has been feed to detector but detector still is not confident enough about
|
||||
any encoding, method "Done" will tell detector to make a best guess.</font>
|
||||
<p><font face="Courier New"><font size=-1>class nsICharsetDetector : public
|
||||
nsISupports {</font></font>
|
||||
<br><font face="Courier New"><font size=-1> public:</font></font>
|
||||
<br><font face="Courier New"><font size=-1> NS_DEFINE_STATIC_IID_ACCESSOR(NS_ICHARSETDETECTOR_IID)</font></font>
|
||||
<p><font face="Courier New"><font size=-1> //Setup the observer so
|
||||
it know how to notify the answer</font></font>
|
||||
<br><font face="Courier New"><font size=-1> NS_IMETHOD Init(nsICharsetDetectionObserver*
|
||||
observer) = 0;</font></font>
|
||||
<p><font face="Courier New"><font size=-1> //Feed a block of bytes
|
||||
to the detector.</font></font>
|
||||
<br><font face="Courier New"><font size=-1> //It will call the Notify
|
||||
function of the nsICharsetObserver if it</font></font>
|
||||
<br><font face="Courier New"><font size=-1> //find out the answer.</font></font>
|
||||
<br><font face="Courier New"><font size=-1> // aBytesArray - array
|
||||
of bytes</font></font>
|
||||
<br><font face="Courier New"><font size=-1> // aLen - length of aBytesArray</font></font>
|
||||
<br><font face="Courier New"><font size=-1> // oDontFeedMe - return
|
||||
PR_TRUE if the detector do not need the</font></font>
|
||||
<br> <font face="Courier New"><font size=-1>// following block</font></font>
|
||||
<br> <font face="Courier New"><font size=-1>// PR_FALSE it need more
|
||||
bytes.</font></font>
|
||||
<br> <font face="Courier New"><font size=-1>// This is used to enhance
|
||||
performance</font></font>
|
||||
<br> <font face="Courier New"><font size=-1>NS_IMETHOD DoIt(const
|
||||
char* aBytesArray, PRUint32 aLen, PRBool* oDontFeedMe) = 0;</font></font>
|
||||
<p> /<font face="Courier New"><font size=-1>/It also tell the detector
|
||||
the last chance the make a decision</font></font>
|
||||
<br> <font face="Courier New"><font size=-1>NS_IMETHOD Done() = 0;</font></font>
|
||||
<br><font size=-1>}<font face="Courier New">;</font></font>
|
||||
<br>
|
||||
<br>
|
||||
<p><font face="Arial">Inside Charset Detector</font>
|
||||
<p><font size=-1>Inside Charset Detector, major work is done by function
|
||||
"HandleData()". In fact, "DoIt" has very little extra thing to do other
|
||||
than call "HandleData". The following is the algorithm logic using C-Like
|
||||
Pseudo-Language. Some detail is drop in order to make main point more clear.</font>
|
||||
<p><font face="Courier New"><font size=-1>HandleData(batch_of_text)</font></font>
|
||||
<br><font face="Courier New"><font size=-1>{</font></font>
|
||||
<br><font face="Courier New"><font size=-1> if (batch_of_text contains
|
||||
BOM)</font></font>
|
||||
<br><font face="Courier New"><font size=-1> report UCS2;</font></font>
|
||||
<br><font face="Courier New"><font size=-1> if ((inputState is PureAscii)
|
||||
|| (inputState is EscAscii))</font></font>
|
||||
<br><font face="Courier New"><font size=-1> if (batch_of_text
|
||||
contains 8-bits-byte)</font></font>
|
||||
<br><font face="Courier New"><font size=-1>
|
||||
inputState = HighByte;</font></font>
|
||||
<br><font face="Courier New"><font size=-1> else if ((inputState
|
||||
is PureAscii ) && (batch_of_text contains Esc_Sequence) )</font></font>
|
||||
<br><font face="Courier New"><font size=-1>
|
||||
inputState = EscAscii;</font></font>
|
||||
<p><font face="Courier New"><font size=-1> if (inputState is HighByte)</font></font>
|
||||
<br><font face="Courier New"><font size=-1> {</font></font>
|
||||
<br><font face="Courier New"><font size=-1> Remove Ascii
|
||||
character that is not neighboring to 8-bits byte</font></font>
|
||||
<br><font face="Courier New"><font size=-1> For each
|
||||
prober in multibyte_probers</font></font>
|
||||
<br><font face="Courier New"><font size=-1> Prober.HandleData(batch_of_text);</font></font>
|
||||
<br><font face="Courier New"><font size=-1> For each
|
||||
prober in singlebyte_probers</font></font>
|
||||
<br><font face="Courier New"><font size=-1> Prober.HandleData(batch_of_text);</font></font>
|
||||
<br><font face="Courier New"><font size=-1> }</font></font>
|
||||
<br><font face="Courier New"><font size=-1> else if (inputState is
|
||||
EscAscii)</font></font>
|
||||
<br><font face="Courier New"><font size=-1> {</font></font>
|
||||
<br><font face="Courier New"><font size=-1> For each
|
||||
prober in (ISO2022_XX or HZ)</font></font>
|
||||
<br><font face="Courier New"><font size=-1> Prober.HandleData(batch_of_text);</font></font>
|
||||
<br><font face="Courier New"><font size=-1> }</font></font>
|
||||
<br><font face="Courier New"><font size=-1>}</font></font>
|
||||
<p><i><font face="Courier New"><font size=-1>nsUniversalDetector.h</font></font></i>
|
||||
<br><i><font face="Courier New"><font size=-1>nsUniversalDetector.cpp</font></font></i>
|
||||
<p><i><font face="Courier New"><font size=-1>Implemented the high level
|
||||
control logic.</font></font></i>
|
||||
<br>
|
||||
<br>
|
||||
<p>Charset Prober
|
||||
<p><font size=-1>A charset prober verifies if the input data is belong
|
||||
to certain encoding or group of encoding. It maintains its state in member
|
||||
"mState", which has 3 possible value. State "eDetecting" means it hasn’t
|
||||
found any sure answer yet, "eFoundIt" and "eNotMe" carries the same meaning
|
||||
as their names. Method "GetCharSetName" tell its caller its sure answer
|
||||
or best guess.</font>
|
||||
<p><font size=-1>Generally, for each encoding we implemented a charset
|
||||
prober. Several probers can be wrapped together with a wrapper prober.
|
||||
It is also possible for a prober to "probe" several encodings. Each charset
|
||||
prober is designed, implemented and working independently. This enables
|
||||
prober caller to eliminate certain probers when it has any pre-knowledge.
|
||||
For example, if user know that an html page is some kind of Japanese encoding,
|
||||
non-Japanese charset probers will not be fired. If user have not interest
|
||||
in certain languages, they can also eliminate those charset probers. Those
|
||||
measures will lead to a small footprint and faster performance.</font>
|
||||
<p><font face="Courier New"><font size=-1>typedef enum {</font></font>
|
||||
<br><font face="Courier New"><font size=-1> eDetecting = 0,</font></font>
|
||||
<br><font face="Courier New"><font size=-1> eFoundIt = 1,</font></font>
|
||||
<br><font face="Courier New"><font size=-1> eNotMe = 2</font></font>
|
||||
<br><font size=-1>}<font face="Courier New"> nsProbingState;</font></font>
|
||||
<p><font face="Courier New"><font size=-1>class nsCharSetProber {</font></font>
|
||||
<br><font face="Courier New"><font size=-1> public:</font></font>
|
||||
<br><font face="Courier New"><font size=-1> nsCharSetProber(){};</font></font>
|
||||
<br><font face="Courier New"><font size=-1> virtual const
|
||||
char* GetCharSetName() {return "";};</font></font>
|
||||
<br><font face="Courier New"><font size=-1> virtual nsProbingState
|
||||
HandleData(const char* aBuf, PRUint32 aLen) = 0;</font></font>
|
||||
<br><font face="Courier New"><font size=-1> nsProbingState
|
||||
GetState(void) {return mState;};</font></font>
|
||||
<br><font face="Courier New"><font size=-1> virtual void
|
||||
Reset(void) {mState = eDetecting;};</font></font>
|
||||
<br><font face="Courier New"><font size=-1> virtual float
|
||||
GetConfidence(void) = 0;</font></font>
|
||||
<br><font face="Courier New"><font size=-1> virtual void
|
||||
SetOpion() {};</font></font>
|
||||
<br><font face="Courier New"><font size=-1> protected:</font></font>
|
||||
<br><font face="Courier New"><font size=-1> nsProbingState
|
||||
mState;</font></font>
|
||||
<br><font face="Courier New"><font size=-1>};</font></font>
|
||||
<br>
|
||||
<br>
|
||||
<p><font face="Arial">How multi-byte encoding charset prober works</font>
|
||||
<p><font size=-1>For charset prober verifying SJIS, EUC-JP, EUC-KR, EUC-CN
|
||||
(or GB2312), EUC-TW, Big5 encodings, each prober embeds state machine (mCodingSM),
|
||||
which identify legal byte sequence base on its encoding scheme. If an illegal
|
||||
byte sequence is met, this state machine will reach "eError" state. That
|
||||
signifies a failure for this prober, and prober will report negative answer
|
||||
to its caller. Once state machine reach "eStart" state, it means sequence
|
||||
of bytes has been identified as a character. This character will be sent
|
||||
to Character distribution analyzer (mDistributionAnalyser) and 2-Char sequence
|
||||
analyzer (mContextAnalyser) for statistic sampling. "GetConfidence" call
|
||||
will let its caller know the likelihood of input charset being of this
|
||||
encoding.</font>
|
||||
<p><font size=-1>Inside "HandleData" method each time after a batch of
|
||||
text has been processed, shortcut judgement is performed. If the prober
|
||||
receives enough data and reaches certain confidence level, it will set
|
||||
its state to be "eFoundIt" and notify its caller an immediate sure answer.</font>
|
||||
<p><font size=-1>For encoding like ISO_2022 and HZ, since the embedded
|
||||
state machine can do almost a perfect job along, no other statistic sampling
|
||||
is done.</font>
|
||||
<p><i><font size=-1>Big5Freq.tab</font></i>
|
||||
<p><i><font size=-1>EUCKRFreq.tab</font></i>
|
||||
<p><i><font size=-1>EUCTWFreq.tab</font></i>
|
||||
<p><i><font size=-1>GB2312Freq.tab</font></i>
|
||||
<p><i><font size=-1>JISFreq.tab</font></i>
|
||||
<p><i><font size=-1>Those files defined the frequency table (Character
|
||||
to frequency order mapping) for each language. Since Big5 and EUC-TW are
|
||||
not basing on the same charset standard like EUC-JP and SJIS do, 2 tables
|
||||
is defined.</font></i>
|
||||
<p><i><font size=-1>CharDistribution.h</font></i>
|
||||
<p><i><font size=-1>CharDistribution.cpp</font></i>
|
||||
<p><i><font size=-1>Implementation for Character distribution analyzer.</font></i>
|
||||
<p><i><font size=-1>nsPkgInt.h</font></i>
|
||||
<p><i><font size=-1>nsCodingStateMachine.h</font></i>
|
||||
<p><i><font size=-1>Those are bases of state machine implementation.</font></i>
|
||||
<p><i><font size=-1>nsEscSM.cpp</font></i>
|
||||
<p><i><font size=-1>State machine for ISO-2022XX and HZ.</font></i>
|
||||
<p><i><font size=-1>nsMBCSSM.cpp</font></i>
|
||||
<p><i><font size=-1>State machines for Big5, EUC-JP, EUC-KR, EUC-TW, GB2312,
|
||||
SJIS, and UTF8.</font></i>
|
||||
<p><i><font size=-1>JpCntx.h</font></i>
|
||||
<p><i><font size=-1>JpCntx.cpp</font></i>
|
||||
<p><i><font size=-1>Japanese hiragana sequence analyzer.</font></i>
|
||||
<p><i><font size=-1>nsBig5Prober.h</font></i>
|
||||
<p><i><font size=-1>nsBig5Prober.cpp</font></i>
|
||||
<p><i><font size=-1>nsEUCKRProber.h</font></i>
|
||||
<p><i><font size=-1>nsEUCKRProber.cpp</font></i>
|
||||
<p><i><font size=-1>nsEUCJPProber.h</font></i>
|
||||
<p><i><font size=-1>nsEUCJPProber.cpp</font></i>
|
||||
<p><i><font size=-1>nsEUCTWProber.h</font></i>
|
||||
<p><i><font size=-1>nsEUCTWProber.cpp</font></i>
|
||||
<p><i><font size=-1>nsSJISProber.h</font></i>
|
||||
<p><i><font size=-1>nsSJISProber.cpp</font></i>
|
||||
<p><i><font size=-1>nsGB2312Prober.h</font></i>
|
||||
<p><i><font size=-1>nsGB2312Prober.cpp</font></i>
|
||||
<p><i><font size=-1>nsUTF8Prober.h</font></i>
|
||||
<p><i><font size=-1>nsUTF8Prober.cpp</font></i>
|
||||
<p><i><font size=-1>Charset Prober classes definition and implementation
|
||||
for each encoding. Each prober has an embedded state machine and a character
|
||||
distribution analyzer except UTF8, which state machine is good enough.</font></i>
|
||||
<p><i><font size=-1>nsMBCSProber.h</font></i>
|
||||
<p><i><font size=-1>nsMBCSProber.cpp</font></i>
|
||||
<p><i><font size=-1>This is a wrapper of all the MBCS probers. I was expecting
|
||||
to put some high level logic which base on multiple encoding knowledge
|
||||
to appears here in the very beginning. That might still be needed in future.</font></i>
|
||||
<br>
|
||||
<br>
|
||||
<p><font face="Arial">How single-byte encoding charset prober works</font>
|
||||
<p><font size=-1>For each encoding, a table is used to map a character
|
||||
to an encoding independent identification number. Those identification
|
||||
numbers in fact come from characters’ frequency order but with some adjustment.
|
||||
For each language, a 2-D matrix is defined as language model. If cell <x,
|
||||
y> is 0, it means sequence <character(x), character(y)> is a rarely
|
||||
used sequence in this language, with character(x) representing the character
|
||||
whose identification number is x. The 2-D matrix only defines sequence
|
||||
of a subset of all the characters. For characters whose identification
|
||||
number is out of this range, those characters are ignored. Since some of
|
||||
the sequences, like ascii-to-ascii sequences, have no relation with the
|
||||
language we try to verify, and those sequences should not be counted. In
|
||||
current implementation, a sequence will be counted if both characters are
|
||||
8-bits ones. In some situations, one 8-bits character sequence is expected
|
||||
to be counted.</font>
|
||||
<p><i><font size=-1>LangCyrillicModel.cpp : these files defined a mapping
|
||||
table for each encoding and a 2-D matrix for all Cyrillic languages. A
|
||||
"SequenceModel" structure is also defined for each encoding. This structure
|
||||
will be used to initialize a single-byte character prober class. All Cyrillic
|
||||
encodings are sharing the same prober class implementation.</font></i>
|
||||
<p><i><font size=-1>nsSBCharSetProber.h</font></i>
|
||||
<p><i><font size=-1>nsSBCharSetProber.cpp : These 2 files defined and implemented
|
||||
single-byte charset prober.</font></i>
|
||||
</body>
|
||||
</html>
|
Loading…
Reference in New Issue
Block a user