mirror of
https://github.com/mozilla/gecko-dev.git
synced 2024-11-07 12:15:51 +00:00
232 lines
14 KiB
HTML
232 lines
14 KiB
HTML
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
|
||
<html>
|
||
<head>
|
||
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
||
<meta name="GENERATOR" content="Mozilla/4.76 [en] (WinNT; U) [Netscape]">
|
||
<title>Charset Detector Interface</title>
|
||
</head>
|
||
<body>
|
||
<font face="Arial">Charset Detector Interface</font>
|
||
<p><font size=-1>This is the charset detector’s interface that is exposed
|
||
to outside world, in our case, the browser. In the very beginning, caller
|
||
calls detector’s "Init()" method and let detector know how it would like
|
||
to be notified about the detecting result. Observer pattern is used in
|
||
this case. Then the caller just need to feed charset detector with text
|
||
data through "DoIt()". This can be done through a series "DoIt()" calls,
|
||
with each call only contains part of the data. This can be very useful
|
||
if the data is only partially available at one time. In our case, since
|
||
the data comes from network, we can start detecting long before network
|
||
finishes transferring all data. When detector is confident enough about
|
||
one encoding, it will notify its caller and stop detecting. If all data
|
||
has been feed to detector but detector still is not confident enough about
|
||
any encoding, method "Done" will tell detector to make a best guess.</font>
|
||
<p><font face="Courier New"><font size=-1>class nsICharsetDetector : public
|
||
nsISupports {</font></font>
|
||
<br><font face="Courier New"><font size=-1> public:</font></font>
|
||
<br><font face="Courier New"><font size=-1> NS_DEFINE_STATIC_IID_ACCESSOR(NS_ICHARSETDETECTOR_IID)</font></font>
|
||
<p><font face="Courier New"><font size=-1> //Setup the observer so
|
||
it know how to notify the answer</font></font>
|
||
<br><font face="Courier New"><font size=-1> NS_IMETHOD Init(nsICharsetDetectionObserver*
|
||
observer) = 0;</font></font>
|
||
<p><font face="Courier New"><font size=-1> //Feed a block of bytes
|
||
to the detector.</font></font>
|
||
<br><font face="Courier New"><font size=-1> //It will call the Notify
|
||
function of the nsICharsetObserver if it</font></font>
|
||
<br><font face="Courier New"><font size=-1> //find out the answer.</font></font>
|
||
<br><font face="Courier New"><font size=-1> // aBytesArray - array
|
||
of bytes</font></font>
|
||
<br><font face="Courier New"><font size=-1> // aLen - length of aBytesArray</font></font>
|
||
<br><font face="Courier New"><font size=-1> // oDontFeedMe - return
|
||
PR_TRUE if the detector do not need the</font></font>
|
||
<br> <font face="Courier New"><font size=-1>// following block</font></font>
|
||
<br> <font face="Courier New"><font size=-1>// PR_FALSE it need more
|
||
bytes.</font></font>
|
||
<br> <font face="Courier New"><font size=-1>// This is used to enhance
|
||
performance</font></font>
|
||
<br> <font face="Courier New"><font size=-1>NS_IMETHOD DoIt(const
|
||
char* aBytesArray, PRUint32 aLen, PRBool* oDontFeedMe) = 0;</font></font>
|
||
<p> /<font face="Courier New"><font size=-1>/It also tell the detector
|
||
the last chance the make a decision</font></font>
|
||
<br> <font face="Courier New"><font size=-1>NS_IMETHOD Done() = 0;</font></font>
|
||
<br><font size=-1>}<font face="Courier New">;</font></font>
|
||
<br>
|
||
<br>
|
||
<p><font face="Arial">Inside Charset Detector</font>
|
||
<p><font size=-1>Inside Charset Detector, major work is done by function
|
||
"HandleData()". In fact, "DoIt" has very little extra thing to do other
|
||
than call "HandleData". The following is the algorithm logic using C-Like
|
||
Pseudo-Language. Some detail is drop in order to make main point more clear.</font>
|
||
<p><font face="Courier New"><font size=-1>HandleData(batch_of_text)</font></font>
|
||
<br><font face="Courier New"><font size=-1>{</font></font>
|
||
<br><font face="Courier New"><font size=-1> if (batch_of_text contains
|
||
BOM)</font></font>
|
||
<br><font face="Courier New"><font size=-1> report UCS2;</font></font>
|
||
<br><font face="Courier New"><font size=-1> if ((inputState is PureAscii)
|
||
|| (inputState is EscAscii))</font></font>
|
||
<br><font face="Courier New"><font size=-1> if (batch_of_text
|
||
contains 8-bits-byte)</font></font>
|
||
<br><font face="Courier New"><font size=-1>
|
||
inputState = HighByte;</font></font>
|
||
<br><font face="Courier New"><font size=-1> else if ((inputState
|
||
is PureAscii ) && (batch_of_text contains Esc_Sequence) )</font></font>
|
||
<br><font face="Courier New"><font size=-1>
|
||
inputState = EscAscii;</font></font>
|
||
<p><font face="Courier New"><font size=-1> if (inputState is HighByte)</font></font>
|
||
<br><font face="Courier New"><font size=-1> {</font></font>
|
||
<br><font face="Courier New"><font size=-1> Remove Ascii
|
||
character that is not neighboring to 8-bits byte</font></font>
|
||
<br><font face="Courier New"><font size=-1> For each
|
||
prober in multibyte_probers</font></font>
|
||
<br><font face="Courier New"><font size=-1> Prober.HandleData(batch_of_text);</font></font>
|
||
<br><font face="Courier New"><font size=-1> For each
|
||
prober in singlebyte_probers</font></font>
|
||
<br><font face="Courier New"><font size=-1> Prober.HandleData(batch_of_text);</font></font>
|
||
<br><font face="Courier New"><font size=-1> }</font></font>
|
||
<br><font face="Courier New"><font size=-1> else if (inputState is
|
||
EscAscii)</font></font>
|
||
<br><font face="Courier New"><font size=-1> {</font></font>
|
||
<br><font face="Courier New"><font size=-1> For each
|
||
prober in (ISO2022_XX or HZ)</font></font>
|
||
<br><font face="Courier New"><font size=-1> Prober.HandleData(batch_of_text);</font></font>
|
||
<br><font face="Courier New"><font size=-1> }</font></font>
|
||
<br><font face="Courier New"><font size=-1>}</font></font>
|
||
<p><i><font face="Courier New"><font size=-1>nsUniversalDetector.h</font></font></i>
|
||
<br><i><font face="Courier New"><font size=-1>nsUniversalDetector.cpp</font></font></i>
|
||
<p><i><font face="Courier New"><font size=-1>Implemented the high level
|
||
control logic.</font></font></i>
|
||
<br>
|
||
<br>
|
||
<p>Charset Prober
|
||
<p><font size=-1>A charset prober verifies if the input data is belong
|
||
to certain encoding or group of encoding. It maintains its state in member
|
||
"mState", which has 3 possible value. State "eDetecting" means it hasn’t
|
||
found any sure answer yet, "eFoundIt" and "eNotMe" carries the same meaning
|
||
as their names. Method "GetCharSetName" tell its caller its sure answer
|
||
or best guess.</font>
|
||
<p><font size=-1>Generally, for each encoding we implemented a charset
|
||
prober. Several probers can be wrapped together with a wrapper prober.
|
||
It is also possible for a prober to "probe" several encodings. Each charset
|
||
prober is designed, implemented and working independently. This enables
|
||
prober caller to eliminate certain probers when it has any pre-knowledge.
|
||
For example, if user know that an html page is some kind of Japanese encoding,
|
||
non-Japanese charset probers will not be fired. If user have not interest
|
||
in certain languages, they can also eliminate those charset probers. Those
|
||
measures will lead to a small footprint and faster performance.</font>
|
||
<p><font face="Courier New"><font size=-1>typedef enum {</font></font>
|
||
<br><font face="Courier New"><font size=-1> eDetecting = 0,</font></font>
|
||
<br><font face="Courier New"><font size=-1> eFoundIt = 1,</font></font>
|
||
<br><font face="Courier New"><font size=-1> eNotMe = 2</font></font>
|
||
<br><font size=-1>}<font face="Courier New"> nsProbingState;</font></font>
|
||
<p><font face="Courier New"><font size=-1>class nsCharSetProber {</font></font>
|
||
<br><font face="Courier New"><font size=-1> public:</font></font>
|
||
<br><font face="Courier New"><font size=-1> nsCharSetProber(){};</font></font>
|
||
<br><font face="Courier New"><font size=-1> virtual const
|
||
char* GetCharSetName() {return "";};</font></font>
|
||
<br><font face="Courier New"><font size=-1> virtual nsProbingState
|
||
HandleData(const char* aBuf, PRUint32 aLen) = 0;</font></font>
|
||
<br><font face="Courier New"><font size=-1> nsProbingState
|
||
GetState(void) {return mState;};</font></font>
|
||
<br><font face="Courier New"><font size=-1> virtual void
|
||
Reset(void) {mState = eDetecting;};</font></font>
|
||
<br><font face="Courier New"><font size=-1> virtual float
|
||
GetConfidence(void) = 0;</font></font>
|
||
<br><font face="Courier New"><font size=-1> virtual void
|
||
SetOpion() {};</font></font>
|
||
<br><font face="Courier New"><font size=-1> protected:</font></font>
|
||
<br><font face="Courier New"><font size=-1> nsProbingState
|
||
mState;</font></font>
|
||
<br><font face="Courier New"><font size=-1>};</font></font>
|
||
<br>
|
||
<br>
|
||
<p><font face="Arial">How multi-byte encoding charset prober works</font>
|
||
<p><font size=-1>For charset prober verifying SJIS, EUC-JP, EUC-KR, EUC-CN
|
||
(or GB2312), EUC-TW, Big5 encodings, each prober embeds state machine (mCodingSM),
|
||
which identify legal byte sequence base on its encoding scheme. If an illegal
|
||
byte sequence is met, this state machine will reach "eError" state. That
|
||
signifies a failure for this prober, and prober will report negative answer
|
||
to its caller. Once state machine reach "eStart" state, it means sequence
|
||
of bytes has been identified as a character. This character will be sent
|
||
to Character distribution analyzer (mDistributionAnalyser) and 2-Char sequence
|
||
analyzer (mContextAnalyser) for statistic sampling. "GetConfidence" call
|
||
will let its caller know the likelihood of input charset being of this
|
||
encoding.</font>
|
||
<p><font size=-1>Inside "HandleData" method each time after a batch of
|
||
text has been processed, shortcut judgement is performed. If the prober
|
||
receives enough data and reaches certain confidence level, it will set
|
||
its state to be "eFoundIt" and notify its caller an immediate sure answer.</font>
|
||
<p><font size=-1>For encoding like ISO_2022 and HZ, since the embedded
|
||
state machine can do almost a perfect job along, no other statistic sampling
|
||
is done.</font>
|
||
<p><i><font size=-1>Big5Freq.tab</font></i>
|
||
<p><i><font size=-1>EUCKRFreq.tab</font></i>
|
||
<p><i><font size=-1>EUCTWFreq.tab</font></i>
|
||
<p><i><font size=-1>GB2312Freq.tab</font></i>
|
||
<p><i><font size=-1>JISFreq.tab</font></i>
|
||
<p><i><font size=-1>Those files defined the frequency table (Character
|
||
to frequency order mapping) for each language. Since Big5 and EUC-TW are
|
||
not basing on the same charset standard like EUC-JP and SJIS do, 2 tables
|
||
is defined.</font></i>
|
||
<p><i><font size=-1>CharDistribution.h</font></i>
|
||
<p><i><font size=-1>CharDistribution.cpp</font></i>
|
||
<p><i><font size=-1>Implementation for Character distribution analyzer.</font></i>
|
||
<p><i><font size=-1>nsPkgInt.h</font></i>
|
||
<p><i><font size=-1>nsCodingStateMachine.h</font></i>
|
||
<p><i><font size=-1>Those are bases of state machine implementation.</font></i>
|
||
<p><i><font size=-1>nsEscSM.cpp</font></i>
|
||
<p><i><font size=-1>State machine for ISO-2022XX and HZ.</font></i>
|
||
<p><i><font size=-1>nsMBCSSM.cpp</font></i>
|
||
<p><i><font size=-1>State machines for Big5, EUC-JP, EUC-KR, EUC-TW, GB2312,
|
||
SJIS, and UTF8.</font></i>
|
||
<p><i><font size=-1>JpCntx.h</font></i>
|
||
<p><i><font size=-1>JpCntx.cpp</font></i>
|
||
<p><i><font size=-1>Japanese hiragana sequence analyzer.</font></i>
|
||
<p><i><font size=-1>nsBig5Prober.h</font></i>
|
||
<p><i><font size=-1>nsBig5Prober.cpp</font></i>
|
||
<p><i><font size=-1>nsEUCKRProber.h</font></i>
|
||
<p><i><font size=-1>nsEUCKRProber.cpp</font></i>
|
||
<p><i><font size=-1>nsEUCJPProber.h</font></i>
|
||
<p><i><font size=-1>nsEUCJPProber.cpp</font></i>
|
||
<p><i><font size=-1>nsEUCTWProber.h</font></i>
|
||
<p><i><font size=-1>nsEUCTWProber.cpp</font></i>
|
||
<p><i><font size=-1>nsSJISProber.h</font></i>
|
||
<p><i><font size=-1>nsSJISProber.cpp</font></i>
|
||
<p><i><font size=-1>nsGB2312Prober.h</font></i>
|
||
<p><i><font size=-1>nsGB2312Prober.cpp</font></i>
|
||
<p><i><font size=-1>nsUTF8Prober.h</font></i>
|
||
<p><i><font size=-1>nsUTF8Prober.cpp</font></i>
|
||
<p><i><font size=-1>Charset Prober classes definition and implementation
|
||
for each encoding. Each prober has an embedded state machine and a character
|
||
distribution analyzer except UTF8, which state machine is good enough.</font></i>
|
||
<p><i><font size=-1>nsMBCSProber.h</font></i>
|
||
<p><i><font size=-1>nsMBCSProber.cpp</font></i>
|
||
<p><i><font size=-1>This is a wrapper of all the MBCS probers. I was expecting
|
||
to put some high level logic which base on multiple encoding knowledge
|
||
to appears here in the very beginning. That might still be needed in future.</font></i>
|
||
<br>
|
||
<br>
|
||
<p><font face="Arial">How single-byte encoding charset prober works</font>
|
||
<p><font size=-1>For each encoding, a table is used to map a character
|
||
to an encoding independent identification number. Those identification
|
||
numbers in fact come from characters’ frequency order but with some adjustment.
|
||
For each language, a 2-D matrix is defined as language model. If cell <x,
|
||
y> is 0, it means sequence <character(x), character(y)> is a rarely
|
||
used sequence in this language, with character(x) representing the character
|
||
whose identification number is x. The 2-D matrix only defines sequence
|
||
of a subset of all the characters. For characters whose identification
|
||
number is out of this range, those characters are ignored. Since some of
|
||
the sequences, like ascii-to-ascii sequences, have no relation with the
|
||
language we try to verify, and those sequences should not be counted. In
|
||
current implementation, a sequence will be counted if both characters are
|
||
8-bits ones. In some situations, one 8-bits character sequence is expected
|
||
to be counted.</font>
|
||
<p><i><font size=-1>LangCyrillicModel.cpp : these files defined a mapping
|
||
table for each encoding and a 2-D matrix for all Cyrillic languages. A
|
||
"SequenceModel" structure is also defined for each encoding. This structure
|
||
will be used to initialize a single-byte character prober class. All Cyrillic
|
||
encodings are sharing the same prober class implementation.</font></i>
|
||
<p><i><font size=-1>nsSBCharSetProber.h</font></i>
|
||
<p><i><font size=-1>nsSBCharSetProber.cpp : These 2 files defined and implemented
|
||
single-byte charset prober.</font></i>
|
||
</body>
|
||
</html>
|