View Single Post
  #22 (permalink)   Report Post  
Lewis Perin
 
Posts: n/a
Default Japanese Chinese tea web sites

"Space Cowboy" > writes:

> Lewis Perin wrote:
> > [...Unicode is Unicode...]

>
> Agreed UTF-8 is Unicode. Anytime you use the codepoint 8336 it means
> tea only if you find websites with charset=UTF-8.


I just Googled for the Chinese character for tea, limiting it to *.jp
sites. The first one that came up (http://www.chanoaji.jp/) has
nothing about UTF-8 in its source, but does have

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=Shift_JIS">

> As I said before the tea codepoint for GB2312 is 1872, BIG5 AFF9,
> JIS 3567. So if the webpage said charset=JIS you would use 3567 to
> find the glyph meaning tea which is the reason you would only see
> Japanese websites. You won't see the Japanese websites using
> charset=UTF-8 or if you did there is a Unicode glyph for 3567 but
> not for tea.


Then why is it that, when I copy the tea character from that Japanese
page into Google and search again, the search term is identical? (By
the way, the search term is URL-encoded UTF-8: %E8%8C%B6

> If I come across a webpage that says charset=UTF-8 and want to see
> the glyphs in my browser I load the MS Unicode CJK codeset. GB2312,
> BIG5, JIS have their codepoints and glyphs. Any specific codepoint
> only has meaning if you know what charset it uses to look up the
> glyph. As an aside I've been checking the Chinese webpages
> mentioned in this thread by you and the html says charset=GB2312.
> You indicate you derived a Unicode codepoint which I assume came
> from the webpage contents. I don't see how. That is only valid if
> charset=UTF-8.


I don't have access to Google's source code, but it seems clear to me
that they're not confused by the Big5 vs. GB vs. JIS. They're
probably converting everything to Unicode codepoints before indexing.
>
> > > I'm not sure of the particulars but you can also mix language sets
> > > on a webpage. I use Unicode strings for Google searches. I could
> > > get additional hits if I used JIS or GB strings but I only track
> > > Unicode. On TaoBao I have to use GB strings. Ebay China uses
> > > Unicode.

> >
> > JIS, GB, and Big5 are all parts of Unicode.

>
> In what sense? They use different codepoints for language glyphs.


They use different codepoints for *some* glyphs - but mostly they use
the same codepoints for glyphs they share. The fact that e.g. GB
enumerates the Cha character differently than JIS doesn't affect the
fact that they both use the same Unicode codepoint.

/Lew
---
Lew Perin /
http://www.panix.com/~perin/babelcarp.html