View Single Post
  #25 (permalink)   Report Post  
Space Cowboy
 
Posts: n/a
Default Japanese Chinese tea web sites

At this point I think we are talking past each other. For example I
want to take the GB codepoint 1872 and translate it into Unicode
codepoint 8336. Agreed the different codepoints for tea in the CJK
language packs will point to 8336 which is UTF-16 representation
consistent with the language pairs for non Roman language packs. I
know Google will take my Unicode strings and return matches it finds in
websites coded in charset other than UTF. At that point I can't cut
and paste any characters from those webpages into
Babelfish,MandarinTools,Zhongwen because they're not Unicode. From
what I understand NJ Star Communicator for example will flip
charset=GB2312 and charset=UTF-8. If I find anything pertinent on
language pack codepoints to Unicode codepoints I'll report back. I can
go from Unicode codepoints to language packs codepoints.

Jim

Lewis Perin wrote:
> > As I said before the tea codepoint for GB2312 is 1872, BIG5 AFF9,
> > JIS 3567. So if the webpage said charset=JIS you would use 3567 to
> > find the glyph meaning tea which is the reason you would only see
> > Japanese websites. You won't see the Japanese websites using
> > charset=UTF-8 or if you did there is a Unicode glyph for 3567 but
> > not for tea.

>
> Then why is it that, when I copy the tea character from that Japanese
> page into Google and search again, the search term is identical? (By
> the way, the search term is URL-encoded UTF-8: %E8%8C%B6
>
> > If I come across a webpage that says charset=UTF-8 and want to see
> > the glyphs in my browser I load the MS Unicode CJK codeset. GB2312,
> > BIG5, JIS have their codepoints and glyphs. Any specific codepoint
> > only has meaning if you know what charset it uses to look up the
> > glyph. As an aside I've been checking the Chinese webpages
> > mentioned in this thread by you and the html says charset=GB2312.
> > You indicate you derived a Unicode codepoint which I assume came
> > from the webpage contents. I don't see how. That is only valid if
> > charset=UTF-8.

>
> I don't have access to Google's source code, but it seems clear to me
> that they're not confused by the Big5 vs. GB vs. JIS. They're
> probably converting everything to Unicode codepoints before indexing.
> >
> > > > I'm not sure of the particulars but you can also mix language sets
> > > > on a webpage. I use Unicode strings for Google searches. I could
> > > > get additional hits if I used JIS or GB strings but I only track
> > > > Unicode. On TaoBao I have to use GB strings. Ebay China uses
> > > > Unicode.
> > >
> > > JIS, GB, and Big5 are all parts of Unicode.

> >
> > In what sense? They use different codepoints for language glyphs.

>
> They use different codepoints for *some* glyphs - but mostly they use
> the same codepoints for glyphs they share. The fact that e.g. GB
> enumerates the Cha character differently than JIS doesn't affect the
> fact that they both use the same Unicode codepoint.
>
> /Lew
> ---
> Lew Perin /
>
http://www.panix.com/~perin/babelcarp.html