View Single Post
  #18 (permalink)   Report Post  
Space Cowboy
 
Posts: n/a
Default Japanese Chinese tea web sites

Lewis Perin wrote:
> "Space Cowboy" > writes:
>
> > Lewis Perin wrote:
> > > Warning: nerdy details abound here!
> > >
> > > "Space Cowboy" > writes:
> > > >
> > > > Lewis Perin wrote:
> > > >> [...why are there Chinese tea names that appear only in Japanese sites...]
> > > >
> > > > The charset=shift_jis of the webpage indicates Japanese. All 2
> > > > character pairs are used for Japanese font sets. The characters you
> > > > see are from the Japanese fonts and not Chinese. That character may
> > > > very well exist in the Chinese font set and vice versa but the charset
> > > > setting on the HTML page tells where to look. Basically non Roman
> > > > languages take two characters for representation and a corresponding
> > > > font set. For example the Cha character in Japanese JIS is 3567 and
> > > > simplified Chinese GB 1872.
> > >
> > > Yes, but it's still the same Unicode code point (33590, or 8336 in
> > > hex), which is why you get both .cn and .jp web sites if you Google
> > > for it.

> >
> > Only if the Chinese or Japanese websites uses Unicode codepoints such
> > as 8336. There are plenty of Chinese and Japanese sites that use
> > charset=UTF-8.

>
> But UTF-8 *is* Unicode. More pedantically, it's an encoding of
> Unicode. The codepoints exist at the abstract level of Unicode; the
> encodings, like UTF-8, mediate between that level and what you see in
> your browser. See
>
> http://www.unicode.org/standard/principles.html
>
> for an explanation.


Agreed UTF-8 is Unicode. Anytime you use the codepoint 8336 it means
tea only if you find websites with charset=UTF-8. As I said before the
tea codepoint for GB2312 is 1872, BIG5 AFF9, JIS 3567. So if the
webpage said charset=JIS you would use 3567 to find the glyph meaning
tea which is the reason you would only see Japanese websites. You
won't see the Japanese websites using charset=UTF-8 or if you did there
is a Unicode glyph for 3567 but not for tea. If I come across a
webpage that says charset=UTF-8 and want to see the glyphs in my
browser I load the MS Unicode CJK codeset. GB2312, BIG5, JIS have
their codepoints and glyphs. Any specific codepoint only has meaning
if you know what charset it uses to look up the glyph. As an aside
I've been checking the Chinese webpages mentioned in this thread by you
and the html says charset=GB2312. You indicate you derived a Unicode
codepoint which I assume came from the webpage contents. I don't see
how. That is only valid if charset=UTF-8.

> > I'm not sure of the particulars but you can also mix language sets
> > on a webpage. I use Unicode strings for Google searches. I could
> > get additional hits if I used JIS or GB strings but I only track
> > Unicode. On TaoBao I have to use GB strings. Ebay China uses
> > Unicode.

>
> JIS, GB, and Big5 are all parts of Unicode.


In what sense? They use different codepoints for language glyphs. You
couldn't tell what codepoint produced the tea glyph if it exist in any
of the language packs. Every scriptable language on Earth is part of
Unicode or that is the intent. There are language sets that only exist
in Unicode because the computer linguists know of some some isolated
language group that hasn't seen a computer but they could communicate
with each other in Unicode when the Internet arrives.

> > Babelfish doesn't accept Unicode strings.

>
> Do you mean Babelfish or Babelcar? If it's the latter, and you want
> to try the alpha version that searches on Chinese characters, email me.
>
> /Lew
> ---
> Lew Perin /
>
http://www.panix.com/~perin/babelcarp.html


It's AltaVista Babelfish. I would expect at the minimum to use Unicode
strings to search your site. I'm not talking about the derived hex
codepoints. As I said before there is a mapping of the normaly used
codepoints used in the CJK language packs to Unicode. If you could
find the routine, if it exists, then internally you store Unicode while
accepting any external language pack characters in CJK or the default
Unicode. It would be just as easy to display back in the language
packs codepoints.

Jim

PS: One doesn't care about different codepoints in language packs if
you see the expected glyph. It is important because some Japanese
website might be talking about Chinese teas using charset=JIS
codepoints.