View Single Post
  #11 (permalink)   Report Post  
Space Cowboy
 
Posts: n/a
Default Japanese Chinese tea web sites

Lewis Perin wrote:
> Warning: nerdy details abound here!
>
> "Space Cowboy" > writes:
> >
> > Lewis Perin wrote:
> >> [...why are there Chinese tea names that appear only in Japanese sites...]

> >
> > The charset=shift_jis of the webpage indicates Japanese. All 2
> > character pairs are used for Japanese font sets. The characters you
> > see are from the Japanese fonts and not Chinese. That character may
> > very well exist in the Chinese font set and vice versa but the charset
> > setting on the HTML page tells where to look. Basically non Roman
> > languages take two characters for representation and a corresponding
> > font set. For example the Cha character in Japanese JIS is 3567 and
> > simplified Chinese GB 1872.

>
> Yes, but it's still the same Unicode code point (33590, or 8336 in
> hex), which is why you get both .cn and .jp web sites if you Google
> for it.


Only if the Chinese or Japanese websites uses Unicode codepoints such
as 8336. There are plenty of Chinese and Japanese sites that use
charset=UTF-8. I'm not sure of the particulars but you can also mix
language sets on a webpage. I use Unicode strings for Google searches.
I could get additional hits if I used JIS or GB strings but I only
track Unicode. On TaoBao I have to use GB strings. Ebay China uses
Unicode. Babelfish doesn't accept Unicode strings.

> > The Glyph representation from both will look the same and the same
> > argument for "zhou da tie cha" in Japanese JIS and Chinese GB where
> > the Glyphs look the same but not the pairs.

>
> But Google, smart though it is, can't see the glyph; it can only see
> the codepoint in whatever encoding is there. I've run these through
> the Unihan database, and they're the Chinese codepoints that
> correspond to the Pinyin on the same line of the page.


The codepoints are Japanese and not Unihan which only accepts Unicode
codepoints. You didn't run any Japanese codepoints from "zhou da tie
cha" and get a valid hit on Unihan. At the minimum you would need
Japanese JIS to Unicode codepoints. If anyone knows of a routine or
website to do this let me know. You also don't plug in strings to
Unihan just the 4 bit hex characters (0-9A-F) which represent each pair
of ascii characters for a total of 16 bits.

> > Google will find computer strings anywhere which in your case just
> > happens to be on web pages with charset indicating JIS. It looks
> > like to me you did a post with Linux which comes with default
> > international language support.

>
> BSD, actually, but I didn't post anything that wasn't ASCII.


I don't have JIS or GB or BIG5 loaded on this computer. The webpage
you mentioned looks like gibberish. I also don't have Unicode loaded
on this computer. Fortunately I can tell Unicode characters because
MSIE indicates a "empty square". If I want to see the Glyph I insert
the Unicode string into a routine which gives the character pair
codepoints which I then use in Unihan. This is the main reason I use
Unicode. I previously posted a Zhongwen backdoor procedure using
Unicode codepoints. I don't know of any Japanese or Chinese sites that
let me do the same thing with their corresponding character pairs to
see a Glyph representation.

> > In Windows you optionally load the Unicode font set called
> > CJK for Chinese, Japanese, Korean which is the international
> > standard to replace national language sets like JIS and GB.

>
> Right, I use that a lot.
>
> Thanks, Jim, for trying, but I don't see how this explains the phenomenon.


It's simple. The codepoints from any charset are different. I think
you understand the character pairs that make up each non Roman language
or the Unicode standard for all languages. Maybe there are some
overlapping codepoints between JIS or GB or BIG5 meaning the same Glyph
character but I haven't found that true for Unicode at least for tea
terms. When you use cut and paste in Windows you keep intact the ascii
character pairs for whatever language.

> /Lew
> ---
> Lew Perin /
>
http://www.panix.com/~perin/babelcarp.html


Jim