Home |
Search |
Today's Posts |
|
Tea (rec.drink.tea) Discussion relating to tea, the world's second most consumed beverage (after water), made by infusing or boiling the leaves of the tea plant (C. sinensis or close relatives) in water. |
Reply |
|
LinkBack | Thread Tools | Display Modes |
|
|||
|
|||
Japanese Chinese tea web sites
In researching information for Babelcarp's database, I often run Web
searches using Chinese characters. Typically you find vastly more hits (mainly mainland Chinese sites) this way than if you use the Pinyin name for a tea. I've noticed often that a lot of hits will come from Japanese web sites. This isn't too surprising when you think about it: Japanese is written using (among other things) Chinese characters; why shouldn't Japanese people be interested in Chinese tea; and for those Japanese people who are interested in Chinese tea, why shouldn't they use Chinese characters to refer to them?[1] One thing, though, puzzles me about these Japanese sites for Chinese teas: some of the teas they list can only be found on Japanese sites. If a tea really is Chinese, why wouldn't it be retrievable on some Chinese site? Here's an example. (This won't work, of course, if your Web browser has no access to Chinese characters.) On the site http://chinese-tea.info/03g/shurui.html scroll down to the Jiangxi teas, where you'll find a tea whose Pinyin name (in the right-hand column) is zhou da tie cha. Search for it using the Chinese characters in the left-hand column. The results will be exclusively Japanese sites. Anyone know what's going on here? Kuri? /Lew --- Lew Perin / http://www.panix.com/~perin/babelcarp.html [1]Actually, I just thought of a reason why Japanese people wouldn't want to use Chinese characters: because, when using them in a Japanese context, the phonemes they correspond to wouldn't be the same as in Chinese. |
|
|||
|
|||
Japanese Chinese tea web sites
The charset=shift_jis of the webpage indicates Japanese. All 2
character pairs are used for Japanese font sets. The characters you see are from the Japanese fonts and not Chinese. That character may very well exist in the Chinese font set and vice versa but the charset setting on the HTML page tells where to look. Basically non Roman languages take two characters for representation and a corresponding font set. For example the Cha character in Japanese JIS is 3567 and simplified Chinese GB 1872. The Glyph representation from both will look the same and the same argument for "zhou da tie cha" in Japanese JIS and Chinese GB where the Glyphs look the same but not the pairs. Google will find computer strings anywhere which in your case just happens to be on web pages with charset indicating JIS. It looks like to me you did a post with Linux which comes with default international language support. In Windows you optionally load the Unicode font set called CJK for Chinese, Japanese, Korean which is the international standard to replace national language sets like JIS and GB. Jim Lewis Perin wrote: > In researching information for Babelcarp's database, I often run Web > searches using Chinese characters. Typically you find vastly more > hits (mainly mainland Chinese sites) this way than if you use the > Pinyin name for a tea. > > I've noticed often that a lot of hits will come from Japanese web > sites. This isn't too surprising when you think about it: Japanese is > written using (among other things) Chinese characters; why shouldn't > Japanese people be interested in Chinese tea; and for those Japanese > people who are interested in Chinese tea, why shouldn't they use > Chinese characters to refer to them?[1] > > One thing, though, puzzles me about these Japanese sites for Chinese > teas: some of the teas they list can only be found on Japanese sites. > If a tea really is Chinese, why wouldn't it be retrievable on some > Chinese site? Here's an example. (This won't work, of course, if > your Web browser has no access to Chinese characters.) On the site > > http://chinese-tea.info/03g/shurui.html > > scroll down to the Jiangxi teas, where you'll find a tea whose Pinyin > name (in the right-hand column) is zhou da tie cha. Search for it > using the Chinese characters in the left-hand column. The results > will be exclusively Japanese sites. > > Anyone know what's going on here? Kuri? > > /Lew > --- > Lew Perin / http://www.panix.com/~perin/babelcarp.html > [1]Actually, I just thought of a reason why Japanese people wouldn't > want to use Chinese characters: because, when using them in a Japanese > context, the phonemes they correspond to wouldn't be the same as in > Chinese. |
|
|||
|
|||
Japanese Chinese tea web sites
Warning: nerdy details abound here!
"Space Cowboy" > writes: > > Lewis Perin wrote: >> [...why are there Chinese tea names that appear only in Japanese sites...] > > The charset=shift_jis of the webpage indicates Japanese. All 2 > character pairs are used for Japanese font sets. The characters you > see are from the Japanese fonts and not Chinese. That character may > very well exist in the Chinese font set and vice versa but the charset > setting on the HTML page tells where to look. Basically non Roman > languages take two characters for representation and a corresponding > font set. For example the Cha character in Japanese JIS is 3567 and > simplified Chinese GB 1872. Yes, but it's still the same Unicode code point (33590, or 8336 in hex), which is why you get both .cn and .jp web sites if you Google for it. > The Glyph representation from both will look the same and the same > argument for "zhou da tie cha" in Japanese JIS and Chinese GB where > the Glyphs look the same but not the pairs. But Google, smart though it is, can't see the glyph; it can only see the codepoint in whatever encoding is there. I've run these through the Unihan database, and they're the Chinese codepoints that correspond to the Pinyin on the same line of the page. > Google will find computer strings anywhere which in your case just > happens to be on web pages with charset indicating JIS. It looks > like to me you did a post with Linux which comes with default > international language support. BSD, actually, but I didn't post anything that wasn't ASCII. > In Windows you optionally load the Unicode font set called > CJK for Chinese, Japanese, Korean which is the international > standard to replace national language sets like JIS and GB. Right, I use that a lot. Thanks, Jim, for trying, but I don't see how this explains the phenomenon. /Lew --- Lew Perin / http://www.panix.com/~perin/babelcarp.html |
|
|||
|
|||
Japanese Chinese tea web sites
"Lewis Perin" > wrote in message news > In researching information for Babelcarp's database, I often run Web > searches using Chinese characters. Typically you find vastly more > hits (mainly mainland Chinese sites) this way than if you use the > Pinyin name for a tea. > > I've noticed often that a lot of hits will come from Japanese web > sites. This isn't too surprising when you think about it: Japanese is > written using (among other things) Chinese characters; why shouldn't > Japanese people be interested in Chinese tea; and for those Japanese > people who are interested in Chinese tea, why shouldn't they use > Chinese characters to refer to them?[1] > > One thing, though, puzzles me about these Japanese sites for Chinese > teas: some of the teas they list can only be found on Japanese sites. > If a tea really is Chinese, why wouldn't it be retrievable on some > Chinese site? Here's an example. (This won't work, of course, if > your Web browser has no access to Chinese characters.) On the site > > http://chinese-tea.info/03g/shurui.html > > scroll down to the Jiangxi teas, where you'll find a tea whose Pinyin > name (in the right-hand column) is zhou da tie cha. Search for it > using the Chinese characters in the left-hand column. The results > will be exclusively Japanese sites. > > Anyone know what's going on here? Kuri? > > /Lew > --- > Lew Perin / http://www.panix.com/~perin/babelcarp.html > [1]Actually, I just thought of a reason why Japanese people wouldn't > want to use Chinese characters: because, when using them in a Japanese > context, the phonemes they correspond to wouldn't be the same as in > Chinese. |
|
|||
|
|||
Japanese Chinese tea web sites
"Lewis Perin" > wrote in message >and for those Japanese > people who are interested in Chinese tea, why shouldn't they use > Chinese characters to refer to them?[1] They have no other characters anyway. That'd be too bad to get rid of the meaning and keep only a phonetic reading. > [1]Actually, I just thought of a reason why Japanese people wouldn't > want to use Chinese characters: because, when using them in a Japanese > context, the phonemes they correspond to wouldn't be the same as in > Chinese. Isn't that the same for the different Chinese dialects ? On the first column, they give the kanji name (for Japan) of that tea. In the second, they give the Japanese reading that are supposed to use (real tea fans tend to know the pin yin actually used in China better than the Japanese reading). There is the possibility that some of the characters used in the first column are only for the Japanese naming of that tea. One possibility is that the Chinese original uses a more simplified or more complicated, and that character is not of the list of kanji (characters used in Japan), so they replace. Another is that they translated the Chinese meaning (here that would be the *rolled* thing) into Japanese, with different characters. > One thing, though, puzzles me about these Japanese sites for Chinese > teas: some of the teas they list can only be found on Japanese sites. > If a tea really is Chinese, why wouldn't it be retrievable on some > Chinese site? I have seen that many times. They obviously change certain names. And they don't tell which...at the end the Japanese themselves believe that was the original Chinese. I suspect the pin yin of that list has been added later, using the automatic character change of the computer. Also, Chinese sites about tea tend to be more basic, give very little information. Most of them are only made to sell tea. Probably fewer idle amateurs have access to internet, compared with Japan. In Japanese, there are many more pages that aim at sharing some knowledge, then the on-line shop copy from them.... And you know what it's like on internet. The first guy may have mispelled the name of a tea, 1000 others copy the mistake and a new tea is invented. In this case, nobody says he/she has had the tea you've picked. All these Japanese sites are not selling that tea. They are listing all the green teas they have ever heard about. Kuri |
|
|||
|
|||
Japanese Chinese tea web sites
kuri wrote: > There is the possibility that some of the characters used in the first > column are only for the Japanese naming of that tea. One possibility is that > the Chinese original uses a more simplified or more complicated, and that > character is not of the list of kanji (characters used in Japan), so they > replace. Another is that they translated the Chinese meaning (here that > would be the *rolled* thing) into Japanese, with different characters. These lists of teas are given in Kanji (Japanese character set); but Kanji uses a lot of simplified characters, which is the same simplied form of Chinese used in the PRC. It is perfectly fine Chinese, and perfectly understandable. No switching of Chinese characters to muddle the meaning. But the encoding is: charset=shift_jis. So it's in Japanese encoding. Still, if you input these characters using simplified Chinese IME, you can find them. > > One thing, though, puzzles me about these Japanese sites for Chinese > > teas: some of the teas they list can only be found on Japanese sites. > > If a tea really is Chinese, why wouldn't it be retrievable on some > > Chinese site? I just did a search in Chinese for that Zhou Da Tie Cha, and I found 199 matches. All are Chinese sources. Here is one example: http://www.china-tea.org/Html/200511593644-1.html > I suspect the pin yin of that list has been added later, using the automatic > character change of the computer. The pinyin transliteration for these teas is mostly correct. But some of the teas do have mistakes in Pinyin. > Also, Chinese sites about tea tend to be more basic, give very little > information. Most of them are only made to sell tea. That's not really true. There are many many Chinese websites devoted to tea, to information about tea. Not all are tea vendor websites. In fact, most of the Chinese tea websites I visit don't sell tea at all. A lot of Chinese tea websites have very detailed information - and many many subjects. However, a lot of other tea websites just have garbage information too. And many tea websites do copy the texts from the other websites. So what you get is many sites that contain the exact same information. I don't really know why you are getting only Japanese websites. Maybe your CJK IME is not set to Chinese. |
|
|||
|
|||
Japanese Chinese tea web sites
"niisonge" > wrote in message > These lists of teas are given in Kanji (Japanese character set); but > Kanji uses a lot of simplified characters, which is the same simplied > form of Chinese used in the PRC. The *tie* (tetsu) used in Japanese is not the same *tie* used in the link you give. It's a different simplification. That could be enough to restrict the search to Japanese language pages. > I just did a search in Chinese for that Zhou Da Tie Cha, and I found > 199 matches. All are Chinese sources. > http://www.china-tea.org/Html/200511593644-1.html I've pasted the Chinese writing from this page and I got zero hit with google. I have not restricted the search to any language (in theory, my preferences on the browser are French +English+Japanese+Chinese+Chinese). I need to do a search restricted to simplified Chinese to get something (I got 1000 matches). > That's not really true. There are many many Chinese websites devoted to > tea, to information about tea. It seems I don't get all the hits. Whatever I search about tea, I get something 10 000 hits in Japanese and only 1000 in Chinese. Maybe my browser is not impartial. Well surely it isn't... > I don't really know why you are getting only Japanese websites. Maybe > your CJK IME is not set to Chinese. We didn't use IME in this story. Just copy and paste. Kuri |
|
|||
|
|||
Japanese Chinese tea web sites
> The *tie* (tetsu) used in Japanese is not the same *tie* used in the link
> you give. It's a different simplification. That could be enough to restrict > the search to Japanese language pages. In my browser, both characters show as exactly the same character. Must be just a configuration problem on your computer. There is a traditional Chinese form of the "tie" character. But Japanese don't use that one. > I've pasted the Chinese writing from this page and I got zero hit with > google. I have not restricted the search to any language (in theory, my > preferences on the browser are French +English+Japanese+Chinese+Chinese). > I need to do a search restricted to simplified Chinese to get something (I > got 1000 matches). Google sucks. Forget about google. Why not use a Chinese search engine? Try Baidu: http://www.baidu.com > We didn't use IME in this story. Just copy and paste. Try downloading NJ Star Communicator: http://www.njstar.com You can then input Chinese into your browser. Copy paste just doesn't work very well, in my experience. |
|
|||
|
|||
Japanese Chinese tea web sites
By the way, if you use Baidu, you should get about 525 search results.
|
|
|||
|
|||
Japanese Chinese tea web sites ( UTF8)
"niisonge" > wrote in message > In my browser, both characters show as exactly the same character. >Must > be just a configuration problem on your computer. There would be a problem if I saw both the same. That I confuse Chinese and Japanese is one thing, but I hope that my computer is more clever than I. >There is a > traditional Chinese form of the "tie" character. But Japanese don't use > that one. I don't know what *you* see. (my post is in Unicode UTF 8) I get : 周打鉄茶 in Japanese 周打铁茶 in Chinese (simplified) Baidu also finds pages in Japanese if you enter the first line. >Why not use a Chinese search engine? Because of my level in Chinese... Maybe next year. > You can then input Chinese into your browser. Most times, I can't, but that's not a question of IME. I don't know when the Chinese character should be different from the Japanese one + I don't know the pin yin. Well, there are dictionnaries for that. I should get one. Also, I have a textbook that lists all the different characters between Japanese and Chinese, I should study... Kuri |
|
|||
|
|||
Japanese Chinese tea web sites
Lewis Perin wrote:
> Warning: nerdy details abound here! > > "Space Cowboy" > writes: > > > > Lewis Perin wrote: > >> [...why are there Chinese tea names that appear only in Japanese sites...] > > > > The charset=shift_jis of the webpage indicates Japanese. All 2 > > character pairs are used for Japanese font sets. The characters you > > see are from the Japanese fonts and not Chinese. That character may > > very well exist in the Chinese font set and vice versa but the charset > > setting on the HTML page tells where to look. Basically non Roman > > languages take two characters for representation and a corresponding > > font set. For example the Cha character in Japanese JIS is 3567 and > > simplified Chinese GB 1872. > > Yes, but it's still the same Unicode code point (33590, or 8336 in > hex), which is why you get both .cn and .jp web sites if you Google > for it. Only if the Chinese or Japanese websites uses Unicode codepoints such as 8336. There are plenty of Chinese and Japanese sites that use charset=UTF-8. I'm not sure of the particulars but you can also mix language sets on a webpage. I use Unicode strings for Google searches. I could get additional hits if I used JIS or GB strings but I only track Unicode. On TaoBao I have to use GB strings. Ebay China uses Unicode. Babelfish doesn't accept Unicode strings. > > The Glyph representation from both will look the same and the same > > argument for "zhou da tie cha" in Japanese JIS and Chinese GB where > > the Glyphs look the same but not the pairs. > > But Google, smart though it is, can't see the glyph; it can only see > the codepoint in whatever encoding is there. I've run these through > the Unihan database, and they're the Chinese codepoints that > correspond to the Pinyin on the same line of the page. The codepoints are Japanese and not Unihan which only accepts Unicode codepoints. You didn't run any Japanese codepoints from "zhou da tie cha" and get a valid hit on Unihan. At the minimum you would need Japanese JIS to Unicode codepoints. If anyone knows of a routine or website to do this let me know. You also don't plug in strings to Unihan just the 4 bit hex characters (0-9A-F) which represent each pair of ascii characters for a total of 16 bits. > > Google will find computer strings anywhere which in your case just > > happens to be on web pages with charset indicating JIS. It looks > > like to me you did a post with Linux which comes with default > > international language support. > > BSD, actually, but I didn't post anything that wasn't ASCII. I don't have JIS or GB or BIG5 loaded on this computer. The webpage you mentioned looks like gibberish. I also don't have Unicode loaded on this computer. Fortunately I can tell Unicode characters because MSIE indicates a "empty square". If I want to see the Glyph I insert the Unicode string into a routine which gives the character pair codepoints which I then use in Unihan. This is the main reason I use Unicode. I previously posted a Zhongwen backdoor procedure using Unicode codepoints. I don't know of any Japanese or Chinese sites that let me do the same thing with their corresponding character pairs to see a Glyph representation. > > In Windows you optionally load the Unicode font set called > > CJK for Chinese, Japanese, Korean which is the international > > standard to replace national language sets like JIS and GB. > > Right, I use that a lot. > > Thanks, Jim, for trying, but I don't see how this explains the phenomenon. It's simple. The codepoints from any charset are different. I think you understand the character pairs that make up each non Roman language or the Unicode standard for all languages. Maybe there are some overlapping codepoints between JIS or GB or BIG5 meaning the same Glyph character but I haven't found that true for Unicode at least for tea terms. When you use cut and paste in Windows you keep intact the ascii character pairs for whatever language. > /Lew > --- > Lew Perin / > http://www.panix.com/~perin/babelcarp.html Jim |
|
|||
|
|||
Japanese Chinese tea web sites
> "Lewis Perin" > wrote in message
> > >and for those Japanese > > people who are interested in Chinese tea, why shouldn't they use > > Chinese characters to refer to them?[1] > > They have no other characters anyway. That'd be too bad to get rid of the > meaning and keep only a phonetic reading. > > > [1]Actually, I just thought of a reason why Japanese people wouldn't > > want to use Chinese characters: because, when using them in a Japanese > > context, the phonemes they correspond to wouldn't be the same as in > > Chinese. > > Isn't that the same for the different Chinese dialects ? That's a good point. I suppose it applies for teas that aren't well known enough to have recognized names in one's own dialect. /Lew --- Lew Perin / http://www.panix.com/~perin/babelcarp.html |
|
|||
|
|||
Japanese Chinese tea web sites
"niisonge" > writes:
> [...] > > > > One thing, though, puzzles me about these Japanese sites for Chinese > > > teas: some of the teas they list can only be found on Japanese sites. > > > If a tea really is Chinese, why wouldn't it be retrievable on some > > > Chinese site? > > I just did a search in Chinese for that Zhou Da Tie Cha, and I found > 199 matches. All are Chinese sources. Here is one example: > > http://www.china-tea.org/Html/200511593644-1.html Thanks for finding this. That site's Tie character is different from the Japanese site's character, despite their being rendered with the same glyph. The Chinese site's character's Unicode codepoint is 38081, while the Japanese site's is 37444. When I search using the four characters I get 776 sites from Google. /Lew --- Lew Perin / http://www.panix.com/~perin/babelcarp.html |
|
|||
|
|||
Japanese Chinese tea web sites
"niisonge" > writes:
> [...] > > > I've pasted the Chinese writing from this page and I got zero hit with > > google. I have not restricted the search to any language (in theory, my > > preferences on the browser are French +English+Japanese+Chinese+Chinese). > > I need to do a search restricted to simplified Chinese to get something (I > > got 1000 matches). > > Google sucks. Forget about google. Why not use a Chinese search engine? > Try Baidu: > > http://www.baidu.com Thanks for the pointer. From where I browse, though, Google wins: 776 hits versus 497. > > We didn't use IME in this story. Just copy and paste. > Try downloading NJ Star Communicator: > > http://www.njstar.com > > You can then input Chinese into your browser. Copy paste just doesn't > work very well, in my experience. Are you sure you don't mean what that site calls Asian Explorer? The other products on the site all seem to have only *trial* versions available for free. /Lew --- Lew Perin / http://www.panix.com/~perin/babelcarp.html |
|
|||
|
|||
Japanese Chinese tea web sites
"Space Cowboy" > writes:
> Lewis Perin wrote: > > Warning: nerdy details abound here! > > > > "Space Cowboy" > writes: > > > > > > Lewis Perin wrote: > > >> [...why are there Chinese tea names that appear only in Japanese sites...] > > > > > > The charset=shift_jis of the webpage indicates Japanese. All 2 > > > character pairs are used for Japanese font sets. The characters you > > > see are from the Japanese fonts and not Chinese. That character may > > > very well exist in the Chinese font set and vice versa but the charset > > > setting on the HTML page tells where to look. Basically non Roman > > > languages take two characters for representation and a corresponding > > > font set. For example the Cha character in Japanese JIS is 3567 and > > > simplified Chinese GB 1872. > > > > Yes, but it's still the same Unicode code point (33590, or 8336 in > > hex), which is why you get both .cn and .jp web sites if you Google > > for it. > > Only if the Chinese or Japanese websites uses Unicode codepoints such > as 8336. There are plenty of Chinese and Japanese sites that use > charset=UTF-8. But UTF-8 *is* Unicode. More pedantically, it's an encoding of Unicode. The codepoints exist at the abstract level of Unicode; the encodings, like UTF-8, mediate between that level and what you see in your browser. See http://www.unicode.org/standard/principles.html for an explanation. > I'm not sure of the particulars but you can also mix language sets > on a webpage. I use Unicode strings for Google searches. I could > get additional hits if I used JIS or GB strings but I only track > Unicode. On TaoBao I have to use GB strings. Ebay China uses > Unicode. JIS, GB, and Big5 are all parts of Unicode. > Babelfish doesn't accept Unicode strings. Do you mean Babelfish or Babelcar? If it's the latter, and you want to try the alpha version that searches on Chinese characters, email me. /Lew --- Lew Perin / http://www.panix.com/~perin/babelcarp.html |
|
|||
|
|||
Japanese Chinese tea web sites
You can download Asian Explorer if you want, but basically, it's just a
cheaper version of Internet Explorer, just enhanced for Asian character sets. The copy/paste function is the most useful part of it. But what I'm referring to is NJ Star Communicator - it's a CJK IME. The trial version is fully functional. It's supposedly only a 30 day trial. But it's still fully functional way beyond the trial date. If you use NJ Star Communicator, it will automatically display the characters on the web page in whatever character format you set the software to load - GB, Big5, EUC, etc. So for me, traditional Chinese web pages (doesn't matter if encoded in Big5 or unicode or utf-8) all get loaded into simplified Chinese. If I want to change to traditional Chinese, then I change language settings. And inputting characters into a web search using say, GB will also yield results in Big5, EUC, Chinese UTF simplified, Chinese UTF traditional, Japanese Shift-JIS, Japanese UTF8, etc. So all of this encoding stuff is really a moot point if you use a CJK IME. If you know the Pinyin, you can find the character easily. If you are unsure of the Pinyin, you can also switch to English to Chinese input. |
|
|||
|
|||
Japanese Chinese tea web sites ( UTF8)
don't know what *you* see. (my post is in Unicode UTF 8)
I get : 周打鉄茶 in Japanese 周打铁茶 in Chinese (simplified) I see what you mean there. Changing my settings to Japanese UTF 8 shows 2 different characters. The Japanese "tie" is the simplified character. And the Chinese "tie" is the traditional character. Switching to Chinese UTF Traditional also shows the last character as Chinese traditional. Switching to Chinese UTF Simplified shows both "tie" characters as Chinese simplified text. I think if you want to search Chinese PRC websites, you better switch to Chinese Simplified. I learned that years ago. So I never have any problems. Of course, it meant I had to learn Chinese simplified characters along the way. |
|
|||
|
|||
Japanese Chinese tea web sites
Lewis Perin wrote:
> "Space Cowboy" > writes: > > > Lewis Perin wrote: > > > Warning: nerdy details abound here! > > > > > > "Space Cowboy" > writes: > > > > > > > > Lewis Perin wrote: > > > >> [...why are there Chinese tea names that appear only in Japanese sites...] > > > > > > > > The charset=shift_jis of the webpage indicates Japanese. All 2 > > > > character pairs are used for Japanese font sets. The characters you > > > > see are from the Japanese fonts and not Chinese. That character may > > > > very well exist in the Chinese font set and vice versa but the charset > > > > setting on the HTML page tells where to look. Basically non Roman > > > > languages take two characters for representation and a corresponding > > > > font set. For example the Cha character in Japanese JIS is 3567 and > > > > simplified Chinese GB 1872. > > > > > > Yes, but it's still the same Unicode code point (33590, or 8336 in > > > hex), which is why you get both .cn and .jp web sites if you Google > > > for it. > > > > Only if the Chinese or Japanese websites uses Unicode codepoints such > > as 8336. There are plenty of Chinese and Japanese sites that use > > charset=UTF-8. > > But UTF-8 *is* Unicode. More pedantically, it's an encoding of > Unicode. The codepoints exist at the abstract level of Unicode; the > encodings, like UTF-8, mediate between that level and what you see in > your browser. See > > http://www.unicode.org/standard/principles.html > > for an explanation. Agreed UTF-8 is Unicode. Anytime you use the codepoint 8336 it means tea only if you find websites with charset=UTF-8. As I said before the tea codepoint for GB2312 is 1872, BIG5 AFF9, JIS 3567. So if the webpage said charset=JIS you would use 3567 to find the glyph meaning tea which is the reason you would only see Japanese websites. You won't see the Japanese websites using charset=UTF-8 or if you did there is a Unicode glyph for 3567 but not for tea. If I come across a webpage that says charset=UTF-8 and want to see the glyphs in my browser I load the MS Unicode CJK codeset. GB2312, BIG5, JIS have their codepoints and glyphs. Any specific codepoint only has meaning if you know what charset it uses to look up the glyph. As an aside I've been checking the Chinese webpages mentioned in this thread by you and the html says charset=GB2312. You indicate you derived a Unicode codepoint which I assume came from the webpage contents. I don't see how. That is only valid if charset=UTF-8. > > I'm not sure of the particulars but you can also mix language sets > > on a webpage. I use Unicode strings for Google searches. I could > > get additional hits if I used JIS or GB strings but I only track > > Unicode. On TaoBao I have to use GB strings. Ebay China uses > > Unicode. > > JIS, GB, and Big5 are all parts of Unicode. In what sense? They use different codepoints for language glyphs. You couldn't tell what codepoint produced the tea glyph if it exist in any of the language packs. Every scriptable language on Earth is part of Unicode or that is the intent. There are language sets that only exist in Unicode because the computer linguists know of some some isolated language group that hasn't seen a computer but they could communicate with each other in Unicode when the Internet arrives. > > Babelfish doesn't accept Unicode strings. > > Do you mean Babelfish or Babelcar? If it's the latter, and you want > to try the alpha version that searches on Chinese characters, email me. > > /Lew > --- > Lew Perin / > http://www.panix.com/~perin/babelcarp.html It's AltaVista Babelfish. I would expect at the minimum to use Unicode strings to search your site. I'm not talking about the derived hex codepoints. As I said before there is a mapping of the normaly used codepoints used in the CJK language packs to Unicode. If you could find the routine, if it exists, then internally you store Unicode while accepting any external language pack characters in CJK or the default Unicode. It would be just as easy to display back in the language packs codepoints. Jim PS: One doesn't care about different codepoints in language packs if you see the expected glyph. It is important because some Japanese website might be talking about Chinese teas using charset=JIS codepoints. |
|
|||
|
|||
Japanese Chinese tea web sites
I can't believe NJSC will never expire. I have an old computer I load
trial dated software. I just reset the Date if I want to use the software. Some of the products know about this so put in a semaphore entry in the Registry. You simply keep track of before and after software changes to the Registry on the date of load versus the date of expiration. Or there is some mysterious hidden file entry you need to find. I'd love to find any routine than allows me to go from CJK languages packs to Unicode. Jim niisonge wrote: > You can download Asian Explorer if you want, but basically, it's just a > cheaper version of Internet Explorer, just enhanced for Asian character > sets. The copy/paste function is the most useful part of it. > > But what I'm referring to is NJ Star Communicator - it's a CJK IME. The > trial version is fully functional. It's supposedly only a 30 day trial. > But it's still fully functional way beyond the trial date. > > If you use NJ Star Communicator, it will automatically display the > characters on the web page in whatever character format you set the > software to load - GB, Big5, EUC, etc. So for me, traditional Chinese > web pages (doesn't matter if encoded in Big5 or unicode or utf-8) all > get loaded into simplified Chinese. If I want to change to traditional > Chinese, then I change language settings. > > And inputting characters into a web search using say, GB will also > yield results in Big5, EUC, Chinese UTF simplified, Chinese UTF > traditional, Japanese Shift-JIS, Japanese UTF8, etc. So all of this > encoding stuff is really a moot point if you use a CJK IME. > > If you know the Pinyin, you can find the character easily. If you are > unsure of the Pinyin, you can also switch to English to Chinese input. |
|
|||
|
|||
Japanese Chinese tea web sites
>Also, Chinese sites about tea tend to be more basic, give very little
>information. Most of them are only made to sell tea. Probably fewer idle >amateurs have access to internet, compared with Japan. Na, everyone and their grandmother spends time on the net in China now. Some people are failing out of school because of QQ, a Chinese chat program (they stole the code from ICQ). Anyway, it's the Chinese business style to give as little information about their products as possible to confuse the consumer. You can not imagine how many "ten fu" tea shop copies there are around here... |
|
|||
|
|||
Japanese Chinese tea web sites
"Space Cowboy" > writes:
> I can't believe NJSC will never expire. I have an old computer I load > trial dated software. I just reset the Date if I want to use the > software. Some of the products know about this so put in a semaphore > entry in the Registry. You simply keep track of before and after > software changes to the Registry on the date of load versus the date of > expiration. Or there is some mysterious hidden file entry you need to > find. I'd love to find any routine than allows me to go from CJK > languages packs to Unicode. I'm not sure exactly what you mean here. Do you mean pasting a CJK character into something that would pull up the appropriate Unihan page? /Lew --- Lew Perin / http://www.panix.com/~perin/babelcarp.html |
|
|||
|
|||
Japanese Chinese tea web sites
"Space Cowboy" > writes:
> Lewis Perin wrote: > > [...Unicode is Unicode...] > > Agreed UTF-8 is Unicode. Anytime you use the codepoint 8336 it means > tea only if you find websites with charset=UTF-8. I just Googled for the Chinese character for tea, limiting it to *.jp sites. The first one that came up (http://www.chanoaji.jp/) has nothing about UTF-8 in its source, but does have <META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=Shift_JIS"> > As I said before the tea codepoint for GB2312 is 1872, BIG5 AFF9, > JIS 3567. So if the webpage said charset=JIS you would use 3567 to > find the glyph meaning tea which is the reason you would only see > Japanese websites. You won't see the Japanese websites using > charset=UTF-8 or if you did there is a Unicode glyph for 3567 but > not for tea. Then why is it that, when I copy the tea character from that Japanese page into Google and search again, the search term is identical? (By the way, the search term is URL-encoded UTF-8: %E8%8C%B6 > If I come across a webpage that says charset=UTF-8 and want to see > the glyphs in my browser I load the MS Unicode CJK codeset. GB2312, > BIG5, JIS have their codepoints and glyphs. Any specific codepoint > only has meaning if you know what charset it uses to look up the > glyph. As an aside I've been checking the Chinese webpages > mentioned in this thread by you and the html says charset=GB2312. > You indicate you derived a Unicode codepoint which I assume came > from the webpage contents. I don't see how. That is only valid if > charset=UTF-8. I don't have access to Google's source code, but it seems clear to me that they're not confused by the Big5 vs. GB vs. JIS. They're probably converting everything to Unicode codepoints before indexing. > > > > I'm not sure of the particulars but you can also mix language sets > > > on a webpage. I use Unicode strings for Google searches. I could > > > get additional hits if I used JIS or GB strings but I only track > > > Unicode. On TaoBao I have to use GB strings. Ebay China uses > > > Unicode. > > > > JIS, GB, and Big5 are all parts of Unicode. > > In what sense? They use different codepoints for language glyphs. They use different codepoints for *some* glyphs - but mostly they use the same codepoints for glyphs they share. The fact that e.g. GB enumerates the Cha character differently than JIS doesn't affect the fact that they both use the same Unicode codepoint. /Lew --- Lew Perin / http://www.panix.com/~perin/babelcarp.html |
|
|||
|
|||
Japanese Chinese tea web sites
Something like that. CJK GB,BIG5,JIS,KS national characters to
Unicode. I can go from Unicode to CJK national characters. Zhongwen,Mandarintools,Babelfish require Unicode. If I know the Unicode I can use Unihan to look at a graphical representation of the character without loading charasets including Unicode for MS. I still don't know how you are getting GB2312 webpages to show you Unicodes. NJ Star Communicator apparently can do that but it would be overkill for my limited use. Jim Lewis Perin wrote: > "Space Cowboy" > writes: > > > I'd love to find any routine than allows me to go from CJK > > languages packs to Unicode. > > I'm not sure exactly what you mean here. Do you mean pasting a CJK > character into something that would pull up the appropriate Unihan page? > > /Lew > --- > Lew Perin / > http://www.panix.com/~perin/babelcarp.html |
|
|||
|
|||
Japanese Chinese tea web sites
"Space Cowboy" > writes:
> > Lewis Perin wrote: > > "Space Cowboy" > writes: > > > > > I'd love to find any routine than allows me to go from CJK > > > languages packs to Unicode. > > > > I'm not sure exactly what you mean here. Do you mean pasting a CJK > > character into something that would pull up the appropriate Unihan page? > > > Something like that. Try this: www.panix.com/~perin/getunihan.html You need Javascript, but I promise it won't do anything evil. /Lew --- Lew Perin / http://www.panix.com/~perin/babelcarp.html |
|
|||
|
|||
Japanese Chinese tea web sites
At this point I think we are talking past each other. For example I
want to take the GB codepoint 1872 and translate it into Unicode codepoint 8336. Agreed the different codepoints for tea in the CJK language packs will point to 8336 which is UTF-16 representation consistent with the language pairs for non Roman language packs. I know Google will take my Unicode strings and return matches it finds in websites coded in charset other than UTF. At that point I can't cut and paste any characters from those webpages into Babelfish,MandarinTools,Zhongwen because they're not Unicode. From what I understand NJ Star Communicator for example will flip charset=GB2312 and charset=UTF-8. If I find anything pertinent on language pack codepoints to Unicode codepoints I'll report back. I can go from Unicode codepoints to language packs codepoints. Jim Lewis Perin wrote: > > As I said before the tea codepoint for GB2312 is 1872, BIG5 AFF9, > > JIS 3567. So if the webpage said charset=JIS you would use 3567 to > > find the glyph meaning tea which is the reason you would only see > > Japanese websites. You won't see the Japanese websites using > > charset=UTF-8 or if you did there is a Unicode glyph for 3567 but > > not for tea. > > Then why is it that, when I copy the tea character from that Japanese > page into Google and search again, the search term is identical? (By > the way, the search term is URL-encoded UTF-8: %E8%8C%B6 > > > If I come across a webpage that says charset=UTF-8 and want to see > > the glyphs in my browser I load the MS Unicode CJK codeset. GB2312, > > BIG5, JIS have their codepoints and glyphs. Any specific codepoint > > only has meaning if you know what charset it uses to look up the > > glyph. As an aside I've been checking the Chinese webpages > > mentioned in this thread by you and the html says charset=GB2312. > > You indicate you derived a Unicode codepoint which I assume came > > from the webpage contents. I don't see how. That is only valid if > > charset=UTF-8. > > I don't have access to Google's source code, but it seems clear to me > that they're not confused by the Big5 vs. GB vs. JIS. They're > probably converting everything to Unicode codepoints before indexing. > > > > > > I'm not sure of the particulars but you can also mix language sets > > > > on a webpage. I use Unicode strings for Google searches. I could > > > > get additional hits if I used JIS or GB strings but I only track > > > > Unicode. On TaoBao I have to use GB strings. Ebay China uses > > > > Unicode. > > > > > > JIS, GB, and Big5 are all parts of Unicode. > > > > In what sense? They use different codepoints for language glyphs. > > They use different codepoints for *some* glyphs - but mostly they use > the same codepoints for glyphs they share. The fact that e.g. GB > enumerates the Cha character differently than JIS doesn't affect the > fact that they both use the same Unicode codepoint. > > /Lew > --- > Lew Perin / > http://www.panix.com/~perin/babelcarp.html |
|
|||
|
|||
Japanese Chinese tea web sites
I have a routine that does the same thing offline. It takes Unicode
strings, determines their hex value, and calls Unihan. I was hoping it would take CJK language pack strings for example paste in the GB or JIS codepoint character for tea. There has to be an easy way of going from language packs codepoints to Unicode codepoints. Jim Lewis Perin wrote: > "Space Cowboy" > writes: > > > > > Lewis Perin wrote: > > > "Space Cowboy" > writes: > > > > > > > I'd love to find any routine than allows me to go from CJK > > > > languages packs to Unicode. > > > > > > I'm not sure exactly what you mean here. Do you mean pasting a CJK > > > character into something that would pull up the appropriate Unihan page? > > > > > Something like that. > > Try this: > > www.panix.com/~perin/getunihan.html > > You need Javascript, but I promise it won't do anything evil. > > /Lew > --- > Lew Perin / > http://www.panix.com/~perin/babelcarp.html |
|
|||
|
|||
Japanese Chinese tea web sites
"Space Cowboy" > writes:
> > Lewis Perin wrote: > > "Space Cowboy" > writes: > > > > > > > > Lewis Perin wrote: > > > > "Space Cowboy" > writes: > > > > > > > > > I'd love to find any routine than allows me to go from CJK > > > > > languages packs to Unicode. > > > > > > > > I'm not sure exactly what you mean here. Do you mean pasting > > > > a CJK character into something that would pull up the > > > > appropriate Unihan page? > > > > > > > Something like that. > > > > Try this: > > > > www.panix.com/~perin/getunihan.html > > > > You need Javascript, but I promise it won't do anything evil. > > > I have a routine that does the same thing offline. It takes Unicode > strings, determines their hex value, and calls Unihan. I was hoping it > would take CJK language pack strings for example paste in the GB or JIS > codepoint character for tea. There has to be an easy way of going from > language packs codepoints to Unicode codepoints. Sorry, I really don't know what you mean by a "CJK language pack string". The page I cited lets you paste a CJK character from a Chinese website and get back the corresponding Unihan page. /Lew --- Lew Perin / http://www.panix.com/~perin/babelcarp.html |
|
|||
|
|||
Japanese Chinese tea web sites
Just download NJ Star Communicator, and you can convert into any of 21
options. It's simple. And easy to use. But beware, some characters don't convert properly. It's a machine conversion. And it doesn't replace human conversion. For example, this software in GB mode only supports about 7 000 characters - or something like that. But in Big5 mode, it supports 15 000 characters. So there are going to be many characters, that don't get converted, or are converted into another character, rendering the meaning of the text useless. And 15 000 is not a lot of characters. For common, every day Chinese language, it's fine. But for some scholarly or artistic work, I often can't find the character I am looking for in my software - because it's not in there. When it comes to Chinese, computers are still way behind, and woefully inadequate. But somehow, we still get by. Amazing isn't it? Chinese fonts are another big beef of mine. But anyway, save that for later. |
|
|||
|
|||
Japanese Chinese tea web sites
On 23 Oct 2005 11:39:48 -0400, Lewis Perin > wrote:
>In researching information for Babelcarp's database, I often run Web >searches using Chinese characters. Typically you find vastly more >hits (mainly mainland Chinese sites) this way than if you use the >Pinyin name for a tea. > >I've noticed often that a lot of hits will come from Japanese web >sites. This isn't too surprising when you think about it: Japanese is >written using (among other things) Chinese characters; why shouldn't >Japanese people be interested in Chinese tea; and for those Japanese >people who are interested in Chinese tea, why shouldn't they use >Chinese characters to refer to them?[1] > >One thing, though, puzzles me about these Japanese sites for Chinese >teas: some of the teas they list can only be found on Japanese sites. >If a tea really is Chinese, why wouldn't it be retrievable on some >Chinese site? Here's an example. (This won't work, of course, if >your Web browser has no access to Chinese characters.) On the site > > http://chinese-tea.info/03g/shurui.html > >scroll down to the Jiangxi teas, where you'll find a tea whose Pinyin >name (in the right-hand column) is zhou da tie cha. Search for it >using the Chinese characters in the left-hand column. The results >will be exclusively Japanese sites. > >Anyone know what's going on here? Kuri? > >/Lew >--- >Lew Perin / http://www.panix.com/~perin/babelcarp.html >[1]Actually, I just thought of a reason why Japanese people wouldn't >want to use Chinese characters: because, when using them in a Japanese >context, the phonemes they correspond to wouldn't be the same as in >Chinese. sort of like searching for references to french fries on a french web site? |
Posted to rec.food.drink.tea
|
|||
|
|||
Japanese Chinese tea web sites
Here is an interesting site for GB2312 to UNICODE conversion
http://www.herongyang.com/gb2312/ I found yesterday. As I previously suspected it is a mapping and not a mathematical routine even though the table was generated by a Java program with a bunch of but-ifs. I didn't see anything right off the bat that would prevent Javascript from doing the same thing mathematically as Java. The table says B2E8 is the GB value for the Unicode value 8336 and not 1872 as mentioned in Unihan. I can tell I'm going to have some fun. Also if I had DOTNET loaded then there is a simple routine to indicate the language pack such as GB2312 and give the corresponding Unicode value. The charCodeAt routine in Javascript is just a Unicode character to Unicode hex representation. The two byte hexview of a Unicode character is not the same as the result of the charCodeAt conversion. Some things are flipped around in the way the Unicode char is stored on disk. In a file the Unicode tea character is stored as 36383. Notepad will store the Unicode tea character as four bytes with the first two characters high order FFFE. Jim Lewis Perin wrote: > "Space Cowboy" > writes: > > > > Lewis Perin wrote: > > > Try this: > > > > > > www.panix.com/~perin/getunihan.html > > > > > > You need Javascript, but I promise it won't do anything evil. > > > > > I have a routine that does the same thing offline. It takes Unicode > > strings, determines their hex value, and calls Unihan. I was hoping it > > would take CJK language pack strings for example paste in the GB or JIS > > codepoint character for tea. There has to be an easy way of going from > > language packs codepoints to Unicode codepoints. > > Sorry, I really don't know what you mean by a "CJK language pack > string". The page I cited lets you paste a CJK character from a > Chinese website and get back the corresponding Unihan page. > > /Lew > --- > Lew Perin / > http://www.panix.com/~perin/babelcarp.html |
Posted to rec.food.drink.tea
|
|||
|
|||
Japanese Chinese tea web sites
You said before it really doesn't expire. What do you mean by that?
Most of the time you'll lose some functions such as printing or limited file size. If I stay with Unicode I am fine for tea terms but occasionally I would like using native language packs. Thanks, Jim niisonge wrote: > Just download NJ Star Communicator, and you can convert into any of 21 > options. It's simple. And easy to use. But beware, some characters > don't convert properly. It's a machine conversion. And it doesn't > replace human conversion. For example, this software in GB mode only > supports about 7 000 characters - or something like that. But in Big5 > mode, it supports 15 000 characters. So there are going to be many > characters, that don't get converted, or are converted into another > character, rendering the meaning of the text useless. > > And 15 000 is not a lot of characters. For common, every day Chinese > language, it's fine. But for some scholarly or artistic work, I often > can't find the character I am looking for in my software - because it's > not in there. When it comes to Chinese, computers are still way behind, > and woefully inadequate. But somehow, we still get by. Amazing isn't > it? Chinese fonts are another big beef of mine. But anyway, save that > for later. |
Posted to rec.food.drink.tea
|
|||
|
|||
Japanese Chinese tea web sites
Space Cowboy wrote:
> You said before it really doesn't expire. What do you mean by that? > Most of the time you'll lose some functions such as printing or limited > file size. If I stay with Unicode I am fine for tea terms but > occasionally I would like using native language packs. What I mean by doesn't expire, is that for the first 30 days, you can use the software fine. After 30 days, you get a splash screen that reminds you to buy the software. It counts down 1 second for however many days you use it beyond the 30 days. Then, after 50 days, the screen kind of stays there permanently. But it's movable. So you can move it right off the desktop, out of your way. Then, you can still use the software without being bothered by that screen. Just don't click "I agree" after the 50 day period. Some other weird things happen too, but the software is still fully functional. The only thing the donwload version doesn't include are Chinese fonts. But that doesn't matter if you donwnload the Asian Languages pack for MS Office. You can use the MS Office fonts instead - like Simsun, Mingliu, etc. But they're not very good fonts - just basic ones. I have used this software for over a year without problems. It has a lot of features that Asiansuite doesn't have. |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Forum | |||
Korean, Japanese, Chinese Tea Ceremonies | Tea | |||
What is the difference between Chinese and Japanese Tea? | Tea | |||
Semi OT - URLs for two "Japanese Manners" sites? | Sushi | |||
Japanese vs. Chinese greens | Tea | |||
chinese and japanese green tea | Tea |