Language resource management — Word segmentation of written texts — Part 2: Word segmentation for Chinese, Japanese and Korean
语言资源管理——书面文本的分词Spart 2:中文、日文和韩文的分词
发布日期:
2011-08-25
ISO 24614-1中定义的分词的基本概念和一般原则适用于汉语、日语和韩语。文本需要被分割成标记、单词、短语或一些其他类型的较小文本单元,以便在语言资源上执行某些计算应用,例如自然语言处理、信息检索和机器翻译。ISO 24614-2:2011仅限于将文本分割为单词或其他分词单元(WSU)。这项任务与词法或句法分析本身不同,尽管它在很大程度上依赖于词法句法分析。它也不同于构建一个词汇框架并识别其词条的任务,即引理和词素。
后一项任务的框架由ISO 24611、ISO 24613和ISO 24615提供。
ISO 24614-2:2011规定了为中文、日文和韩文划定WSU的规则。三种语言都有一些共同的规则,尽管每种语言都有自己独特的识别WSU的规则。讨论了它们的共同特点,然后为中国人、日本人和韩国人制定了不同的规则。
The basic concepts and general principles of word segmentation as defined in ISO 24614-1 apply to Chinese, Japanese and Korean. Text needs to be segmented into tokens, words, phrases or some other types of smaller textual units in order to perform certain computational applications on language resources, such as natural language processing, information retrieval and machine translation. ISO 24614-2:2011 is restricted to the segmentation of a text into words or other word segmentation units (WSUs). This task is distinct from morphological or syntactic analysis per se, although it greatly depends on morphosyntactic analysis. It is also different from the task of laying out a framework for constructing a lexicon and identifying its lexical entries, namely lemmas and lexemes. The frameworks for the latter tasks are provided by ISO 24611, ISO 24613 and ISO 24615.
ISO 24614-2:2011 specifies rules for delineating WSUs for Chinese, Japanese and Korean. Some rules are common to all three languages, though each language also has its own distinct rules for identifying WSUs. The common features are discussed, then the distinct rules are laid out for Chinese, for Japanese and for Korean.