TF-IDF(Term Frequency - Inverse Document Frequency) ๋ž€?

  • TF(๋‹จ์–ด ๋นˆ๋„, term frequency)๋Š” ํŠน์ •ํ•œ ๋‹จ์–ด๊ฐ€ ๋ฌธ์„œ ๋‚ด์— ์–ผ๋งˆ๋‚˜ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’. ์ด ๊ฐ’์ด ๋†’์„์ˆ˜๋ก ๋ฌธ์„œ์—์„œ ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ํ•˜์ง€๋งŒ ํ•˜๋‚˜์˜ ๋ฌธ์„œ์—์„œ ๋งŽ์ด ๋‚˜์˜ค์ง€ ์•Š๊ณ  ๋‹ค๋ฅธ ๋ฌธ์„œ์—์„œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋ฉด ๋‹จ์–ด์˜ ์ค‘์š”๋„๋Š” ๋‚ฎ์•„์ง„๋‹ค.
  • DF(๋ฌธ์„œ ๋นˆ๋„, document frequency)๋ผ๊ณ  ํ•˜๋ฉฐ, ์ด ๊ฐ’์˜ ์—ญ์ˆ˜๋ฅผ IDF(์—ญ๋ฌธ์„œ ๋นˆ๋„, inverse document frequency)๋ผ๊ณ  ํ•œ๋‹ค.
  • TF-IDF๋Š” TF์™€ IDF๋ฅผ ๊ณฑํ•œ ๊ฐ’์œผ๋กœ ์ ์ˆ˜๊ฐ€ ๋†’์€ ๋‹จ์–ด์ผ์ˆ˜๋ก ๋‹ค๋ฅธ ๋ฌธ์„œ์—๋Š” ๋งŽ์ง€ ์•Š๊ณ  ํ•ด๋‹น ๋ฌธ์„œ์—์„œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

์˜ˆ์ œ

Twitter ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๊ธฐ

  • ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•˜๋‹ค.

  • https://github.com/open-korean-text/open-korean-text

  • Maven

    <dependency>
      <groupId>org.openkoreantext</groupId>
      <artifactId>open-korean-text</artifactId>
      <version>2.1.0</version>
    </dependency>

๋ฌธ์„œ ์ •๋ณด

  • 3๊ฐ€์ง€์˜ ์˜ˆ์ œ ๋ฌธ์„œ๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž.
private static String[] dataList = {
            "I love dogs", // I, love, dogs
            "I hate dogs and knitting", // I, hate, dogs, and, knitting
            "Knitting is my hobby and my passion", // Knitting, is, my, hobby, and, my, passion
    };

Tf-IDF ๋ชจ๋ธ ์ •์˜

class Tf_idf {
    private int tf = 0;
    private double idf = 0;
 
    public void addTf() {
        this.tf = tf + 1;
    }
 
    public void setIdf(double idf) {
        this.idf = idf;
    }
 
    public double getTf_idf() {
        return tf * idf;
    }
 
    @Override
    public String toString() {
        return String.format("tf = %d | idf = %.2f | tf-idf = %.2f", tf, idf, getTf_idf());
    }
}

Tf๊ฐ’ ๊ณ„์‚ฐํ•˜๊ธฐ

  • tf๊ฐ’์„ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์—๋Š” ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค.

    • ๋ฌธ์„œ์˜ ๊ธธ์ด์— ๋”ฐ๋ผ ๋‹จ์–ด์˜ ๋นˆ๋„๊ฐ’์„ ์กฐ์ ˆํ•  ์ˆ˜๋„ ์žˆ๋‹ค.
    • boolean ๋นˆ๋„๋กœ ํ•œ๋ฒˆ๋งŒ ๋“ฑ์žฅํ•ด๋„ 1๋กœ ๊ฐ’์„ ์ •ํ•˜๋Š” ๊ฒฝ์šฐ๋„ ์žˆ๋‹ค.
    • log๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ’์„ ์กฐ์ ˆํ•  ์ˆ˜๋„ ์žˆ๋‹ค.
  • ์•„๋ž˜ ๋ฐฉ๋ฒ•์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋‚˜์˜จ ํšŸ์ˆ˜๋ฅผ countํ•œ๋‹ค.

// Tf ๊ณ„์‚ฐ
for (String data : dataList) {
    // Normalize - ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ
    CharSequence normalized = OpenKoreanTextProcessorJava.normalize(data);
 
    // Tokenize - ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ
    Seq<KoreanTokenizer.KoreanToken> tokens = OpenKoreanTextProcessorJava.tokenize(normalized);
 
    // documentMap ์ƒ์„ฑ
    HashMap<String, Tf_idf> documentMap = new HashMap<String, Tf_idf>();
 
    // documentMap์— Tf๊ฐ’ ์ž…๋ ฅ
    for (String token : OpenKoreanTextProcessorJava.tokensToJavaStringList(tokens)) {
        if (!documentMap.containsKey(token)) {  // document์— token์ด ์—†์„ ๊ฒฝ์šฐ
            Tf_idf tokenValue = new Tf_idf();
            tokenValue.addTf();
 
            documentMap.put(token, tokenValue);
        } else {                                // document์— token์ด ์žˆ๋Š” ๊ฒฝ์šฐ
            Tf_idf tokenValue = documentMap.get(token);
            tokenValue.addTf();
        }
    }
 
    // documentList์— documentMap ์ถ”๊ฐ€
    documentList.add(documentMap);
}
  • ๊ฒฐ๊ณผ๊ฐ’

Idf๊ฐ’ ๊ณ„์‚ฐ

  • log(์ „์ฒด๋ฌธ์„œ์˜ ์ˆ˜ / token์ด ํฌํ•จ๋œ ๋ฌธ์„œ์˜ ์ˆ˜)
// idf ๊ณ„์‚ฐ
for (HashMap documentMap : documentList) {
    Iterator<String> tokenList = documentMap.keySet().iterator(); // document token ๊ฐ€์ ธ์˜ค๊ธฐ
 
    while (tokenList.hasNext()) {
        String token = tokenList.next();
        Tf_idf tokenValue = (Tf_idf) documentMap.get(token);
 
        int hit_document = 0;
 
        for (int index = 0; index < documentList.size(); index++) {
            if (documentList.get(index).containsKey(token)) { // document์— token์ด ํฌํ•จํ•œ ๊ฒฝ์šฐ
                hit_document++;
            }
        }
        tokenValue.setIdf(Math.log10((double) documentList.size() / hit_document)); //  log(์ „์ฒด ๋ฌธ์„œ / hit ๋ฌธ์„œ)
    }
}
  • ๊ฒฐ๊ณผ๊ฐ’
    • tf-idf๊ฐ’์ด ๋†’์„์ˆ˜๋ก ๋‹ค๋ฅธ ๋ฌธ์„œ์— ์ž˜ ์–ธ๊ธ‰๋˜์ง€ ์•Š์€ ๋‹จ์–ด(my, love, hate, hobby, is, passion)์ธ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
    • tf-idf๊ฐ’์ด ๋‚ฎ์„์ˆ˜๋ก ๋‹ค๋ฅธ ๋ฌธ์„œ์— ์ž˜ ์–ธ๊ธ‰ํ•˜๋Š” ๋‹จ์–ด(I, dogs, and, knitting)์ธ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

๋งˆ์น˜๋ฉฐ

  • Tf-IDF์˜ ๋ชฉ์ ์€ ๋‹ค๋ฅธ ๋ฌธ์„œ์— ์ž์ฃผ ์–ธ๊ธ‰๋˜์ง€ ์•Š๊ณ  ํ•ด๋‹น ๋ฌธ์„œ์—๋Š” ์ž์ฃผ ์–ธ๊ธ‰๋˜๋Š” token์— ๋Œ€ํ•ด ์ ์ˆ˜๋ฅผ ๋†’๊ฒŒ ๋ถ€์—ฌํ•˜๋Š” ๊ฒƒ.
  • ์ •ํ•ด์ง„ ๊ณต์‹์€ ์—†๋‹ค. ๋ถ„์„ํ•˜๋Š” ๋ฐ์ดํ„ฐ์— ๋งž๊ฒŒ Tf-IDF๋ฅผ ๋ณ€ํ˜•ํ•ด๊ฐ€๋ฉฐ ๋‹ค์–‘ํ•œ ์‹œ๋„๋ฅผ ํ•ด๋ณด์ž.

Source

Reference