NLP at Meedan

by Clarissa Castellã Xavier

This is the first of a series of posts about NLP (Natural Language Processing) work at Meedan. Meedan’s NLP research and development focuses on solutions for adding value to our company’s key products Check and Bridge: collaborative systems for news verification and social media translation, respectively.

The NLP tools we developed at Meedan are gathered in an API called Alegre. Alegre currently offers the following features: Language Identifier, Translation Memory, Glossary and Dictionary. We are particularly interested to develop solutions that run on a wide range of languages (including long-tail ones), that scale well and that run efficiently. In order to achieve these goals, the Translation Memory, Glossary and Dictionary features are implemented using Elasticsearch: an open source, distributed search engine. This tool gives us the performance and scalability we need, and it works with a large number of languages, thanks to its multilingual text analysis plugins.

Alegre works with an Elasticsearch schema that helps us achieve our goals:

  • Each item (glossary term, translation entry, etc.) is stored in a field that’s named according to the ISO 639 language code
  • Each item is associated with a language analyzer corresponding to the item’s language code (or the default analyzer if this language does not have its own, e.g. for small-community dialects)
  • Each entry also includes a context field where we can add any valuable non-linguistic information such as project, user, source or geolocation.

The important point is that the contextual information can be changed according the needs of each project and Elasticsearch gives us the power to search with high performance within the context, even if it does not have a fixed structure. Below we present an excerpt of the Alegre schema as example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
  "alegre": {
    "mappings": {
      "glossary": {
        "properties": {
          "ar": {
            "type": "string",
            "analyzer": "arabic"
          },
          "es": {
            "type": "string",
            "analyzer": "spanish"
          },
          ...
          "context": {
            "properties": {
                "project": {"type": "string"},
                "user": {"type": "string"}
            }
          }
        }
      }
    }
  }
}

Next, we can see a Translation Memory document example, following the same type design.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
         {
            "_index": "alegre",
            "_type": "translationMemory",
            "_id": "AVc0HS8lNHOnL_ufSvRr",
            "_source": {
               "en": "one car",
               "pt": "um carro",
               "context": {
                  "provider": "translation",
                  "project": "bridge",
                  "user": "ccx"
               }
            }
         }
}

And here a query searching for one car (1) pair in Portuguese (2) from the translation provider (context information) (3).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
POST alegre/translationMemory/_search?pretty=true
{
      "query": {
          "bool": {
              "must": [{
                  "match_phrase": {
                      "en": "one car" (1)
                      }
                },
                {
                  "match": {
                      "context.provider": "translation" (3)
                      }
                }],
                "filter": {
                  "exists" : { "field" : "pt" } (2)
              }                                    
          }              
      }
}

In the next post we will present Alegre’s Language Identification feature. Stay tuned!


Follow Clarissa on GitHub.