Werkzeuge
This page is only available in English
The following tools were used, enhanced or created for this project.
-
NATools
NATools is a set of tools for processing parallel corpora. It includes a sentence aligner, an extractor of probabilistic translation dictionaries, a word aligner and a host of other tools to study the alignment of parallel corpora.
During the course of Per-fide, NATools was improved to extract PTDs efficiently in large corpora, and other tools were added:
- Lingua::PTD: tools for handling translation dictionaries;
- Lingua::PTD::More: tools for extracting resources such as UMTS from PTDs.
- Math::KullbackLeibler::Discrete: a module developed for the comparison of distributions in probabilistic translation dictionaries based on the Kullback-Leibler algorithm.
-
Open Corpus Workbench
The IMS Open Corpus Workbench (CWB) is a collection of free tools for managing and querying large corpora (with dimensions of the order of 10 million to 2 billion words) with linguistic annotations.
As for Open-CWB, the following tools were developed:
- XML::TMX::CWB: a tool for the direct incorporation of translation memories in the Open-CWB system;
- CWB::CQP::More: a high-level interface for the Open-CWB Perl modules;
- POSIX::Open3: a module developed to allow the use of OpenCWB via web pages, particularly in the Dancer framework.
-
JSpell
Jspell is a morphological analyzer derived from the ispell spell checker. (Jspell = + + ispell). It has been adapted for use in the Portuguese language. However, there are dictionaries for other languages.
- Freeling3
Freeling is a template library developed by Lluís Padró for the (lexical and syntactic) processing of several languages, including all the languages of the Per-Fide project (except German).
During the life span of Per-Fide, the following tools were created:
- Lingua::FreeLing2: an interface to version 2 of Freeling. It was discontinued when FreeLing3 became available.
- Lingua::FreeLing3: an interface to version 3 of Freeling using Perl language.
- Lingua::FreeLing3::Utils: a set of features and utilities implemented on Lingua::FreeLing3.
- XML::TMX
a Perl library for handling translation memories. It includes tools for tokenization and tagging of corpora using the Lingua::FreeLing3 library..
- XML::DT::Sequence: a system was implemented to process large XML files based on item repetition.
- TreeTagger
TreeTagger is a well-known morphosyntactic tagger. It was used because FreeLing3 does not support the German language. In the scope of Per-Fide, the following modules were developed:
- Lingua::TreeTagger::Installer: a tool for automating the installation of TreeTagger as well as the language models.
- Lingua::TreeTagger: although it was not developed as part of Per-Fide, project members have been involved in improving the tool.
- Lingua::Identify::CLD
A Perl interface and compiling system were developed for Chrome Language Detection Library, which was created by Google.