links · people · groups · tags | My: links · tags · groups · watchlists · notes login · sign up now! | help · blog
Simpy simpy
 
era, member since Jun 19, 2006
.
Search Everyone: "language.identification",
1 - 14 of 14   Watch era
 
On-line demo of Xerox's language identifier (commercial) 47 languages, not terribly actively maintained. I believe this was originally created by one of their Finnish researchers in XRCE Grenoble once upon a time ... I also got the impression that this one was the first to make a conscious effort at supporting different character set encodings. Fun Observation: the Danish sample Sentence uses ancient German-Style Capitalization Rules (-: ... and the Norwegian is (predictably) unlabelled, although I believe it's Bokmål. And it's incorrectly punctuated.
by era 2006-06-19 01:25 history · language · language.identification · server · tool · 20060619-0123
http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser-ISO-8859-1.en.html - cached - mail it - history
C reimplementation of TextCat, open source More solid language models than TextCat, and they use a similar format, so you can use the mguesser models with TextCat and vice versa. It's written in C, so it's faster, too. The web page is hideous, but the tool is good. This is available as a Debian package as well. See also
by era 2006-06-19 01:24 02a · download · language · language.identification · opensource · tool · 20060619-0123
http://www.mnogosearch.org/guesser/ - cached - mail it - history
Is the language "Persian" or "Farsi"? (Apparently, "Persian," really.)
by era 2006-06-19 01:24 article · language · language.identification · persian · 20060619-0123
http://www.iranian.com/Features/Dec97/Persian/ - cached - mail it - history
Vector-space -based language identification (commercial) There's a link there to a paper which was also published at the 32nd Hawaii International Conference on System Sciences (1999) -- I'll try to find that and submit to CiteSeer too. The vector-space cosine distance measure makes more theoretical sense to me than the others I've seen, but I haven't had the time to compare their performance head-to-head.
by era 2006-06-19 01:24 language · language.analysis · language.identification · tool · 20060619-0123
http://www-306.ibm.com/software/globalization/topics/linguini/welcome.jsp - cached - mail it - history
The software which runs the languid site, apparently
by era 2006-06-19 01:23 02a · download · language · language.identification · module · opensource · perl · tool · 20060619-0123
http://search.cpan.org/~mceglows/Language-Guess-0.01/ - cached - mail it - history
Gertjan van Noord's language identification tool in Perl, with a demo See also the "competitors" page for links to more similar tools.
by era 2006-06-19 01:23 02a · download · language · language.identification · module · opensource · perl · server · tool · 20060619-0123
http://odur.let.rug.nl/~vannoord/TextCat/ - cached - mail it - history
UTF-8 language guesser, sort of TextCat-based (?) ... or so it sez on the TextCat site. It also says the code is GPL but I haven't figured out where to download it, and/or the language models. See also
by era 2006-06-19 01:23 02a · language · language.identification · server · tool · 20060619-0123
http://languid.cantbedone.org/ - cached - mail it - history
As the site grows, it will be increasingly useful to be able to focus on languages you understand Ideally, the site would be able to supply a meaningful default guess for every field, and a user preference for which languages to display and/or suggest. See also the Accept-Language HTTP header. Gertjan van Noord's TextCat is a fairly popular Perl-based language identification module. (It's not actually a proper module, but you can get a modularized version e.g. from the SpamAssassin sources.) Samma på svenska. Ja suomeksikin.
by era 2006-06-19 01:23 blog · bugs · deliriousbugs · deliriouswishlist · erablog · language · language.identification · rubric_0.09 · rubric_0.10 · 20060619-0123
http://de.lirio.us/rubric/entry/5407 - cached - mail it - history
Wyard, Rose (1997)
by era 2006-06-19 01:23 article · citeseer · corpus · language · language.analysis · language.identification · science · similarity · theory · 20060619-0123
http://citeseer.ist.psu.edu/wyard97internet.html - cached - mail it - history
Penelope Sibun, Jeffrey C. Reynar (1996)
by era 2006-06-19 01:23 article · citeseer · language · language.analysis · language.identification · science · similarity · theory · 20060619-0123
http://citeseer.ist.psu.edu/sibun96language.html - cached - mail it - history
Kenneth Beesley (1998). Very crude, but hey, it's very old, too
by era 2006-06-19 01:23 article · citeseer · history · language · language.analysis · language.identification · science · similarity · theory · 20060619-0123
http://citeseer.ist.psu.edu/beesley88language.html - cached - mail it - history
Cavnar, Trenkle (1994) - the popular paper behind TextCat et al. The ranking algorithm is kind of screwy, until you think of it as editing distance in an alphabet where each n-gram is a distinct symbol. Maybe it's still screwy.
by era 2006-06-19 01:23 article · citeseer · language · language.analysis · language.identification · science · similarity · theory · 20060619-0123
http://citeseer.ist.psu.edu/68861.html - cached - mail it - history
Carter (1994) Spoken language models, but still
by era 2006-06-19 01:23 article · citeseer · language · language.analysis · language.identification · science · similarity · theory · 20060619-0123
http://citeseer.ist.psu.edu/23437.html - cached - mail it - history
1 - 14 of 14  
Related Tags
 
- exclude ~ optional + require
Add Dates