keybr.com
Update Greek dictionary for linguistic correctness and completeness
#416
Open

Update Greek dictionary for linguistic correctness and completeness #416

agoatboi wants to merge 1 commit into aradzie:master from agoatboi:master
agoatboi
agoatboi27 days ago❤ 1

The goal of this PR is to manually check the Greek dictionary and replace all occurances of:

  • Incorrect spellings
  • Incorrect stress signs
  • English names transliterated in Greek
  • Words with double stress symbol (valid, but only appears in particular sentence structuring).

With alternatives featuring:

  • Common words (and a few scientific or vernacular/idiomatic ones)
  • Correctly stressed word variations
  • Greek names (historical or current)
  • Occasional words stressed with (¨) "διαλυτικά" and (΅) "διαλυτικά με τόνο" which infrequently come up in writing.

In all replacements an effort was made to keep the letter distribution largely the same, but I am relatively sure that the average length of the words has been increased. The file was consistently checked for duplicates, and none were created.

semanticdiff-com
semanticdiff-com27 days ago (edited 25 days ago)

Review changes with  SemanticDiff

Changed Files
File Status
  packages/keybr-keyboard/lib/language.ts  88% smaller
  packages/keybr-keyboard-io/lib/parser/diacritics.ts  0% smaller
  packages/keybr-keyboard/lib/layout/el_gr.ts  0% smaller
  packages/keybr-phonetic-model/assets/model-el.data Unsupported file format
agoatboi agoatboi marked this pull request as draft 27 days ago
aradzie aradzie force-pushed the master branch from dc93f80b to 23a157ca 25 days ago
aradzie
aradzie25 days ago

George, thank you for your contribution!
I know from experience that creating a quality word list is a tedious and time consuming task.
(I can't count how many hours I spent on the English list.)
That being said, the file that you are modifying is automatically generated, your changes will be lost, and we use a different approach.
It's a bit complicated, so let me explain.

First, we have a separate repository where we develop the word frequency dictionaries.
I compared your changes with the existing file, the list sizes are the same, so I assume you only fixed the existing words, and did not insert or delete words.
Based on this assumption I replaced the words in the original corpus repository.
Next, I passed the words thought an automatic spell checker to filter out any invalid words.
Then I created a spreadsheet in Google docs.
This is the usual approach, I create a spreadsheet for a native speaker to review. It seems to be easier this way because it does not require any coding from the reviewer.

The spreadsheet has three columns. If the third column is not empty, then it contains the original word before you fixed it, and such a row is highlighted in a separate color.
There is also a separate sheet rejected with the words rejected by the spell checker. You may want to take a look at it.

So, let me ask you to review the spreadsheet.
If you find a spelling error, just fix the word in the first column.
If you find an invalid word, then delete the row or simply blank out the cell. (Put the cursor on the invalid word then press Backspace.)

If you feel enthusiastic, the please also remove any potentially triggering words. As an example take a look at the blacklist files for the English language -- profanity.txt and sensitive.txt

Thanks again!

aradzie
aradzie25 days ago

Hey, I had an idea. I took the original English blacklist of sensitive words and asked an AI bot to translate it to Greek. The translation may not be perfect, but it's a start. Here's the result

The sensitive words are any words about race, ideology, religion or sex. They can create random combinations that can trigger some people.

Again, if you don't feel too enthusiastic, just skip this part.

So, please take a look at the updated spreadsheet. Let me know when you are done then I'll update the web site right away.

aradzie
aradzie25 days ago

I just found out that you can use the built-in Google Docs spell checker to review words quickly. It's a good start to automatically check the words before doing any manual work. The spell checker is available in the menu "Tools" -> "Spelling" -> "Spell check".

agoatboi
agoatboi25 days ago (edited 25 days ago)

Hi aradzie,

Thanks a lot for your enthusiastic response and wealth of information!

I actually understood I had modified the wrong file a little after finishing my (20hour plus 💀) run, when I tried to run the development server locally and, obviously, the changes weren't integrated. It was at this point that I marked the PR as a draft, hoping to work on it over the weekend. I now realize I should have also communicated this in the comments of the PR, but honestly it seemed like it was almost ready.

Unfortunately, what you have in the spreadsheet with the 1-1 mapping won't quite work, because I replaced some seemingly frequent -but wrong- words with rather rarer ones. When I realized that word frequencies had to be involved, I postponed this until I could find a suitable dataset to estimate them, or to instead sort them by hand based on a heuristic/intuitive sense. This will take me some time to do still, but if you have any good ideas I'm happy to hear them.

I am now at the point where I've read the documentation further and decided to update both the layout and language to add the διαλυτικα με τονο symbol appearing in a few words, but I can't quite get it to work. It seems as though the alphabet variable is cached somewhere, because not even by deleting symbols from it can I get the profile statistics page to update. I got lost between the various nested components trying to understand how it works, and regretfully, I'm not proficient in JS.

Could I perhaps push the commits I have here so that you can guide me on how to do it? I think I'm very close.

As for the sensitive words, I found out a few while going through the list, and in fact added very few of them myself since they (rather unsurprisingly) show up often in Greek texts and occasionally have unique trigrams the algorithm could help you get used to. I am willing to go through the list again and mark those out for you. Perhaps their inclusion could be an optional setting to be toggled as desired? I am not sure if that is already implemented.

agoatboi agoatboi force pushed from 2eca5288 to d2b16051 25 days ago
agoatboi
agoatboi25 days ago

Force pushed the commit since I'm removing the original about changing the json of words. I'll instead make a PR to the corpus repository with updated words and frequencies when I can muster to courage to estimate the latter.

agoatboi add dialytika tonos symbol in Greek
06525a64
agoatboi agoatboi force pushed from d2b16051 to 06525a64 25 days ago
aradzie
aradzie25 days ago

There must be a way to salvage your work. 20 hours is a lot of time. You changed around 1076 words in the original list. I think we can assume that the remaining ~9000 words are valid.

I added a new sheet with only changed words. Maybe you can review this smaller list? If a word is misspelled, fix it. If it is invalid, make it blank. I will find a way to reconcile your changes in the spreadsheet to the corpus repository.

Speaking of the corpus repository. The original word frequency dictionary is el_50k.csv. (I think it comes from parsing a movie subtitles database, but I don't remember exactly.) It has around 50000 words, so it's not a problem to cull it aggressively, there is a plenty of room.

If you want to add a pull request, then it can be as simple as a text file with invalid words named lang-el/blackilst-xyz.txt. The corpus repository is a bunch of one time throwaway scripts. For this reason I do not pay too much attention to the quality of the scripts, and you probably shouldn't either ;)

Login to write a write a comment.

Login via GitHub

Reviewers
No reviews
Assignees
No one assigned
Labels
Milestone