Question Does size matter - and other requirements.....
- mikkitobi
- Topic Author
- Visitor
In a book seach for "Russell Soundex" I find that the term is used mainly as a courtesy to Russell, not to distinguish it from other algorithms. Moreover, if I search for just Soundex, I find that in the first several dozen hits, none of them prefixes Soundex with Russell or NARA, though all of them seem to be using the term to mean that particular algorithm. No matter. The main thing is that I now understand that when you said Soundex produces several encodings, you meant that there are several types of Soundex.
Technically, as far as I know, each type in itself is deterministic.
And I agree that the original is poor. And that ALL of them are worse than poor for French.
Let me put it this way.... people who deal with and use soundex frequently will refer to that particular flavour of soundex as Russell or NARA soundex.
No I did NOT mean there are several types of soundex. I meant that DECENT soundex and phonetic matching encoding systems do not produce 1-to-1 names to codes. They try to allow for variations in pronunciation and different possible languages and branch, thus a single name can produce a varying number of possible codes. In terms of database design this means you can no longer use a single field containing a single code and index it and search on it efficiently. Instead you must use a separated related code table linked to the original data table by record IDs or name. I gave examples in a previous post of names and methods that produce more than one soundex or phonetic code.
Miichael
Please Log in or Create an account to join the conversation.
- fisharebest
- Offline
- Administrator
Surname: forman EXACT: forman APPROX: forman, formon, fYrman, fYrmon
Surname: michelson EXACT: mixelzon, mixelson APPROX: mQxYlzon, mQxilzon, mixYilzon, mixilzon
In terms of database design, we need to know the characteristics of these strings. i.e. things like
1) maximum length?
2) are they case-sensitive - i.e. is Q different to q.
Greg Roach - greg@subaqua.co.uk - @fisharebest@phpc.social - fisharebest.webtrees.net
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
In terms of database design, we need to know the characteristics of these strings. i.e. things like
1) maximum length?
2) are they case-sensitive - i.e. is Q different to q.
1) maximum length would be the same as the maximum length of the name fields themselves. The field type(s) and lengths of the phonetic code fields should be identical to those of the fields/names they are encoding.
2) yes they are case sensitive
I am not sure how you calculate the existing Russell soundex and Daitch-Mokotoff soundex codes for inclusion in the SQL databases. Do you use php functions? I can try and talk nicely to Steve Morse and see if he is happy to allow us the use of his phonetic php source code and tables. I have copies I use but I would prefer to get his permission.
Michael
Please Log in or Create an account to join the conversation.
- WGroleau
- Offline
- Platinum Member
- Posts: 2165
Agreed (though “decent” is subjective).I meant that DECENT soundex and phonetic matching encoding systems do not produce 1-to-1 names to codes. They try to allow for variations in pronunciation and different possible languages and branch, thus a single name can produce a varying number of possible codes. In terms of database design this means you can no longer use a single field containing a single code and index it and search on it efficiently. Instead you must use a separated related code table linked to the original data table by record IDs or name.
I mentioned using IN (<list of equivalent names>) for my idea of a user- or admin-supplied list. But it could also be turned into a two-column table of name/equivalent name and then you select distinct rows from a join.
Terminology argument tossed into its own thread where you may have the last word if you wish.
--
Wes Groleau
UniGen.us/
Please Log in or Create an account to join the conversation.
- fisharebest
- Offline
- Administrator
I am not sure how you calculate the existing Russell soundex and Daitch-Mokotoff soundex codes for inclusion in the SQL databases. Do you use php functions?
There is a native PHP function for soundex ( php.net/soundex ). We have our own (php) function for DM.
Look in includes/functions/functions_name.php if you are interested. The DM implementation was written by (IIRC) Gerry Kroll (PGV developer).
I can try and talk nicely to Steve Morse and see if he is happy to allow us the use of his phonetic php source code and tables. I have copies I use but I would prefer to get his permission.
Unless he publishes under GPL, we will not be able to include it in webtrees.
Ask him nicely
Greg Roach - greg@subaqua.co.uk - @fisharebest@phpc.social - fisharebest.webtrees.net
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
And, a Canadian fellow named Denis Beauregard has developed a system that he says works well for French names. I'd love to see that as an option. All of the systems I and Michael have mentioned suck for French names. For example, all of the following are pronounced the same and appear in my paternal line (and siblings), but do not match in Soundex, DM, or Morse:
Groleau
Grolleau
Grosleau
Groleaux
Grosleaux
Groslot
Grolot
I have discussed this briefly with Steve and Sasha and they agree that their generic french phonetic rules need some extensions. They have made SOME already and if you check out Steve's webform you will see that MOST of your names now match for at least ONE phonetic code variant. The surnames with 's' do not yet match and I am pushing them further on that. Watch this space.
I am sure they would welcome more input on their phonetic matching for french names so please have a play with the form and if you have any other suggestions please let me or Steve know.
Regards
Michael
Please Log in or Create an account to join the conversation.
- WGroleau
- Offline
- Platinum Member
- Posts: 2165
Better, in my opinion, would be for the person searching, who knows it is a French name, to be able to select an algorithm that actually works for French. Or to have a name-based list instead of an AI attempt to figure out the pronunciation in a language-independent way.
As usual, “one size fits all” turns out in real life to be “you need to find another tailor.”
--
Wes Groleau
UniGen.us/
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
Well, the biggest problem with Soundex is that it matches too much. The Beider-Morse tokens almost eliminate the Soundex problem of too many false negatives, but they aren't helping the false positives much when they take a French name and generate one correct code, one code for a Spanish mispronunciation, and codes for two English mispronunciations.
Better, in my opinion, would be for the person searching, who knows it is a French name, to be able to select an algorithm that actually works for French. Or to have a name-based list instead of an AI attempt to figure out the pronunciation in a language-independent way.
As usual, “one size fits all” turns out in real life to be “you need to find another tailor.”
The BMPM system actually has different rules for different languages as well as an 'Any' Rule.
You COULD specify a French rule and get exactly what you want.... BUT that is all very well for inputting a name into a search form and saying 'I know this is a French name' but what about the database you are searching? When the database is indexed it doesnt know what language origins the names have. If you were indexing a batch of French births then fine, you could index them and tell the system they are all French, but in a gedcom you are likely to have a mixture of names from different countries and you will not always know what the countries of origin are. And just because somebody is born, marries and died in France does not mean their name origin is French....
I think you are spending too much time thinking about coding your INPUT name and not realising the difficulties of indexing the database itself...
There will never be a perfect answer. When indexing databases of mixed origin you have to use an 'any' country/language approach and you will get some more false positives than you would prefer.... but still better than traditional soundex.
Steve and Sasha are now amending theFrench Rules to correctly allow for the 's' in your name(s)...
Michael
Please Log in or Create an account to join the conversation.
- fisharebest
- Offline
- Administrator
Greg Roach - greg@subaqua.co.uk - @fisharebest@phpc.social - fisharebest.webtrees.net
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
Does this BMPM system only work for latin alphabets, or does it work for greek/hebrew/cyrillic/arabic/etc. ?
I am pretty sure it handles cyrillic and hebrew.... I have asked Steve about the others.
Meantime you might want to read the following paper that appeared in the Association of Professional Genealogists Quarterly (March 2010):
stevemorse.org/phonetics/bmpm2.htm
I was slightly inaccurate in my last posting. BMPM will attempt to derive the correct language on a name-by-name basis if it has to....
Michael
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
Does this BMPM system only work for latin alphabets, or does it work for greek/hebrew/cyrillic/arabic/etc. ?
Further to my last post. Greek IS supported. See the article I mentioned:
The languages we currently support are Catalan, Czech, Dutch, English, French, German, Greek (in Greek characters as well as translitered into Latin characters), Hebrew, Hungarian, Italian, Polish, Portuguese, Romanian, Russian (in Cyrillic characters as well as various transliterations), Spanish (including its Latin American dialects), and Turkish. New language tables will probably be added in the future.
And for Jewish names we have gone one step further. We have a special variant of Phonetic Matching that is customized for Ashkenazic Jewish names (languages are English, French, German, Hebrew, Hungarian, Polish, Romanian, Russian, and Spanish), and one that is customized for Sephardic Jewish names (languages are Catalan, French, Hebrew, Italian, Portuguese, and Spanish).
Michael
Please Log in or Create an account to join the conversation.
- fisharebest
- Offline
- Administrator
Is the code/algorithm available under GPL, or some other open-source licence?
Greg Roach - greg@subaqua.co.uk - @fisharebest@phpc.social - fisharebest.webtrees.net
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
Sounds great.
Is the code/algorithm available under GPL, or some other open-source licence?
I just asked Steve that question 5 minutes ago....
M
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
Sounds great.
Is the code/algorithm available under GPL, or some other open-source licence?
OK the good news is that Sasha and Steve have agreed that open source is the way ahead.... but the bad news is Steve is not sure how to go about that.
I know you guys are busy developing and this might be just an interesting side issue but if you can contact Steve and give him some pointers it would be appreciated. BTW Steve thinks the PhpGedView php D-M implementation was his....
You can contact Steve at steve@stevemorse.org and please copy me in at michael@tobias.org.uk
And I guess I should now explain who I am/we are.... I am Michael Tobias, one of the Vice-Presidents of JewishGen ( www.jewishgen.org ) and I am looking to use webtrees to replace our aging Family Tree of the Jewish People database.
I am hoping to demonstrate a prototype of the new system in Los Angeles in July and will take that opportunity (if not before) to try to get volunteer translators for some the languages you need.
Regards
Michael
Please Log in or Create an account to join the conversation.