- Posts: 2165
Question Does size matter - and other requirements.....
- mikkitobi
- Topic Author
- Visitor
I am a newbie - please bear with me.
Earlier this year I downloaded and installed PGV to play/test/prototype some ideas I had. Now I am returning to that development and learned of the PGV / webtrees split and am thinking that probably webtrees is the more likely way forward.
Our system is far from typical - and I am not sure how much customisation we will want to do ourselves and how much of our requirements will be of general enough interest to other webtrees users to incorporate into the core product. We are a not-for-profit organisation based in the USA (though I am in the UK).
Here is a summary of our requirements....
1. We will be uploading approximately 4000 (and continually growing) gedcom files into the system - totalling around 5 million individuals currently.
PGV did not have a 'batch import' facility. I assume webtrees does not either - but is this something that could easily be done? Since you are still in beta test mode and often restructure your MySQL databases and wipe/reimport your data I was wondering if anybody had put together a system for automating the re-importing batches of gedcom files...
2. Performance. Initially - and perhaps even in the longer term - our system will be read-only. It will be used purely for DISPLAY of trees not for any updates. What will performance be like for simply display individuals and moving around a single tree when we have 4000 trees and 5 million records in total in the SQL database?
3. Customisation. We want a lot of control over what menus appear to the users and what options appear in each menu. We want to limit the number and type of reports, charts etc and probably remove menus, add menus, add our own menu options with links to other external systems.... This was not going to be easy in PGV though there were promises that the next version released would allow easier customisation. Especially while webtrees in is beta test I would not like to be changing too much code ourselves and having to control those changes as new versions are released. Would this kind of development be of more general interest?
4. Data structures and indexes and searching. There were limitations in PGV - not sure about webtrees when it came to soundex and other searching. Although soundex codes and Daitch-Mokotoff soundex codes were stored in the database no indexes were used. SQL queries must have been using 'IN' comparisons which are kinda slow when searching 5 million records. Ideally we would like separate tables to hold soundex values (there is NOT a 1-to-1 relationship of names to soundex codes, some names can generate 2,4,8 and more soundex variants). In addition we use phonetic matching techniques as another search option and would want new tables and indexes for those phonetic codes. This has huge implications for data structures, data import and search SQL query code. Would this kind of development be of more general interest?
I am not an experienced php programmer, but I have at least one volunteer with php experience and might be able to get others involved who could devote considerable time and effort towards this development. In addition we are very interested in the Internationalisation of the product as we have users accessing our systems from over 100 countries worldwide. So perhaps we can help with the language translations.
Just wanted to say 'hi', let you guys know where I am coming from and where I am wanting to get to, and see what the possibilities are.
All comments welcome!
Regards
Michael
Please Log in or Create an account to join the conversation.
- WGroleau
- Offline
- Platinum Member
Huh? Soundex has a very specific deterministic algorithm that is based on the spelling of the name. It is acknowledge to not be the optimum predictor of equivalence, especially for certain languages, but it does generate one code for any particular spelling.there is NOT a 1-to-1 relationship of names to soundex codes, some names can generate 2,4,8 and more soundex variants).
As for the rest, webtrees may not do everything you want, and it may do things you don't want, but it will be faster at whatever it does.
It sounds to me like you need to take a snapshot of webtrees (or some other project), give it a new name, and run your own development team from that point. If you noticed the intro text, one of the differences between webtrees and PGV is webtrees is not trying to be everything for everybody. If your project would be fifty percent different from webtrees, then a combined project would be a coding mess with all the conditionals.
Or take a read-only feed from PGV SVN, and manage your own differences. That's what I do with PGV, although I have a much smaller set of differences.
--
Wes Groleau
UniGen.us/
Please Log in or Create an account to join the conversation.
- WGroleau
- Offline
- Platinum Member
- Posts: 2165
Would you mind identifying the organization or describing its "mission"?We are a not-for-profit organisation
If it's a secret, well, that's OK, but I'm probably not the only one who'd like to know.
Although, if it's a secret, that might make it illegal and/or unethical to maintain your own variation of ope-source code.
--
Wes Groleau
UniGen.us/
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
Huh? Soundex has a very specific deterministic algorithm that is based on the spelling of the name. It is acknowledge to not be the optimum predictor of equivalence, especially for certain languages, but it does generate one code for any particular spelling.
As for the rest, webtrees may not do everything you want, and it may do things you don't want, but it will be faster at whatever it does.
It sounds to me like you need to take a snapshot of webtrees (or some other project), give it a new name, and run your own development team from that point. If you noticed the intro text, one of the differences between webtrees and PGV is webtrees is not trying to be everything for everybody. If your project would be fifty percent different from webtrees, then a combined project would be a coding mess with all the conditionals.
Or take a read-only feed from PGV SVN, and manage your own differences. That's what I do with PGV, although I have a much smaller set of differences.
You are thinking of NARA or Russell soundex. It is totally inadequate and full of plain errors. Daitch-Mokotoff soundex generates variant codes based on possible different pronunciations and languages. Phonetic matching techniques also have branching in the generation of codes. There is no 1-to-1 relationship between a name and soundex/phonetic code. If you doubt this have a look at the DM soundex codes stored in the SQL databases by PGV (not sure about webtrees).
I realise we will have to do much of the coding ourselves - but before we started I wanted to check if ANY of the features we are interested in would be of more general interest or value.
Michael
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
mikkitobi wrote:
Would you mind identifying the organization or describing its "mission"?We are a not-for-profit organisation
If it's a secret, well, that's OK, but I'm probably not the only one who'd like to know.
Although, if it's a secret, that might make it illegal and/or unethical to maintain your own variation of ope-source code.
It is not a secret.... but all will be revealed a little later....
Michael
Please Log in or Create an account to join the conversation.
- fisharebest
- Offline
- Administrator
thinking that probably webtrees is the more likely way forward.
Good thinking....
We will be uploading approximately 4000 (and continually growing) gedcom files into the system - totalling around 5 million individuals currently.
This is large - but not out of the question. You'd probably want to invest in some MySQL administration skills. As a mostly read-only system, master-slave replication would work well, and the nature of the queries (against a single gedcom) makes table-partitioning a useful tool too.
Is this all public data? The way you choose to implement privacy can have a large bearing on performance. Are these all public records? Do you expect 4000 user accounts for these 4000 gedcoms? etc...
Of course, size matters far less than your transaction rate. How many concurrent users do you expect? You may need to set up a small server farm as your site grows, but the application should scale well enough.
I'm in the process of rewriting much of the database structure. I've almost finished the "application" tables (hope to finish this weekend). I've started work on the "genealogy" tables, although this requires a lot of associated code changes, so there will be many more months before I can submit these changes to SVN. It might have to wait for webtrees 2.0.0. We can't hold up the 1.0.0 release forever.....
PGV did not have a 'batch import' facility. I assume webtrees does not either - but is this something that could easily be done?
Yup - pretty easy. Probably a day or two of effort. If it is your own server, you'd be able to take advantage of MySQL's direct file access to load files, instead of posting it via queries.
The current import code, while much faster than PGV, is only temporary. A final solution will import the data using stored procedures, and will run an order of magnitude faster. This is mostly written, but is queued behind some other DB changes.
More important to you will be the gedcom management system. The GUI was designed for 1 -> 20 gedcoms, and you'll find some pages unmanageably large with 4000 gedcoms.
3. Customisation. We want a lot of control over what menus appear to the users and what options appear in each menu. We want to limit the number and type of reports, charts etc and probably remove menus, add menus, add our own menu options with links to other external systems....
This is trivial. The new module management system and a custom theme will do all this for you.
4. Data structures and indexes and searching. There were limitations in PGV - not sure about webtrees when it came to soundex and other searching. Although soundex codes and Daitch-Mokotoff soundex codes were stored in the database no indexes were used. SQL queries must have been using 'IN' comparisons which are kinda slow when searching 5 million records. Ideally we would like separate tables to hold soundex values (there is NOT a 1-to-1 relationship of names to soundex codes, some names can generate 2,4,8 and more soundex variants). In addition we use phonetic matching techniques as another search option and would want new tables and indexes for those phonetic codes.
You are right. The PGV database structure is pretty awful in places. As I said above, I'm working through the database, rewriting it table by table. It's a lot of work, as there is a lot of associated code associated with the existing structures. The soundex is on my list.....
If you want to tell me about your "other phonetic codes", I'll try to make sure that the new design will support them.
In addition we are very interested in the Internationalisation of the product as we have users accessing our systems from over 100 countries worldwide. So perhaps we can help with the language translations.
You can see from the translation page that we have translators working in about 10 languages. Getting the "major" languages completed first is more of a priority than adding obscure/minor languages, but any help with this would be most welcome.
Greg
Greg Roach - greg@subaqua.co.uk - @fisharebest@phpc.social - fisharebest.webtrees.net
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
This is large - but not out of the question. You'd probably want to invest in some MySQL administration skills. As a mostly read-only system, master-slave replication would work well, and the nature of the queries (against a single gedcom) makes table-partitioning a useful tool too.
I assume the gedcom name is stored in a field in SQL in any case...
Is this all public data? The way you choose to implement privacy can have a large bearing on performance. Are these all public records? Do you expect 4000 user accounts for these 4000 gedcoms? etc...
Ah good question.... the answer is yes and no lol.
Access to the entire system will be behind a login process external to webtrees. Continued access within webtrees will be subject to correct cookie values or session variables being present.
There will only be admin users. No other user accounts are needed. Access to the system will be read-only guest access - unless at a later date we allow online editing of users trees.
Of course, size matters far less than your transaction rate. How many concurrent users do you expect? You may need to set up a small server farm as your site grows, but the application should scale well enough.
The number of concurrent users will not be large. Perhaps a dozen?
I'm in the process of rewriting much of the database structure. I've almost finished the "application" tables (hope to finish this weekend). I've started work on the "genealogy" tables, although this requires a lot of associated code changes, so there will be many more months before I can submit these changes to SVN. It might have to wait for webtrees 2.0.0. We can't hold up the 1.0.0 release forever.....
When do you expect the initial stable release of 1.0.0 ?
The current import code, while much faster than PGV, is only temporary. A final solution will import the data using stored procedures, and will run an order of magnitude faster. This is mostly written, but is queued behind some other DB changes.
Will that be in version 1.0.0 ?
More important to you will be the gedcom management system. The GUI was designed for 1 -> 20 gedcoms, and you'll find some pages unmanageably large with 4000 gedcoms.
Yes that is important, but since the system is a read-only system we 'only' need a way of 'searching' for a gedcom file and a method of deleting it or replacing with a new version of the file. The names of the gedcom files follow a pattern.
This is trivial. The new module management system and a custom theme will do all this for you.
Excellent news!
You are right. The PGV database structure is pretty awful in places. As I said above, I'm working through the database, rewriting it table by table. It's a lot of work, as there is a lot of associated code associated with the existing structures. The soundex is on my list.....
If you want to tell me about your "other phonetic codes", I'll try to make sure that the new design will support them.
Are you now using indexes on the tables? Are you or are you happy to, move the soundex codes to separate related tables to speed up searches?
Currently the phonetic codes and Daitch-Mokotoff codes we use are generated by a php system. The phonetic codes are Beider-Morse Phonetic Matching codes developed by Sasha Beider and Steve Morse (of one-step webpage fame). You can read more about the codes at: stevemorse.org/phonetics/bmpm.htm
You can see from the translation page that we have translators working in about 10 languages. Getting the "major" languages completed first is more of a priority than adding obscure/minor languages, but any help with this would be most welcome.
Excellent I will check out the list. Where will I find instructions on how to participate in the translation process?
Regards
Michael
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
You can see from the translation page that we have translators working in about 10 languages. Getting the "major" languages completed first is more of a priority than adding obscure/minor languages, but any help with this would be most welcome.
There are several languages on that list that we would love to see completed and could possibly help with - in particular:
Spanish
Hebrew
Russian
French
Portuguese
Hungarian
Romanian
Exactly what is required of the translators? Any systems knowledge? - or purely language skills and a web browser?
If you let me know what is needed I will try to find volunteers for you. Are the translations normally validated by others? We could probably get 2 volunteers for each language to do/check the translations.
Regards
Michael
Please Log in or Create an account to join the conversation.
- fisharebest
- Offline
- Administrator
I assume the gedcom name is stored in a field in SQL in any case...
Yes. Everything either is, or soon will be, stored in SQL.
The number of concurrent users will not be large. Perhaps a dozen?
That's fine. I had visions of 1,000s of users...
When do you expect the initial stable release of 1.0.0 ?
Good question. One I need to discuss with my fellow developers. It been "a few months time" since we started, three months ago.
Will that be in version 1.0.0 ?A final solution will import the data using stored procedures, and will run an order of magnitude faster.
Depends how much free time I get
Since a 1000 record gedcom will load in < 10 seconds, you are looking at (ballpark figure) of 12 hours to import your 4000 gedcoms, I wouldn't get too hung up on this. Just make sure your database is set up nicely.
The names of the gedcom files follow a pattern.
Are you now using indexes on the tables? Are you or are you happy to, move the soundex codes to separate related tables to speed up searches?
Sorry if I wasn't clear. Yes, I fully intend to normalise this table. I asked about your prefered phonetic system simply to find out the required column type. Soundex are 4 chars long while DM are 6. A quick scan though the URL you provided doesn't answer the question, but I'm assuming a similarly short alpha-numeric code.
Greg Roach - greg@subaqua.co.uk - @fisharebest@phpc.social - fisharebest.webtrees.net
Please Log in or Create an account to join the conversation.
- fisharebest
- Offline
- Administrator
There are several languages on that list that we would love to see completed and could possibly help with - in particular:
Spanish Hebrew Russian French Portuguese Hungarian Romanian
Exactly what is required of the translators? Any systems knowledge? - or purely language skills and a web browser?
Language skills, a web-browser, and enough common sense to stop and ask questions when the context isn't clear. The translators forum here exists for this exact purpose.
Are the translations normally validated by others? We could probably get 2 volunteers for each language to do/check the translations.
The system is open - meaning that anyone can contribute translations and corrections. All they need is to register an account on launchpad.
However, each language needs a (small) team of reviewers. As well as confirming the submitted translations, they are also responsible for writing the translation guidelines for the language. These should cover things like use of 2nd/3rd person, consistent translation of technical words, etc.
Greg
Greg Roach - greg@subaqua.co.uk - @fisharebest@phpc.social - fisharebest.webtrees.net
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
Sorry if I wasn't clear. Yes, I fully intend to normalise this table. I asked about your prefered phonetic system simply to find out the required column type. Soundex are 4 chars long while DM are 6. A quick scan though the URL you provided doesn't answer the question, but I'm assuming a similarly short alpha-numeric code.
Hi Greg
The phonetic codes are not fixed length but vary just like the length of surnames and givennames vary. In many cases the phonetic codes are the very same as the name spellings.....
There are also 2 sets of phonetic code. EXACT and APPROX. Both sets need stored.
Example codes:
Surname: forman EXACT: forman APPROX: forman, formon, fYrman, fYrmon
Surname: michelson EXACT: mixelzon, mixelson APPROX: mQxYlzon, mQxilzon, mixYilzon, mixilzon
Michael
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
Sorry if I wasn't clear. Yes, I fully intend to normalise this table. I asked about your prefered phonetic system simply to find out the required column type. Soundex are 4 chars long while DM are 6. A quick scan though the URL you provided doesn't answer the question, but I'm assuming a similarly short alpha-numeric code.
See: stevemorse.org/census/soundex.html which is an online form that creates codes for you.
Michael
Please Log in or Create an account to join the conversation.
- WGroleau
- Offline
- Platinum Member
- Posts: 2165
No, I was thinking of Soundex—an algorithm patented in the early 1900s. The fact that it is not a very good algorithm doesn't, IMHO, justify other algorithms grabbing its name. (And I was not aware till now that they had done so.)You are thinking of NARA or Russell soundex. It is totally inadequate and full of plain errors.
As for Greg's question about allowing room for other algorithms, there is one called Metaphone that might be considered. I don't know anything about it other than it is alleged to be better than Soundex (which would not be hard to do).
There's also the New York State Identification and Intelligence System, but it may not be worth using--only a slight improvement over Soundex. And Caverphone--another relatively unknown that I know little about.
And, a Canadian fellow named Denis Beauregard has developed a system that he says works well for French names. I'd love to see that as an option. All of the systems I and Michael have mentioned suck for French names. For example, all of the following are pronounced the same and appear in my paternal line (and siblings), but do not match in Soundex, DM, or Morse:
Groleau
Grolleau
Grosleau
Groleaux
Grosleaux
Groslot
Grolot
I made a feature request once for PGV for something that would be great if it were feasible. That would be for an admin to be able to maintain a site-specific or GEDCOM-specific set of equivalences. At one time, I suspected it would have unacceptable DB load, but now I think otherwise. If the list says
For Groleau, include Groslot, Grolleau, Groleaux, Grolot
Then the user could check "use equivalence list" and the SQL would be something like
WHERE Surname = @Surname OR Surname IN ('Groslot', 'Grolleau', 'Groleaux', 'Grolot')
--
Wes Groleau
UniGen.us/
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
No, I was thinking of Soundex—an algorithm patented in the early 1900s. The fact that it is not a very good algorithm doesn't, IMHO, justify other algorithms grabbing its name. (And I was not aware till now that they had done so.)
The name was grabbed a long time ago.
Nowadays there are many types of soundex and soundex represents more the TECHNIQUE than the absolute algorithm.
To be precise the soundex you were referring to is known as Russell or NARA soundex to differentiate it from others.
Michael
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
No, I was thinking of Soundex—an algorithm patented in the early 1900s. The fact that it is not a very good algorithm doesn't, IMHO, justify other algorithms grabbing its name. (And I was not aware till now that they had done so.)
By the way, a search of the web shows that the original 'soundex' created and patented by Russell was NOT called soundex. He called it INDEX.
The original Soundex algorithm began in a patent by Robert C. Russell in 1918. The name "Soundex" came along later (it was probably coined by the telephone company but I'm not sure). The original patent was simply titled "Index" (application Number 1,261,167, filed Oct. 25, 1917).
Please Log in or Create an account to join the conversation.
- kiwi
- Offline
- Platinum Member
To clarify ... you can look at the Module Administration feature on the DEMO site here. It allows all menu items, blocks, reports to be controlled. It will also include charts and themes later. You will find it by logging in as demo_admin, then go to Admin -->Module Administration.3. Customisation. We want a lot of control over what menus appear to the users and what options appear in each menu. We want to limit the number and type of reports, charts etc and probably remove menus, add menus, add our own menu options with links to other external systems....
This is trivial. The new module management system and a custom theme will do all this for you.
Nigel
www.our-families.info
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
To clarify ... you can look at the Module Administration feature on the DEMO site here. It allows all menu items, blocks, reports to be controlled. It will also include charts and themes later. You will find it by logging in as demo_admin, then go to Admin -->Module Administration.
Thanks Nigel.... getting there....
BUT it is not exactly what I am wanting.
Various comments - please dont take these as complaints:
1. There does not appear to be a way to hide an entire menu - eg reports or charts
2. There are many menu items (eg on the charts menu) that cannot be customised/hidden
3. Even if you remove all items from a menu - eg reports - the icon remains despite the list of options being empty.
Ideally I would like to see :
A. The option to hide entire menus - eg reports, charts, calendar, search etc
B. The option to hide every individual item on every menu.
Is there any reason why the module admin settings do not list every menu and sub-menu item individually to be customised? Is that the plan? I see your comment about charts and themes coming later....
And then.....
Can we ADD our own menus and submenu items which link to external URLs / systems?
Thanks
Michael
PS should some of this discussion now move to the 'Customising' or 'Request for New Feature' forums?
Please Log in or Create an account to join the conversation.
- kiwi
- Offline
- Platinum Member
In my view that feature is to trivial to warrant a GUI option, and there just isn't enough demand for it. Just go to the themes/xxx/toplinks.php (or in some themes header.php) and either delete the menu item or wrap it in a "if (WT_USER_IS_ADMIN) {" statement. The latter will hide it from everyone except Admin. Presumably on the site you describe there is no need to have theme switching, so you will only have one 'customised' theme to manage. It might of course get implemented as part of something larger, such as mentioned below.There does not appear to be a way to hide an entire menu - eg reports or charts
That is an option that we should consider - but it won't be an urgent one I'm afraid, and definitely not in the first release. It may well require a re-think of the way menus are created. I recommend you raise a separate topic here for that one so others can comment, we can gauge the level of interest; and so we don't forget it, which might happen in a large and complex thread like this one.There are many menu items (eg on the charts menu) that cannot be customised/hidden
If you do as described above to remove the entire menu option there will no icons left.Even if you remove all items from a menu - eg reports - the icon remains despite the list of options being empty.
Absolutely. You can either add it manually to the same file mentioned above (toplinks.php) or as a module. We havn't written any details about creating a module from scratch yet, but I've done a couple and they are quite easy. We will get instructions written sometime.Can we ADD our own menus and submenu items which link to external URLs / systems?
Nigel
www.our-families.info
Please Log in or Create an account to join the conversation.
- mikkitobi
- Topic Author
- Visitor
M
Please Log in or Create an account to join the conversation.
- WGroleau
- Offline
- Platinum Member
- Posts: 2165
In a book seach for "Russell Soundex" I find that the term is used mainly as a courtesy to Russell, not to distinguish it from other algorithms. Moreover, if I search for just Soundex, I find that in the first several dozen hits, none of them prefixes Soundex with Russell or NARA, though all of them seem to be using the term to mean that particular algorithm. No matter. The main thing is that I now understand that when you said Soundex produces several encodings, you meant that there are several types of Soundex.… the soundex you were referring to is known as Russell or NARA soundex to differentiate it from others.
Technically, as far as I know, each type in itself is deterministic.
And I agree that the original is poor. And that ALL of them are worse than poor for French.
--
Wes Groleau
UniGen.us/
Please Log in or Create an account to join the conversation.