- Posts: 31
Question INDI-specific repository citations
- tashtari
- Topic Author
- Offline
- New Member
I'm slowly chugging through my tree, adding source citations for facts that I added willy-nilly back when I was new to genealogy, and I've run up against what's either a shortcoming of GEDCOM or a lack of understanding of it on my part (probably the latter!). It's difficult to summarize in a sentence, so let me give a concrete example:
Suppose I have found on Ancestry.com a page out of the US census that contains some of my ancestors. From this I get a number of URLs - one to the image of the page, as well as one link to a record for each line on the page. I upload the page image as an OBJE node and add it to a SOUR node which I cite on various facts obtained from the census.
To indicate where the source came from, I add a REPO citation to the SOUR node with a CALN that points to the image link on Ancestry.com. Here's the issue - what do I do with the links to the individual line records? I could add REPO citations for them as well, but how do I indicate which one is associated with which INDI?
Up to now, I've been doing this by adding a custom "_ID" tag to the REPO citation, i.e.:
This is not a great solution as webtrees doesn't recognize "_ID" and so it has to be added/altered using edit-raw.
Why do this at all? Because eventually what I want to do is build a tool that will take a GEDCOM file exported from webtrees and alter the source citations on each fact on each INDI to add "_APID" tags. This will allow the GEDCOM file to be imported into Ancestry.com and connect each fact to its record on Ancestry.com, thus ensuring that it won't suggest records that I already have. This way, I can do my work in webtrees and have my own copy of each record cited, and still use Ancestry.com's tools to expand and fill in the tree.
So my question is - is there a better way to accomplish what I'm trying to do?
Please Log in or Create an account to join the conversation.
- norwegian_sardines
- Offline
- Platinum Member
- Posts: 3127
As I see it we have three points of data entry for source citation.
1) Repository record
2) Source Record
3) Citation Structure
As you have identified
1) The repository is site where you found the fact you are citing. This can be a Library, Website, or some other place
2) The Source is the actual item you found, normally this is the book, website, magazine, census. It is usually a high order item!
3) Citation Structure is the detail about the where the fact was found within the Source, page number, page URL, film image identifier.
For a Census, I do the following.
Two concepts exist for the Repository, you can take either one. Most people think the repository would be the Ancestry.com, some people the Repository is the National Archives and Records Administration.
The Source Record would be the actual Census (1950 U.S. Census)
I connect the Source-Record to the Repository_Record with a URL to the search page for the 1950 US Census on Ancestry ( www.ancestry.com/search/collections/62308/ )
When I find a listing in the 1950 Source for a Census I create the appropriate fact (birth year, occupation, residence) then create a Citation Structure with a PAGE reference URL to the page I found the Census Website.
The concept could look like this:
0 @R6@ REPO
1 NAME Ancestry.com
1 ADDR Provo, UT, USA
1 WWW www.ancestry.com
0 @X498@ SOUR
1 TITL Census: 1950 US
1 AUTH National Archives and Records Administration
1 REPO @R6@
2 CALN www.ancestry.com/search/collections/62308/
1 CENS
2 DATE <Census 1950 US Date>
2 PLAC City, County, State, USA
2 SOUR @X498@
3 OBJE @X622@
3 PAGE www.ancestry.com/discoveryui-content/view/ .......
3 DATA
4 TEXT Additional Data (Line, Page, Enum District, etc)
The OBJE (image) is an actual copy of the page from the Census. This normally has all of the detail about where in the the Census (line, page, enum district, etc) I found the information.
Page points to the actual page in the census website. I don't rely on this to stay with this address so I always capture the page image for the future.
Hope this helps!
Ken
Please Log in or Create an account to join the conversation.
- tashtari
- Topic Author
- Offline
- New Member
- Posts: 31
For a census household, that SOUR node contains:
- Links to OBJEs for each page of the census that contains the family in question, one image per OBJE
- A link to a shared note that contains the census transcription, if I've done one (this, I think, could equally validly be a TEXT node, but the census wizard creates shared notes, so I just use those)
- A REPO citation with an _ID of "Page" and a CALN that links to the image on Ancestry
- A REPO citation with an _ID of "Page 2" and a CALN that links to the image on Ancestry if the family spans two pages
- REPO citations with an _ID pointing to an individual represented by a line on the census and a CALN that links to that person's record within the census on Ancestry (these are the citations that can later be transformed into _APIDs)
- REPO citations as above for FamilySearch, too, if I'm really on my toes
Then, if I want to indicate that a fact on an INDI or FAM came from that family on the census, I add a SOUR citation under the fact, sometimes with a DATA/TEXT that explains it, if it isn't obvious.
The transform step for a GEDCOM to be imported into Ancestry will look something like this:
- See a fact under an INDI that has a SOUR citation on it
- Check whether the SOUR node that the citation points to has a REPO citation with an _ID that matches the INDI where the SOUR citation was found
- If so, transform the Ancestry link into an _APID citation and attach it to the fact, detaching the SOUR citation on which it was based
This is why it's important for me to use _ID to keep track of the INDIs associated with each Ancestry link. I believe this is compliant with the GEDCOM spec, since it denotes an unofficial tag using an underbar prefix, and when it contains a link, it's in the standard @...@ format so any software that changes XREFs will know to change the _ID as well - I just wonder if there's a better way to integrate this system into Webtrees.
Please Log in or Create an account to join the conversation.
- norwegian_sardines
- Offline
- Platinum Member
- Posts: 3127
It seems like you're doing something not entirely dissimilar to what I'm doing... but for purposes of a US census, I consider a source (a SOUR node, that is) to be a single household within the census rather than the entire census.
You are doing it totally different than me! You are causing the Source_record to represent a very different definition of the term "Source". Which has forced you to create a new tag called _ID. If you want to do it your own way that's fine, but it does not follow the v5.5.1 GEDCOM Standard, which my way does. You are doing what is called "Splitting", GEDCOM, in general does not support splitting! It's design is for Lumping, i.e. Lump all fact citations under a general Source Artifact, such as the 1950 Census, a book title, a church register, etc. To use "Splitting" you must add new tags to the GEDCOM. Some software does this.
You asked:
I just wonder if there's a better way to integrate this system into Webtrees.
I answered the way that would best use GEDCOM to support Ancestry which is known to not create good GEDCOM, so I don't use Ancestry to create my GEDCOM but I do use it to support my webtrees assertions! If you want to change the way GEDCOM works you would need to write a code module to support your new tag!
This is why it's important for me to use _ID to keep track of the INDIs associated with each Ancestry link. I believe this is compliant with the GEDCOM spec, since it denotes an unofficial tag using an underbar prefix, and when it contains a link, it's in the standard @...@ format so any software that changes XREFs will know to change the _ID as well - I just wonder if there's a better way to integrate this system into Webtrees.
I'm at a loss to understand what you are really trying to accomplish! Why do you need for the Source_record to have a link to the Individual-Record? A Source_Record will most likely be cited by multiple facts from multiple Individuals and Family instances.
Yes GEDCOM allows you to create any new tag so long as you put an "_" at the beginning of the tag, but no other software program will understand what the new tag in a v5.5.1 GEDCOM does or how to support it. In v7 of GEDCOM you are still able to add new tags but you are required to provide a URI that defines the tag, this however will still not make the new tag valid in any other software program if it is not programmed for the tag.
EDIT. Sorry I reread some of what I wrote and the words I used (spell check messed with some of them) so the wording has change a little from what you may have gotten in an email!
Ken
Please Log in or Create an account to join the conversation.
- norwegian_sardines
- Offline
- Platinum Member
- Posts: 3127
One of the biggest issues with using multiple applications is that most don't use GEDCOM as intended or designed. They add tags, misuse standard tags, or don't use a tag that is in the Standard but decide to create their own.
webtrees has the ability to read and understand multiple dialects of GEDCOM, but can't always understand the reason that that a particular extension exists or how to use it.
I personally suggest that users of multiple application do not do "round tripping" between the different applications because the way each of them support GEDCOM and the various dialects each one understands.
Ken
Please Log in or Create an account to join the conversation.
- tashtari
- Topic Author
- Offline
- New Member
- Posts: 31
Also, looking at the spec for PAGE, it looks like it's meant to contain what you're using DATA/TEXT for: gedcom.io/specifications/FamilySearchGEDCOMv7.html#PAGE
You're certainly correct that there's a lot of twisted uses of GEDCOM around - Ancestry.com is certainly guilty of this. However, their suggestions are very helpful when adding documentation to individuals and families and building out trees, and I want to be able to leverage them while still keeping the master copy of everything in my webtrees instance. To that end, I'd like to build out some tools that allow a user to export their tree from webtrees, import it into Ancestry.com, work on it there, then export it from Ancestry.com and import the new information back into webtrees, with as little manual labor as possible - it seems like translation between dialects of GEDCOM is the only realistic way to do this, as Ancestry.com does not have a public API.
Please Log in or Create an account to join the conversation.
- norwegian_sardines
- Offline
- Platinum Member
- Posts: 3127
Hmm, so you're using PAGE to contain the INDI-specific URL?
Yes, This is correct. webtrees, supports this as an html link to the actual page on the web, very handy for an online program.
The PAGE tag in v5.5.1 is defined as:
Specific location with in the information referenced.
In v7.0 GEDCOM we have acknowledged that the PAGE and CALN tags should also have a URL option. Right now the discussion is that it should be used as a Data pair. The documentation FAQ also says:
Instead, the URL can be placed in the `PAGE` structure with the "URL:" label, along with any other label: value pairs, as follows:
1 DEAT
2 DATE 14 DEC 1799
2 SOUR @FindAGraveSourceRecord@
3 PAGE Memorial: 1075, URL: www.findagrave.com/memorial/1075
The use of data pairs better supports the need to list the Additional Data (Line, Page, Enum District, etc) rather than using the TEXT tag.
You also said:
To that end, I'd like to build out some tools that allow a user to export their tree from webtrees, import it into Ancestry.com, work on it there, then export it from Ancestry.com and import the new information back into webtrees, with as little manual labor as possible - it seems like translation between dialects of GEDCOM is the only realistic way to do this, as Ancestry.com does not have a public API.
This is called "round tripping" something that needs a lot of work to accomplish, I don't suggest this be done unless you are very familiar with GEDCOM rules as used by both systems, have a programming background and are willing to work on your own. Personally,as I've said before, I use Ancestry for its data, not its tree building. webtrees is far superior to the display of data, does not require an Ancestry login and will support GEDCOM v5.5.1 and any future release of GEDCOM.
But you can do anything you want it is your database!
Ken
Please Log in or Create an account to join the conversation.
- kiwi
- Offline
- Platinum Member
Sorry, but I absolutely disagree with this statement. I believe Ancestry “hints” are almost completely useless. They often repeat things I already have, suggest things that are blatantly wrong, and generally serve only to distribute other un-sourced claims from one of the thousands of poor quality trees they have.However, their suggestions are very helpful when adding documentation to individuals and families and building out trees,
The hints ‘might’ suggest a path to further research, but (IMHO) never provide reliable data that I would import into my tree. Just clicking on a “wobbly leaf” might seem attractive, but I suggest you will regret using it in eventually.
I totally support Ken’s advice, and follow a very similar approach. It works.
Nigel
www.our-families.info
Please Log in or Create an account to join the conversation.
- tashtari
- Topic Author
- Offline
- New Member
- Posts: 31
With the method I'm using, the SOUR node contains the OBJE references and the URLs (in addition to notes, transcriptions, and whatever else might be useful to someone interested in the source), and in order for a fact to cite it, it only needs to reference the SOUR node itself, nothing needs to be under it.
Obviously this is just my opinion, but I believe it has merit worthy of consideration, and (the custom tag aside) I don't believe it contravenes the standard as defined.
With regard to round-tripping, what I'd like to do is make it an accessible technique so that its benefits aren't outweighed by the toil involved, at least with Ancestry.com. I have the requisite programming background and experience with the mechanics of GEDCOM, but I'd like for people who don't have these things to be able to do it, too.
@kiwi The reason Ancestry.com hints suggest sources you already have is because it doesn't know you already have them. Ancestry.com is built around the idea that you build and work on your tree exclusively on their website, which is self-evidently an unpopular notion among webtrees users. This is a problem I'd like to address with a tool that attaches the _APID citations to facts when round-tripping - that way, once you appropriately add a source to your webtrees tree, Ancestry.com will no longer suggest it. There's inevitably going to be some chaff among its suggestions, but I think this will go a long way towards reducing it.
Please Log in or Create an account to join the conversation.
- norwegian_sardines
- Offline
- Platinum Member
- Posts: 3127
@norwegian_sardines I feel like what you're doing with SOUR nodes under-utilizes them and causes needless repetition in the INDI node. As I'm sure you well know, a lot of facts can be deduced from a single source. A death certificate, for example, could be cited as a source for a birth, a death, and a burial. With the method you describe, each of those facts needs a SOUR citation that has under it, at minimum, the same URL and the same OBJE reference, while the SOUR node they cite contains not much more than the link to the larger collection.
I don't disagree with this assertion, but the design of the v5.5.1 GEDCOM is the design. Some software programs have added additional extensions to change the design to
I've been using GEDCOM since the early 1980's and as a database designer I can attest that this is a poor design for a database, but GEDCOM was not designed to be a database, it is a data transfer protocol. A lot of software uses it as a database because it is easier to read and write the GEDCOM.
If you want to only enter a Source_Citation once but continue to use the Standard V5.5.1 GEDCOM you can write a module similar to the Census Module that has a single data entry screen but creates all the appropriate GEDCOM code.
AGAIN... What you want to accomplish is a "splitter" design, the v5.5.1 GEDCOM is a "lumper" design. We are working on a new design for a future release of GEDCOM that could fix this long standing need by some users of GEDCOM to support splitting sources.
Ken
Please Log in or Create an account to join the conversation.
- norwegian_sardines
- Offline
- Platinum Member
- Posts: 3127
When it comes to webtrees display, images of a "source extract" (a census page, birth/death/marriage certificate, page from a church/history/jurisdictional book, etc) webtrees displays this extract on the same screen as the actual fact, alternatively any "Source Artifact" (the entire book, the website listing the certificate, the entire census) information found on the Source_Record requires you to link to the source_record to view it. Here again, the value of using the Source_record as a de facto citation does not make sense! I would want to see the image of the "source extract" right next to the fact that it asserts rather than going to another record-type to view the image.
It is important to understand why things are done and how they work before going your own way when using a application!
GEDCOM defines a Source_Record as:
Source records are used to provide a bibliographic description of the source cited. (See the <<SOURCE_CITATION>> structure, page 39, which contains the pointer to this source record.)
GEDCOM identifies the values stored in the Source_Citation as:
The data provided in the <<SOURCE_CITATION>> structure is source-related information specific to the data being cited. Information, such as a page number, to help the user find the cited data within the referenced
source. This is stored in the “.SOUR.PAGE” tag context. Actual text from the source that was used in making assertions.
I'm pointing all this out so you understand that it is not "me" who is telling you how the v5.5.1 GEDCOM was defined and used, but that it is in the actual standards document! The Standard also outlines a use-case, where the Source record is titled for the Madison County Birth, Death, and Marriage Records Source Artifact and the Source_Citation identifies the Page number where in the Source Artifact the data was found!
Hopefully this is enough for you to understand the major points!
Ken
Please Log in or Create an account to join the conversation.
- tashtari
- Topic Author
- Offline
- New Member
- Posts: 31
For what it's worth, my rules on OBJEs are that OBJEs linked to an INDI are only allowed to be photos that include that person, while OBJEs linked to a fact are for display purposes only (newspaper articles that the viewer might find interesting to read, for example). I figure that the actual images of things like census pages and death certificates are less interesting to viewers than the fact that they exist and substantiate facts on an INDI or FAM, so I don't mind the extra click.
On the basis that GEDCOM is treated as a database by webtrees, even though this wasn't the intent of the format's creators, I think I'm going to keep doing what I'm doing, with the _ID tag and all. If at some future point I decide to change course or (more likely) the GEDCOM spec evolves to provide a better way of doing what you call the "splitter" design, it should be relatively easy to make a programmatic change to my tree.
How much of a task would it be for me to make a module for webtrees that makes SOUR/REPO/_ID appear as a link to an INDI and be editable as such, rather than the bold red error text it displays as now? Also, my definition of the _ID tag is that it can be an XREF or free-form text, is there any precedent for such a thing in the GEDCOM spec or should I split its functionality into two different tags, one for XREF and one for free-form text?
Please Log in or Create an account to join the conversation.
- norwegian_sardines
- Offline
- Platinum Member
- Posts: 3127
How much of a task would it be for me to make a module for webtrees that makes SOUR/REPO/_ID appear as a link to an INDI and be editable as such, rather than the bold red error text it displays as now?
Examples of modules exist on this website. I’m not a PHP programmer so I can not advise.
Also, my definition of the _ID tag is that it can be an XREF or free-form text, is there any precedent for such a thing in the GEDCOM spec or should I split its functionality into two different tags, one for XREF and one for free-form text?
In the v7.x GEDCOM specification the correct way to handle this would be as follows:
Ken
Please Log in or Create an account to join the conversation.
- Jefferson49
- Offline
- Junior Member
- Posts: 241
Although I do not understand all the background from Ancestry, the approach to add the individual information to the source first, and afterwards writing a converter to get the information back to the individual facts seems to be a little bit complicated.The transform step for a GEDCOM to be imported into Ancestry will look something like this:
- See a fact under an INDI that has a SOUR citation on it
- Check whether the SOUR node that the citation points to has a REPO citation with an _ID that matches the INDI where the SOUR citation was found
- If so, transform the Ancestry link into an _APID citation and attach it to the fact, detaching the SOUR citation on which it was based
This is why it's important for me to use _ID to keep track of the INDIs associated with each Ancestry link. I believe this is compliant with the GEDCOM spec, since it denotes an unofficial tag using an underbar prefix, and when it contains a link, it's in the standard @...@ format so any software that changes XREFs will know to change the _ID as well - I just wonder if there's a better way to integrate this system into Webtrees.
What I understand is that you finally want to have _APID tags under your individuals' facts. Wouldn't it be easier to add the _APID information straight forward while working on the sources and source citations. For example, you could include the_APID information each time you add a source citation to a fact.
Please Log in or Create an account to join the conversation.
- Jefferson49
- Offline
- Junior Member
- Posts: 241
Multiple usage of - more or less - identical source citations is indeed a weakness of GEDCOM 5.5.1 and has been discussed several times in the webtrees forum. One of the forum members issued a proposal to the GEDCOM 7 working group to offer a new record structure for reusable source citations. I would appreciate this concept.@norwegian_sardines I feel like what you're doing with SOUR nodes under-utilizes them and causes needless repetition in the INDI node. As I'm sure you well know, a lot of facts can be deduced from a single source. A death certificate, for example, could be cited as a source for a birth, a death, and a burial. With the method you describe, each of those facts needs a SOUR citation that has under it, at minimum, the same URL and the same OBJE reference, while the SOUR node they cite contains not much more than the link to the larger collection.
However, how can we handle the current situation? First of all, I do not care too much about the amount of data (i.e. lines of text) for repeating soruce citations in the GEDCOM file. For today's software this is not relevant any more.
What is worse is the change management. If the source citation changes, it needs to be changed at several places. However, my own experience is that this does not happen frequently. Usually, I work on a source and try to retrieve all information into source citations. Afterwards, it is usually done and not changing any more.
The last issue is the effort for copying/multiplying a source citation. For this use case, I introduced a copy/paste feature for source citations in the Repository Hierarchy custom module. This has proven quite helpful in my own work:
Please Log in or Create an account to join the conversation.
- tashtari
- Topic Author
- Offline
- New Member
- Posts: 31
Potentially, but _APID is only needed when importing a GEDCOM file into Ancestry.com, on top of which it's one more thing to keep consistent - as a software engineer by trade, I'm really trying to keep to the DRY (Don't Repeat Yourself) principle. I don't consider writing a tool to add the _APID tags to be very complicated, so that doesn't worry me.What I understand is that you finally want to have _APID tags under your individuals' facts. Wouldn't it be easier to add the _APID information straight forward while working on the sources and source citations.
Your points are well taken. Memory concerns are basically nonexistent (excluding media files, my own GEDCOM file could still fit on a single floppy disk!), and it's true that in practice, source citations usually don't change once added, and a tool to copypaste them makes it easier to handle the current standard's state of affairs.Multiple usage of - more or less - identical source citations is indeed a weakness of GEDCOM 5.5.1 and has been discussed several times in the webtrees forum. One of the forum members issued a proposal to the GEDCOM 7 working group to offer a new record structure for reusable source citations. I would appreciate this concept.
However, how can we handle the current situation? First of all, I do not care too much about the amount of data (i.e. lines of text) for repeating soruce citations in the GEDCOM file. For today's software this is not relevant any more.
What is worse is the change management. If the source citation changes, it needs to be changed at several places. However, my own experience is that this does not happen frequently. Usually, I work on a source and try to retrieve all information into source citations. Afterwards, it is usually done and not changing any more.
The last issue is the effort for copying/multiplying a source citation. For this use case, I introduced a copy/paste feature for source citations in the Repository Hierarchy custom module. This has proven quite helpful in my own work:
I suppose I just have too much difficulty with the 5.5.1 approach to citation because it's bad database design. While acknowledging the truth of what norwegian_sardines said before - that GEDCOM wasn't designed as a database schema - it's frequently used as one, including by webtrees, and this is the hand we're dealt.
I appreciate everyone in this thread for indulging me - I came in search of a better/standards-compliant way to do what I wanted to do, but at the present time it feels to me like the cure is worse than the disease. I look forward to GEDCOM 7 including a structure for reusable sources - if/when that's in the standard and supported by webtrees, I will probably start using it. Until that time, though, it seems like I'd best look into creating some extensions to facilitate the way I'm currently working.
Please Log in or Create an account to join the conversation.
- Jefferson49
- Offline
- Junior Member
- Posts: 241
Yes, I agree to the DRY principle. However, if thinking about round trip issues, I would consider it a good idea to have the same data structure on both sides of the tools, even if _APID is not needed in webtrees.
Potentially, but _APID is only needed when importing a GEDCOM file into Ancestry.com, on top of which it's one more thing to keep consistent - as a software engineer by trade, I'm really trying to keep to the DRY (Don't Repeat Yourself) principle. I don't consider writing a tool to add the _APID tags to be very complicated, so that doesn't worry me.What I understand is that you finally want to have _APID tags under your individuals' facts. Wouldn't it be easier to add the _APID information straight forward while working on the sources and source citations.
Another question: Why is the _ID with individual information needed below the sources? Isn't the information already available from the relationships between the individual, the fact, and the source. If the source is linked to an individual fact, you could retrieve the INDI reference from the related individual. Maybe, you would need a flag at the source that indicates using Ancestry data; and that it is relevant for _APID. However, a simple flag might be much easier than the _ID data.
Please Log in or Create an account to join the conversation.
- norwegian_sardines
- Offline
- Platinum Member
- Posts: 3127
I look forward to GEDCOM 7 including a structure for reusable sources -
Sorry if I’m being pedantic but GEDCOM already supports reusable “sources”, what you want is reusable “citations”!
Ken
Please Log in or Create an account to join the conversation.
- Jefferson49
- Offline
- Junior Member
- Posts: 241
You might want to have a look a the MyCustomTags module, which creates several new tags and types, and adds them to certain GEDCOM structures.How much of a task would it be for me to make a module for webtrees that makes SOUR/REPO/_ID appear as a link to an INDI and be editable as such, rather than the bold red error text it displays as now? Also, my definition of the _ID tag is that it can be an XREF or free-form text, is there any precedent for such a thing in the GEDCOM spec or should I split its functionality into two different tags, one for XREF and one for free-form text?
Specifically, look at the following parts of the code:
Please Log in or Create an account to join the conversation.
- Franz Frese
- Offline
- Elite Member
Answer:
It is impossible.
A source (repo) never contains only one INDI.
Please Log in or Create an account to join the conversation.