census data format and FTA checks

Oct 5, 2014 at 7:03 PM
Edited Oct 5, 2014 at 8:02 PM
Is there a complete description of how FTA checks for census record validity? Some information is available in the discussions, but incomplete. There appears to be a location/place check, but what happens if the street address is included as well? Is the PRO reference checked and if so how should the complete reference be structured?
Oct 5, 2014 at 10:49 PM
Can you clarify what you mean by "census record validity" ie: which part of the program is checking for validity that you are asking about. I'm not sure if you are referring to census references or to a GEDCOM census record or what exactly.
Oct 6, 2014 at 10:26 AM

My apologies for lack of clarity – I should have taken a deep breath :-)

I was referring in part to the Unrecognised Census Refs function/report, an example below with the gedcom snip.

Unknown Census Ref: RG 13/ 146 page 34
Unknown Census Ref: RG12/128 page 66
Unknown Census Ref: RG14/ PN7284 SN463
Unknown Census Ref: RG14/ PN895 SN67

2 TYPE C06 Civic, Census
2 DATE 2 APR 1911
2 PLAC 361, West Green Rd., Tottenham, Middlesex, England
2 CAUS Taken
2 RIN 1004
2 NOTE Coal Porter
2 _PPT @I172@
3 TYPE 0 event owner
3 _NSQ 1
2 _PPT @I174@
3 TYPE 2 observer
3 _NSQ 4
3 NOTE Wife
2 _PPT @I182@
3 TYPE 2 observer
3 _NSQ 0
2 _PPT @I606@
3 TYPE 2 observer
3 _NSQ 0
2 _PPT @I607@
3 TYPE 2 observer
3 _NSQ 0
2 SOUR @S12@
3 PAGE RG14/ PN7284 SN463
3 QUAY 3


0 @S12@ SOUR
1 TITL 1911 Census of Engand and Wales
1 QUAY 3
1 TYPE I69 Published Books, Census Records
1 REPO @R328@
1 RIN 1899


So I guess my questions were: What checks does FTA carry out, for example, on the above gedcom structure to validate the census ref?
What assumptions does FTA make concerning the structure of a census event recorded in a gedcom record?
I also noted from one discussion that some programs output the census place data as:


But when I changed the above file structure to

2 PLAC Tottenham, Middlesex, England
2 ADDR 361, West Green Rd.

I still got the Unknown census ref. message.

An additional question is: If a census record has an unknown census reference, does that impact on what is displayed in the colour coded census report, i.e. is it flagged in red or is the error ignored and the record flagged as entered?

Does the above help in understanding my question(s)?


Oct 6, 2014 at 10:09 PM
Yes that helps enormously.

The address is in no way validated for the census references. The ONLY thing that is validated for census references is the actual reference. There are numerous regular expression patterns that are used to match possible references. These are a work in progress and only by seeing what users have entered can I understand any tweaks that are required to the regular expression patterns.

The patterns and the code that matches them is available in CensusReference.cs. You can see the latest version at https://ftanalyzer.codeplex.com/SourceControl/latest#FTAnalyser/Core/CensusReference.cs near the top of that file is a list of all the patterns used.

Naturally I have no idea about your level of skill with regular expressions. I'm on a course at present so I'm not at my PC to be able to explain in more detail, I'm just typing this on my iPad so it's not a easy to cross reference things. It may be that you would need a non technical explanation rather than being able to decipher the REGEX.

The basic info I can give is that it's ONLY the reference that is looked at to determine if it's a valid format. So in this case it's the

3 PAGE RG14/ PN7284 SN463

that is the only line being looked at for the census reference. In this case my guess without having my PC to hand to check is that is the spurious space between the / and the PN that is not being recognised.
Oct 6, 2014 at 10:18 PM
Edited Oct 6, 2014 at 10:20 PM
You could test this theory by removing the space from the GEDCOM and seeing if that reference still showed as unknown. ie:

3 PAGE RG14/PN7284 SN463

If it does work then that's a good example of the pattern being too rigid and needing to be tweaked to tolerate that sort of spurious space. Sadly users will be very inconsistent in the manner of recording references and the patterns need to strike a careful balance. Too "loose" and it will match incorrect patterns of characters and thus get the reference wrong. Too tight as here and relatively valid formats get treated as unknown.

Note in your other examples you have three different patterns with the extra space in a different position each time. It's this sort of inconsistency that makes the pattern matching hard. So initially I went for quite tight patterns and as people let me know the errors I can incorporate tweaks to match patterns where there are consistency issues in the GEDCOM.
Oct 7, 2014 at 3:04 PM
Edited Oct 7, 2014 at 3:05 PM
I just noticed I forgot to answer your follow up question. The census ref has no impact at all on the colour coding of the census report. The reference detection is intended simply to assist users finding the census details for entry to the Lost Cousins website.
Oct 7, 2014 at 4:20 PM
Thanks for getting back so quickly.

I tried removing the space, but that didn't work either. However, when I removed the '/' instead everything worked OK.

For the R12 & 13 references I needed to figure out the REGEX stuff. When I entered the reference longhand and used a ',' separator all went OK.

I didn't see it mentioned in the WIki, but it might be useful to give examples of accepted reference formats for census data.

Once again thanks for your help and I hope I haven't interrupted your course too much.

Jan 19, 2015 at 12:34 AM
Hi Levva,
How is this progressing? I just hit the same problem after I'd merged several bits and different versions of my tree, and I have to plead guilty to being the program designer's nightmare, my data entry is accurate but not consistent. In many cases I will cut, paste, and edit the references from whichever site I happened to have used to find the census info.
FTA rejects all of the following for 1881:-

Piece 0282 Folio 33 Page 32
RG11 1401/8/9
RG11 Piece 1399 folio 77 page 10

[I'll tell you now that I'm a retired Business Analyst / System Designer, so I want the Moon on a Stick!!]

I could almost certainly read and understand your 'code' and figure out what's wrong with them, but I am probably quite a rare bird in that respect - a list of 'FTA valid formats' would be really useful.

For what it's worth (and not wanting to teach Grandma to suck eggs) the way I would tackle this would be:-

1) Search for (e.g. for 1881) "RG11" - (If you find another census's PRO class number (e.g. RG14 = 1911), or indeed any other AAnnn (esp AA = RG or HO), report an error.
2) Take out the PRO class number, and then replace all non-numeric characters with spaces, then remove repeated and trailing spaces
3) Certainly for most censuses you should now have three numeric strings which should be straightforward to validate
(I've not done the analysis on all other censuses!)
4) Now, you can put the PRO class number, Piece, Folio, and Page in separate columns, and even build a consistent Reference number
5) Moon on a stick - update my Gedcom with nice consistent references :)
Jan 19, 2015 at 11:27 PM
Sadly until a major project at work is complete, probably April/May I just don't have the energy/enthusiasm to update the program. Basically I get home at night and rarely want to switch on the PC let alone code. So there won't be any substantial revision until then sorry.

Feel free to download the code from Codeplex repository to investigate options for yourself. Indeed if you find a fix Codeplex has an option to submit a patch!!!

Basically the code uses regular expressions to extract the strings as you suggest. The fix is likely to tweak the regex or to add a new pattern.
Jan 19, 2015 at 11:31 PM
All the "valid formats" are in the CensusReference class. You can download the source from the source code tab.