Before asking for help please read "Requesting Help and Suggestions" by clicking on that tab above here.
  • Page:
  • 1
  • 2

TOPIC:

google indexing 3 weeks 4 days ago #1

  • comet48
  • comet48's Avatar Topic Author
  • Offline
  • New Member
  • New Member
  • Posts: 67
About 8 weeks ago I successfully moved my webtrees site, genealogy.haleycentral.com across from 1.x windows apache installation to 2.x Unraid docker install. Thanks nathanvaughn for the docker:) The only issue seems to be that it doesn't seem to being crawled by google. Pretty much all the references from google have disappeared. I haven't been able to find the robots.txt file, but genealogy.haleycentral.com/robots.txt responds with

User-agent: admantx
User-agent: Adsbot
User-agent: AhrefsBot
User-agent: Amazonbot
User-agent: AspiegelBot
User-agent: Barkrowler
User-agent: BLEXBot
User-agent: DataForSEO
User-agent: DataForSeoBot
User-agent: DotBot
User-agent: Grapeshot
User-agent: Honolulu-bot
User-agent: ia_archiver
User-agent: linabot
User-agent: Linguee
User-agent: MegaIndex.ru
User-agent: MJ12bot
User-agent: netEstate NE
User-agent: panscient
User-agent: PetalBot
User-agent: proximic
User-agent: SeekportBot
User-agent: SemrushBot
User-agent: serpstatbot
User-agent: SEOkicks
User-agent: SiteKiosk
User-agent: Turnitin
User-agent: wp_is_mobile
User-agent: XoviBot
User-agent: ZoominfoBot
Disallow: /

User-agent: *
Disallow: /admin
Disallow: /manager
Disallow: /editor
Disallow: /account
Crawl-delay: 10

Am I missing something? Coming in via cloudflare and nginx reverse proxy.

Thanks,

Ron H

Please Log in or Create an account to join the conversation.

Last edit: by comet48. Reason: additional info

google indexing 3 weeks 4 days ago #2

  • comet48
  • comet48's Avatar Topic Author
  • Offline
  • New Member
  • New Member
  • Posts: 67
I see that the SemrushBot/7~bl; is working happily along:(

Please Log in or Create an account to join the conversation.

google indexing 3 weeks 4 days ago #3

  • fisharebest
  • fisharebest's Avatar
  • Offline
  • Administrator
  • Administrator
  • Posts: 17010
> I haven't been able to find the robots.txt file

With pretty URLs, the robots.txt is generated sutomatocally.

I see from your robots.txt that you aren't using sitemaps.

Perhaps you should. Google apparantly like them...
Greg Roach - This email address is being protected from spambots. You need JavaScript enabled to view it. - @fisharebest@phpc.social - fisharebest.webtrees.net

Please Log in or Create an account to join the conversation.

google indexing 3 weeks 3 days ago #4

  • comet48
  • comet48's Avatar Topic Author
  • Offline
  • New Member
  • New Member
  • Posts: 67
Thanks - I've selected that, and will see what it does.
The other issue is that although bing seems to be doing some crawling of the site, it comes up with a reference that's illegal.
e.g. genealogy.haleycentral.com/individual.php?pid=I39268
provides no ged reference, and generates a message "The parameter “ged” is missing."
Legacy and pretty url's are both selected.

Please Log in or Create an account to join the conversation.

Last edit: by comet48.

google indexing 3 weeks 3 days ago #5

  • comet48
  • comet48's Avatar Topic Author
  • Offline
  • New Member
  • New Member
  • Posts: 67
Notifying google of sitemap worked, but bing request responded
"This www.bing.com page can’t be found"

Please Log in or Create an account to join the conversation.

Last edit: by comet48.

google indexing 3 weeks 3 days ago #6

  • fisharebest
  • fisharebest's Avatar
  • Offline
  • Administrator
  • Administrator
  • Posts: 17010
Bing has stopped its public sitemap submission page.
See github.com/fisharebest/webtrees/issues/4772
You will need to submit it using the bing webmaster tools site.
Greg Roach - This email address is being protected from spambots. You need JavaScript enabled to view it. - @fisharebest@phpc.social - fisharebest.webtrees.net

Please Log in or Create an account to join the conversation.

google indexing 3 weeks 3 days ago #7

  • comet48
  • comet48's Avatar Topic Author
  • Offline
  • New Member
  • New Member
  • Posts: 67
Thanks Greg. Did you see my post above re the bing crawler dropping the tree info in the url? Perhaps google is too, and the references I see on google are from my earlier 1.x version from the windows machine.

"The other issue is that although bing seems to be doing some crawling of the site, it comes up with a reference that's illegal.
e.g. genealogy.haleycentral.com/individual.php?pid=I39268
provides no ged reference, and generates a message "The parameter “ged” is missing."
Legacy and pretty url's are both selected."

Please Log in or Create an account to join the conversation.

Last edit: by comet48.

google indexing 3 weeks 3 days ago #8

  • fisharebest
  • fisharebest's Avatar
  • Offline
  • Administrator
  • Administrator
  • Posts: 17010
I know that you can configure google to ignore certain URL parameters.
Maybe bing is the same.
Maybe you did this?

Did this URL come from the sitemap or from crawling the site.
Maybe it came from a user-typed URL...
Greg Roach - This email address is being protected from spambots. You need JavaScript enabled to view it. - @fisharebest@phpc.social - fisharebest.webtrees.net

Please Log in or Create an account to join the conversation.

Do you need a web hosting solution for your webtrees site?
If you prefer a host that specialises in webtrees, the following page lists some suppliers able to provide one for you: 

google indexing 3 weeks 3 days ago #9

  • comet48
  • comet48's Avatar Topic Author
  • Offline
  • New Member
  • New Member
  • Posts: 67
Pretty sure it came from crawler
From log; webtrees:80 172.18.0.1 - - [28/Feb/2023:11:15:48 -0800] "GET /individual.php?ged=tree1&pid=I15839 HTTP/1.1" 406 487 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)"
webtrees:80 172.18.0.1 - - [28/Feb/2023:11:15:50 -0800] "GET /module.php?mod=descendancy&mod_action=descendants&xref=I320073 HTTP/1.1" 406 487 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)"
webtrees:80 172.18.0.1 - - [28/Feb/2023:11:16:04 -0800] "GET /individual.php?pid=i322261 HTTP/1.1" 406 487 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36"

Bingbot not picking up tree, whereas semrush does

Please Log in or Create an account to join the conversation.

Last edit: by comet48.

google indexing 3 weeks 3 days ago #10

  • fisharebest
  • fisharebest's Avatar
  • Offline
  • Administrator
  • Administrator
  • Posts: 17010
Maybe bing is fetching the link because it found it somewhere else. Maybe on another site?

AFAICT, webtrees has never generated links without the ged parameter.
Greg Roach - This email address is being protected from spambots. You need JavaScript enabled to view it. - @fisharebest@phpc.social - fisharebest.webtrees.net

Please Log in or Create an account to join the conversation.

google indexing 3 weeks 1 day ago #11

  • comet48
  • comet48's Avatar Topic Author
  • Offline
  • New Member
  • New Member
  • Posts: 67
I'm getting the message from google "URL is not available to Google". Google for some reason is not seeing the website. Any ideas?

Current robots.txt (pretty urls)
# robots.txt for genealogy.haleycentral.com

User-agent: admantx
User-agent: Adsbot
User-agent: AhrefsBot
User-agent: Amazonbot
User-agent: AspiegelBot
User-agent: Barkrowler
User-agent: BLEXBot
User-agent: DataForSEO
User-agent: DataForSeoBot
User-agent: DotBot
User-agent: Grapeshot
User-agent: Honolulu-bot
User-agent: ia_archiver
User-agent: linabot
User-agent: Linguee
User-agent: MegaIndex.ru
User-agent: MJ12bot
User-agent: netEstate NE
User-agent: panscient
User-agent: PetalBot
User-agent: proximic
User-agent: SeekportBot
User-agent: SemrushBot
User-agent: serpstatbot
User-agent: SEOkicks
User-agent: SiteKiosk
User-agent: Turnitin
User-agent: wp_is_mobile
User-agent: XoviBot
User-agent: ZoominfoBot
Disallow: /

User-agent: *
Disallow: /admin
Disallow: /manager
Disallow: /editor
Disallow: /account
Crawl-delay: 10

Sitemap: genealogy.haleycentral.com/sitemap.xml


It says it couldn't fetch the sitemap. I have about 100,000 individuals. Is there any possibility it exceeded the 50MB limit? I can't find it btw:)

I emptied /data/cache/@

Please Log in or Create an account to join the conversation.

Last edit: by comet48.

google indexing 3 weeks 1 day ago #12

  • Lars1963
  • Lars1963's Avatar
  • Offline
  • Junior Member
  • Junior Member
  • Posts: 164
As said in this thread www.webtrees.net/index.php/en/forum/help...temap-and-robots-txt I have the same problem with Google. For some reason Google seems not to like the webtrees generated sitemap (anymore?).
Lars van Ravenzwaaij - see my family tree at www.ravenzwaaij.info

Please Log in or Create an account to join the conversation.

Last edit: by Lars1963.

google indexing 3 weeks 1 day ago #13

  • Franz Frese
  • Franz Frese's Avatar
  • Offline
  • Premium Member
  • Premium Member
  • Posts: 659

... genealogy.haleycentral.com...
by the way: the start - person of your tree is not public accessible!

and I can see the sitemap without error.

Please Log in or Create an account to join the conversation.

Last edit: by Franz Frese.

google indexing 3 weeks 1 day ago #14

  • comet48
  • comet48's Avatar Topic Author
  • Offline
  • New Member
  • New Member
  • Posts: 67
What do you mean by Start person?

Please Log in or Create an account to join the conversation.

google indexing 3 weeks 1 day ago #15

  • fisharebest
  • fisharebest's Avatar
  • Offline
  • Administrator
  • Administrator
  • Posts: 17010
> I'm getting the message from google "URL is not available to Google". Google for some reason is not seeing the website. Any ideas?

Could be a robots.txt issue. Have you recently changed your robots.txt? Could the previous version have blocked Google?

Can you find an entry in your webserver logs for when google tried to fetch the sitemap? Perhaps it receieved an error response from the server?

> Is there any possibility it exceeded the 50MB limit? I can't find it btw:)

webtrees splits sitemap files into smaller files - each with 500 links.
This keeps every file is less than 50MB.
Greg Roach - This email address is being protected from spambots. You need JavaScript enabled to view it. - @fisharebest@phpc.social - fisharebest.webtrees.net

Please Log in or Create an account to join the conversation.

google indexing 3 weeks 1 day ago #16

  • Franz Frese
  • Franz Frese's Avatar
  • Offline
  • Premium Member
  • Premium Member
  • Posts: 659

What do you mean by Start person?
Attachments:

Please Log in or Create an account to join the conversation.

google indexing 3 weeks 1 day ago #17

  • comet48
  • comet48's Avatar Topic Author
  • Offline
  • New Member
  • New Member
  • Posts: 67
snips attached
Attachments:

Please Log in or Create an account to join the conversation.

google indexing 3 weeks 1 day ago #18

  • fisharebest
  • fisharebest's Avatar
  • Offline
  • Administrator
  • Administrator
  • Posts: 17010
Do you have your apache logs, which show google fetching (or trying to fetch) the sitemaps?
Greg Roach - This email address is being protected from spambots. You need JavaScript enabled to view it. - @fisharebest@phpc.social - fisharebest.webtrees.net

Please Log in or Create an account to join the conversation.

google indexing 3 weeks 1 day ago #19

  • comet48
  • comet48's Avatar Topic Author
  • Offline
  • New Member
  • New Member
  • Posts: 67
Looking for the main log file in the docker. Here is some output however;
webtrees:80 172.18.0.1 - - [03/Mar/2023:01:55:49 -0800] "GET /tree/tree1/my-page-block?block_id=73 HTTP/1.1" 200 3845 "genealogy.haleycentral.com/tree/tree1/my-page" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
webtrees:80 172.18.0.1 - - [03/Mar/2023:01:55:49 -0800] "GET /tree/tree1/my-page-block?block_id=91 HTTP/1.1" 200 4138 "genealogy.haleycentral.com/tree/tree1/my-page" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
webtrees:80 172.18.0.1 - - [03/Mar/2023:01:55:49 -0800] "GET /tree/tree1/media-thumbnail?xref=M409&fact_id=cf51bb423540dbe611134970a4e87a7d&w=400&h=400&fit=contain&mark=0&s=4c7b029820f223be28d1be1541d4a910 HTTP/1.1" 200 29601 "genealogy.haleycentral.com/tree/tree1/my-page" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
webtrees:80 127.0.0.1 - - [03/Mar/2023:01:56:09 -0800] "GET / HTTP/1.1" 302 669 "-" "curl/7.74.0"
webtrees:80 172.18.0.1 - - [03/Mar/2023:01:56:14 -0800] "GET /individual.php?pid=i322098 HTTP/1.1" 406 487 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36"
webtrees:80 172.18.0.1 - - [03/Mar/2023:01:56:21 -0800] "GET /tree/tree1/my-page-edit HTTP/1.1" 200 105229 "genealogy.haleycentral.com/tree/tree1/my-page" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
webtrees:80 172.18.0.1 - - [03/Mar/2023:01:56:19 -0800] "GET /individual.php?pid=i322091 HTTP/1.1" 406 487 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36"
webtrees:80 172.18.0.1 - - [03/Mar/2023:01:56:24 -0800] "GET /individual.php?pid=i322257 HTTP/1.1" 406 487 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36"
webtrees:80 127.0.0.1 - - [03/Mar/2023:01:56:41 -0800] "GET / HTTP/1.1" 302 669 "-" "curl/7.74.0"
webtrees:80 172.18.0.1 - - [03/Mar/2023:01:57:08 -0800] "GET /note.php?nid=BI38627&ged=tree1 HTTP/1.1" 406 487 "genealogy.haleycentral.com/individual.ph...on=ajax&module=notes" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"

Please Log in or Create an account to join the conversation.

google indexing 3 weeks 1 day ago #20

  • fisharebest
  • fisharebest's Avatar
  • Offline
  • Administrator
  • Administrator
  • Posts: 17010
You are running webtrees behind a proxy server.

Therefore it sees the same IP address for all external requests. i.e. 172.18.0.1

webtrees validates search engines against IP addresses.
The IP address 172.18.0.1 is not valid for bingbot.
Therefore it is being rejected with a 406 response.

You need to

(a) configure your proxy to provide the real IP address in a header. For example HTTP_X_FORWARDED_FOR

(b) tell webtrees to look in this header to find the real IP address. See webtrees.net/install/cloudflare/ for how to tell webtrees about the header.
Greg Roach - This email address is being protected from spambots. You need JavaScript enabled to view it. - @fisharebest@phpc.social - fisharebest.webtrees.net

Please Log in or Create an account to join the conversation.

  • Page:
  • 1
  • 2
Powered by Kunena Forum