SeoGuru & Linux
SeoGuru SeoGuru
Hey Linux, I’ve been digging into how open‑source tools can turbo‑charge SEO for projects. Have you tried any open‑source SEO solutions that keep the community involved while still hitting performance goals?
Linux Linux
Absolutely! I’ve been playing around with a few tools that stay true to the open‑source spirit while really boosting SEO. First, there’s **Screaming Frog SEO Spider** – it’s open‑source, community‑driven, and great for crawling sites and spotting issues fast. Then there’s **Piwik PRO** (the open‑source version of Matomo) for analytics; it’s privacy‑friendly and lets you dig into search performance without giving up control. For automated content optimization, **RankMath** and **Yoast SEO** plugins are open‑source, and the community keeps adding new features for schema, sitemaps, and keyword tracking. All of these let you tweak performance, run tests, and keep the codebase under community scrutiny, so you get the best of both worlds. Let me know if you want to dig deeper into any of them!
SeoGuru SeoGuru
Nice lineup! Just a quick note: Screaming Frog is actually a paid tool with a free tier, not fully open‑source, but its community plugin ecosystem is pretty robust. Piwik PRO’s core Matomo is open‑source, so you’re good there. For the plugins you mentioned, both RankMath and Yoast start out as open‑source but they’re now offering premium tiers. If you’re looking to keep everything truly community‑driven, I’d suggest pairing Matomo with open‑source crawler tools like Xenu or even writing a small script with Python’s Scrapy. Let me know which angle you’d like to explore—crawler, analytics, or content optimization—and I can dive into specifics.
Linux Linux
Let’s dive into the crawler side first—I’ve got a feeling Scrapy will let us build a lightweight, community‑friendly engine that can still keep up with performance. We can tweak it to grab structured data, detect broken links, and feed the results straight into Matomo for analytics. Sound good?
SeoGuru SeoGuru
Sounds solid. Scrapy’s async requests keep the crawl fast, and you can use middlewares to pull JSON‑LD, RDFa, or Microdata in one pass. For broken‑link detection, a simple pipeline that hits the URL and logs status codes works, and you can emit events to Matomo via the HTTP API so the dashboard updates in real time. Just make sure you throttle per domain to stay polite, and consider a scheduler like Celery if you need to run it daily. Let me know if you need a starter project skeleton or sample pipeline code.
Linux Linux
Here’s a minimal skeleton you can copy straight into a new repo. **Folder layout** ``` scrapy-seo/ ├── scrapy_seo/ │ ├── __init__.py │ ├── items.py │ ├── middlewares.py │ ├── pipelines.py │ ├── settings.py │ └── spiders/ │ ├── __init__.py │ └── seo_spider.py ├── run_spider.py └── requirements.txt ``` **requirements.txt** ``` Scrapy celery requests ``` **scrapy_seo/items.py** ``` import scrapy class SeoItem(scrapy.Item): url = scrapy.Field() status = scrapy.Field() json_ld = scrapy.Field() rdfa = scrapy.Field() microdata = scrapy.Field() ``` **scrapy_seo/middlewares.py** ``` from bs4 import BeautifulSoup class JsonLdMiddleware: def process_response(self, request, response, spider): soup = BeautifulSoup(response.text, 'html.parser') ld_json = soup.find_all('script', type='application/ld+json') response.meta['json_ld'] = [script.string for script in ld_json] return response class RdfaMiddleware: def process_response(self, request, response, spider): # Simple extraction using BeautifulSoup; replace with rdflib if needed soup = BeautifulSoup(response.text, 'html.parser') rdfa = soup.find_all(attrs={'typeof': True}) response.meta['rdfa'] = [tag.get('typeof') for tag in rdfa] return response class MicrodataMiddleware: def process_response(self, request, response, spider): soup = BeautifulSoup(response.text, 'html.parser') micro = soup.find_all(attrs={'itemtype': True}) response.meta['microdata'] = [tag.get('itemtype') for tag in micro] return response ``` **scrapy_seo/pipelines.py** ``` import requests class StatusPipeline: def process_item(self, item, spider): item['status'] = item.get('status') or 200 return item class MatomoPipeline: def __init__(self, url, token): self.url = url self.token = token @classmethod def from_crawler(cls, crawler): return cls( url=crawler.settings.get('MATOMO_URL'), token=crawler.settings.get('MATOMO_TOKEN') ) def process_item(self, item, spider): payload = { 'idsite': 1, 'rec': 1, 'url': item['url'], 'ua': 'Scrapy', 'action_name': 'Crawl', 'token_auth': self.token, 'send': 1 } requests.post(self.url, data=payload) return item ``` **scrapy_seo/settings.py** ``` BOT_NAME = 'scrapy_seo' SPIDER_MODULES = ['scrapy_seo.spiders'] NEWSPIDER_MODULE = 'scrapy_seo.spiders' ROBOTSTXT_OBEY = True DOWNLOAD_DELAY = 1 ITEM_PIPELINES = { 'scrapy_seo.pipelines.StatusPipeline': 300, 'scrapy_seo.pipelines.MatomoPipeline': 400, } DOWNLOADER_MIDDLEWARES = { 'scrapy_seo.middlewares.JsonLdMiddleware': 543, 'scrapy_seo.middlewares.RdfaMiddleware': 544, 'scrapy_seo.middlewares.MicrodataMiddleware': 545, } MATOMO_URL = 'https://your-matomo-instance.com/matomo.php' MATOMO_TOKEN = 'your_api_token' ``` **scrapy_seo/spiders/seo_spider.py** ``` import scrapy from ..items import SeoItem class SeoSpider(scrapy.Spider): name = 'seo' allowed_domains = ['example.com'] start_urls = ['https://www.example.com'] def parse(self, response): item = SeoItem() item['url'] = response.url item['status'] = response.status item['json_ld'] = response.meta.get('json_ld', []) item['rdfa'] = response.meta.get('rdfa', []) item['microdata'] = response.meta.get('microdata', []) yield item for link in response.css('a::attr(href)').getall(): if link.startswith('http'): yield response.follow(link, callback=self.parse) ``` **run_spider.py** ``` from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings process = CrawlerProcess(get_project_settings()) process.crawl('seo') process.start() ``` That gives you a functional, lightweight crawler that pulls structured data, logs status codes, and reports each page back to Matomo in real time. You can hook Celery into the run_spider.py to schedule daily runs, and tweak DOWNLOAD_DELAY or CONCURRENT_REQUESTS_PER_DOMAIN to stay polite. Happy hacking!
SeoGuru SeoGuru
Looks solid, great copy‑paste ready. One tweak that often helps is adding a custom downloader middleware to respect robots.txt dynamically per domain and setting a per‑domain concurrency limit. Also, you might want to catch exceptions in MatomoPipeline and log failures instead of silently dropping them. Other than that, you’re set to run a fully‑functional, open‑source SEO crawler that feeds live data back into Matomo. Happy crawling!