SeoGuru & Linux
Hey Linux, I’ve been digging into how open‑source tools can turbo‑charge SEO for projects. Have you tried any open‑source SEO solutions that keep the community involved while still hitting performance goals?
Absolutely! I’ve been playing around with a few tools that stay true to the open‑source spirit while really boosting SEO. First, there’s **Screaming Frog SEO Spider** – it’s open‑source, community‑driven, and great for crawling sites and spotting issues fast. Then there’s **Piwik PRO** (the open‑source version of Matomo) for analytics; it’s privacy‑friendly and lets you dig into search performance without giving up control. For automated content optimization, **RankMath** and **Yoast SEO** plugins are open‑source, and the community keeps adding new features for schema, sitemaps, and keyword tracking. All of these let you tweak performance, run tests, and keep the codebase under community scrutiny, so you get the best of both worlds. Let me know if you want to dig deeper into any of them!
Nice lineup! Just a quick note: Screaming Frog is actually a paid tool with a free tier, not fully open‑source, but its community plugin ecosystem is pretty robust. Piwik PRO’s core Matomo is open‑source, so you’re good there. For the plugins you mentioned, both RankMath and Yoast start out as open‑source but they’re now offering premium tiers. If you’re looking to keep everything truly community‑driven, I’d suggest pairing Matomo with open‑source crawler tools like Xenu or even writing a small script with Python’s Scrapy. Let me know which angle you’d like to explore—crawler, analytics, or content optimization—and I can dive into specifics.
Let’s dive into the crawler side first—I’ve got a feeling Scrapy will let us build a lightweight, community‑friendly engine that can still keep up with performance. We can tweak it to grab structured data, detect broken links, and feed the results straight into Matomo for analytics. Sound good?
Sounds solid. Scrapy’s async requests keep the crawl fast, and you can use middlewares to pull JSON‑LD, RDFa, or Microdata in one pass. For broken‑link detection, a simple pipeline that hits the URL and logs status codes works, and you can emit events to Matomo via the HTTP API so the dashboard updates in real time. Just make sure you throttle per domain to stay polite, and consider a scheduler like Celery if you need to run it daily. Let me know if you need a starter project skeleton or sample pipeline code.
Here’s a minimal skeleton you can copy straight into a new repo.
**Folder layout**
```
scrapy-seo/
├── scrapy_seo/
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders/
│ ├── __init__.py
│ └── seo_spider.py
├── run_spider.py
└── requirements.txt
```
**requirements.txt**
```
Scrapy
celery
requests
```
**scrapy_seo/items.py**
```
import scrapy
class SeoItem(scrapy.Item):
url = scrapy.Field()
status = scrapy.Field()
json_ld = scrapy.Field()
rdfa = scrapy.Field()
microdata = scrapy.Field()
```
**scrapy_seo/middlewares.py**
```
from bs4 import BeautifulSoup
class JsonLdMiddleware:
def process_response(self, request, response, spider):
soup = BeautifulSoup(response.text, 'html.parser')
ld_json = soup.find_all('script', type='application/ld+json')
response.meta['json_ld'] = [script.string for script in ld_json]
return response
class RdfaMiddleware:
def process_response(self, request, response, spider):
# Simple extraction using BeautifulSoup; replace with rdflib if needed
soup = BeautifulSoup(response.text, 'html.parser')
rdfa = soup.find_all(attrs={'typeof': True})
response.meta['rdfa'] = [tag.get('typeof') for tag in rdfa]
return response
class MicrodataMiddleware:
def process_response(self, request, response, spider):
soup = BeautifulSoup(response.text, 'html.parser')
micro = soup.find_all(attrs={'itemtype': True})
response.meta['microdata'] = [tag.get('itemtype') for tag in micro]
return response
```
**scrapy_seo/pipelines.py**
```
import requests
class StatusPipeline:
def process_item(self, item, spider):
item['status'] = item.get('status') or 200
return item
class MatomoPipeline:
def __init__(self, url, token):
self.url = url
self.token = token
@classmethod
def from_crawler(cls, crawler):
return cls(
url=crawler.settings.get('MATOMO_URL'),
token=crawler.settings.get('MATOMO_TOKEN')
)
def process_item(self, item, spider):
payload = {
'idsite': 1,
'rec': 1,
'url': item['url'],
'ua': 'Scrapy',
'action_name': 'Crawl',
'token_auth': self.token,
'send': 1
}
requests.post(self.url, data=payload)
return item
```
**scrapy_seo/settings.py**
```
BOT_NAME = 'scrapy_seo'
SPIDER_MODULES = ['scrapy_seo.spiders']
NEWSPIDER_MODULE = 'scrapy_seo.spiders'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 1
ITEM_PIPELINES = {
'scrapy_seo.pipelines.StatusPipeline': 300,
'scrapy_seo.pipelines.MatomoPipeline': 400,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_seo.middlewares.JsonLdMiddleware': 543,
'scrapy_seo.middlewares.RdfaMiddleware': 544,
'scrapy_seo.middlewares.MicrodataMiddleware': 545,
}
MATOMO_URL = 'https://your-matomo-instance.com/matomo.php'
MATOMO_TOKEN = 'your_api_token'
```
**scrapy_seo/spiders/seo_spider.py**
```
import scrapy
from ..items import SeoItem
class SeoSpider(scrapy.Spider):
name = 'seo'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com']
def parse(self, response):
item = SeoItem()
item['url'] = response.url
item['status'] = response.status
item['json_ld'] = response.meta.get('json_ld', [])
item['rdfa'] = response.meta.get('rdfa', [])
item['microdata'] = response.meta.get('microdata', [])
yield item
for link in response.css('a::attr(href)').getall():
if link.startswith('http'):
yield response.follow(link, callback=self.parse)
```
**run_spider.py**
```
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('seo')
process.start()
```
That gives you a functional, lightweight crawler that pulls structured data, logs status codes, and reports each page back to Matomo in real time. You can hook Celery into the run_spider.py to schedule daily runs, and tweak DOWNLOAD_DELAY or CONCURRENT_REQUESTS_PER_DOMAIN to stay polite. Happy hacking!
Looks solid, great copy‑paste ready. One tweak that often helps is adding a custom downloader middleware to respect robots.txt dynamically per domain and setting a per‑domain concurrency limit. Also, you might want to catch exceptions in MatomoPipeline and log failures instead of silently dropping them. Other than that, you’re set to run a fully‑functional, open‑source SEO crawler that feeds live data back into Matomo. Happy crawling!