Network Monitoring for ISPs | Zabbix, Grafana

What network monitoring is

Continuous collection of metrics (CPU, memory, link, BGP session, latency), logs (Syslog, NetFlow) and state (UP/DOWN, threshold) from all infrastructure equipment, with automatic alerting when something goes out of pattern.

Without structured monitoring, the ISP becomes a hostage to reactive NOC — losing customers because they didn't see the problem first. Loose SNMP works for small ISPs; to grow you need Zabbix with specific templates, Grafana dashboards, and alerts going to the channel the team actually uses (Telegram, Discord).

What RASYS does with monitoring

Zabbix with proprietary templates — not the generic factory ones. Templates for Huawei NE, Juniper MX, MikroTik, your specific GPON OLT vendor, with items that matter (BGP session, OSPF neighbors, ONU optical power).
LibreNMS for inventory and duplicate metrics — SNMP auto-discovery, automatic topology, traffic map. We keep a centralized instance at Rasys collecting in parallel, so if the ISP's infrastructure goes down we can still see history and help diagnose the outage.
Grafana dashboards — operational view (NOC) and executive view (management). Datasource Zabbix, InfluxDB, Prometheus.
NetFlow / sFlow — traffic analysis per application, top talkers, DDoS detection via flow rate spike. nfdump, ELK, ntopng.
Structured alerting — Telegram/Slack/email with severity, deduplication, escalation if no one acknowledges. No useless alert flooding.
Centralized Syslog — rsyslog/syslog-ng for retention and fast search. Incident forensics in minutes, not hours.

Equipment we work with

Zabbix 6/7, LibreNMS, Grafana, InfluxDB, Prometheus, Elastic/Kibana, ntopng, nfdump, rsyslog, syslog-ng.

When it makes sense to talk to us

You use only basic SNMP and miss critical events; have Zabbix but only with generic templates; customer reported an outage before you saw it; need baseline data to justify link upgrades.

FREQUENTLY ASKED QUESTIONS

Zabbix or LibreNMS — which to choose for an ISP?

LibreNMS is faster to get running (SNMP autodiscovery), good for network device inventory. Zabbix is more flexible for custom alerts, complex logic, and internal system integration. Many ISPs run both — LibreNMS for inventory and graphs, Zabbix for critical alerting.

How much does it cost to monitor 1,000 devices with Zabbix?

Zabbix is free (open source). The cost is infrastructure: 1 VM with 4 vCPU / 8GB RAM / 100GB SSD handles 1,000 devices at 5-minute polling without issue. A separate MySQL/Postgres database for easy scaling. Ongoing operation (templates, alerts, tuning) is the real cost — and delivers the most value.

Is NetFlow worth it with MikroTik border routers?

Yes. MikroTik generates NetFlow v9 natively. A collector (nfsen, nfdump, Akvorado) runs on a small VM. NetFlow shows who is consuming bandwidth, identifies DDoS before it causes impact, and helps decide peering. Implementation cost is low, operational value is high.

Which alerts should wake up the on-call engineer at night?

Only those indicating client impact: B-RAS down, OLT down, transit link down with no backup, BGP session loss with the only upstream, RADIUS down, own authoritative DNS offline. High CPU, disk filling up, link with CRC errors — those wait for business hours. A night alert that requires no immediate action destroys the on-call rotation.