Creating a New Infra Platform
This documents how to deploy the platform services for an infra. These include:
-
MQTT Broker
-
Opsgenie Forwarder
-
SMTP Mailer
-
Auto Alerter
-
NoData Executor
-
Incident Tracker
-
Meld Continuous Query Generator
Grafana, the Dashboard Generator, and the Infra Api have been moved to the UI Deployment and are not part of and infra deployment anymore. |
Infra deployments follow the 'infra' convention. |
Infra Naming
You can in principle name your infras whatever you like.
If they are dev or test infras though, you should consider appending dev
or test
to the name.
See Ansible Deployments for an explanation.
Influx Database
The influx database is deployed separately. As you can add a database to an existing instance, you will probably not need to create one anyway. Start by creating the database in the influx instance.
Hosts File
This is a the basic template for an infra host file:
; Physical Devices
[<Hostname>]
<hostname>.dgcsdev.com
[MQTT_BROKER]
mqtt.<infra_id>.<...>.smartermicrogrid.com
; Characterisations
[ec2:children]
MQTT_BROKER
<Hostname>
; Installation
[infra:children]
ec2
; Services
[mqtt_server:children]
MQTT_BROKER
[influxdb_server:children]
<Hostname>
[auto_alerter_server:children]
[incident_tracker_server:children]
[nodata_executor_server:children]
<Hostname>
[opsgenie_forwarder_server:children]
<Hostname>
[smtp_mailer_server:children]
<Hostname>
[meld_continuous_query_generator_server:children]
<Hostname>
Section | Intent |
---|---|
|
Declares the servers |
|
Creates the conventional |
|
Creates the required |
|
Lists the services to deploy |
MQTT Broker
The MQTT Broker is not deployed by the standard infra services.yml
playbook.
It is typically built separately.
See Deploy an MQTT Broker for details.
When deciding on the broker’s DNS, you should use a grouping level below the TLD - as described in the linked docs above. For infras, although I wouldn’t go so far as to call this a convention, you could do worse than use `mqtt.<infra_id>.infras.smartermicrogrid.com.
InfluxDB Server
This reference exists to declares where the influx_writer
service should deploy.
Conventionally, the influx_writer
service deploys to the same server that the influx instance lives on.
The reasons for this are mostly historical stemming from efforts to stabilise this service in the face of previous bugs and fragility.
For example, the Influx is not a Docker container so the influx_writer
container uses 'host' networking.
Also, we wanted to take network issues off the table as a possible culprit.
Group Vars
Here is an example of a group_vars/infra
file:
deploy_level: prod
config_src: <infra_id>
infra: <infra_id>
infra_public: <infra_id>
hosts_dir: <infra_id>
influx_database: prod_<infra_id>
influx_backup_minute: 18
influx_writer_report_interval: 5000
auto_alerter_nodeduper: true
incident_tracker_bus_schedule: '0 6,18 * * *'
opsgenie_summary_schedule: '0 8,12,16,20 * * *'
opsgenie_summary_recipients: '...'
meld_continuous_query_generator:
default_bucket_size: 15m
request_timeout: 2000
incident_tracker_name: incident_tracker
incident_tracker_agent_id: incident_tracker
default_inspect_port: 12340
influx_writer_inspect_port: 12340
auto_alerter_inspect_port: 12343
nodata_executor_inspect_port: 12344
incident_tracker_inspect_port: 12345
opsgenie_forwarder_inspect_port: 12346
smtp_mailer_inspect_port: 12347
meld_continuous_query_generator_inspect_port: 12349
influx_writer_healthcheck_port: 11230
auto_alerter_healthcheck_port: 11233
nodata_executor_healthcheck_port: 11234
incident_tracker_healthcheck_port: 11235
opsgenie_forwarder_healthcheck_port: 11236
smtp_mailer_healthcheck_port: 11237
meld_continuous_query_generator_healthcheck_port: 11239
opsgenie_forwarder_version: 7.1.4
smtp_mailer_version: 4.0.10
influx_writer_version: 15.1.7
auto_alerter_version: 6.0.9
incident_tracker_version: 11.1.5
nodata_executor_version: 6.0.9
meld_continuous_query_generator_version: 0.7.2
docker_memory: 150M
incident_tracker_docker_memory: 200M
opsgenie_forwarder_docker_memory: 200M
mqtt_broker_host: mqtt.<infra_id>.<...>.smartermicrogrid.com
mqtt_broker_port: 202##
mqtt_username: xyz
mqtt_password: ***********
mqtt_config: |
connection my_other_infra
...
Most of these variables are driven by the needs of the specific service roles or playbooks, and are better documented there. However, the following vars are generically required:
Var | Purpose | Value/Example |
---|---|---|
|
Mostly used as a grouping level in names, such as directories or DNS. |
|
|
Very historical. This used to be used to publish core config messages directly to mqtt. This is rarely used now, if at all. |
Conventionally, |
|
This infra id itself |
|
|
Replaces the infra id for public dns entries (i.e. on smartermicrogrid.com) |
|
|
Manual reference to the hosts dir, used by some configuration scripts |
|
|
Explicitly names the influx database |
Conventionally |
|
Required details for the mqtt broker. The host should not specify the protocol. |
Conventionally, the host should look like |
|
The default memory limit for Docker containers. This can be overridden on a per-service basis |
Most (nodejs) platform services which cache signal properties will want at leas |
|
Optional text block to add to the mqtt broker configuration. This can be used to create bridges to other brokers for example |
See the official Mosquitto documentation |
Running the Deploy
To run the deploy, use the scripts as outlined in Ansible Deployments.
Validated the Deploy
Unfortunately, at the moment there isn’t really a handy way of validating a deploy in a single step. One just needs to check all the containers are up and running based on their logs:
$ docker logs <infra_id>.<deploy_level>.<service_id>
Standard nodejs services should be started if you can see some lines like the following:
2020-11-24T17:28:37.155299980Z 2020-11-24T17:28:37.155Z INFO MQTT Connecting to client at mqtt://mqtt.customers.infras.smartermicrogrid.com:20288
2020-11-24T17:28:37.160251196Z 2020-11-24T17:28:37.160Z INFO rss: 69.04 MB heapTotal: 59.4 MB heapUsed: 30.09 MB external: 0.56 MB. Up 396ms
2020-11-24T17:28:37.160351114Z 2020-11-24T17:28:37.160Z INFO Runner started
2020-11-24T17:28:37.160482898Z 2020-11-24T17:28:37.160Z INFO HEALTHCHECK: SERVICE_LIVE
Remember to ssh into all the servers. The influx_writer for example is typically not on the same server as any other services. |