Total Pageviews

Wednesday, December 28, 2016

Deployment of Telegraf Agent on multiple Windows Servers – deploy Telegraf Agent with PowerShell

During the development of performance metrics visualization projects in real-time, we have made a few changes to deploy the performance metric collector agent (Telegraf.exe) on multiple Windows Servers. In the development and testing phase, we often approach new ideas and metrics requirements, so the configuration of Telegraf agent changes often. To reflect those changes, the configuration file needs to be re-deployed again and Agent service needs a restart.

Another challenge is that we need to deploy the agent onto multiple servers without performing RDP to the remote server. Every time the Telegraf configuration is updated for any reason, the Telegraf configuration file needs to be replaced with changes on the remote server too. The Telegraf service needs to be restarted with the new configuration file for those changes to take effect.

So, we developed a PowerShell based deployment Script to perform the entire task on a remote Windows Server regardless of the number of Servers.

Suitability of this deployment Script:
There are many applications of this automated deployment script on a remote server.
1.      New deployment of the Telegraf agent.
2.      Upgrading the Telegraf agent.
3.      Restarting/Stopping/removing the Telegraf agent.
4.      Updating “telegraf.conf” file.
5.      Deploying an updated version of the Telegraf agent.
6.      Changing the Telegraf Agent’s data posting URL, and so on.

The provided script can be changed and enhanced according to your specific needs.

The dashboard search requirements:
Following is the dashboard we have developed to monitor more than 300+ production SQL Servers in real-time. To identify a server, we have a filter mechanism on the Grafana UI so that a user/viewer can bring a specific server into viewing. In Grafana, it is called “Grafana Templating”. To do this, we created three templating variables which are “Server”, “AppEnv” and “App”.

Figure #1: A SQL Server Dashboard:



Figure # 2: Another SQL Server Dashboard:

The section “global_tags” of the configuration file “telegraf.conf” of Telegraf Agent “Telegraf.exe” needs to be changed for each server and needs to be placed in the remote server so the Telegraf Agent can read the specific configuration associated with this server and send the data to InfluxDB.

The telegraf “telegraf.conf” file:
The Telegraf “telegraf.conf” file has a specific section known as [global_tags] which contains Key = “Value” format data. In our case, the “telegraf.conf” file has been renamed as “telegraf.baseconf” with the following dummy tag name and from this template configuration file we will derive the required “telegraf.conf” for deployment.

[global_tags]
Appenv = "AppEnvName"
App = "AppName"

The PowerShell Script will read each key and will then change according to the passing server parameters which comes from a predefined CSV file. The components of the CSV file looks as follows:

"Server","AppEnv","App"
"FIN001", "Finance", "Budget Reconciliation Application"
"FIN002", "Finance", "Payment Disbursement"
"HR0101", "HR", "Employee Performance"
"HR0102\SQL2014", "HR", "Vendor Management"
"HR0103", "HR", "CRM portal"
"HR0201\SHARE2013","HR", "HR SharePoint"
"HR0201","HR", "Insight the Organization"

PowerShell Script:
The powerShell scripts “TelegrafDeployV3.ps1” has two functions:

Correct-SQLPerf
Deploy-Telegraf

Script download Link:

“Correct-SQLPerf”:
The “Correct-SQLPerf” is to correct any missing performance counters on a SQL Server instance. It has only one parameter which is “$SQLServer”. The value can be default or named instance of SQL Server. This function is called automatically by the main function internally. You can also execute it separately.


Function Correct-SQLPerf
{
       [CmdletBinding()]
       param (
              [Parameter(Mandatory = $True)][string]$SQLServer
       )


“Deploy-Telegraf”:
The main function of the PowerShell script is “Deploy-Telegraf” which requires three parameters.

Function Deploy-Telegraf
{
       [CmdletBinding()]
       param (
              [Parameter(Mandatory = $True)][string]$SourceFolder,
              [Parameter(Mandatory = $True)][string]$DeployType,
# N = New deployment, U = update existing, R= remove everything
              [Parameter(Mandatory = $True)][string[]]$SeverList
       )
        

$SourceFolder is the local folder from where the telegraf will be deployed.
$DeployType is the option for deployment type.
$SeverList Is the list of Server. The format of this column is “WindowsServerName\SQLInstanceName”

How to deploy?
Make sure that the WInRM Service is running on the Remote server. To deploy Telegraf onto one or multiple remote servers, do the following:

1.      The Script will create a folder called “C:\PerfMonSQL” on a remote computer.
2.      Create a folder on your local desktop, such as “D:\TelefragDeploy”.
3.      Copy the “telegraf.exe” into this folder.
4.      Create a text file “serverList.txt” (CSV) with a list of servers, application environment and application name. An environment can have multiple applications and an application may have one or more servers.
5.      Copy the telegraf “telegraf.conf” as “telegraf.baseconf” into this folder.
6.      Edit the “telegraf.baseconf” and add the tags as appenv=”AppEnvName” and app=”AppName”.

The “D:\TelegrafDeploy” folder should contain the following items:







Execute the Deployment Script:
To execute the deployment script, open PowerShell ISE with administrator privileges. Open the PowerShell Script “DeployTelegraf.ps1”. Change the three parameters according to your needs and settings.

Example of Running the Deployment Scripts:

To deploy Telegraf Agent on a Remote Windows Server:

# Deploy telegraf for the first time on a (or multiple) remote server
Deploy-Telegraf -DeployType "N" `
                           -SourceFolder "D:\TelegrafDeploy\" `
                           -SeverList "D:\TelegrafDeploy\ServerList.txt"

# Remove telegraf from a (or multiple) remote server
Deploy-Telegraf -DeployType "R" `
                           -SourceFolder "D:\TelegrafDeploy\" `
                           -SeverList "D:\TelegrafDeploy\ServerList.txt"

# Upgrade telegraf on a (or multiple) remote server
Deploy-Telegraf -DeployType "U" `
                           -SourceFolder "D:\TelegrafDeploy\" `
                           -SeverList "D:\TelegrafDeploy\ServerList.txt"


To correct missing SQL PerfMon counters:

#Correct SQL Server missing Perfmon counter
Correct-SQLPerf -SQLServer "SHB_MAIN\SQL2016" # this for named instance of SQL Server
Correct-SQLPerf -SQLServer "SHB_MAIN" # this for default instance of SQL Server


Read and Learn More:
Telegraf and InfluxDB: https://www.influxdata.com/

Golden Gate Latency monitoring with PowerShell, influxDB and Grafana – PowerShell to InfluxDB





Sunday, December 25, 2016

Golden Gate Latency monitoring with PowerShell, influxDB and Grafana – PowerShell to InfluxDB

The SQL Server which is installed on the Windows Server is receiving data from “Tandem Non-Stop System” by using Oracle Golden Gate replication. This downstream data feed is critical to smooth business functionality. So, it needs to be monitored proactively.

Although monitoring this environment is critical, however, by executing various Golden Gate commands (STATUS ALL, INFO ALL, LAG) in the GGSCI shell manually to understand Latency is very annoying and tedious. Therefore, the automated process comes into the picture to monitor performance in real-time for our die-hard operational team.

Two REPLICATs we are interested in:
We are interested in monitoring two of the REPLICATs’ performance constantly as they are critical to our healthcare business activities along with some SQL Server and Windows performance counters.

The following dashboard has been developed with the open source technology to visualize performance data in real-time so one can glue his/her eyes 24/7 on the dashboard. The dashboard process (the Grafana) can also trigger alerts/notifications if a particular monitoring object exceeds the threshold for a substantial period of time.

Performance metrics dashboard in Real-time:


Technology Stack Used:
The PowerShell script streams data in every 10 seconds interval and sends it to influxDB using the REST API method. The script runs by the windows Task scheduler continuously.

InfluxDB is a time-series based database technology which is high performance and can be massively scaled up. It is written in GO language and is open source.

Grafana is an open source metric analytics & visualization suite. It is most commonly used for visualizing time series data for infrastructure and application analytics.

PowerShell Script and Dashboard download:

PowerShell Script:
Following is the complete PowerShell Script which is feeding data into influxDB. This script collects two Golden Gate REPLICATs’ latency, as well as couples SQL Server and Windows Performance metrics every 10 seconds.  Where, in the script, the server represents the following:

SRV0009 – the Windows server where the Golden Gate process is running. The PowerShell Script is also hosted here.
STV1075 – where the InfluxDB and the Grafana have been installed and are running.

<#####################################################################################
       .NOTES
       ===========================================================================
        Created with:      PowerShell ISE
        Created on:        12/25/2016 9:39 AM
        Created by:        Sarjen Haque
       ===========================================================================
       .DESCRIPTION
              function to collect and parse Golden Gate status data from
              a local or remote server installed on Windows
######################################################################################>

function Get-RealTimeStatus()
{
       $result = winrs -r:SRV0009 "CMD /c echo status all | H:\OracleGoldenGate\ggsci"
       $raw = $result -match 'REPLICAT'
      
       [StringSplitOptions]$Options = "RemoveEmptyEntries"
      
       # loop through each line and break
       foreach ($line in $raw)
       {
              $wrd = $line.Split(" ", $Options)
              $lg = $wrd[3].Split(":")
              $tm = $wrd[4].Split(":")
              $result = @{
                     "Program" = $wrd[0];
                     "Status" = $wrd[1];
                     "Name" = $wrd[2];
                     "Lag" = $wrd[3];
                     "LagSec" = [int]$lg[0] *(60 * 60) + [int]$lg[1] *60 + [int]$lg[2];
                     "ChkPt" = $wrd[4];
                     "ChkPtSec" = [int]$tm[0]*(60 * 60) + [int]$tm[1] *60 + [int]$tm[2];
              }
             
              $obj = New-Object -TypeName PSObject -Property $result
              Write-Output $obj
       }
}

#######################################################################################
# FUnction to collect and format Golden gate and Windows/SQL Server PerfMon data
#######################################################################################

Function Get-ggPerf
{
       # Windows and SQL Server Instance
       $WinServer = 'SRV0009'
       $SQLServer = 'SQLServer' # use for default instance
       # $SQLServer = 'MSSQL$SQL2014' # use for a named instance. Replace SQL2014 with the isntance name
      
       # Pull Windows and SQL stats
       [System.Collections.ArrayList]$counters = @()
      
       # Windows reelated metrics
       $counters.Add("\\$WinServer\logicaldisk(d:)\avg. disk sec/transfer") | out-null #0
       $counters.Add("\\$WinServer\logicaldisk(e:)\avg. disk sec/transfer") | out-null #1
       $counters.Add("\\$WinServer\logicaldisk(h:)\avg. disk sec/transfer") | out-null #2
       $counters.Add("\\$WinServer\logicaldisk(l:)\avg. disk sec/transfer") | out-null #3
       $counters.Add("\\$WinServer\logicaldisk(_total)\avg. disk sec/transfer") | out-null #4
       $counters.Add("\\$WinServer\Memory\Available MBytes") | out-null #5
       $counters.Add("\\$WinServer\Processor(_total)\% Privileged Time") | out-null #6
       $counters.Add("\\$WinServer\Processor(_total)\% User Time") | out-null #7
       $counters.Add("\\$WinServer\Processor(_total)\% Processor Time") | out-null #8
      
       #SQL Server related metrics      
       $counters.Add("\\$WinServer\" + $SQLServer + ":Resource Pool Stats(default)\CPU usage %") | out-null #9
       $counters.Add("\\$WinServer\" + $SQLServer + ":General Statistics\Processes blocked") | out-null #10
       $counters.Add("\\$WinServer\" + $SQLServer + ":Buffer Manager\page life expectancy") | out-null #11
       $counters.Add("\\$WinServer\" + $SQLServer + ":General Statistics\User Connections") | out-null #12
       $counters.Add("\\$WinServer\" + $SQLServer + ":SQL Statistics\Batch Requests/Sec") | out-null #13
       $counters.Add("\\$WinServer\" + $SQLServer + ":Memory Manager\total server memory (kb)") | out-null #14
       $countersall = (Get-Counter -Counter $Counters -SampleInterval 1).countersamples.CookedValue
      
       # Format SQL data into Influxdb Line Protocol
       $postParams = "sql_perf,host=" + $WinServer + " avgtransfer_d=" + $countersall[0] + ",avgtransfer_e=" + $countersall[1] `
                     + ",avgtransfer_h=" + $countersall[2] + ",avgtransfer_l=" + $countersall[3] `
                     + ",avgtransfer_total=" + $countersall[4] + ",available_mbytes=" + $countersall[5] `
                     + ",pct_priviledge_time=" + $countersall[6] + ",pct_user_time=" + $countersall[7] `
                     + ",pct_processor_time=" + $countersall[8] + ",sql_cpu_time=" + $countersall[9] `
                     + ",processess_blocked=" + $countersall[10] + ",page_life_expectancy=" + $countersall[11] `
                     + ",user_connections=" + $countersall[12] + ",batch_resc_per_sec=" + $countersall[13] `
                     + ",total_server_memory_kb=" + $countersall[14]
      
       # get the output from the Golden Gate function
       $result = Get-RealTimeStatus | select Program, Status, Name, Lag, LagSec, ChkPt, ChkptSec
       $r1 = $result | ?{ @("RP9CRIT2") -contains $_.Name } | select -First 1
       $r2 = $result | ?{ @("RP9MAIN2") -contains $_.Name } | select -First 1
      
       if ($r1.Status -eq 'RUNNING') { $r1status = 1 }
              else { $r1status = 0 }
      
       if ($r2.Status -eq 'RUNNING') { $r2status = 1 }
              else { $r2status = 0 }
      
       # Format GG data into influxdb Line Protocol
       $postParams_r1 = "gg_RP9CRIT2,host=" + $WinServer + " LagSec=" + $r1.LagSec + ",ChkptSec=" + $r1.ChkptSec + ",Status=" + $r1status
       $postParams_r2 = "gg_RP9MAIN2,host=" + $WinServer + " LagSec=" + $r2.LagSec + ",ChkptSec=" + $r2.ChkptSec + ",Status=" + $r2Status
      
       # Post to influxdb API
       $uri = 'http://SRV0175:8086/write?db=ggperfdb'
      
       # Create web connection with influxdb URL
       $ServicePoint = [System.Net.ServicePointManager]::FindServicePoint($uri)
       $ServicePoint.ConnectionLimit = 3
      
       # Use influxdb authentication to logon
       $authheader = "Basic " + ([Convert]::ToBase64String([System.Text.encoding]::ASCII.GetBytes("admin:admin")))
      
       # Post data to influxDB
       try
       {
              Invoke-RestMethod -Headers @{ Authorization = $authheader } -Uri $uri -Method POST -Body $postParams -DisableKeepAlive
              Invoke-RestMethod -Headers @{ Authorization = $authheader } -Uri $uri -Method POST -Body $postParams_r1 -DisableKeepAlive
              Invoke-RestMethod -Headers @{ Authorization = $authheader } -Uri $uri -Method POST -Body $postParams_r2 -DisableKeepAlive
             
              #Close web connection
              $ServicePoint.CloseConnectionGroup("") | Out-Null          
       }
      
       catch
       {     
              throw 'Could not POST to InfluxDB API endpoint'            
       }
}

#######################################################################################
# Run the collector function in a loop, continuously 
#######################################################################################

while ($true)
{
       # execute the collector
       Get-ggPerf
       # pause for X seconds
       start-sleep -Seconds 8
}

# end of script 

References:
PowerShell Way: Automating Golden Gate Replication Monitoring on Windows
http://sqltouch.blogspot.ca/2014/08/powershell-way-automating-golden-gate.html


Grafana: http://grafana.org/