Ep 029: Problem Unknown: Log Lines
► Play EpisodeNate is dropped in the middle of a huge log file and hunts for the source of the errors.
- We dig into the world of DonutGram.
- "We are the devops."
- Problem: we start seeing errors in the log.
- "The kind of error that makes total sense to whoever wrote the log line."
- (04:10) We want to use Clojure to characterize the errors.
- Why not grep?
- Well...we grep out the lines, count them, and we get 10,351 times. Is that a lot? Hard to say. We need more info.
- (05:50) Developers think it's safe to ignore, but it's bugging us.
- We want to come up with an incident count by user.
- "Important thing we do a lot in the devops area: we see data and then we have to figure out, what does it mean?"
- "What story is the data telling us?"
- The "forensic data" versus "operating data" (Ep 022, 027)
- "With forensic data, you're trying to save the state of the imprint of the known universe over time."
- (07:40) The log file is the most common forensic data for applications.
- To characterize, we need to manipulate the data at a higher level than grep.
- First goal: make the data useful in Clojure
- Get a nice sample from the production log, put it in a file, load up the REPL, and start writing code to parse it.
- Need to parse out the data from the log file and turn it into something more structured.
- (12:05) Let's make a function,
parse-line
that separates out all the common elements for each line:- timestamp
- log level
- thread name
- package name
- freeform "message"
- "A message has arrived from the developers, lovingly bundled in a long log line."
- (13:10) We can parse every line generically, but we need to parse the message and make sense of that.
- "We want to lift up the unstructured data into structured data that we can map, filter, and reduce on."
- We will have different kinds of log lines, so we need to detect them and parse appropriately.
- We want to amplify the map to include the details for the specific kind of log line.
- We'll use a
kind
field to identify which kind of log line it is. - There are two steps:
- recognize which kind of log line the "message" is for
- parse the data out of that message
- "Maybe we'll have a podcast that's constantly about
constantly
. It's not justjuxt
." - (18:45) How do we do this concisely?
- Let's use
cond
:- flatten out all the cases
- code to detect kind (the condition)
- code to parse the details (the expression)
- Can use
includes?
in the condition to detect the kind andre-matches
to parse. - (20:00) Why not use the regex to detect the kind too?
- Can avoid writing the regex twice by using a
def
. - It's less code, but now we're running the regex twice: once in the condition and once to parse.
- We can't capture the result of the test in
cond
. No "cond-let" macro. - We could write our own
cond-let
macro, but should we? - "When you feel like you should write a macro, you should step back and assess your current state of being. Am I tired? Did I have a bad day? Do I really need this?"
- (24:05) New goals for our problem:
- one regex literal
- only run the regex once
- keep the code that uses the matches close to the regex
- Similar problem to "routes" for web endpoints: want the route definition next to the code that executes using the data for that route.
- Feels like an index or table of options.
- (25:20) Let's make a "table". A vector of regex and handler-code pairs.
- We need a coffee mug: "Clojure. It's just a list."
- The code can be a function literal that takes the matches and returns a map of parsed data.
- Write a function
parse-details
which goes through each regex until one matches and then invokes the handler for that one. (See below.) - (30:15) Once we have higher-level data, it's straight forward to
filter
andgroup-by
to get our user count. - Once again, the goal is to take unstructured data and turn it into structured data.
- "You have just up-leveled unstructured information into a sequence of structured information."
- Can
slurp
,split
,map
,filter
, and then aggregate. - (32:10) What happens when we try to open a 10 GB log file?
- Sounds like a problem for next week.
- "Fixing production is always a problem for right now."
- "If a server falls over and no one outside of the devops team knows about it, did the server really fall over?"
- "Who watches the watchers?"
Related episodes:
Clojure in this episode:
#""
re-matches
case
cond
if-let
->>
slurp
filter
group-by
def
constantly
juxt
clojure.string/
includes?
split
Related links:
Code sample from this episode:
(ns devops.week-01
(:require
[clojure.java.io :as io]
[clojure.string :as string]
))
;; General parsing
(def general-re #"(\d\d\d\d-\d\d-\d\d)\s+(\d\d:\d\d:\d\d)\s+\|\s+(\S+)\s+\|\s+(\S+)\s+\|\s+(\S+)\s+\|\s(.*)")
(defn parse-line
[line]
(when-let [[whole dt tm thread-name level ns message] (re-matches general-re line)]
{:raw/line whole
:log/date dt
:log/time tm
:log/thread thread-name
:log/level level
:log/namespace ns
:log/message message
}))
(defn general-parse
[lines]
(->> lines
(map parse-line)
(filter some?)))
;; Detailed parsing
(def detail-specs
[[#"transaction failed while updating user ([^:]+): code 357"
(fn [[_whole user]] {:kind :code-357 :code-357/user user})]
])
(defn try-detail-spec
[message [re fn]]
(when-some [matches (re-matches re message)]
(fn matches)))
(defn parse-details
[entry]
(let [{:keys [log/message]} entry]
(if-some [extra (->> detail-specs
(map (partial try-detail-spec message))
(filter some?)
(first))]
(merge entry extra)
entry)))
;; Log analysis
(defn lines
[filename]
(->> (slurp filename)
(string/split-lines)))
(defn summarize
[filename calc]
(->> (lines filename)
(general-parse)
(map parse-details)
(calc)))
;; data summarizing
(defn code-357-by-user
[entries]
(->> entries
(filter #(= :code-357 (:kind %)))
(map :code-357/user)
(frequencies)))
(comment
(summarize "sample.log" code-357-by-user)
)
Log file sample:
2019-05-14 16:48:55 | process-Poster | INFO | com.donutgram.poster | transaction failed while updating user joe: code 357
2019-05-14 16:48:56 | process-Poster | INFO | com.donutgram.poster | transaction failed while updating user sally: code 357
2019-05-14 16:48:57 | process-Poster | INFO | com.donutgram.poster | transaction failed while updating user joe: code 357