nexmon – Blame information for rev 1

Subversion Repositories:
Rev:
Rev Author Line No. Line
1 office 1 (This is a consolidation of documentation written by stig, sahlberg, and gram)
2  
3 What is the display filter system?
4 ==================================
5 The display filter system allows the user to select packets by testing
6 for values in the proto_tree that Wireshark constructs for that packet.
7 Every proto_item in the proto_tree has an 'abbrev' field
8 and a 'type' field, which tells the display filter engine the name
9 of the field and its type (what values it can hold).
10  
11 For example, this is the definition of the ip.proto field from packet-ip.c:
12  
13 { &hf_ip_proto,
14 { "Protocol", "ip.proto", FT_UINT8, BASE_DEC | BASE_EXT_STRING,
15 &ipproto_val_ext, 0x0, NULL, HFILL }},
16  
17 This definition says that "ip.proto" is the display-filter name for
18 this field, and that its field-type is FT_UINT8.
19  
20 The display filter system has 3 major parts to it:
21  
22 1. A type system (field types, or "ftypes")
23 2. A parser, to convert a user's query to an internal representation
24 3. An engine that uses the internal representation to select packets.
25  
26  
27 code:
28 epan/dfilter/* - the display filter engine, including
29 scanner, parser, syntax-tree semantics checker, DFVM bytecode
30 generator, and DFVM engine.
31 epan/ftypes/* - the definitions of the various FT_* field types.
32 epan/proto.c - proto_tree-related routines
33  
34  
35 The field type system
36 =====================
37 The field type system is stored in epan/ftypes.
38  
39 The proto_tree system #includes ftypes.h, which gives it the ftenum
40 definition, which is the enum of all possible ftypes:
41  
42 /* field types */
43 enum ftenum {
44 FT_NONE, /* used for text labels with no value */
45 FT_PROTOCOL,
46 FT_BOOLEAN, /* TRUE and FALSE come from <glib.h> */
47 FT_UINT8,
48 FT_UINT16,
49 FT_UINT24, /* really a UINT32, but displayed as3 hex-digits if FD_HEX*/
50 FT_UINT32,
51 FT_UINT64,
52 etc., etc.
53 }
54  
55 It also provides the definition of fvalue_t, the struct that holds the *value*
56 that corresponds to the type. Each proto_item (proto_node) holds an fvalue_t
57 due to having a field_info struct (defined in proto.h).
58  
59 The fvalue_t is mostly just a gigantic union of possible C-language types
60 (as opposed to FT_* types):
61  
62 typedef struct _fvalue_t {
63 ftype_t *ftype;
64 union {
65 /* Put a few basic types in here */
66 guint32 uinteger;
67 gint32 sinteger;
68 guint64 integer64;
69 gdouble floating;
70 gchar *string;
71 guchar *ustring;
72 GByteArray *bytes;
73 ipv4_addr ipv4;
74 ipv6_addr ipv6;
75 e_guid_t guid;
76 nstime_t time;
77 tvbuff_t *tvb;
78 GRegex *re;
79 } value;
80  
81 /* The following is provided for private use
82 * by the fvalue. */
83 gboolean fvalue_gboolean1;
84  
85 } fvalue_t;
86  
87  
88 Defining a field type
89 ---------------------
90 The ftype system itself is designed to be modular, so that new field types
91 can be added when necessary.
92  
93 Each field type must implement an ftype_t structure, also defined in
94 ftypes.h. This is the way a field type is registered with the ftype engine.
95  
96 If you take a look at ftype-integer.c, you will see that it provides
97 an ftype_register_integers() function, that fills in many such ftype_t
98 structs. It creates one for each integer type: FT_UINT8, FT_UINT16,
99 FT_UINT32, etc.
100  
101 The ftype_t struct defines the things needed for the ftype:
102  
103 * its ftenum value
104 * a string representation of the FT name ("FT_UINT8")
105 * how much data it consumes in the packet
106 * how to store that value in an fvalue_t: new(), free(),
107 various value-related functions
108 * how to compare that value against another
109 * how to slice that value (strings and byte ranges can be sliced)
110  
111 Using an fvalue_t
112 -----------------
113 Once the value of a field is stored in an fvalue_t (stored in
114 each proto_item via field_info), it's easy to use those values,
115 thanks to the various fvalue_*() functions defined in ftypes.h.
116  
117 Functions like fvalue_get(), fvalue_eq(), etc., are all generic
118 interfaces to get information about the field's value. They work
119 on any field type because of the ftype_t struct, which is the lookup
120 table that the field-type engine uses to work with any field type.
121  
122 The display filter parser
123 =========================
124 The display filter parser (along with the comparison engine)
125 is stored in epan/dfilter.
126  
127 The scanner/parser pair read the string representing the display filter
128 and convert it into a very simple syntax tree. The syntax tree is very
129 simple in that it is possible that many of the nodes contain unparsed
130 chunks of text from the display filter.
131  
132 There are four phases to parsing a user's request:
133  
134 1. Scanning the string for dfilter syntax
135 2. Parsing the keywords according to the dfilter grammar, into a
136 syntax tree
137 3. Doing a semantic check of the nodes in that syntax tree
138 4. Converting the syntax tree into a series of DFVM byte codes
139  
140 The dfilter_compile() function, in epan/dfilter/dfilter.c,
141 runs these 4 phases. The end result is a dfwork_t object (dfw), that
142 can be passed to dfilter_apply() to actually run the display filter
143 against a set of proto_trees.
144  
145  
146 Scanning the display filter string
147 ----------------------------------
148 epan/dfilter/scanner.l is the lex scanner for finding keywords
149 in the user's display filter string.
150  
151 Its operation is simple. It finds the special function and comparison
152 operators ("==", "!=", "eq", "ne", etc.), it finds slice operations
153 ( "[0:1]" ), quoted strings, IP addresses, numbers, and any other "special"
154 keywords or string types.
155  
156 Anything it doesn't know how to handle is passed to to grammar parser
157 as an unparsed string (TOKEN_UNPARSED). This includes field names. The
158 scanner does not interpret any protocol field names at all.
159  
160 The scanner has to return a token type (TOKEN_*, and in many cases,
161 a value. The value will be an stnode_t struct, which is a syntax
162 tree node object. Since the final storage of the parse will
163 be in a syntax tree, it is convenient for the scanner to fill in
164 syntax tree nodes with values when it can.
165  
166 The stnode_t definition is in epan/dfilter/syntax-tree.h
167  
168  
169 Parsing the keywords according to the dfilter grammar
170 -----------------------------------------------------
171 The grammar parser is implemented with the 'lemon' tool,
172 rather than the traditional yacc or bison grammar parser,
173 as lemon grammars were found to be easier to work with. The
174 lemon parser specification (epan/dfilter/grammar.lemon) is
175 much easier to read than its bison counterpart would be,
176 thanks to lemon's feature of being able to name fields, rather
177 then using numbers ($1, $2, etc.)
178  
179 The lemon tool is located in tools/lemon in the Wireshark
180 distribution.
181  
182 An on-line introduction to lemon is available at:
183  
184 http://www.sqlite.org/src/doc/trunk/doc/lemon.html
185  
186 The grammar specifies which type of constructs are possible
187 within the dfilter language ("dfilter-lang")
188  
189 An "expression" in dfilter-lang can be a relational test or a logical test.
190  
191 A relational test compares a value against another, which is usually
192 a field (or a slice of a field) against some static value, like:
193  
194 ip.proto == 1
195 eth.dst != ff:ff:ff:ff:ff:ff
196  
197 A logical test combines other expressions with "and", "or", and "not".
198  
199 At the end of the grammatical parsing, the dfw object will
200 have a valid syntax tree, pointed at by dfw->st_root.
201  
202 If there is an error in the syntax, the parser will call dfilter_fail()
203 with an appropriate error message, which the UI will need to report
204 to the user.
205  
206 The syntax tree system
207 ----------------------
208 The syntax tree is created as a result of running the lemon-based
209 grammar parser on the scanned tokens. The syntax tree code
210 is in epan/dfilter/syntax-tree* and epan/dfilter/sttree-*. It too
211 uses a set of code modules that implement different syntax node types,
212 similar to how the field-type system registers a set of ftypes
213 with a central engine.
214  
215 Each node (stnode_t) in the syntax tree has a type (sttype).
216 These sttypes are very much related to ftypes (field types), but there
217 is not a one-to-one correspondence. The syntax tree nodes are slightly
218 high-level. For example, there is only a single INTEGER sttype, unlike
219 the ftype system that has a type for UINT64, UINT32, UINT16, UINT8, etc.
220  
221 typedef enum {
222 STTYPE_UNINITIALIZED,
223 STTYPE_TEST,
224 STTYPE_UNPARSED,
225 STTYPE_STRING,
226 STTYPE_FIELD,
227 STTYPE_FVALUE,
228 STTYPE_INTEGER,
229 STTYPE_RANGE,
230 STTYPE_FUNCTION,
231 STTYPE_SET,
232 STTYPE_NUM_TYPES
233 } sttype_id_t;
234  
235  
236 The root node of the syntax tree is the main test or comparison
237 being done.
238  
239 Semantic Check
240 --------------
241 After the parsing is done and a syntax tree is available, the
242 code in semcheck.c does a semantic check of what is in the syntax
243 tree.
244  
245 The semantics of the simple syntax tree are checked to make sure that
246 the fields that are being compared are being compared to appropriate
247 values. For example, if a field is an integer, it can't be compared to
248 a string, unless a value_string has been defined for that field.
249  
250 During the process of checking the semantics, the simple syntax tree is
251 fleshed out and no longer contains nodes with unparsed information. The
252 syntax tree is no longer in its simple form, but in its complete form.
253  
254 For example, if the dfilter is slicing a field and comparing
255 against a set of bytes, semcheck.c has to check that the field
256 in question can indeed be sliced.
257  
258 Or, can a field be compared against a certain type of value (string,
259 integer, float, IPv4 address, etc.)
260  
261 The semcheck code also makes adjustments to the syntax tree
262 when it needs to. The parser sometimes stores raw, unparsed strings
263 in the syntax tree, and semcheck has to convert them to
264 certain types. For example, the display filter may contain
265 a value_string string (the "enum" type that protocols can use
266 to define the possible textual descriptions of numeric fields), and
267 semcheck will convert that value_string string into the correct
268 integer value.
269  
270 Truth be told, the semcheck.c code is a bit disorganized, and could
271 be re-designed & re-written.
272  
273 DFVM Byte Codes
274 ---------------
275 The syntax tree is analyzed to create a sequence of bytecodes in the
276 "DFVM" language. "DFVM" stands for Display Filter Virtual Machine. The
277 DFVM is similar in spirit, but not in definition, to the BPF VM that
278 libpcap uses to analyze packets.
279  
280 A virtual bytecode is created and used so that the actual process of
281 filtering packets will be fast. That is, it should be faster to process
282 a list of VM bytecodes than to attempt to filter packets directly from
283 the syntax tree. (heh... no measurement has been made to support this
284 supposition)
285  
286 The DFVM opcodes are defined in epan/dfilter/dfvm.h (dfvm_opcode_t).
287 Similar to how the BPF opcode system works in libpcap, there is a
288 limited set of opcodes. They operate by loading values from the
289 proto_tree into registers, loading pre-defined values into
290 registers, and comparing them. The opcodes are checked in sequence, and
291 there are only 2 branching opcodes: IF_TRUE_GOTO and IF_FALSE_GOTO.
292 Both of these can only branch forwards, and never backwards. In this way
293 sets of DFVM instructions will never get into an infinite loop.
294  
295 The epan/dfilter/gencode.c code converts the syntax tree
296 into a set of dvfm instructions.
297  
298 The constants that are in the DFVM instructions (the constant
299 values that the user is checking against) are pre-loaded
300 into registers via the dvfm_init_const() call, and stored
301 in the dfilter_t structure for when the display filter is
302 actually applied.
303  
304  
305 DFVM Engine
306 ===========
307 Once the DFVM bytecode has been produced, it's a simple matter of
308 running the DFVM engine against the proto_tree from the packet
309 dissection, using the DFVM bytecodes as instructions. If the DFVM
310 bytecode is known before packet dissection occurs, the
311 proto_tree-related code can be "primed" to store away pointers to
312 field_info structures that are interesting to the display filter. This
313 makes lookup of those field_info structures during the filtering process
314 faster.
315  
316 The dfilter_apply() function runs a single pre-compiled
317 display filter against a single proto_tree function, and returns
318 TRUE or FALSE, meaning that the filter matched or not.
319  
320 That function calls dfvm_apply(), which runs across the DFVM
321 instructions, loading protocol field values into DFVM registers
322 and doing the comparisons.
323  
324 There is a top-level Makefile target called 'dftest' which
325 builds a 'dftest' executable that will print out the DFVM
326 bytecode for any display filter given on the command-line.
327 To build it, run:
328  
329 $ make dftest
330  
331 To use it, give it the display filter on the command-line:
332  
333 $ ./dftest 'ip.addr == 127.0.0.1'
334 Filter: "ip.addr == 127.0.0.1"
335  
336 Constants:
337 00000 PUT_FVALUE 127.0.0.1 <FT_IPv4> -> reg#1
338  
339 Instructions:
340 00000 READ_TREE ip.addr -> reg#0
341 00001 IF-FALSE-GOTO 3
342 00002 ANY_EQ reg#0 == reg#1
343 00003 RETURN
344  
345  
346 The output shows the original display filter, then the opcodes
347 that put constant values into registers. The registers are
348 numbered, and are shown in the output as "reg#n", where 'n' is the
349 identifying number.
350  
351 Then the instructions are shown. These are the instructions
352 which are run for each proto_tree.
353  
354 This is what happens in this example:
355  
356 00000 READ_TREE ip.addr -> reg#0
357  
358 Any ip.addr fields in the proto_tree are loaded into register 0. Yes,
359 multiple values can be loaded into a single register. As a result
360 of this READ_TREE, the accumulator will hold TRUE or FALSE, indicating
361 if any field's value was loaded, or not.
362  
363 00001 IF-FALSE-GOTO 3
364  
365 If the load failed because there were no ip.addr fields
366 in the proto_tree, then we jump to instruction 3.
367  
368 00002 ANY_EQ reg#0 == reg#1
369  
370 This checks to see if any of the fields in register 0
371 (which has the pre-loaded constant value of 127.0.0.1) are equal
372 to any of the fields in register 1 (which are all of the ip.addr
373 fields in the proto tree). The resulting value in the
374 accumulator will be TRUE if any of the fields match, or FALSE
375 if none match.
376  
377 00003 RETURN
378  
379 This returns the accumulator's value, either TRUE or FALSE.
380  
381 In addition to dftest, there is also a tools/dfilter-test script
382 which is a unit-test script for the display filter engine.
383 It makes use of text2pcap and tshark to run specific display
384 filters against specific captures (embedded within dfilter-test).
385  
386  
387  
388 Display Filter Functions
389 ========================
390 You define a display filter function by adding an entry to
391 the df_functions table in epan/dfilter/dfunctions.c. The record struct
392 is defined in dfunctions.h, and shown here:
393  
394 typedef struct {
395 char *name;
396 DFFuncType function;
397 ftenum_t retval_ftype;
398 guint min_nargs;
399 guint max_nargs;
400 DFSemCheckType semcheck_param_function;
401 } df_func_def_t;
402  
403 name - the name of the function; this is how the user will call your
404 function in the display filter language
405  
406 function - this is the run-time processing of your function.
407  
408 retval_ftype - what type of FT_* type does your function return?
409  
410 min_nargs - minimum number of arguments your function accepts
411 max_nargs - maximum number of arguments your function accepts
412  
413 semcheck_param_function - called during the semantic check of the
414 display filter string.
415  
416 DFFuncType function
417 -------------------
418 typedef gboolean (*DFFuncType)(GList *arg1list, GList *arg2list, GList **retval);
419  
420 The return value of your function is a gboolean; TRUE if processing went fine,
421 or FALSE if there was some sort of exception.
422  
423 For now, display filter functions can accept a maximum of 2 arguments.
424 The "arg1list" parameter is the GList for the first argument. The
425 'arg2list" parameter is the GList for the second argument. All arguments
426 to display filter functions are lists. This is because in the display
427 filter language a protocol field may have multiple instances. For example,
428 a field like "ip.addr" will exist more than once in a single frame. So
429 when the user invokes this display filter:
430  
431 somefunc(ip.addr) == TRUE
432  
433 even though "ip.addr" is a single argument, the "somefunc" function will
434 receive a GList of *all* the values of "ip.addr" in the frame.
435  
436 Similarly, the return value of the function needs to be a GList, since all
437 values in the display filter language are lists. The GList** retval argument
438 is passed to your function so you can set the pointer to your return value.
439  
440 DFSemCheckType
441 --------------
442 typedef void (*DFSemCheckType)(int param_num, stnode_t *st_node);
443  
444 For each parameter in the syntax tree, this function will be called.
445 "param_num" will indicate the number of the parameter, starting with 0.
446 The "stnode_t" is the syntax-tree node representing that parameter.
447 If everything is okay with the value of that stnode_t, your function
448 does nothing --- it merely returns. If something is wrong, however,
449 it should THROW a TypeError exception.
450  
451  
452 Example: add an 'in' display filter operation
453 =============================================
454  
455 This example has been discussed on wireshark-dev in April 2004. It illustrates
456 how a more complex operation can be added to the display filter language.
457  
458 Question:
459  
460 If I want to add an 'in' display filter operation, I need to define
461 several things. This can happen in different ways. For instance,
462 every value from the "in" value collection will result in a test.
463 There are 2 options here, either a test for a single value:
464  
465 (x in {a b c})
466  
467 or a test for a value in a given range:
468  
469 (x in {a ... z})
470  
471 or even a combination of both. The former example can be reduced to:
472  
473 ((x == a) or (x == b) or (x == c))
474  
475 while the latter can be reduced to
476  
477 ((x >= MIN(a, z)) and (x <= MAX(a, z)))
478  
479 I understand that I can replace "x in {" with the following steps:
480 first store x in the "in" test buffer, then add "(" to the display
481 filter expression internally.
482  
483 Similarly I can replace the closing brace "}" with the following
484 steps: release x from the "in" test buffer and then add ")"
485 to the display filter expression internally.
486  
487 How could I do this?
488  
489 Answer:
490  
491 This could be done in grammar.lemon. The grammar would produce
492 syntax tree nodes, combining them with "or", when it is given
493 tokens that represent the "in" syntax.
494  
495 It could also be done later in the process, maybe in
496 semcheck.c. But if you can do it earlier, in grammar.lemon,
497 then you shouldn't have to worry about modifying anything in
498 semcheck.c, as the syntax tree that is passed to semcheck.c
499 won't contain any new type of operators... just lots of nodes
500 combined with "or".
501  
502 How to add an operator FOO to the display filter language?
503 ==========================================================
504  
505 Go to wireshark/epan/dfilter/
506  
507 Edit grammar.lemon and add the operator. Add the operator FOO and the
508 test logic (defining TEST_OP_FOO).
509  
510 Edit scanner.l and add the operator name(s) hence defining
511 TOKEN_TEST_FOO. Also update the simple() or add the new operand's code.
512  
513 Edit sttype-test.h and add the TEST_OP_FOO to the list of test operations.
514  
515 Edit sttype-test.c and add TEST_OP_FOO to the num_operands() method.
516  
517 Edit gencode.c, add TEST_OP_FOO in the gen_test() method by defining
518 ANY_FOO.
519  
520 Edit dfvm.h and add ANY_FOO to the enum dfvm_opcode_t structure.
521  
522 Edit dfvm.c and add ANY_FOO to dfvm_dump() (for the dftest display filter
523 test binary), to dfvm_apply() hence defining the methods fvalue_foo().
524  
525 Edit semcheck.c and look at the check_relation_XXX() methods if they
526 still apply to the foo operator; if not, amend the code. Start from the
527 check_test() method to discover the logic.
528  
529 Go to wireshark/epan/ftypes/
530  
531 Edit ftypes.h and declare the fvalue_foo(), ftype_can_foo() and
532 fvalue_foo() methods. Add the cmp_foo() method to the struct _ftype_t.
533  
534 This is the first time that a make in wireshark/epan/dfilter/ can
535 succeed. If it fails, then some code in the previously edited files must
536 be corrected.
537  
538 Edit ftypes.c and define the fvalue_foo() method with its associated
539 logic. Define also the ftype_can_foo() and fvalue_foo() methods.
540  
541 Edit all ftype-*.c files and add the required fvalue_foo() methods.
542  
543 This is the point where you should be able to compile without errors in
544 wireshark/epan/ftypes/. If not, first fix the errors.
545  
546 Go to wireshark/epan/ and run make. If this one succeeds, then we're
547 almost done as no errors should occur here.
548  
549 Go to wireshark/ and run make. One thing to do is make dftest and see
550 if you can construct valid display filters with your new operator. Or
551 you may want to move directly to the generation of Wireshark.
552  
553 Look also at wireshark/gtk/dfilter_expr_dlg.c and edit the display filter
554 expression generator.
555  
556 How to add a new test to dfilter-test.py
557 ========================================
558 Note: dfilter-test.py requires Python 2.7 or newer (including Python 3)
559  
560 "tools/dfilter-test.py" is the main test script. It includes
561 the test from files in tools/dftestlib. You can add a test
562 to a file in tools/dftestlib, or you can create a new file
563 in tools/dftestlib. If you do add a new file, you must
564 import it (and the class it defines) in "tools/dfilter-test.py"
565  
566 Each new test class must define "trace_file", which names
567 a capture file in "tools/dftestfiles". All the tests
568 run in that class will use that one capture file.
569  
570 There are 2 methods you can use for testing:
571  
572 assertDfilter(dfilter_text, expected_count)
573  
574 This will run the display filter through tshark, on the
575 file named by "trace_file", and assert that the
576 number of resulting packets equals "expected_count". This
577 also asserts that tshark does not fail; success with zero
578 matches is not the same as failure to compile the display
579 filter string.
580  
581 assertDFilterFail(dfilter_text)
582  
583 This will run tshark with the display filter, and
584 asser that tshark fails. This is useful when expecting
585 display filter syntax errors to be caught.
586  
587 Then, simply run "dfilter-test.py". You can run the tests
588 in a single Test class by naming that -class on the
589 dfilter-test.py command-line, or even run a single
590 test by naming it. E.g., the following are all valid ways
591 of running dfilter-test.py:
592  
593 # Run all tests
594 $ ./tools/dfilter-test.py
595  
596 # Run all tests in "testTVB"
597 $ ./tools/dfilter-test.py testTVB
598  
599 # Run the the "test_contains_1" test from testTVB
600 $ ./tools/dfilter-test.py testTVB.test_contains_1
601  
602 Note that dfilter-test.py should be run from the top of the
603 Wireshark distribution, so it knows where to find the default
604 tshark executable.
605