mirror of
https://github.com/xemu-project/xemu.git
synced 2024-11-27 13:30:52 +00:00
cb8d4c8f54
Signed-off-by: Stefan Weil <sw@weilnetz.de> Reviewed-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
1015 lines
36 KiB
Plaintext
1015 lines
36 KiB
Plaintext
Rocker Network Switch Register Programming Guide
|
||
Copyright (c) Scott Feldman <sfeldma@gmail.com>
|
||
Copyright (c) Neil Horman <nhorman@tuxdriver.com>
|
||
Version 0.11, 12/29/2014
|
||
|
||
LICENSE
|
||
=======
|
||
|
||
This program is free software; you can redistribute it and/or modify
|
||
it under the terms of the GNU General Public License as published by
|
||
the Free Software Foundation; either version 2 of the License, or
|
||
(at your option) any later version.
|
||
|
||
This program is distributed in the hope that it will be useful,
|
||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||
GNU General Public License for more details.
|
||
|
||
SECTION 1: Introduction
|
||
=======================
|
||
|
||
Overview
|
||
--------
|
||
|
||
This document describes the hardware/software interface for the Rocker switch
|
||
device. The intended audience is authors of OS drivers and device emulation
|
||
software.
|
||
|
||
Notations and Conventions
|
||
-------------------------
|
||
|
||
o In register descriptions, [n:m] indicates a range from bit n to bit m,
|
||
inclusive.
|
||
o Use of leading 0x indicates a hexadecimal number.
|
||
o Use of leading 0b indicates a binary number.
|
||
o The use of RSVD or Reserved indicates that a bit or field is reserved for
|
||
future use.
|
||
o Field width is in bytes, unless otherwise noted.
|
||
o Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear
|
||
on read
|
||
o TLV values in network-byte-order are designated with (N).
|
||
|
||
|
||
SECTION 2: PCI Configuration Registers
|
||
======================================
|
||
|
||
PCI Configuration Space
|
||
-----------------------
|
||
|
||
Each switch instance registers as a PCI device with PCI configuration space:
|
||
|
||
offset width description value
|
||
---------------------------------------------
|
||
0x0 2 Vendor ID 0x1b36
|
||
0x2 2 Device ID 0x0006
|
||
0x4 4 Command/Status
|
||
0x8 1 Revision ID 0x01
|
||
0x9 3 Class code 0x2800
|
||
0xC 1 Cache line size
|
||
0xD 1 Latency timer
|
||
0xE 1 Header type
|
||
0xF 1 Built-in self test
|
||
0x10 4 Base address low
|
||
0x14 4 Base address high
|
||
0x18-28 Reserved
|
||
0x2C 2 Subsystem vendor ID *
|
||
0x2E 2 Subsystem ID *
|
||
0x30-38 Reserved
|
||
0x3C 1 Interrupt line
|
||
0x3D 1 Interrupt pin 0x00
|
||
0x3E 1 Min grant 0x00
|
||
0x3D 1 Max latency 0x00
|
||
0x40 1 TRDY timeout
|
||
0x41 1 Retry count
|
||
0x42 2 Reserved
|
||
|
||
|
||
* Assigned by sub-system implementation
|
||
|
||
SECTION 3: Memory-Mapped Register Space
|
||
=======================================
|
||
|
||
There are two memory-mapped BARs. BAR0 maps device register space and is
|
||
0x2000 in size. BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in
|
||
size, allowing for 256 MSI-X vectors.
|
||
|
||
All registers are 4 or 8 bytes long. It is assumed host software will access 4
|
||
byte registers with one 4-byte access, and 8 byte registers with either two
|
||
4-byte accesses or a single 8-byte access. In the case of two 4-byte accesses,
|
||
access must be lower and then upper 4-bytes, in that order.
|
||
|
||
BAR0 device register space is organized as follows:
|
||
|
||
offset description
|
||
------------------------------------------------------
|
||
0x0000-0x000f Bogus registers to catch misbehaving
|
||
drivers. Writes do nothing. Reads
|
||
back as 0xDEADBABE.
|
||
0x0010-0x00ff Test registers
|
||
0x0300-0x03ff General purpose registers
|
||
0x1000-0x1fff Descriptor control
|
||
|
||
Holes in register space are reserved. Writes to reserved registers do nothing.
|
||
Reads to reserved registers read back as 0.
|
||
|
||
No fancy stuff like write-combining is enabled on any of the registers.
|
||
|
||
BAR1 MSI-X register space is organized as follows:
|
||
|
||
offset description
|
||
------------------------------------------------------
|
||
0x0000-0x0fff MSI-X vector table (256 vectors total)
|
||
0x1000-0x1fff MSI-X PBA table
|
||
|
||
|
||
SECTION 4: Interrupts, DMA, and Endianness
|
||
==========================================
|
||
|
||
PCI Interrupts
|
||
--------------
|
||
|
||
The device supports only MSI-X interrupts. BAR1 memory-mapped region contains
|
||
the MSI-X vector and PBA tables, with support for up to 256 MSI-X vectors.
|
||
|
||
The vector assignment is:
|
||
|
||
vector description
|
||
-----------------------------------------------------
|
||
0 Command descriptor ring completion
|
||
1 Event descriptor ring completion
|
||
2 Test operation completion
|
||
3 RSVD
|
||
4-255 Tx and Rx descriptor ring completion
|
||
Tx vector is even
|
||
Rx vector is odd
|
||
|
||
A MSI-X vector table entry is 16 bytes:
|
||
|
||
field offset width description
|
||
-------------------------------------------------------------
|
||
lower_addr 0x0 4 [31:2] message address[31:2]
|
||
[1:0] Rsvd (4 byte alignment
|
||
required)
|
||
upper_addr 0x4 4 [31:19] Rsvd
|
||
[14:0] message address[46:32]
|
||
data 0x8 4 message data[31:0]
|
||
control 0xc 4 [31:1] Rsvd
|
||
[0] mask (0 = enable,
|
||
1 = masked)
|
||
|
||
Software should install the Interrupt Service Routine (ISR) before any ports
|
||
are enabled or any commands are issued on the command ring.
|
||
|
||
DMA Operations
|
||
--------------
|
||
|
||
DMA operations are used for packet DMA to/from the CPU, command and event
|
||
processing. Command processing includes statistical counters and table dumps,
|
||
table insertion/deletion, and more. Event processing provides an async
|
||
notification method for device-originating events. Each DMA operation has a
|
||
set of control registers to manage a descriptor ring. The descriptor rings are
|
||
allocated from contiguous host DMA-able memory and registers specify the rings
|
||
base address, size and current head and tail indices. Software always writes
|
||
the head, and hardware always writes the tail.
|
||
|
||
The higher-order bit of DMA_DESC_COMP_ERR is used to mark hardware completion
|
||
of a descriptor. Software will clear this bit when posting a descriptor to the
|
||
ring, and hardware will set this bit when the descriptor is complete.
|
||
|
||
Descriptor ring sizes must be a power of 2 and range from 2 to 64K entries.
|
||
Descriptor rings' base address must be 8-byte aligned. Descriptors must be
|
||
packed within ring. Each descriptor in each ring must also be aligned on an 8
|
||
byte boundary. Each descriptor ring will have these registers:
|
||
|
||
DMA_DESC_xxx_BASE_ADDR, offset 0x1000 + (x * 32), 64-bit, (R/W)
|
||
DMA_DESC_xxx_SIZE, offset 0x1008 + (x * 32), 32-bit, (R/W)
|
||
DMA_DESC_xxx_HEAD, offset 0x100c + (x * 32), 32-bit, (R/W)
|
||
DMA_DESC_xxx_TAIL, offset 0x1010 + (x * 32), 32-bit, (R)
|
||
DMA_DESC_xxx_CTRL, offset 0x1014 + (x * 32), 32-bit, (W)
|
||
DMA_DESC_xxx_CREDITS, offset 0x1018 + (x * 32), 32-bit, (R/W)
|
||
DMA_DESC_xxx_RSVD1, offset 0x101c + (x * 32), 32-bit, (R/W)
|
||
|
||
Where x is descriptor ring index:
|
||
|
||
index ring
|
||
--------------------
|
||
0 CMD
|
||
1 EVENT
|
||
2 TX (port 0)
|
||
3 RX (port 0)
|
||
4 TX (port 1)
|
||
5 RX (port 1)
|
||
.
|
||
.
|
||
.
|
||
124 TX (port 61)
|
||
125 RX (port 61)
|
||
126 Resv
|
||
127 Resv
|
||
|
||
Writing BASE_ADDR or SIZE will reset HEAD and TAIL to zero. HEAD cannot be
|
||
written past TAIL. To do so would wrap the ring. An empty ring is when HEAD
|
||
== TAIL. A full ring is when HEAD is one position behind TAIL. Both HEAD and
|
||
TAIL increment and modulo wrap at the ring size.
|
||
|
||
CTRL register bits:
|
||
|
||
bit name description
|
||
------------------------------------------------------------------------
|
||
[0] CTRL_RESET Reset the descriptor ring
|
||
[1:31] Reserved
|
||
|
||
All descriptor types share some common fields:
|
||
|
||
field width description
|
||
-------------------------------------------------------------------
|
||
DMA_DESC_BUF_ADDR 8 Phys addr of desc payload, 8-byte
|
||
aligned
|
||
DMA_DESC_COOKIE 8 Desc cookie for completion matching,
|
||
upper-most bit is reserved
|
||
DMA_DESC_BUF_SIZE 2 Desc payload size in bytes
|
||
DMA_DESC_TLV_SIZE 2 Desc payload total size in bytes
|
||
used for TLVs. Must be <=
|
||
DMA_DESC_BUF_SIZE.
|
||
DMA_DESC_COMP_ERR 2 Completion status of associated
|
||
desc payload. High order bit is
|
||
clear on new descs, toggled by
|
||
hw for completed items.
|
||
|
||
To support forward- and backward-compatibility, descriptor and completion
|
||
payloads are specified in TLV format. Fields are packed with Type=field name,
|
||
Length=field length, and Value=field value. Software will ignore unknown fields
|
||
filled in by the switch. Likewise, the switch will ignore unknown fields
|
||
filled in by software.
|
||
|
||
Descriptor payload buffer is 8-byte aligned and TLVs are 8-byte aligned. The
|
||
value within a TLV is also 8-byte aligned. The (packed, 8 byte) TLV header is:
|
||
|
||
field width description
|
||
-----------------------------
|
||
type 4 TLV type
|
||
len 2 TLV value length
|
||
pad 2 Reserved
|
||
|
||
The alignment requirements for descriptors and TLVs are to avoid unaligned
|
||
access exceptions in software. Note that the payload for each TLV is also
|
||
8 byte aligned.
|
||
|
||
Figure 1 shows an example descriptor buffer with two TLVs.
|
||
|
||
<------- 8 bytes ------->
|
||
|
||
8-byte +––––+ +–––––––––––+–––––+–––––+ +–+
|
||
align | type | len | pad | TLV#1 hdr |
|
||
+–––––––––––+–––––+–––––+ (len=22) |
|
||
| | |
|
||
| value | TVL#1 value |
|
||
| | (padded to 8-byte |
|
||
| +–––––+ alignment) |
|
||
| |/////| |
|
||
8-byte +––––+ +–––––––––––+–––––––––––+ |
|
||
align | type | len | pad | TLV#2 hdr DESC_BUF_SIZE
|
||
+–––––+–––––+–––––+–––––+ (len=2) |
|
||
|value|/////////////////| TLV#2 value |
|
||
+–––––+/////////////////| |
|
||
|///////////////////////| |
|
||
|///////////////////////| |
|
||
|///////////////////////| |
|
||
|////////unused/////////| |
|
||
|////////space//////////| |
|
||
|///////////////////////| |
|
||
|///////////////////////| |
|
||
|///////////////////////| |
|
||
+–––––––––––––––––––––––+ +–+
|
||
|
||
fig. 1
|
||
|
||
TLVs can be nested within the NEST TLV type.
|
||
|
||
Interrupt credits
|
||
^^^^^^^^^^^^^^^^^
|
||
|
||
MSI-X vectors used for descriptor ring completions use a credit mechanism for
|
||
efficient device, PCIe bus, OS and driver operations. Each descriptor ring has
|
||
a credit count which represents the number of outstanding descriptors to be
|
||
processed by the driver. As the device marks descriptors complete, the credit
|
||
count is incremented. As the driver processes those outstanding descriptors,
|
||
it returns credits back to the device. This way, the device knows the driver's
|
||
progress and can make decisions about when to fire the next interrupt or not.
|
||
When the credit count is zero, and the first descriptors are posted for the
|
||
driver, a single interrupt is fired. Once the interrupt is fired, the
|
||
interrupt is disabled (auto-masked*). In response to the interrupt, the driver
|
||
will process descriptors and PIO write a returned credit value for that
|
||
descriptor ring. If the driver returns all credits (the driver caught up with
|
||
the device and there is no outstanding work), then the interrupt is unmasked,
|
||
but not fired. If only partial credits are returned, the interrupt remains
|
||
masked but the device generates an interrupt, signaling the driver that more
|
||
outstanding work is available.
|
||
|
||
(* this masking is unrelated to the MSI-X interrupt mask register)
|
||
|
||
Endianness
|
||
----------
|
||
|
||
Device registers are hard-coded to little-endian (LE). The driver should
|
||
convert to/from host endianness to LE for device register accesses.
|
||
|
||
Descriptors are LE. Descriptor buffer TLVs will have LE type and length
|
||
fields, but the value field can either be LE or network-byte-order, depending
|
||
on context. TLV values containing network packet data will be in network-byte
|
||
order. A TLV value containing a field or mask used to compare against network
|
||
packet data is network-byte order. For example, flow match fields (and masks)
|
||
are network-byte-order since they're matched directly, byte-by-byte, against
|
||
network packet data. All non-network-packet TLV multi-byte values will be LE.
|
||
|
||
TLV values in network-byte-order are designated with (N).
|
||
|
||
|
||
SECTION 5: Test Registers
|
||
=========================
|
||
|
||
Rocker has several test registers to support troubleshooting register access,
|
||
interrupt generation, and DMA operations:
|
||
|
||
TEST_REG, offset 0x0010, 32-bit (R/W)
|
||
TEST_REG64, offset 0x0018, 64-bit (R/W)
|
||
TEST_IRQ, offset 0x0020, 32-bit (R/W)
|
||
TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W)
|
||
TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W)
|
||
TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W)
|
||
|
||
Reads to TEST_REG and TEST_REG64 will read a value equal to twice the last
|
||
value written to the register. The 32-bit and 64-bit versions are for testing
|
||
32-bit and 64-bit host accesses.
|
||
|
||
A vector can be written to TEST_IRQ and the device will generate an interrupt
|
||
for that vector.
|
||
|
||
To test basic DMA operations, allocate a DMA-able host buffer and put the
|
||
buffer address into TEST_DMA_ADDR and size into TEST_DMA_SIZE. Then, write to
|
||
TEST_DMA_CTRL to manipulate the buffer contents. TEST_DMA_CTRL operations are:
|
||
|
||
operation value description
|
||
-----------------------------------------------------------
|
||
TEST_DMA_CTRL_CLEAR 1 clear buffer
|
||
TEST_DMA_CTRL_FILL 2 fill buffer bytes with 0x96
|
||
TEST_DMA_CTRL_INVERT 4 invert bytes in buffer
|
||
|
||
Various buffer address and sizes should be tested to verify no address boundary
|
||
issue exists. In particular, buffers that start on odd-8-byte boundary and/or
|
||
span multiple PAGE sizes should be tested.
|
||
|
||
|
||
SECTION 6: Ports
|
||
================
|
||
|
||
Physical and Logical Ports
|
||
------------------------------------
|
||
|
||
The switch supports up to 62 physical (front-panel) ports. Register
|
||
PORT_PHYS_COUNT returns the actual number of physical ports available:
|
||
|
||
PORT_PHYS_COUNT, offset 0x0304, 32-bit, (R)
|
||
|
||
In addition to front-panel ports, the switch supports logical ports for
|
||
tunnels.
|
||
|
||
Front-panel ports and logical tunnel ports are mapped into a single 32-bit port
|
||
space. A special CPU port is assigned port 0. The front-panel ports are
|
||
mapped to ports 1-62. A special loopback port is assigned port 63. Logical
|
||
tunnel ports are assigned ports 0x0001000-0x0001ffff.
|
||
To summarize the port assignments:
|
||
|
||
port mapping
|
||
-------------------------------------------------------
|
||
0 CPU port (for packets to/from host CPU)
|
||
1-62 front-panel physical ports
|
||
63 loopback port
|
||
64-0x0000ffff RSVD
|
||
0x00010000-0x0001ffff logical tunnel ports
|
||
0x00020000-0xffffffff RSVD
|
||
|
||
Physical Port Mode
|
||
------------------
|
||
|
||
Switch front-panel ports operate in a mode. Currently, the only mode is
|
||
OF-DPA. OF-DPA[1] mode is based on OpenFlow Data Plane Abstraction (OF-DPA)
|
||
Abstract Switch Specification, Version 1.0, from Broadcom Corporation. To
|
||
set/get the mode for front-panel ports, see port settings, below.
|
||
|
||
Port Settings
|
||
-------------
|
||
|
||
Link status for all front-panel ports is available via PORT_PHYS_LINK_STATUS:
|
||
|
||
PORT_PHYS_LINK_STATUS, offset 0x0310, 64-bit, (R)
|
||
|
||
Value is port bitmap. Bits 0 and 63 always read 0. Bits 1-62
|
||
read 1 for link UP and 0 for link DOWN for respective front-panel ports.
|
||
|
||
Other properties for front-panel ports are available via DMA CMD descriptors:
|
||
|
||
Get PORT_SETTINGS descriptor:
|
||
|
||
field width description
|
||
----------------------------------------------
|
||
PORT_SETTINGS 2 CMD_GET
|
||
PPORT 4 Physical port #
|
||
|
||
Get PORT_SETTINGS completion:
|
||
|
||
field width description
|
||
----------------------------------------------
|
||
PPORT 4 Physical port #
|
||
SPEED 4 Current port interface speed, in Mbps
|
||
DUPLEX 1 1 = Full, 0 = Half
|
||
AUTONEG 1 1 = enabled, 0 = disabled
|
||
MACADDR 6 Port MAC address
|
||
MODE 1 0 = OF-DPA
|
||
LEARNING 1 MAC address learning on port
|
||
1 = enabled
|
||
0 = disabled
|
||
PHYS_NAME <var> Physical port name (string)
|
||
|
||
Set PORT_SETTINGS descriptor:
|
||
|
||
field width description
|
||
----------------------------------------------
|
||
PORT_SETTINGS 2 CMD_SET
|
||
PPORT 4 Physical port #
|
||
SPEED 4 Port interface speed, in Mbps
|
||
DUPLEX 1 1 = Full, 0 = Half
|
||
AUTONEG 1 1 = enabled, 0 = disabled
|
||
MACADDR 6 Port MAC address
|
||
MODE 1 0 = OF-DPA
|
||
|
||
Port Enable
|
||
-----------
|
||
|
||
Front-panel ports are initially disabled, which means port ingress and egress
|
||
packets will be dropped. To enable or disable a port, use PORT_PHYS_ENABLE:
|
||
|
||
PORT_PHYS_ENABLE: offset 0x0318, 64-bit, (R/W)
|
||
|
||
Value is bitmap of first 64 ports. Bits 0 and 63 are ignored
|
||
and always read as 0. Write 1 to enable port; write 0 to disable it.
|
||
Default is 0.
|
||
|
||
|
||
SECTION 7: Switch Control
|
||
=========================
|
||
|
||
This section covers switch-wide register settings.
|
||
|
||
Control
|
||
-------
|
||
|
||
This register is used for low level control of the switch.
|
||
|
||
CONTROL: offset 0x0300, 32-bit, (W)
|
||
|
||
bit name description
|
||
------------------------------------------------------------------------
|
||
[0] CONTROL_RESET If set, device will perform reset
|
||
[1:31] Reserved
|
||
|
||
Switch ID
|
||
---------
|
||
|
||
The switch has a SWITCH_ID to be used by software to uniquely identify the
|
||
switch:
|
||
|
||
SWITCH_ID: offset 0x0320, 64-bit, (R)
|
||
|
||
Value is opaque to switch software and no special encoding is implied.
|
||
|
||
|
||
SECTION 8: Events
|
||
=================
|
||
|
||
Non-I/O asynchronous events from the device are notified to the host using the
|
||
event ring. The TLV structure for events is:
|
||
|
||
field width description
|
||
---------------------------------------------------
|
||
TYPE 4 Event type, one of:
|
||
1: LINK_CHANGED
|
||
2: MAC_VLAN_SEEN
|
||
INFO <nest> Event info (details below)
|
||
|
||
Link Changed Event
|
||
------------------
|
||
|
||
When link status changes on a physical port, this event is generated.
|
||
|
||
field width description
|
||
---------------------------------------------------
|
||
INFO <nest>
|
||
PPORT 4 Physical port
|
||
LINKUP 1 Link status:
|
||
0: down
|
||
1: up
|
||
|
||
MAC VLAN Seen Event
|
||
-------------------
|
||
|
||
When a packet ingresses on a port and the source MAC/VLAN isn't known to the
|
||
device, the device will generate this event. In response to the event, the
|
||
driver should install to the device the MAC/VLAN on the port into the bridge
|
||
table. Once installed, the MAC/VLAN is known on the port and this event will
|
||
no longer be generated.
|
||
|
||
field width description
|
||
---------------------------------------------------
|
||
INFO <nest>
|
||
PPORT 4 Physical port
|
||
MAC 6 MAC address
|
||
VLAN 2 VLAN ID
|
||
|
||
|
||
SECTION 9: CPU Packet Processing
|
||
================================
|
||
|
||
Ingress packets directed to the host CPU for further processing are delivered
|
||
in the DMA RX ring. Likewise, host CPU originating packets destined to egress
|
||
on switch ports are scheduled by software using the DMA TX ring.
|
||
|
||
Tx Packet Processing
|
||
--------------------
|
||
|
||
Software schedules packets for egress on switch ports using the DMA TX ring. A
|
||
TX descriptor buffer describes the packet location and size in host DMA-able
|
||
memory, the destination port, and any hardware-offload functions (such as L3
|
||
payload checksum offload). Software then bumps the descriptor head to signal
|
||
hardware of new Tx work. In response, hardware will DMA read Tx descriptors up
|
||
to head, DMA read descriptor buffer and packet data, perform offloading
|
||
functions, and finally frame packet on wire (network). Once packet processing
|
||
is complete, hardware will writeback status to descriptor(s) to signal to
|
||
software that Tx is complete and software resources (e.g. skb) backing packet
|
||
can be released.
|
||
|
||
Figure 2 shows an example 3-fragment packet queued with one Tx descriptor. A
|
||
TLV is used for each packet fragment.
|
||
|
||
pkt frag 1
|
||
+–––––––+ +–+
|
||
+–––+ | |
|
||
desc buf | | | |
|
||
+––––––––+ | | | |
|
||
Tx ring +–––+ +–––––+ | | |
|
||
+–––––––––+ | | TLVs | +–––––––+ |
|
||
| +–––+ +––––––––+ pkt frag 2 |
|
||
| desc 0 | | +–––––+ +–––––––+ |
|
||
+–––––––––+ | TLVs | +–––+ | |
|
||
head+–+ | +––––––––+ | | |
|
||
| desc 1 | | +–––––+ +–––––––+ |pkt
|
||
+–––––––––+ | TLVs | | |
|
||
| | +––––––––+ | pkt frag 3 |
|
||
| | | +–––––––+ |
|
||
+–––––––––+ +–––+ | |
|
||
| | | | |
|
||
| | | | |
|
||
+–––––––––+ | | |
|
||
| | | | |
|
||
| | | | |
|
||
+–––––––––+ | | |
|
||
| | +–––––––+ +–+
|
||
| |
|
||
+–––––––––+
|
||
|
||
fig 2.
|
||
|
||
The TLVs for Tx descriptor buffer are:
|
||
|
||
field width description
|
||
---------------------------------------------------------------------
|
||
PPORT 4 Destination physical port #
|
||
TX_OFFLOAD 1 Hardware offload modes:
|
||
0: no offload
|
||
1: insert IP csum (ipv4 only)
|
||
2: insert TCP/UDP csum
|
||
3: L3 csum calc and insert
|
||
into csum offset (TX_L3_CSUM_OFF)
|
||
16-bit 1's complement csum value.
|
||
IPv4 pseudo-header and IP
|
||
already calculated by OS
|
||
and inserted.
|
||
4: TSO (TCP Segmentation Offload)
|
||
TX_L3_CSUM_OFF 2 For L3 csum offload mode, the offset,
|
||
from the beginning of the packet,
|
||
of the csum field in the L3 header
|
||
TX_TSO_MSS 2 For TSO offload mode, the
|
||
Maximum Segment Size in bytes
|
||
TX_TSO_HDR_LEN 2 For TSO offload mode, the
|
||
length of ethernet, IP, and
|
||
TCP/UDP headers, including IP
|
||
and TCP options.
|
||
TX_FRAGS <array> Packet fragments
|
||
TX_FRAG <nest> Packet fragment
|
||
TX_FRAG_ADDR 8 DMA address of packet fragment
|
||
TX_FRAG_LEN 2 Packet fragment length
|
||
|
||
Possible status return codes in descriptor on completion are:
|
||
|
||
DESC_COMP_ERR reason
|
||
--------------------------------------------------------------------
|
||
0 OK
|
||
-ROCKER_ENXIO address or data read err on desc buf or packet
|
||
fragment
|
||
-ROCKER_EINVAL bad pport or TSO or csum offloading error
|
||
-ROCKER_ENOMEM no memory for internal staging tx fragment
|
||
|
||
Rx Packet Processing
|
||
--------------------
|
||
|
||
For packets ingressing on switch ports that are not forwarded by the switch but
|
||
rather directed to the host CPU for further processing are delivered in the DMA
|
||
RX ring. Rx descriptor buffers are allocated by software and placed on the
|
||
ring. Hardware will fill Rx descriptor buffers with packet data, write the
|
||
completion, and signal to software that a new packet is ready. Since Rx packet
|
||
size is not known a-priori, the Rx descriptor buffer must be allocated for
|
||
worst-case packet size. A single Rx descriptor will contain the entire Rx
|
||
packet data in one RX_FRAG. Other Rx TLVs describe and hardware offloads
|
||
performed on the packet, such as checksum validation.
|
||
|
||
The TLVs for Rx descriptor buffer are:
|
||
|
||
field width description
|
||
---------------------------------------------------
|
||
PPORT 4 Source physical port #
|
||
RX_FLAGS 2 Packet parsing flags:
|
||
(1 << 0): IPv4 packet
|
||
(1 << 1): IPv6 packet
|
||
(1 << 2): csum calculated
|
||
(1 << 3): IPv4 csum good
|
||
(1 << 4): IP fragment
|
||
(1 << 5): TCP packet
|
||
(1 << 6): UDP packet
|
||
(1 << 7): TCP/UDP csum good
|
||
(1 << 8): Offload forward
|
||
RX_CSUM 2 IP calculated checksum:
|
||
IPv4: IP payload csum
|
||
IPv6: header and payload csum
|
||
(Only valid is RX_FLAGS:csum calc is set)
|
||
RX_FRAG_ADDR 8 DMA address of packet fragment
|
||
RX_FRAG_MAX_LEN 2 Packet maximum fragment length
|
||
RX_FRAG_LEN 2 Actual packet fragment length after receive
|
||
|
||
Offload forward RX_FLAG indicates the device has already forwarded the packet
|
||
so the host CPU should not also forward the packet.
|
||
|
||
Possible status return codes in descriptor on completion are:
|
||
|
||
DESC_COMP_ERR reason
|
||
--------------------------------------------------------------------
|
||
0 OK
|
||
-ROCKER_ENXIO address or data read err on desc buf
|
||
-ROCKER_ENOMEM no memory for internal staging desc buf
|
||
-ROCKER_EMSGSIZE Rx descriptor buffer wasn't big enough to contain
|
||
packet data TLV and other TLVs.
|
||
|
||
|
||
SECTION 10: OF-DPA Mode
|
||
======================
|
||
|
||
OF-DPA mode allows the switch to offload flow packet processing functions to
|
||
hardware. An OpenFlow controller would communicate with an OpenFlow agent
|
||
installed on the switch. The OpenFlow agent would (directly or indirectly)
|
||
communicate with the Rocker switch driver, which in turn would program switch
|
||
hardware with flow functionality, as defined in OF-DPA. The block diagram is:
|
||
|
||
+–––––––––––––––----–––+
|
||
| OF |
|
||
| Remote Controller |
|
||
+––––––––+––----–––––––+
|
||
|
|
||
|
|
||
+––––––––+–––––––––+
|
||
| OF |
|
||
| Local Agent |
|
||
+––––––––––––––––––+
|
||
| |
|
||
| Rocker Driver |
|
||
+––––––––––––––––––+
|
||
<this spec>
|
||
+––––––––––––––––––+
|
||
| |
|
||
| Rocker Switch |
|
||
+––––––––––––––––––+
|
||
|
||
To participate in flow functions, ports must be configure for OF-DPA mode
|
||
during switch initialization.
|
||
|
||
OF-DPA Flow Table Interface
|
||
---------------------------
|
||
|
||
There are commands to add, modify, delete, and get stats of flow table entries.
|
||
The commands are issued using the DMA CMD descriptor ring. The following
|
||
commands are defined:
|
||
|
||
CMD_ADD: add an entry to flow table
|
||
CMD_MOD: modify an entry in flow table
|
||
CMD_DEL: delete an entry from flow table
|
||
CMD_GET_STATS: get stats for flow entry
|
||
|
||
TLVs for add and modify commands are:
|
||
|
||
field width description
|
||
----------------------------------------------------
|
||
OF_DPA_CMD 2 CMD_[ADD|MOD]
|
||
OF_DPA_TBL 2 Flow table ID
|
||
0: ingress port
|
||
10: vlan
|
||
20: termination mac
|
||
30: unicast routing
|
||
40: multicast routing
|
||
50: bridging
|
||
60: ACL policy
|
||
OF_DPA_PRIORITY 4 Flow priority
|
||
OF_DPA_HARDTIME 4 Hard timeout for flow
|
||
OF_DPA_IDLETIME 4 Idle timeout for flow
|
||
OF_DPA_COOKIE 8 Cookie
|
||
|
||
Additional TLVs based on flow table ID:
|
||
|
||
Table ID 0: ingress port
|
||
|
||
field width description
|
||
----------------------------------------------------
|
||
OF_DPA_IN_PPORT 4 ingress physical port number
|
||
OF_DPA_GOTO_TBL 2 goto table ID; zero to drop
|
||
|
||
Table ID 10: vlan
|
||
|
||
field width description
|
||
----------------------------------------------------
|
||
OF_DPA_IN_PPORT 4 ingress physical port number
|
||
OF_DPA_VLAN_ID 2 (N) vlan ID
|
||
OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask
|
||
OF_DPA_GOTO_TBL 2 goto table ID; zero to drop
|
||
OF_DPA_NEW_VLAN_ID 2 (N) new vlan ID
|
||
|
||
Table ID 20: termination mac
|
||
|
||
field width description
|
||
----------------------------------------------------
|
||
OF_DPA_IN_PPORT 4 ingress physical port number
|
||
OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask
|
||
OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd
|
||
OF_DPA_DST_MAC 6 (N) destination MAC
|
||
OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask
|
||
OF_DPA_VLAN_ID 2 (N) vlan ID
|
||
OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask
|
||
OF_DPA_GOTO_TBL 2 only acceptable values are
|
||
unicast or multicast routing
|
||
table IDs
|
||
OF_DPA_OUT_PPORT 2 if specified, must be
|
||
controller, set zero otherwise
|
||
|
||
Table ID 30: unicast routing
|
||
|
||
field width description
|
||
----------------------------------------------------
|
||
OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd
|
||
OF_DPA_DST_IP 4 (N) destination IPv4 address.
|
||
Must be unicast address
|
||
OF_DPA_DST_IP_MASK 4 (N) IP mask. Must be prefix mask
|
||
OF_DPA_DST_IPV6 16 (N) destination IPv6 address.
|
||
Must be unicast address
|
||
OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask. Must be prefix mask
|
||
OF_DPA_GOTO_TBL 2 goto table ID; zero to drop
|
||
OF_DPA_GROUP_ID 4 data for GROUP action must
|
||
be an L3 Unicast group entry
|
||
|
||
Table ID 40: multicast routing
|
||
|
||
field width description
|
||
----------------------------------------------------
|
||
OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd
|
||
OF_DPA_VLAN_ID 2 (N) vlan ID
|
||
OF_DPA_SRC_IP 4 (N) source IPv4. Optional,
|
||
can contain IPv4 address,
|
||
must be completely masked
|
||
if not used
|
||
OF_DPA_SRC_IP_MASK 4 (N) IP Mask
|
||
OF_DPA_DST_IP 4 (N) destination IPv4 address.
|
||
Must be multicast address
|
||
OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional.
|
||
Can contain IPv6 address,
|
||
must be completely masked
|
||
if not used
|
||
OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask.
|
||
OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must
|
||
be multicast address
|
||
Must be multicast address
|
||
OF_DPA_GOTO_TBL 2 goto table ID; zero to drop
|
||
OF_DPA_GROUP_ID 4 data for GROUP action must
|
||
be an L3 multicast group entry
|
||
|
||
Table ID 50: bridging
|
||
|
||
field width description
|
||
----------------------------------------------------
|
||
OF_DPA_VLAN_ID 2 (N) vlan ID
|
||
OF_DPA_TUNNEL_ID 4 tunnel ID
|
||
OF_DPA_DST_MAC 6 (N) destination MAC
|
||
OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask
|
||
OF_DPA_GOTO_TBL 2 goto table ID; zero to drop
|
||
OF_DPA_GROUP_ID 4 data for GROUP action must
|
||
be a L2 Interface, L2
|
||
Multicast, L2 Flood,
|
||
or L2 Overlay group entry
|
||
as appropriate
|
||
OF_DPA_TUNNEL_LPORT 4 unicast Tenant Bridging
|
||
flows specify a tunnel
|
||
logical port ID
|
||
OF_DPA_OUT_PPORT 2 data for OUTPUT action,
|
||
restricted to CONTROLLER,
|
||
set to 0 otherwise
|
||
|
||
Table ID 60: acl policy
|
||
|
||
field width description
|
||
----------------------------------------------------
|
||
OF_DPA_IN_PPORT 4 ingress physical port number
|
||
OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask
|
||
OF_DPA_ETHERTYPE 2 (N) ethertype
|
||
OF_DPA_VLAN_ID 2 (N) vlan ID
|
||
OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask
|
||
OF_DPA_VLAN_PCP 2 (N) vlan Priority Code Point
|
||
OF_DPA_VLAN_PCP_MASK 2 (N) vlan Priority Code Point mask
|
||
OF_DPA_SRC_MAC 6 (N) source MAC
|
||
OF_DPA_SRC_MAC_MASK 6 (N) source MAC mask
|
||
OF_DPA_DST_MAC 6 (N) destination MAC
|
||
OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask
|
||
OF_DPA_TUNNEL_ID 4 tunnel ID
|
||
OF_DPA_SRC_IP 4 (N) source IPv4. Optional,
|
||
can contain IPv4 address,
|
||
must be completely masked
|
||
if not used
|
||
OF_DPA_SRC_IP_MASK 4 (N) IP Mask
|
||
OF_DPA_DST_IP 4 (N) destination IPv4 address.
|
||
Must be multicast address
|
||
OF_DPA_DST_IP_MASK 4 (N) IP Mask
|
||
OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional.
|
||
Can contain IPv6 address,
|
||
must be completely masked
|
||
if not used
|
||
OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask
|
||
OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must
|
||
be multicast address.
|
||
OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask
|
||
OF_DPA_SRC_ARP_IP 4 (N) source IPv4 address in the ARP
|
||
payload. Only used if ethertype
|
||
== 0x0806.
|
||
OF_DPA_SRC_ARP_IP_MASK 4 (N) IP Mask
|
||
OF_DPA_IP_PROTO 1 IP protocol
|
||
OF_DPA_IP_PROTO_MASK 1 IP protocol mask
|
||
OF_DPA_IP_DSCP 1 DSCP
|
||
OF_DPA_IP_DSCP_MASK 1 DSCP mask
|
||
OF_DPA_IP_ECN 1 ECN
|
||
OF_DPA_IP_ECN_MASK 1 ECN mask
|
||
OF_DPA_L4_SRC_PORT 2 (N) L4 source port, only for
|
||
TCP, UDP, or SCTP
|
||
OF_DPA_L4_SRC_PORT_MASK 2 (N) L4 source port mask
|
||
OF_DPA_L4_DST_PORT 2 (N) L4 source port, only for
|
||
TCP, UDP, or SCTP
|
||
OF_DPA_L4_DST_PORT_MASK 2 (N) L4 source port mask
|
||
OF_DPA_ICMP_TYPE 1 ICMP type, only if IP
|
||
protocol is 1
|
||
OF_DPA_ICMP_TYPE_MASK 1 ICMP type mask
|
||
OF_DPA_ICMP_CODE 1 ICMP code
|
||
OF_DPA_ICMP_CODE_MASK 1 ICMP code mask
|
||
OF_DPA_IPV6_LABEL 4 (N) IPv6 flow label
|
||
OF_DPA_IPV6_LABEL_MASK 4 (N) IPv6 flow label mask
|
||
OF_DPA_GROUP_ID 4 data for GROUP action
|
||
OF_DPA_QUEUE_ID_ACTION 1 write the queue ID
|
||
OF_DPA_NEW_QUEUE_ID 1 queue ID
|
||
OF_DPA_VLAN_PCP_ACTION 1 write the VLAN priority
|
||
OF_DPA_NEW_VLAN_PCP 1 VLAN priority
|
||
OF_DPA_IP_DSCP_ACTION 1 write the DSCP
|
||
OF_DPA_NEW_IP_DSCP 1 new DSCP
|
||
OF_DPA_TUNNEL_LPORT 4 restrct to valid tunnel
|
||
logical port, set to 0
|
||
otherwise.
|
||
OF_DPA_OUT_PPORT 2 data for OUTPUT action,
|
||
restricted to CONTROLLER,
|
||
set to 0 otherwise
|
||
OF_DPA_CLEAR_ACTIONS 4 if 1 packets matching flow are
|
||
dropped (all other instructions
|
||
ignored)
|
||
|
||
TLVs for flow delete and get stats command are:
|
||
|
||
field width description
|
||
---------------------------------------------------
|
||
OF_DPA_CMD 2 CMD_[DEL|GET_STATS]
|
||
OF_DPA_COOKIE 8 Cookie
|
||
|
||
On completion of get stats command, the descriptor buffer is written back with
|
||
the following TLVs:
|
||
|
||
field width description
|
||
---------------------------------------------------
|
||
OF_DPA_STAT_DURATION 4 Flow duration
|
||
OF_DPA_STAT_RX_PKTS 8 Received packets
|
||
OF_DPA_STAT_TX_PKTS 8 Transmit packets
|
||
|
||
Possible status return codes in descriptor on completion are:
|
||
|
||
DESC_COMP_ERR command reason
|
||
--------------------------------------------------------------------
|
||
0 all OK
|
||
-ROCKER_EFAULT all head or tail index outside
|
||
of ring
|
||
-ROCKER_ENXIO all address or data read err on
|
||
desc buf
|
||
-ROCKER_EMSGSIZE GET_STATS cmd descriptor buffer wasn't
|
||
big enough to contain write-back
|
||
TLVs
|
||
-ROCKER_EINVAL all invalid parameters passed in
|
||
-ROCKER_EEXIST ADD entry already exists
|
||
-ROCKER_ENOSPC ADD no space left in flow table
|
||
-ROCKER_ENOENT MOD|DEL|GET_STATS cookie invalid
|
||
|
||
Group Table Interface
|
||
---------------------
|
||
|
||
There are commands to add, modify, delete, and get stats of group table
|
||
entries. The commands are issued using the DMA CMD descriptor ring. The
|
||
following commands are defined:
|
||
|
||
CMD_ADD: add an entry to group table
|
||
CMD_MOD: modify an entry in group table
|
||
CMD_DEL: delete an entry from group table
|
||
CMD_GET_STATS: get stats for group entry
|
||
|
||
TLVs for add and modify commands are:
|
||
|
||
field width description
|
||
-----------------------------------------------------------
|
||
FLOW_GROUP_CMD 2 CMD_[ADD|MOD]
|
||
FLOW_GROUP_ID 2 Flow group ID
|
||
FLOW_GROUP_TYPE 1 Group type:
|
||
0: L2 interface
|
||
1: L2 rewrite
|
||
2: L3 unicast
|
||
3: L2 multicast
|
||
4: L2 flood
|
||
5: L3 interface
|
||
6: L3 multicast
|
||
7: L3 ECMP
|
||
8: L2 overlay
|
||
FLOW_VLAN_ID 2 Vlan ID (types 0, 3, 4, 6)
|
||
FLOW_L2_PORT 2 Port (types 0)
|
||
FLOW_INDEX 4 Index (all types but 0)
|
||
FLOW_OVERLAY_TYPE 1 Overlay sub-type (type 8):
|
||
0: Flood unicast tunnel
|
||
1: Flood multicast tunnel
|
||
2: Multicast unicast tunnel
|
||
3: Multicast multicast tunnel
|
||
FLOW_GROUP_ACTION nest
|
||
FLOW_GROUP_ID 2 next group ID in chain (all
|
||
types except 0)
|
||
FLOW_OUT_PORT 4 egress port (types 0, 8)
|
||
FLOW_POP_VLAN_TAG 1 strip outer VLAN tag (type 1
|
||
only)
|
||
FLOW_VLAN_ID 2 (types 1, 5)
|
||
FLOW_SRC_MAC 6 (types 1, 2, 5)
|
||
FLOW_DST_MAC 6 (types 1, 2)
|
||
|
||
TLVs for flow delete and get stats command are:
|
||
|
||
field width description
|
||
-----------------------------------------------------------
|
||
FLOW_GROUP_CMD 2 CMD_[DEL|GET_STATS]
|
||
FLOW_GROUP_ID 2 Flow group ID
|
||
|
||
On completion of get stats command, the descriptor buffer is written back with
|
||
the following TLVs:
|
||
|
||
field width description
|
||
---------------------------------------------------
|
||
FLOW_GROUP_ID 2 Flow group ID
|
||
FLOW_STAT_DURATION 4 Flow duration
|
||
FLOW_STAT_REF_COUNT 4 Flow reference count
|
||
FLOW_STAT_BUCKET_COUNT 4 Flow bucket count
|
||
|
||
Possible status return codes in descriptor on completion are:
|
||
|
||
DESC_COMP_ERR command reason
|
||
--------------------------------------------------------------------
|
||
0 all OK
|
||
-ROCKER_EFAULT all head or tail index outside
|
||
of ring
|
||
-ROCKER_ENXIO all address or data read err on
|
||
desc buf
|
||
-ROCKER_ENOSPC GET_STATS cmd descriptor buffer wasn't
|
||
big enough to contain write-back
|
||
TLVs
|
||
-ROCKER_EINVAL ADD|MOD invalid parameters passed in
|
||
-ROCKER_EEXIST ADD entry already exists
|
||
-ROCKER_ENOSPC ADD no space left in flow table
|
||
-ROCKER_ENOENT MOD|DEL|GET_STATS group ID invalid
|
||
-ROCKER_EBUSY DEL group reference count non-zero
|
||
-ROCKER_ENODEV ADD next group ID doesn't exist
|
||
|
||
|
||
|
||
References
|
||
==========
|
||
|
||
[1] OpenFlow Data Plane Abstraction (OF-DPA) Abstract Switch Specification,
|
||
Version 1.0, from Broadcom Corporation, February 21, 2014.
|