Many applications in industry can benefit from Ethernet and TCP/IP as it is a well-known and supported networking standard. More and more, these industrial applications require higher bandwidth and lower latency. This means that it is becoming a challenge not to overload the CPU with a TCP/IP stack running at maximum bandwidth. These increasing requirements make the processor spend more time handling data rather than running your application.

Easics' TCP Offload Engine (TOE) can be used to offload the TCP/IP stack from the CPU and handle it in FPGA or ASIC hardware. This core is an all-hardware configurable IP block. It acts as a TCP server for sending and receiving of TCP/IP data. Because everything is handled in hardware very high throughput and low latency are possible. The IP block is completely self-sufficient and can be used as a black box module which takes care of all networking tasks. This means that the rest of the system is free to use its processing power purely for application logic. In some use cases, integrating our full-hardware TCP/IP stack eliminates the need for any built-in embedded processor at all.

The easics TCP Offload Engine is available as a 1 Gbit/s or 10 Gbit/s version. Both versions support Ethernet packets, IP packets, ICMP packets for ping, TCP packets and ARP packets. The 10 Gbit/s version additionally supports pause frames.


Application Areas

Some of the main application areas are listed below. Don't hesitate to contact us for feasibility advice of your use case.


    Block Diagram

     

    overview

     

    The figure above shows the core’s building blocks and its four most important interfaces:

    • (X)GMII
    • user application FIFOs
    • memory interface
    • configuration interface

    The first of these is an industry-standard (X)GMII interface which communicates with a 1(0) Gbit PHY. The second is situated on the application side: two FIFOs with a simple push/pop interface, one for RX and one for TX. These FIFO interfaces, as well as an internal TCP block, communicate with a memory system which is to be provided outside of the core (the third interface). The size and type of memory can be selected by the user. ARM’s AMBA AXI4 2.0E is the protocol used for this communication. Various FPGA vendors, such as e.g. Xilinx, Intel, provide building blocks to interface internal block RAM, SRAM, or DRAM with an AXI bus. The fourth and final interface is used to configure various networking parameters and to read status info.


    Key Features
    • Ethernet Jumbo Frame up to 9000 bytes supported
    • Transmit and Receive buffers can be controlled to optimize the FPGA resource usage: 4 kB up to 4 GB per connection, internal SRAM or external memory.
    • Guaranteed in-order reception of all data at the application side (FIFO interfaces)
    • Fast response times to network traffic
    • Configurable MAC / IP address / TCP port
    • ARP server for mapping IP address onto MAC addresses. No need to manually set the ARP table in the PC.
    • ICMP echo protocol a.k.a. ping. This can be used for connectivity tests.
    • 1 active server connection per TCP port
    • Listens on fixed TCP port number, selectable at startup
    • TCP ACK piggybacking for reduced network load
    • Packet retransmit, both fast retransmit and timeout retransmit, with exponential back-off
    • Flow control, allowing backpressure from both server and client without data loss.
    • TCP Keep alive support (RFC1122)
    • TCP Zero window probes
    • TCP timestamps
    • Nagle algorithm (to prevent Silly Window Syndrome)
    • Round trip time measurement (RFC6298)
    • Congestion Avoidance: slow start and congestion window
    • Reordering for out-of-order packets
    • PAWS (protection against wrapped sequence numbers)
    • Very high throughput: > 99% of the theoretical 10 Gbit/s limit

    Refer to the easics' TCP Offload Engine product brief for an extensive list of all features.


    ​​​​​Easics' TCP Offload Engine Performance Numbers on Xilinx' Zynq-7045 on ZC706 [Click and drag to move] ​Performance

    Following data throughput numbers have been measured on Xilinx ZC706:

    1G TCP (Mbps)

    MTU TX CPU (%) RX CPU (%)
    1500 905 0 949 0

    10G TCP (Gbps)

    MTU TX CPU (%) RX CPU (%)
    1500 9.18 0 9.35 0
    9000 9.69 0 9.73 0

     

    The data throughput is thus higher for MTU=9000 (jumbo frames). CPU load is 0% since the full TCP/IP connectivity is in the FPGA (full hardware acceleration).

    Following latency numbers have been measured for the 10G TOE in simulation, making use of the Xilinx transceiver models (responsible for 160 ns latency):

    • TX latency = 656 ns
    • RX latency = 640 ns
    • Round Trip Time (RTT) = 1.3 µs

    Resource Usage & Scalability
    • 1G (1 TCP socket) ~ 12K LUTs
    • 10G (1 TCP socket) ~ 34K Zynq LUTs / 24K Arria 10 ALMs
    • 10G (4 TCP sockets) ~ 39K Zynq LUTs

    for each additional TCP engine the number of additional LUTs / ALMs is negligible Our design is such that for each additional TCP socket the number of additional LUTs / ALMs is negligible, only extra memory must be foreseen.
     


    Examples of Typical Use Cases

    Standalone easics TCP/IP core with 1 port

    Standalone easics' TCP/IP core with 1 port

    Standalone easics TCP/IP core with 2 ports

    One TCP port is used to send and receive streaming data, while a second TCP port is used to control your hardware application.

    Standalone easics' TCP/IP core with 2 ports

    Easics' TCP/IP core as TOE

    Although an embedded microprocessor is not required, the easics TCP/IP core can be used to accelerate a microprocessor by offloading TCP/IP into hardware.

    TCP/IP Offload Engine (TOE)

    In the next example one FIFO-interface is connected to the hardware application and a second FIFO-interface is connected to an embedded microprocessor. In this configuration, the easics TCP/IP core behaves as a complete TCP/IP Offload Engine (TOE), be it for only 1 connection.

    Easics' TCP/IP core as TOE

    Easics' TCP/IP core & microprocessor sharing an Ethernet connection

    Easics' TCP/IP core & microporcesspr sharing an Ethernet connection

    The use case illustrated above shows another way of integrating both the easics TCP/IP core and a microprocessor. The TCP mux routes Ethernet packets based on the TCP port number. The Ethernet traffic to 1 specific TCP port is routed to the easics TCP/IP core. All other traffic is routed to the microprocessor. This approach offers a limited number of high-speed connections, processed by the easics TCP/IP core and an unlimited number of lower speed connections, processed by the microprocessor.

     


    PHY Interfacing

    The core does not include means to transfer data to the physical layer. Hence, it needs to cooperate with an external chip/block. To achieve this, the 1G core implements a GMII interface and the 10G core an XGMII interface (IEEE 802.3).

    XGMII uses DDR: a 32-bit wide data bus and an 4-bit wide control bus, both synchronized to the rising and falling edge of a 156.25 MHz clock. The core however uses SDR: a 64-bit wide data bus and an 8-bit wide control bus, both synchronized to the rising edge of a 156.25 MHz clock. The reason for this is the interface with Ethernet PHY IP blocks provided by FPGA vendors (Option 1 in the figure below). These blocks include the 10 Gbps transceivers and implement the XGMII interface. But as FPGAs internally do not support DDR, the interface to these blocks is SDR. The translation from DDR to SDR is achieved by doubling the bus widths.

    The user may also choose not to use an FPGA PHY IP block but directly interface with an external PHY chip that supports XGMII and includes the 10 Gbps transceivers (Option 2 in the figure below). Such a chip requires DDR on the XGMII interface. A simple SDR to DDR conversion on the XGMII bus pins is required.

    The figure below illustrates the difference between both options:

    Difference between Ethernet PHY is FPGA IP or Ethernet PHY is part of external device

    The figure below clarifies the transition from SDR to DDR:

    Conversion from SDR to DDR

    Xilinx Zynq-7045 Evaluation System

    Targeted to Xilinx Zynq-7045 on ZC706

    Dual ARM Cortex-A9 core processors are not used for handling TCP/IP. They are free to be used for your application.

    SFP+ for 10GigE

    Supports Vivado design flow

    10 GigE Resource count (64-bit @ 156.25 MHz)

    • Vivado 2017.1 targeting XC7Z045
    • Full Stack (1 TCP socket) : 34K LUTs
    • Multisocket Full Stack (4 TCP sockets) : 39K LUTs

    Runs on Xilinx ZC706 Development Kit

    Easics' TCP Offload Engine targeted to Xilinx Zynq-7045 on ZC706

     


    Intel Arria 10 Evaluation System
     

    ReflexCES Intel Arria 10 SoC SoM for easics TCP Offload Engine

    ReflexCES carrier board for SoM with Easics' TCP Offload Engine on Arria 10

    Targeted to Intel Arria 10 SoC SoM by ReflexCES plugged in on ReflexCES' PCIe Carrier Board Arria 10 SoC SoM

    SFP+ for 10GigE

    Supports Quartus design flow

    10 GigE resource count (64-bit @ 156.25 MHz)

    • Quartus Prime Pro 17.0 targeting 10AS027H3F34E2SG
    • Full Stack (1 TCP socket) : 24K ALM
    • Multisocket Full Stack (4 TCP sockets) : 27K ALM

    Runs on ReflexCES PCIe Carrier Board Arria 10 SoC SoM Development Kit

     


    Xilinx Virtex-6 FPGA ML605 Evaluation Kit

    Targeted to Xilinx Virtex-6 on ML605

    RJ-45 for 1GigE

    Supports ISE design flow

    1GigE resource count (32-bit @ 31.25 MHz)

    • Synplify H-2013.03 targeting XC6VLX240T
    • Full Stack (1 TCP engine) : 12K LUT

    Runs on Xilinx Virtex-6 FPGA ML605 Evaluation Kit

    Easics' TCP Offload Engine on Xilinx Virtex-6 FPGA on ML605

     


    Easics' Artix-7 FPGA Demo Board

    Easics' demo board for its' TCP Offload Engine

    Targeted to Xilinx Artix-7 on the easics demo board

    RJ-45 for 1GigE

    Supports Vivado design flow

    1GigE resource count (32-bit @ 31.25 MHz)

    • Vivado 2015.1 targeting XC7A100TFTG256-2
    • Full Stack (1 TCP socket) : 5K LUT

    Runs on the easics Artix-7 FPGA demo board