Monday, February 14, 2011

TCP

Description:
Protocol suite: TCP/IP.
Protocol type:Transport layer connection oriented byte stream protocol.
IP Protocol:6.
Ports:
MIME subtype:
SNMP MIBs: iso.org.dod.internet.experimental.ipv6TcpMIB (1.3.6.1.3.86).
iso.org.dod.internet.mgmt.mib-2.tcp (1.3.6.1.2.1.6).
iso.org.dod.internet.mgmt.mib-2.tcpMIB (1.3.6.1.2.1.49).
Working groups: pilc, Performance Implications of Link Characteristics.
tcpimpl, TCP Implementation.
tcpm, TCP Maintenance and Minor Extensions.
tsvwg, Transport Area Working Group.
Links: IANA: TCP option numbers.
TCP is a transport layer protocol used by applications that require guaranteed delivery. It is a sliding window protocol that provides handling for both timeouts and retransmissions.
TCP establishes a full duplex virtual connection between two endpoints. Each endpoint is defined by an IP address and a TCP port number. The operation of TCP is implemented as a finite state machine.
The byte stream is transfered in segments. The window size determines the number of bytes of data that can be sent before an acknowledgement from the receiver is necessary.

MAC header IP header TCP header Data :::
TCP header:
0001020304050607 0809101112131415 1617181920212223 2425262728293031
Source Port Destination Port
Sequence Number
Acknowledgment Number
Data Offset reserved ECN Control Bits Window
Checksum Urgent Pointer
Options and padding :::
Data :::
Source Port. 16 bits.
Destination Port. 16 bits.
Sequence Number. 32 bits.
The sequence number of the first data byte in this segment. If the SYN bit is set, the sequence number is the initial sequence number and the first data byte is initial sequence number + 1.
Acknowledgment Number. 32 bits.
If the ACK bit is set, this field contains the value of the next sequence number the sender of the segment is expecting to receive. Once a connection is established this is always sent.
Data Offset. 4 bits.
The number of 32-bit words in the TCP header. This indicates where the data begins. The length of the TCP header is always a multiple of 32 bits.
reserved. 3 bits.
Must be cleared to zero.
ECN, Explicit Congestion Notification. 3 bits.
Added in RFC 3168.
000102
N C E
N, NS, Nonce Sum. 1 bit.
Added in RFC 3540. This is an optional field added to ECN intended to protect against accidental or malicious concealment of marked packets from the TCP sender.
C, CWR. 1 bit.
E, ECE, ECN-Echo. 1 bit.
Control Bits. 6 bits.
000102030405
U A P R S F
U, URG. 1 bit.
Urgent pointer valid flag.
A, ACK. 1 bit.
Acknowledgment number valid flag.
P, PSH. 1 bit.
Push flag.
R, RST. 1 bit.
Reset connection flag.
S, SYN. 1 bit.
Synchronize sequence numbers flag.
F, FIN. 1 bit.
End of data flag.
Window. 16 bits, unsigned.
The number of data bytes beginning with the one indicated in the acknowledgment field which the sender of this segment is willing to accept.
Checksum. 16 bits.
This is computed as the 16-bit one's complement of the one's complement sum of a pseudo header of information from the IP header, the TCP header, and the data, padded as needed with zero bytes at the end to make a multiple of two bytes. The pseudo header contains the following fields:
0001020304050607 0809101112131415 1617181920212223 2425262728293031
Source IP address
Destination IP address
0 IP Protocol Total length
Urgent Pointer. 16 bits, unsigned.
If the URG bit is set, this field points to the sequence number of the last byte in a sequence of urgent data.
Options. 0 to 40 bytes.
Options occupy space at the end of the TCP header. All options are included in the checksum. An option may begin on any byte boundary. The TCP header must be padded with zeros to make the header length a multiple of 32 bits.
KindLengthDescriptionReferences
01 End of option list. RFC 793
11 No operation. RFC 793
24 MSS, Maximum Segment Size. RFC 793
33 WSOPT, Window scale factor. RFC 1323
42 SACK permitted. RFC 2018
5Variable. SACK. RFC 2018, RFC 2883
66 Echo. (obsolete). RFC 1072
76 Echo reply. (obsolete). RFC 1072
810 TSOPT, Timestamp. RFC 1323
92 Partial Order Connection permitted. RFC 1693
103 Partial Order service profile. RFC 1693
116 CC, Connection Count. RFC 1644
126 CC.NEW RFC 1644
136 CC.ECHO RFC 1644
143 Alternate checksum request. RFC 1146
15Variable. Alternate checksum data. RFC 1146
16 Skeeter. 
17 Bubba. 
183Trailer Checksum Option. 
1918 MD5 signature. RFC 2385
20 SCPS Capabilities. 
21 Selective Negative Acknowledgements. 
22 Record Boundaries. 
23 Corruption experienced. 
24 SNAP. 
25   
26 TCP Compression Filter. 
278Quick-Start Response.RFC 4782
284User Timeout.RFC 5482
29 TCP-AO, TCP Authentication Option. 
30
-
252
   
253 RFC3692-style Experiment 1.RFC 4727
254 RFC3692-style Experiment 2.RFC 4727
255   
Data. Variable length.

TCP State machine:
StateDescription
CLOSE-WAIT Waits for a connection termination request from the remote host.
CLOSED Represents no connection state at all.
CLOSING Waits for a connection termination request acknowledgment from the remote host.
ESTABLISHED Represents an open connection, data received can be delivered to the user. The normal state for the data transfer phase of the connection.
FIN-WAIT-1 Waits for a connection termination request from the remote host or an acknowledgment of the connection termination request previously sent.
FIN-WAIT-2 Waits for a connection termination request from the remote host.
LAST-ACK Waits for an acknowledgment of the connection termination request previously sent to the remote host (which includes an acknowledgment of its connection termination request).
LISTEN Waits for a connection request from any remote TCP and port.
SYN-RECEIVED Waits for a confirming connection request acknowledgment after having both received and sent a connection request.
SYN-SENT Waits for a matching connection request after having sent a connection request.
TIME-WAIT Waits for enough time to pass to be sure the remote host received the acknowledgment of its connection termination request.
The CLOSED state is the entry point to the TCP state machine.

Glossary:
ABC, Appropriate Byte Counting.
Congestion control algorithm. A modification to the algorithm for increasing TCP's congestion window (cwnd) that improves both performance and security. Rather than increasing a TCP's congestion window based on the number of acknowledgments (ACKs) that arrive at the data sender, the congestion window is increased based on the number of bytes acknowledged by the arriving ACKs. The algorithm improves performance by mitigating the impact of delayed ACKs on the growth of cwnd. At the same time, the algorithm provides cwnd growth in direct relation to the probed capacity of a network path, therefore providing a more measured response to ACKs that cover only small amounts of data (less than a full segment size) than ACK counting. This more appropriate cwnd growth can improve both performance and can prevent inappropriate cwnd growth in response to a misbehaving receiver. On the other hand, in some cases the modified cwnd growth algorithm causes larger bursts of segments to be sent into the network. In some cases this can lead to a non-negligible increase in the drop rate and reduced performance.
active open.
AIMD, Additive Increase, Multiplicative Decrease.
Congestion control algorithm. (RFC 2914) In the absence of congestion, the TCP sender increases its congestion window by at most one packet per roundtrip time. In response to a congestion indication, the TCP sender decreases its congestion window by half. More precisely, the new congestion window is half of the minimum of the congestion window and the receiver's advertised window.
Congestion Avoidance.
Congestion control algorithm.
Connection.
A logical communication path identified by a pair of endpoints.
cwnd, congestion window.
TCP state variable. This variable limits the amount of data a TCP can send. At any given time, a TCP MUST NOT send data with a sequence number higher than the sum of the highest acknowledged sequence number and the minimum of cwnd and rwnd.
TCP uses two algorithms for increasing the congestion window. During steady-state, TCP uses the Congestion Avoidance algorithm to linearly increase the value of cwnd. At the beginning of a transfer, after a retransmission timeout or after a long idle period (in some implementations), TCP uses the Slow Start algorithm to increase cwnd exponentially. Slow Start bases the cwnd increase on the number of incoming acknowledgments. During congestion avoidance RFC 2581 allows more latitude in increasing cwnd, but traditionally implementations have based the increase on the number of arriving ACKs.
CWV, Congestion Window Validation. Algorithm.
This algorithm limits the amount of unused cwnd a TCP connection can accumulate. ABC can be used in conjunction with CWV to obtain an accurate measure of the network path.
Eifel. Algorithm.
(RFC 3522) This algorithm allows a TCP sender to detect a posteriori whether it has entered loss recovery unnecessarily. It requires that the TCP Timestamp option is enabled for a connection. Eifel makes use of the fact that the TCP Timestamp option eliminates the retransmission ambiguity in TCP. Based on the timestamp of the first acceptable ACK that arrives during loss recovery, it decides whether loss recovery was entered unnecessarily. The Eifel detection algorithm provides a basis for future TCP enhancements. This includes response algorithms to back out of loss recovery by restoring a TCP sender's congestion control state.
Fast Recovery. Congestion control algorithm.
A sender invokes the Fast Recovery after Fast Retransmit. This algorithm allows the sender to transmit at half its previous rate (regulating the growth of its window based on congestion avoidance), rather than having to begin a Slow Start. This also saves time.
Fast Retransmit. Congestion control algorithm.
(RFC 2757) When a TCP sender receives several duplicate ACKs, fast retransmit allows it to infer that a segment was lost. The sender retransmits what it considers to be this lost segment without waiting for the full timeout, thus saving time.
flight size.
The amount of data that has been sent but not yet acknowledged.
full sized segment.
A segment that contains the maximum number of data bytes permitted.
IW, Initial Window.
The size of the sender's congestion window after the three-way handshake is completed.
LFN, Long Fat Network.
A communications path with a large bandwidth * delay product.
LW, Loss Window.
The size of the congestion window after a TCP sender detects loss using its retransmission timer.
MSL, Maximum Segment Lifetime.
The maximum time in seconds that a segment may be held before being discarded.
MSS, Maximum Segment Size.
When IPv4 is used as the network protocol, the MSS is calculated as the maximum size of an IPv4 datagram minus 40 bytes.
When IPv6 is used as the network protcol, the MSS is calculated as the maximum packet size minus 60 bytes. An MSS of 65535 should be interpreted as infinity.
passive open.
PAWS, Protect Against Wrapped Sequences.
A mechanism to reject old duplicate segments that might corrupt an open TCP connection. PAWS uses the same TCP timestamp option as the RTTM mechanism and assumes that every received TCP segment (including data and ACK segments) contains a timestamp whose values are monotone non-decreasing in time. The basic idea is that a segment can be discarded as an old duplicate if it is received with a timestamp less than some timestamp recently received on this connection.
RMSS, Receiver Maximum Segment Size.
The size of the largest segment the receiver is willing to accept. This is the value specified in the MSS option sent by the receiver during connection startup. Or, if the MSS option is not used, 536 bytes. The size does not include the TCP headers and options.
RTT, Round trip time.
RTTM, Round-Trip Time Measurement.
A technique for measuring the RTT by use of timestamps. The data segments are timestamped using the TSOPT option. The resulting ACK packets contain timestamps from the receiver. The resulting RTT can then be determined by the difference in the timestamps.
RW, Restart Window.
The size of the congestion window after a TCP restarts transmission after an idle period.
rwmd, Receiver Window. TCP state variable.
The most recently advertised receiver window.
SACK, Selective Acknowledgement. Algorithm.
This technique allows the data receiver to inform the sender about all segments that have arrived successfully, so the sender need retransmit only the segments that have actually been lost. This extension uses two TCP options. The first is an enabling option, SACK permitted, which may be sent in a SYN segment to indicate that the SACK option can be used once the connection is established. The other is the SACK option itself, which may be sent over an established connection once permission has been given.
segment.
A TCP data or acknowledgment packet.
Slow Start. Congestion control algorithm.
This algorithm is used to gradually increase the size of the TCP congestion window. It operates by observing that the rate at which new packets should be injected into the network is the rate at which the acknowledgments are returned by the other end.
SMSS, Sender Maximum Segment Size.
The size of the largest segment that the sender can transmit. This value can be based on the maximum transmission unit of the network, the path MTU discovery algorithm, RMSS, or other factors. The size does not include the TCP headers and options.
SWS, Silly Window Syndrome.
TFRC, TCP Friendly Rate Control. Algorithm.
A congestion control mechanism for unicast flows operating in a best effort Internet environment. It is reasonably fair when competing for bandwidth with TCP flows, but has a much lower variation of throughput over time compared with TCP, making it more suitable for applications such as telephony or streaming media where a relatively smooth sending rate is of importance. TFRC is designed for applications that use a fixed packet size and vary their sending rate in packets per second in response to congestion.
Van Jacobson's algorithm.

RFCs:
[IEN 2] Comments on Internet Protocol and TCP.
[IEN 12] Issues in Reliable Host-to-Host Protocols.
[IEN 45] TCP Checksum Function Design.
[IEN 74] Sequence Number Arithmetic.
[IEN 92] Protocol Options.
[IEN 98] TCP Implementation Status.
[IEN 114] PROTOCOL OPTIONS.
[IEN 150] TCP JSYS CALLING SEQUENCES.
[IEN 167] HP3000 TCP DESIGN DOCUMENT.
[RFC 721] Out-of-Band Control Signals in a Host-to-Host Protocol.
[RFC 761] DOD STANDARD TRANSMISSION CONTROL PROTOCOL.
  • Obsoletes:
    IEN 5, IEN 21, IEN 27, IEN 40, IEN 44, IEN 55, IEN 81, IEN 112, IEN 124.
[RFC 793] Transmission Control Protocol.
  • STD: 7.
  • Defines the TCP protocol.
[RFC 801] NCP/TCP TRANSITION PLAN.
[RFC 813] WINDOW AND ACKNOWLEDGEMENT STRATEGY IN TCP.
[RFC 816] FAULT ISOLATION AND RECOVERY.
[RFC 832] Who Talks TCP?
[RFC 833] Who Talks TCP?
[RFC 834] Who Talks TCP?
[RFC 835] Who Talks TCP?
[RFC 836] Who Talks TCP?
[RFC 837] Who Talks TCP?
[RFC 838] Who Talks TCP?
[RFC 839] Who Talks TCP?
[RFC 842] Who Talks TCP? - Survey of 1 February 83.
[RFC 843] Who Talks TCP? -- Survey of 8 February 1983.
[RFC 845] Who Talks TCP? -- Survey of 15 February 1983.
[RFC 846] Who Talks TCP? -- Survey of 22 February 1983.
[RFC 872] TCP-ON-A-LAN.
[RFC 879] The TCP Maximum Segment Size and Related Topics.
[RFC 889] Internet Delay Experiments.
[RFC 896] Congestion Control in IP/TCP Internetworks.
[RFC 939] Executive Summary of the NRC Report on Transport Protocols for Department of Defense Data Networks.
[RFC 942] TRANSPORT PROTOCOLS FOR DEPARTMENT OF DEFENSE DATA NETWORKS.
[RFC 962] TCP-4 Prime.
[RFC 964] SOME PROBLEMS WITH THE SPECIFICATION OF THE MILITARY STANDARD TRANSMISSION CONTROL PROTOCOL.
[RFC 1025] TCP AND IP BAKE OFF.
[RFC 1106] TCP Big Window and Nak Options.
[RFC 1110] A Problem with the TCP Big Window Option.
[RFC 1122] Requirements for Internet Hosts -- Communication Layers.
[RFC 1144] Compressing TCP/IP Headers for Low-Speed Serial Links.
[RFC 1146] TCP Alternate Checksum Options.
  • Defines TCP option 14 (Alternate checksum request), 15 (Alternate checksum data).
  • Obsoletes:
    RFC 1145.
[RFC 1156] Management Information Base for Network Management of TCP/IP-based internets.
[RFC 1180] A TCP/IP Tutorial.
[RFC 1191] Path MTU Discovery.
[RFC 1263] TCP EXTENSIONS CONSIDERED HARMFUL.
[RFC 1323] TCP Extensions for High Performance.
  • Defines TCP options 3 ( Window Scale) and 8 (Timestamp).
  • Obsoletes:
    RFC 1072, RFC 1185.
[RFC 1337] TIME-WAIT Assassination Hazards in TCP.
[RFC 1347] TCP and UDP with Bigger Addresses (TUBA), A Simple Proposal for Internet Addressing and Routing.
[RFC 1379] Extending TCP for Transactions -- Concepts.
[RFC 1475] TP/IX: The Next Internet.
  • Defines ICMP message 31 (Conversion error).
  • Defines IP version 7.
[RFC 1644] T/TCP -- TCP Extensions for Transactions Functional Specification.
  • Category: Experimental.
  • Defines TCP options 11 (CC), 12 (CC.NEW), 13 (CC.ECHO).
[RFC 1693] An Extension to TCP : Partial Order Service.
  • Category: Experimental.
  • Defines TCP options 9 (POC permitted) and 10 (POC service profile).
[RFC 1705] Six Virtual Inches to the Left: The Problem with IPng.
  • Category: Informational.
[RFC 1791] TCP And UDP Over IPX Networks With Fixed Path MTU.
  • Category: Experimental.
[RFC 1812] Requirements for IP Version 4 Routers.
[RFC 1858] Security Considerations for IP Fragment Filtering.
  • Category: Informational.
[RFC 1859] ISO Transport Class 2 Non-use of Explicit Flow Control over TCP RFC1006 extension.
  • Category: Informational.
[RFC 1948] Defending Against Sequence Number Attacks.
  • Category: Informational.
[RFC 1981] Path MTU Discovery for IP version 6.
  • Category: Standards Track.
[RFC 2018] TCP Selective Acknowledgment Options.
  • Category: Standards Track
  • Defines TCP options 4 (SACK permitted) and 5 (SACK).
[RFC 2126] ISO Transport Service on top of TCP (ITOT).
  • Category: Standards Track.
[RFC 2140] TCP Control Block Interdependence.
  • Category: Informational.
[RFC 2385] Protection of BGP Sessions via the TCP MD5 Signature Option.
  • Category: Standards Track.
  • Defines TCP option 19 (MD5 Signature).
[RFC 2415] Simulation Studies of Increased Initial TCP Window Size.
  • Catetgory: Informational.
[RFC 2416] When TCP Starts Up With Four Packets Into Only Three Buffers.
  • Category: Informational.
[RFC 2460] Internet Protocol, Version 6 (IPv6) Specification.
  • Category: Standards Track.
  • Obsoletes:
    RFC 1883.
[RFC 2488] Enhancing TCP Over Satellite Channels using Standard Mechanisms.
  • BCP: 28.
[RFC 2507] IP Header Compression.
  • Category: Standards Track.
[RFC 2525] Known TCP Implementation Problems.
  • Category: Informational.
[RFC 2581] TCP Congestion Control.
  • Category: Standards Track.
  • Obsoletes:
    RFC 2001.
[RFC 2675] IPv6 Jumbograms.
  • Category: Standards Track.
  • Obsoletes:
    RFC 2147.
[RFC 2757] Long Thin Networks.
  • Category: Informational.
[RFC 2760] Ongoing TCP Research Related to Satellites.
  • Category: Informational.
[RFC 2780] IANA Allocation Guidelines For Values In the Internet Protocol and Related Headers.
[RFC 2861] TCP Congestion Window Validation.
  • Category: Experimental.
  • Defines the TCP congestion control algorithm CWV.
[RFC 2873] TCP Processing of the IPv4 Precedence Field.
  • Category: Standards Track.
[RFC 2883] An Extension to the Selective Acknowledgement (SACK) Option for TCP.
  • Category: Standards Track.
[RFC 2884] Performance Evaluation of Explicit Congestion Notification (ECN) in IP Networks.
  • Category: Informational.
[RFC 2914] Congestion Control Principles.
  • BCP: 41.
[RFC 2923] TCP Problems with Path MTU Discovery.
  • Category: Informational.
[RFC 2988] Computing TCP's Retransmission Timer.
  • Category: Standards Track.
[RFC 2990] Next Steps for the IP QoS Architecture.
  • Category: Informational.
[RFC 3042] Enhancing TCP's Loss Recovery Using Limited Transmit.
  • Category: Standards Track.
[RFC 3081] Mapping the BEEP Core onto TCP.
  • Category: Standards Track.
[RFC 3128] Protection Against a Variant of the Tiny Fragment Attack.
  • Category: Informational.
  • Updates:
    RFC 1858.
[RFC 3135] Performance Enhancing Proxies Intended to Mitigate Link-Related Degradations.
  • BCP: 50.
[RFC 3148] A Framework for Defining Empirical Bulk Transfer Capacity Metrics.
  • Category: Informational.
[RFC 3150] End-to-end Performance Implications of Slow Links.
  • BCP: 48.
[RFC 3155] End-to-end Performance Implications of Links with Errors.
  • BCP: 50.
[RFC 3168] The Addition of Explicit Congestion Notification (ECN) to IP.
[RFC 3360] Inappropriate TCP Resets Considered Harmful.
  • BCP: 60.
[RFC 3390] Increasing TCP's Initial Window.
[RFC 3430] Simple Network Management Protocol (SNMP) over Transmission Control Protocol (TCP) Transport Mapping.
  • Category: Experimental.
  • Defines the use of TCP as a transport protocol for SNMP.
[RFC 3449] TCP Performance Implications of Network Path Asymmetry.
  • BCP: 69.
[RFC 3465] TCP Congestion Control with Appropriate Byte Counting (ABC).
  • Category: Experimental.
  • Defines the TCP congestion control algorithm ABC.
[RFC 3481] TCP over Second (2.5G) and Third (3G) Generation Wireless Networks.
  • BCP: 71.
[RFC 3517] A Conservative Selective Acknowledgment (SACK)-based Loss Recovery Algorithm for TCP.
  • Category: Standards Track.
[RFC 3522] The Eifel Detection Algorithm for TCP.
  • Category: Experimental.
[RFC 3540] Robust Explicit Congestion Notification (ECN) Signaling with Nonces.
  • Category: Experimental.
  • Defines ECN bit NS (Nonce Sum).
[RFC 3562] Key Management Considerations for the TCP MD5 Signature Option.
  • Category: Informational.
[RFC 3649] HighSpeed TCP for Large Congestion Windows.
  • Category: Experimental.
[RFC 3708] Using TCP Duplicate Selective Acknowledgement (DSACKs) and Stream Control Transmission Protocol (SCTP) Duplicate Transmission Sequence Numbers (TSNs) to Detect Spurious Retransmissions.
  • Category: Experimental.
[RFC 3742] Limited Slow-Start for TCP with Large Congestion Windows.
  • Category: Experimental.
[RFC 3782] The NewReno Modification to TCP's Fast Recovery Algorithm.
  • Category: Standards Track.
  • Obsoletes:
    RFC 2582.
[RFC 4015] The Eifel Response Algorithm for TCP.
  • Category: Standards Track.
[RFC 4022] Management Information Base for the Transmission Control Protocol (TCP).
  • Category: Standards Track.
  • Defines SNMP MIB iso.org.dod.internet.mgmt.mib-2.tcp (1.3.6.1.2.1.6).
  • Defines SNMP MIB iso.org.dod.internet.mgmt.mib-2.tcpMIB (1.3.6.1.2.1.49).
  • Obsoletes:
    RFC 2012, RFC 2452.
[RFC 4138] Forward RTO-Recovery (F-RTO): An Algorithm for Detecting Spurious Retransmission Timeouts with TCP and the Stream Control Transmission Protocol (SCTP).
  • Category: Experimental.
[RFC 4278] Standards Maturity Variance Regarding the TCP MD5 Signature Option (RFC 2385) and the BGP-4 Specification.
  • Category: Informational.
[RFC 4413] TCP/IP Field Behavior.
  • Category: Informational.
[RFC 5348] TCP Friendly Rate Control (TFRC): Protocol Specification.
[RFC 5382] NAT Behavioral Requirements for TCP.
  • BCP: 142.
[RFC 5461] TCP's Reaction to Soft Errors.
  • Category: Informational.

Publications:




Obsolete RFCs:
[IEN 112] TRANSMISSION CONTROL PROTOCOL.
  • Obsoleted by:
    IEN 124, RFC 761.
  • Obsoletes:
    IEN 5, IEN 21, IEN 27, IEN 40, IEN 44, IEN 55, IEN 81.
[IEN 124] DOD STANDARD TRANSMISSION CONTROL PROTOCOL.
  • Obsoleted by:
    RFC 761.
  • Obsoletes:
    IEN 5, IEN 21, IEN 27, IEN 40, IEN 44, IEN 55, IEN 81, IEN 112.
[RFC 1063] IP MTU Discovery Options.
[RFC 1066] Management Information Base for Network Management of TCP/IP-based internets.
[RFC 1072] TCP Extensions for Long-Delay Paths.
  • Defines TCP options 3 (Window Scale), 4 (SACK permitted), 5 (SACK), 6 (Echo request), 7 (Echo reply).
  • Obsoleted by:
    RFC 1323, RFC 2018.
[RFC 1145] TCP Alternate Checksum Options.
  • Defines TCP option 14 (Alternate checksum request), 15 (Alternate checksum data).
  • Obsoleted by:
    RFC 1146.
[RFC 1158] Management Information Base for Network Management of TCP/IP-based internets: MIB-II.
[RFC 1185] TCP Extension for High-Speed Paths.
[RFC 2001] TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms.
  • Category: Standards Track.
  • Obsoleted by:
    RFC 2581.
[RFC 2012] SNMPv2 Management Information Base for the Transmission Control Protocol using SMIv2.
  • Category: Standards Track.
  • Defines SNMP MIB iso.org.dod.internet.mgmt.mib-2.tcpMIB (1.3.6.1.2.1.49).
  • Obsoleted by:
    RFC 4022.
  • Updates:
    RFC 1213.
[RFC 2147] TCP and UDP over IPv6 Jumbograms.
[RFC 2414] Increasing TCP's Initial Window.
  • Category: Experimental.
  • Obsoleted by:
    RFC 3390.
[RFC 2452] IP Version 6 Management Information Base for the Transmission Control Protocol.
  • Category: Standards Track.
  • Defines SNMP MIB iso.org.dod.internet.experimental.ipv6TcpMIB (1.3.6.1.3.86).
  • Obsoleted by:
    RFC 4022.
[RFC 2481] A Proposal to add Explicit Congestion Notification (ECN) to IP.
  • Category: Experimental.
  • Obsoleted by:
    RFC 3168.
[RFC 2582] The NewReno Modification to TCP's Fast Recovery Algorithm.
  • Category: Experimental.
  • Obsoleted by:
    RFC 3782.
[RFC 3448] TCP Friendly Rate Control (TFRC): Protocol Specification.
  • Category: Standards Track.
  • Defines the TFRC algorithm.
  • Obsoleted by:
    RFC 5348.
http://www.networksorcery.com/enp/protocol/tcp.htm

---------------------------------------------------------

Transmission Control Protocol

From Wikipedia, the free encyclopedia
The Transmission Control Protocol (TCP) is one of the core protocols of the Internet Protocol Suite. TCP is one of the two original components of the suite, complementing the Internet Protocol (IP), and therefore the entire suite is commonly referred to as TCP/IP. TCP provides reliable, ordered delivery of a stream of bytes from a program on one computer to another program on another computer. TCP is the protocol that major Internet applications rely on, applications such as the World Wide Web, e-mail, and file transfer. Other applications, which do not require reliable data stream service, may use the User Datagram Protocol (UDP) which provides a datagram service that emphasizes reduced latency over reliability. TCP provides a point-to-point channel for applications that require reliable communications. The Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP) and Telnet are all examples of applications that require a reliable communication channel.
Internet Protocol Suite
Application Layer
BGP · DHCP · DNS · FTP · HTTP · IMAP · IRC · LDAP · MGCP · NNTP · NTP · POP · RIP · RPC · RTP · SIP · SMTP · SNMP · SSH · Telnet · TLS/SSL · XMPP ·
(more)
Transport Layer
TCP · UDP · DCCP · SCTP · RSVP · ECN ·
(more)
Internet Layer
IP (IPv4, IPv6· ICMP · ICMPv6 · IGMP · IPsec ·
(more)
Link Layer
ARP/InARP · NDP · OSPF · Tunnels (L2TP· PPP · Media Access Control (Ethernet, DSL, ISDN, FDDI· (more)
v · d · e

Contents

[hide]

[edit] Historical origin

In May, 1974, the Institute of Electrical and Electronic Engineers (IEEE) published a paper entitled "A Protocol for Packet Network Interconnection."[1] The paper's authors, Vinton G. Cerf and Bob Kahn, described an internetworking protocol for sharing resources using packet-switching among the nodes. A central control component of this model was the Transmission Control Program that incorporated both connection-oriented links and datagram services between hosts. The monolithic Transmission Control Program was later divided into a modular architecture consisting of the Transmission Control Protocol at the connection-oriented layer and the Internet Protocol at the internetworking (datagram) layer. The model became known informally as TCP/IP, although formally it was henceforth called the Internet Protocol Suite.

[edit] Network function

TCP provides a communication service at an intermediate level between an application program and the Internet Protocol (IP). That is, when an application program desires to send a large chunk of data across the Internet using IP, instead of breaking the data into IP-sized pieces and issuing a series of IP requests, the software can issue a single request to TCP and let TCP handle the IP details.
IP works by exchanging pieces of information called packets. A packet is a sequence of octets and consists of a header followed by a body. The header describes the packet's destination and, optionally, the routers to use for forwarding until it arrives at its destination. The body contains the data IP is transmitting.
Due to network congestion, traffic load balancing, or other unpredictable network behavior, IP packets can be lost, duplicated, or delivered out of order. TCP detects these problems, requests retransmission of lost data, rearranges out-of-order data, and even helps minimize network congestion to reduce the occurrence of the other problems. Once the TCP receiver has reassembled the sequence of octets originally transmitted, it passes them to the application program. Thus, TCP abstracts the application's communication from the underlying networking details.
TCP is utilized extensively by many of the Internet's most popular applications, including the World Wide Web (WWW), E-mail, File Transfer Protocol, Secure Shell, peer-to-peer file sharing, and some streaming media applications.
TCP is optimized for accurate delivery rather than timely delivery, and therefore, TCP sometimes incurs relatively long delays (in the order of seconds) while waiting for out-of-order messages or retransmissions of lost messages. It is not particularly suitable for real-time applications such as Voice over IP. For such applications, protocols like the Real-time Transport Protocol (RTP) running over the User Datagram Protocol (UDP) are usually recommended instead.[2]
TCP is a reliable stream delivery service that guarantees delivery of a data stream sent from one host to another without duplication or losing data. Since packet transfer is not reliable, a technique known as positive acknowledgment with retransmission is used to guarantee reliability of packet transfers. This fundamental technique requires the receiver to respond with an acknowledgment message as it receives the data. The sender keeps a record of each packet it sends, and waits for acknowledgment before sending the next packet. The sender also keeps a timer from when the packet was sent, and retransmits a packet if the timer expires. The timer is needed in case a packet gets lost or corrupted.[2]
TCP consists of a set of rules: for the protocol, that are used with the Internet Protocol, and for the IP, to send data "in a form of message units" between computers over the Internet. At the same time that IP takes care of handling the actual delivery of the data, TCP takes care of keeping track of the individual units of data transmission, called segments, that a message is divided into for efficient routing through the network. For example, when an HTML file is sent from a Web server, the TCP software layer of that server divides the sequence of octets of the file into segments and forwards them individually to the IP software layer (Internet Layer). The Internet Layer encapsulates each TCP segment into an IP packet by adding a header that includes (among other data) the destination IP address. Even though every packet has the same destination address, they can be routed on different paths through the network. When the client program on the destination computer receives them, the TCP layer (Transport Layer) reassembles the individual segments and ensures they are correctly ordered and error free as it streams them to an application.

[edit] TCP segment structure

Transmission Control Protocol accepts data from a data stream, 'segments' it into chunks, and adds a TCP header creating a TCP segment. The TCP segment is then encapsulated into an IP packet. A TCP segment is "the packet of information that TCP uses to exchange data with its peers." [3]
Note that the term TCP packet, though sometimes informally used, is not in line with current terminology, where segment refers to the TCP unit of data, datagram[4] to the IP unit and packet to the data communications network unit:
Processes transmit data by calling on the TCP and passing buffers of data as arguments. The TCP packages the data from these buffers into segments and calls on the internet module [e.g. IP] to transmit each segment to the destination TCP.[5]
A TCP segment consists of a segment header and a data section. The TCP header contains 10 mandatory fields, and an optional extension field (Options, pink background in table).
The data section follows the header. Its contents are the payload data carried for the application. The length of the data section is not specified in the TCP segment header. It can be calculated by subtracting the combined length of the TCP header and the encapsulating IP segment header from the total IP segment length (specified in the IP segment header).
TCP Header
Bit offset  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 Source port Destination port
32 Sequence number
64 Acknowledgment number
96 Data offset Reserved C
W
R
E
C
E
U
R
G
A
C
K
P
S
H
R
S
T
S
Y
N
F
I
N
Window Size
128 Checksum Urgent pointer
160
...
Options (if Data Offset > 5)
...
  • Source port (16 bits) – identifies the sending port
  • Destination port (16 bits) – identifies the receiving port
  • Sequence number (32 bits) – has a dual role:
  • If the SYN flag is set, then this is the initial sequence number. The sequence number of the actual first data byte (and the acknowledged number in the corresponding ACK) are then this sequence number plus 1.
  • If the SYN flag is clear, then this is the accumulated sequence number of the first data byte of this packet for the current session.
  • Acknowledgment number (32 bits) – if the ACK flag is set then the value of this field is the next sequence number that the receiver is expecting. This acknowledges receipt of all prior bytes (if any). The first ACK sent by each end acknowledges the other end's initial sequence number itself, but no data.
  • Data offset (4 bits) – specifies the size of the TCP header in 32-bit words. The minimum size header is 5 words and the maximum is 15 words thus giving the minimum size of 20 bytes and maximum of 60 bytes, allowing for up to 40 bytes of options in the header. This field gets its name from the fact that it is also the offset from the start of the TCP segment to the actual data.
  • Reserved (4 bits) – for future use and should be set to zero
  • Flags (8 bits) (aka Control bits) – contains 8 1-bit flags
  • CWR (1 bit) – Congestion Window Reduced (CWR) flag is set by the sending host to indicate that it received a TCP segment with the ECE flag set and had responded in congestion control mechanism (added to header by RFC 3168).
  • ECE (1 bit) – ECN-Echo indicates
  • If the SYN flag is set, that the TCP peer is ECN capable.
  • If the SYN flag is clear, that a packet with Congestion Experienced flag in IP header set is received during normal transmission (added to header by RFC 3168).
  • URG (1 bit) – indicates that the Urgent pointer field is significant
  • ACK (1 bit) – indicates that the Acknowledgment field is significant. All packets after the initial SYN packet sent by the client should have this flag set.
  • PSH (1 bit) – Push function. Asks to push the buffered data to the receiving application.
  • RST (1 bit) – Reset the connection
  • SYN (1 bit) – Synchronize sequence numbers. Only the first packet sent from each end should have this flag set. Some other flags change meaning based on this flag, and some are only valid for when it is set, and others when it is clear.
  • FIN (1 bit) – No more data from sender
  • Window size (16 bits) – the size of the receive window, which specifies the number of bytes (beyond the sequence number in the acknowledgment field) that the receiver is currently willing to receive (see Flow control and Window Scaling)
  • Checksum (16 bits) – The 16-bit checksum field is used for error-checking of the header and data
  • Urgent pointer (16 bits) – if the URG flag is set, then this 16-bit field is an offset from the sequence number indicating the last urgent data byte
  • Options (Variable 0-320 bits, divisible by 32) – The length of this field is determined by the data offset field. Options 0 and 1 are a single byte (8 bits) in length. The remaining options indicate the total length of the option (expressed in bytes) in the second byte.[6] Some options may only be sent when SYN is set; they are indicated below as [SYN].
  • 0 (8 bits) - End of options list
  • 1 (8 bits) - No operation (NOP, Padding) This may be used to align option fields on 32-bit boundaries for better performance.
  • 2,4,SS (32 bits) - Maximum segment size (see maximum segment size) [SYN]
  • 3,3,S (24 bits) - Window scale (see window scaling for details) [SYN][7]
  • 4,2 (16 bits) - Selective Acknowledgement permitted. [SYN] (See selective acknowledgments for details)[8]
  • 5,N,BBBB,EEEE,... (variable bits, N is either 10, 18, 26, or 34)- Selective ACKnowlegement (SACK)[9] These first two bytes are followed by a list of 1-4 blocks being selectively acknowledged, specified as 32-bit begin/end pointers.
  • 8,10,TTTT,EEEE (80 bits)- Timestamp and echo of previous timestamp (see TCP timestamps for details)[10]
  • 14,3,S (24 bits) - TCP Alternate Checksum Request. [SYN][11]
  • 15,N,... (variable bits) - TCP Alternate Checksum Data.
(The remaining options are obsolete, experimental, not yet standardized, or unassigned)

[edit] Protocol operation

A Simplified TCP State Diagram. See TCP EFSM diagram for a more detailed state diagram including the states inside the ESTABLISHED state.
TCP protocol operations may be divided into three phases. Connections must be properly established in a multi-step handshake process (connection establishment) before entering the data transfer phase. After data transmission is completed, the connection termination closes established virtual circuits and releases all allocated resources.
A TCP connection is managed by an operating system through a programming interface that represents the local end-point for communications, the Internet socket. During the lifetime of a TCP connection it undergoes a series of state changes:
  1. LISTEN : In case of a server, waiting for a connection request from any remote client.
  2. SYN-SENT : waiting for the remote peer to send back a TCP segment with the SYN and ACK flags set. (usually set by TCP clients)
  3. SYN-RECEIVED : waiting for the remote peer to send back an acknowledgment after having sent back a connection acknowledgment to the remote peer. (usually set by TCP servers)
  4. ESTABLISHED : the port is ready to receive/send data from/to the remote peer.
  5. FIN-WAIT-1
  6. FIN-WAIT-2
  7. CLOSE-WAIT
  8. CLOSING
  9. LAST-ACK
  10. TIME-WAIT : represents waiting for enough time to pass to be sure the remote peer received the acknowledgment of its connection termination request. According to RFC 793 a connection can stay in TIME-WAIT for a maximum of four minutes.
  11. CLOSED

[edit] Connection establishment

To establish a connection, TCP uses a three-way handshake. Before a client attempts to connect with a server, the server must first bind to a port to open it up for connections: this is called a passive open. Once the passive open is established, a client may initiate an active open. To establish a connection, the three-way (or 3-step) handshake occurs:
  1. SYN: The active open is performed by the client sending a SYN to the server. It sets the segment's sequence number to a random value A.
  2. SYN-ACK: In response, the server replies with a SYN-ACK. The acknowledgment number is set to one more than the received sequence number (A + 1), and the sequence number that the server chooses for the packet is another random number, B.
  3. ACK: Finally, the client sends an ACK back to the server. The sequence number is set to the received acknowledgement value i.e. A + 1, and the acknowledgement number is set to one more than the received sequence number i.e. B + 1.
At this point, both the client and server have received an acknowledgment of the connection.

[edit] Resource usage

Most implementations allocate an entry in a table that maps a session to a running operating system process. Because TCP packets do not include a session identifier, both endpoints identifies the session using the client's address and port. Whenever a packet is received, the TCP implementation must perform a lookup on this table to find the destination process.
The number of sessions in the server side is limited only by memory and can grow as new connections arrive, but the client must allocate a random port before sending the first SYN to the server. This port remains allocated during the whole conversation, and effectively limits the number of outgoing connections from each of the client's IP addresses. If an application fails to properly close unrequired connections, a client can run out of resources and become unable to establish new TCP connections, even from other applications.
Both endpoints must also allocate space for unacknowledged packets and received (but unread) data.

[edit] Data transfer

There are a few key features that set TCP apart from User Datagram Protocol:
  • Ordered data transfer - the destination host rearranges according to sequence number[2]
  • Retransmission of lost packets - any cumulative stream not acknowledged is retransmitted[2]
  • Error-free data transfer (The checksum in UDP is optional)
  • Flow control - limits the rate a sender transfers data to guarantee reliable delivery. The receiver continually hints the sender on how much data can be received (controlled by the sliding window). When the receiving host's buffer fills, the next acknowledgment contains a 0 in the window size, to stop transfer and allow the data in the buffer to be processed.[2]
  • Congestion control [2]

[edit] Reliable transmission

TCP uses a sequence number to identify each byte of data. The sequence number identifies the order of the bytes sent from each computer so that the data can be reconstructed in order, regardless of any fragmentation, disordering, or packet loss that may occur during transmission. For every payload byte transmitted the sequence number must be incremented. In the first two steps of the 3-way handshake, both computers exchange an initial sequence number (ISN). This number can be arbitrary, and should in fact be unpredictable to defend against TCP Sequence Prediction Attacks.
TCP primarily uses a cumulative acknowledgment scheme, where the receiver sends an acknowledgment signifying that the receiver has received all data preceding the acknowledged sequence number. Essentially, the first byte in a segment's data field is assigned a sequence number, which is inserted in the sequence number field, and the receiver sends an acknowledgment specifying the sequence number of the next byte they expect to receive. For example, if computer A sends 4 bytes with a sequence number of 100 (conceptually, the four bytes would have a sequence number of 100, 101, 102 and 103 assigned) then the receiver would send back an acknowledgment of 104 since that is the next byte it expects to receive in the next packet.
In addition to cumulative acknowledgments, TCP receivers can also send selective acknowledgments to provide further information (see selective acknowledgments).
If the sender infers that data has been lost in the network, it retransmits the data.

[edit] Error detection

Sequence numbers and acknowledgments cover discarding duplicate packets, retransmission of lost packets, and ordered-data transfer. To assure correctness a checksum field is included (see TCP segment structure for details on checksumming).
The TCP checksum is a weak check by modern standards. Data Link Layers with high bit error rates may require additional link error correction/detection capabilities. The weak checksum is partially compensated for by the common use of a CRC or better integrity check at layer 2, below both TCP and IP, such as is used in PPP or the Ethernet frame. However, this does not mean that the 16-bit TCP checksum is redundant: remarkably, introduction of errors in packets between CRC-protected hops is common, but the end-to-end 16-bit TCP checksum catches most of these simple errors.[12] This is the end-to-end principle at work.

[edit] Flow control

TCP uses an end-to-end flow control protocol to avoid having the sender send data too fast for the TCP receiver to receive and process it reliably. Having a mechanism for flow control is essential in an environment where machines of diverse network speeds communicate. For example, if a PC sends data to a hand-held PDA that is slowly processing received data, the PDA must regulate data flow so as not to be overwhelmed.[2]
TCP uses a sliding window flow control protocol. In each TCP segment, the receiver specifies in the receive window field the amount of additional received data (in bytes) that it is willing to buffer for the connection. The sending host can send only up to that amount of data before it must wait for an acknowledgment and window update from the receiving host.
TCP sequence numbers and receive windows behave very much like a clock. The receive window shifts each time the receiver receives and acknowledges a new segment of data. Once it runs out of sequence numbers, the sequence number loops back to 0.
When a receiver advertises a window size of 0, the sender stops sending data and starts the persist timer. The persist timer is used to protect TCP from a deadlock situation that could arise if a subsequent window size update from the receiver is lost, and the sender cannot send more data until receiving a new window size update from the receiver. When the persist timer expires, the TCP sender attempts recovery by sending a small packet so that the receiver responds by sending another acknowledgement containing the new window size.
If a receiver is processing incoming data in small increments, it may repeatedly advertise a small receive window. This is referred to as the silly window syndrome, since it is inefficient to send only a few bytes of data in a TCP segment, given the relatively large overhead of the TCP header. TCP senders and receivers typically employ flow control logic to specifically avoid repeatedly sending small segments. The sender-side silly window syndrome avoidance logic is referred to as Nagle's algorithm.

[edit] Congestion control

The final main aspect of TCP is congestion control. TCP uses a number of mechanisms to achieve high performance and avoid 'congestion collapse', where network performance can fall by several orders of magnitude. These mechanisms control the rate of data entering the network, keeping the data flow below a rate that would trigger collapse.
Acknowledgments for data sent, or lack of acknowledgments, are used by senders to infer network conditions between the TCP sender and receiver. Coupled with timers, TCP senders and receivers can alter the behavior of the flow of data. This is more generally referred to as congestion control and/or network congestion avoidance.
Modern implementations of TCP contain four intertwined algorithms: Slow-start, congestion avoidance, fast retransmit, and fast recovery (RFC 5681).
In addition, senders employ a retransmission timeout (RTO) that is based on the estimated round-trip time (or RTT) between the sender and receiver, as well as the variance in this round trip time. The behavior of this timer is specified in RFC 2988. There are subtleties in the estimation of RTT. For example, senders must be careful when calculating RTT samples for retransmitted packets; typically they use Karn's Algorithm or TCP timestamps (see RFC 1323). These individual RTT samples are then averaged over time to create a Smoothed Round Trip Time (SRTT) using Jacobson's algorithm. This SRTT value is what is finally used as the round-trip time estimate.
Enhancing TCP to reliably handle loss, minimize errors, manage congestion and go fast in very high-speed environments are ongoing areas of research and standards development. As a result, there are a number of TCP congestion avoidance algorithm variations.

[edit] Maximum segment size

The Maximum segment size (MSS) is the largest amount of data, specified in bytes, that TCP is willing to send in a single segment. For best performance, the MSS should be set small enough to avoid IP fragmentation, which can lead to excessive retransmissions if there is packet loss. To try to accomplish this, typically the MSS is negotiated using the MSS option when the TCP connection is established, in which case it is determined by the maximum transmission unit (MTU) size of the data link layer of the networks to which the sender and receiver are directly attached. Furthermore, TCP senders can use Path MTU discovery to infer the minimum MTU along the network path between the sender and receiver, and use this to dynamically adjust the MSS to avoid IP fragmentation within the network.

[edit] Selective acknowledgments

Relying purely on the cumulative acknowledgment scheme employed by the original TCP protocol can lead to inefficiencies when packets are lost. For example, suppose 10,000 bytes are sent in 10 different TCP packets, and the first packet is lost during transmission. In a pure cumulative acknowledgment protocol, the receiver cannot say that it received bytes 1,000 to 9,999 successfully, but failed to receive the first packet, containing bytes 0 to 999. Thus the sender may then have to resend all 10,000 bytes.
To solve this problem TCP employs the selective acknowledgment (SACK) option, defined in RFC 2018, which allows the receiver to acknowledge discontinuous blocks of packets that were received correctly, in addition to the sequence number of the last contiguous byte received successively, as in the basic TCP acknowledgment. The acknowledgement can specify a number of SACK blocks, where each SACK block is conveyed by the starting and ending sequence numbers of a contiguous range that the receiver correctly received. In the example above, the receiver would send SACK with sequence numbers 1,000 and 9,999. The sender thus retransmits only the first packet, bytes 0 to 999.
An extension to the SACK option is the "duplicate-SACK" option, defined in RFC 2883. An out-of-order packet delivery can often falsely indicate the TCP sender of lost packet and, in turn, the TCP sender retransmits the suspected-to-be-lost packet and slow down the data delivery to prevent network congestion. The TCP sender undoes the action of slow-down, that is a recovery of the original pace of data transmission, upon receiving a D-SACK that indicates the retransmitted packet is duplicate.
The SACK option is not mandatory and it is used only if both parties support it. This is negotiated when connection is established. SACK uses the optional part of the TCP header (see TCP segment structure for details). The use of SACK is widespread - all popular TCP stacks support it. Selective acknowledgment is also used in Stream Control Transmission Protocol (SCTP).

[edit] Window scaling

For more efficient use of high bandwidth networks, a larger TCP window size may be used. The TCP window size field controls the flow of data and its value is limited to between 2 and 65,535 bytes.
Since the size field cannot be expanded, a scaling factor is used. The TCP window scale option, as defined in RFC 1323, is an option used to increase the maximum window size from 65,535 bytes to 1 Gigabyte. Scaling up to larger window sizes is a part of what is necessary for TCP Tuning.
The window scale option is used only during the TCP 3-way handshake. The window scale value represents the number of bits to left-shift the 16-bit window size field. The window scale value can be set from 0 (no shift) to 14 for each direction independently. Both sides must send the option in their SYN segments to enable window scaling in either direction.
Some routers and packet firewalls rewrite the window scaling factor during a transmission. This causes sending and receiving sides to assume different TCP window sizes. The result is non-stable traffic that may be very slow. The problem is visible on some sending and receiving sites behind the path of defective routers.[13]

[edit] TCP timestamps

TCP timestamps, defined in RFC 1323, help TCP compute the round-trip time between the sender and receiver. Timestamp options include a 4-byte timestamp value, where the sender inserts its current value of its timestamp clock, and a 4-byte echo reply timestamp value, where the receiver generally inserts the most recent timestamp value that it has received. The sender uses the echo reply timestamp in an acknowledgement to compute the total elapsed time since the acknowledged segment was sent.[2]
TCP timestamps are also used to help in the case where TCP sequence numbers encounter their 232 bound and "wrap around" the sequence number space. This scheme is known as Protect Against Wrapped Sequence numbers, or PAWS (see RFC 1323 for details). Furthermore, the Eifel detection algorithm, defined in RFC 3522, which detects unnecessary loss recovery requires TCP timestamps.

[edit] Out of band data

One is able to interrupt or abort the queued stream instead of waiting for the stream to finish. This is done by specifying the data as urgent. This tells the receiving program to process it immediately, along with the rest of the urgent data. When finished, TCP informs the application and resumes back to the stream queue. An example is when TCP is used for a remote login session, the user can send a keyboard sequence that interrupts or aborts the program at the other end. These signals are most often needed when a program on the remote machine fails to operate correctly. The signals must be sent without waiting for the program to finish its current transfer.[2]
TCP OOB data was not designed for the modern Internet. The urgent pointer only alters the processing on the remote host and doesn't expedite any processing on the network itself. When it gets to the remote host there are two slightly different interpretations of the protocol, which means only single bytes of OOB data are reliable. This is assuming it's reliable at all as it's one of the least commonly used protocol elements and tends to be poorly implemented. [14][15]

[edit] Forcing data delivery

Normally, TCP waits for the buffer to exceed the maximum segment size before sending any data. This creates serious delays when the two sides of the connection are exchanging short messages and need to receive the response before continuing. For example, the login sequence at the beginning of a telnet session begins with the short message "Login", and the session cannot make any progress until these five characters have been transmitted and the response has been received. This process can be seriously delayed by TCP's normal behavior when the message is provided to TCP in several send calls.
However, an application can force delivery of segments to the output stream using a push operation provided by TCP to the application layer.[2] This operation also causes TCP to set the PSH flag or control bit to ensure that data is delivered immediately to the application layer by the receiving transport layer.
In the most extreme cases, for example when a user expects each keystroke to be echoed by the receiving application, the push operation can be used each time a keystroke occurs. More generally, application programs use this function to force output to be sent after writing a character or line of characters. By forcing the data to be sent immediately, delays and wait time are reduced.

[edit] Connection termination

The connection termination phase uses, at most, a four-way handshake, with each side of the connection terminating independently. When an endpoint wishes to stop its half of the connection, it transmits a FIN packet, which the other end acknowledges with an ACK. Therefore, a typical tear-down requires a pair of FIN and ACK segments from each TCP endpoint. After both FIN/ACK exchanges are concluded, the terminating side waits for a timeout before finally closing the connection, during which time the local port is unavailable for new connections; this prevents confusion due to delayed packets being delivered during subsequent connections.
A connection can be "half-open", in which case one side has terminated its end, but the other has not. The side that has terminated can no longer send any data into the connection, but the other side can. The terminating side should continue reading the data until the other side terminates as well.
It is also possible to terminate the connection by a 3-way handshake, when host A sends a FIN and host B replies with a FIN & ACK (merely combines 2 steps into one) and host A replies with an ACK.[16] This is perhaps the most common method.
It is possible for both hosts to send FINs simultaneously then both just have to ACK. This could possibly be considered a 2-way handshake since the FIN/ACK sequence is done in parallel for both directions.
Some host TCP stacks may implement a "half-duplex" close sequence, as Linux or HP-UX do. If such a host actively closes a connection but still has not read all the incoming data the stack already received from the link, this host sends a RST instead of a FIN (Section 4.2.2.13 in RFC 1122). This allows a TCP application to be sure the remote application has read all the data the former sent—waiting the FIN from the remote side, when it actively closes the connection. However, the remote TCP stack cannot distinguish between a Connection Aborting RST and this Data Loss RST. Both cause the remote stack to throw away all the data it received, but that the application still didn't read.[clarification needed]
Some application protocols may violate the OSI model layers, using the TCP open/close handshaking for the application protocol open/close handshaking - these may find the RST problem on active close. As an example:
s = connect(remote);
send(s, data);
close(s);
For a usual program flow like above, a TCP/IP stack like that described above does not guarantee that all the data arrives to the other application.

[edit] Vulnerabilities

TCP may be attacked in a variety of ways. The results of a thorough security assessment of the TCP, along with possible mitigations for the identified issues, was published in 2009,[17] and is currently being pursued within the IETF.[18]

[edit] Denial of service

By using a spoofed IP address and repeatedly sending purposely assembled SYN packets, attackers can cause the server to consume large amounts of resources keeping track of the bogus connections. This is known as a SYN flood attack. Proposed solutions to this problem include SYN cookies and Cryptographic puzzles. Sockstress is a similar attack, that might be mitigated with system resource management.[19] An advanced DoS attack involving the exploitation of the TCP Persist Timer was analyzed at Phrack #66.[20]

[edit] Connection hijacking

An attacker who is able to eavesdrop a TCP session and redirect packets can hijack a TCP connection. To do so, the attacker learns the sequence number from the ongoing communication and forges a false segment that looks like the next segment in the stream. Such a simple hijack can result in one packet being erroneously accepted at one end. When the receiving host acknowledges the extra segment to the other side of the connection, synchronization is lost. Hijacking might be combined with ARP or routing attacks that allow taking control of the packet flow, so as to get permanent control of the hijacked TCP connection.[21]
Impersonating a different IP address was possible prior to RFC 1948, when the initial sequence number was easily guessable. That allowed an attacker to blindly send a sequence of packets that the receiver would believe to come from a different IP address, without the need to deploy ARP or routing attacks: it is enough to ensure that the legitimate host of the impersonated IP address is down, or bring it to that condition using denial of service attacks. This is why the initial sequence number is chosen at random.

[edit] TCP ports

TCP uses the notion of port numbers to identify sending and receiving application end-points on a host, or Internet sockets. Each side of a TCP connection has an associated 16-bit unsigned port number (0-65535) reserved by the sending or receiving application. Arriving TCP data packets are identified as belonging to a specific TCP connection by its sockets, that is, the combination of source host address, source port, destination host address, and destination port. This means that a server computer can provide several clients with several services simultaneously, as long as a client takes care of initiating any simultaneous connections to one destination port from different source ports.
Port numbers are categorized into three basic categories: well-known, registered, and dynamic/private. The well-known ports are assigned by the Internet Assigned Numbers Authority (IANA) and are typically used by system-level or root processes. Well-known applications running as servers and passively listening for connections typically use these ports. Some examples include: FTP (21), SSH (22), TELNET (23), SMTP (25) and HTTP (80). Registered ports are typically used by end user applications as ephemeral source ports when contacting servers, but they can also identify named services that have been registered by a third party. Dynamic/private ports can also be used by end user applications, but are less commonly so. Dynamic/private ports do not contain any meaning outside of any particular TCP connection.

[edit] Development

TCP is a complex protocol. However, while significant enhancements have been made and proposed over the years, its most basic operation has not changed significantly since its first specification RFC 675 in 1974, and the v4 specification RFC 793, published in September 1981. RFC 1122, Host Requirements for Internet Hosts, clarified a number of TCP protocol implementation requirements. RFC 2581, TCP Congestion Control, one of the most important TCP-related RFCs in recent years, describes updated algorithms that avoid undue congestion. In 2001, RFC 3168 was written to describe explicit congestion notification (ECN), a congestion avoidance signaling mechanism.
The original TCP congestion avoidance algorithm was known as "TCP Tahoe", but many alternative algorithms have since been proposed (including TCP Reno, TCP Vegas, FAST TCP, TCP New Reno, and TCP Hybla).
TCP Interactive (iTCP) [22] is a research effort into TCP extensions that allows applications to subscribe to TCP events and register handler components that can launch applications for various purposes, including application-assisted congestion control.
Multipath TCP (MPTCP) [23] is another research effort attempting to utilize multiple path for one TCP connection, thus maximizing resource usage and increasing redundancy. The redundancy offered by Multipath TCP in the context of wireless networks [24] enables statistical multiplexing of resources, and thus increases TCP throughput dramatically.
TCP Cookie Transactions (TCPCT) is an extension proposed in December 2009 to secure servers against denial-of-service attacks. Unlike SYN cookies, TCPCT does not conflict with other TCP extensions such as window scaling. TCPCT was designed due to necessities of DNSSEC, where servers have to handle large numbers of short-lived TCP connections.
tcpcrypt is an extension proposed in July 2010 to provide transport-level encryption directly in TCP itself. It's designed to work transparently and not require any configuration. Unlike TLS (SSL), tcpcrypt itself does not provide authentication, but provides simple primitives down to the application to do that. As of 2010, the first tcpcrypt IETF draft has been published and implementations exist for several major platforms.

[edit] TCP over wireless networks

TCP has been optimized for wired networks. Any packet loss is considered to be the result of network congestion and the congestion window size is reduced dramatically as a precaution. However, wireless links are known to experience sporadic and usually temporary losses due to fading, shadowing, hand off, and other radio effects, that cannot be considered congestion. After the (erroneous) back-off of the congestion window size, due to wireless packet loss, there can be a congestion avoidance phase with a conservative decrease in window size. This causes the radio link to be underutilized. Extensive research has been done on the subject of how to combat these harmful effects. Suggested solutions can be categorized as end-to-end solutions (which require modifications at the client or server),[25] link layer solutions (such as RLP in CDMA2000), or proxy based solutions (which require some changes in the network without modifying end nodes.[25][26]

[edit] Hardware implementations

One way to overcome the processing power requirements of TCP is to build hardware implementations of it, widely known as TCP Offload Engines (TOE). The main problem of TOEs is that they are hard to integrate into computing systems, requiring extensive changes in the operating system of the computer or device. One company to develop such a device was Alacritech.

[edit] Debugging

A packet sniffer, which intercepts TCP traffic on a network link, can be useful in debugging networks, network stacks and applications that use TCP by showing the user what packets are passing through a link. Some networking stacks support the SO_DEBUG socket option, which can be enabled on the socket using setsockopt. That option dumps all the packets, TCP states, and events on that socket, which is helpful in debugging. Netstat is another utility that can be used for debugging.

[edit] Alternatives

For many applications TCP is not appropriate. One big problem (at least with normal implementations) is that the application cannot get at the packets coming after a lost packet until the retransmitted copy of the lost packet is received. This causes problems for real-time applications such as streaming multimedia (such as Internet radio), real-time multiplayer games and voice over IP (VoIP) where it is sometimes more useful to get most of the data in a timely fashion than it is to get all of the data in order.
For both historical and performance reasons, most storage area networks (SANs) prefer to use Fibre Channel protocol (FCP) instead of TCP/IP.
Also for embedded systems, network booting and servers that serve simple requests from huge numbers of clients (e.g. DNS servers) the complexity of TCP can be a problem. Finally, some tricks such as transmitting data between two hosts that are both behind NAT (using STUN or similar systems) are far simpler without a relatively complex protocol like TCP in the way.
Generally, where TCP is unsuitable, the User Datagram Protocol (UDP) is used. This provides the application multiplexing and checksums that TCP does, but does not handle building streams or retransmission, giving the application developer the ability to code them in a way suitable for the situation, or to replace them with other methods like forward error correction or interpolation.
SCTP is another IP protocol that provides reliable stream oriented services similar to TCP. It is newer and considerably more complex than TCP, and has not yet seen widespread deployment. However, it is especially designed to be used in situations where reliability and near-real-time considerations are important.
Venturi Transport Protocol (VTP) is a patented proprietary protocol that is designed to replace TCP transparently to overcome perceived inefficiencies related to wireless data transport.
TCP also has issues in high bandwidth environments. The TCP congestion avoidance algorithm works very well for ad-hoc environments where the data sender is not known in advance, but if the environment is predictable, a timing based protocol such as Asynchronous Transfer Mode (ATM) can avoid TCP's retransmits overhead.
Multipurpose Transaction Protocol (MTP/IP) is patented proprietary software that is designed to adaptively achieve high throughput and transaction performance in a wide variety of network conditions, particularly those where TCP is perceived to be inefficient.

[edit] Checksum computation

[edit] TCP checksum for IPv4

When TCP runs over IPv4, the method used to compute the checksum is defined in RFC 793:
The checksum field is the 16 bit one's complement of the one's complement sum of all 16-bit words in the header and text. If a segment contains an odd number of header and text octets to be checksummed, the last octet is padded on the right with zeros to form a 16-bit word for checksum purposes. The pad is not transmitted as part of the segment. While computing the checksum, the checksum field itself is replaced with zeros.
In other words, after appropriate padding, all 16-bit words are added using one's complement arithmetic. The sum is then bitwise complemented and inserted as the checksum field. A pseudo-header that mimics the IPv4 packet header used in the checksum computation is shown in the table below.
TCP pseudo-header (IPv4)
Bit offset 0–3 4–7 8–15 16–31
0 Source address
32 Destination address
64 Zeros Protocol TCP length
96 Source port Destination port
128 Sequence number
160 Acknowledgement number
192 Data offset Reserved Flags Window
224 Checksum Urgent pointer
256 Options (optional)
256/288+
Data
 
The source and destination addresses are those of the IPv4 header. The protocol value is 6 for TCP (cf. List of IP protocol numbers). The TCP length field is the length of the TCP header and data.

[edit] TCP checksum for IPv6

When TCP runs over IPv6, the method used to compute the checksum is changed, as per RFC 2460:
Any transport or other upper-layer protocol that includes the addresses from the IP header in its checksum computation must be modified for use over IPv6, to include the 128-bit IPv6 addresses instead of 32-bit IPv4 addresses.
A pseudo-header that mimics the IPv6 header for computation of the checksum is shown below.
TCP pseudo-header (IPv6)
Bit offset 0 - 7 8–15 16–23 24–31
0 Source address
32
64
96
128 Destination address
160
192
224
256 TCP length
288 Zeros Next header
320 Source port Destination port
352 Sequence number
384 Acknowledgement number
416 Data offset Reserved Flags Window
448 Checksum Urgent pointer
480 Options (optional)
480/512+
Data
 
  • Source address – the one in the IPv6 header
  • Destination address – the final destination; if the IPv6 packet doesn't contain a Routing header, TCP uses the destination address in the IPv6 header, otherwise, at the originating node, it uses the address in the last element of the Routing header, and, at the receiving node, it uses the destination address in the IPv6 header.
  • TCP length – the length of the TCP header and data
  • Next Header – the protocol value for TCP

[edit] Checksum offload

Many TCP/IP software stack implementations provide options to use hardware assistance to automatically compute the checksum in the network adapter prior to transmission onto the network or upon reception from the network for validation.

[edit] See also

[edit] References

  1. ^ Vinton G. Cerf, Robert E. Kahn, A Protocol for Packet Network Intercommunication, IEEE Transactions on Communications, Vol. 22, No. 5, May 1974 pp. 637-648
  2. ^ a b c d e f g h i j Comer, Douglas E. (2006). Internetworking with TCP/IP:Principles, Protocols, and Architecture. 1 (5th ed.). Prentice Hall. ISBN 0131876716.
  3. ^ TCP (Linktionary term)
  4. ^ RFC 791 - section 2.1
  5. ^ RFC 793
  6. ^ http://www.iana.org/assignments/tcp-parameters/
  7. ^ RFC 1323, TCP Extensions for High Performance, Section 2.2
  8. ^ RFC 2018, TCP Selective Acknowledgement Options, Section 2
  9. ^ RFC 2018, TCP Selective Acknowledgement Options, Section 3
  10. ^ RFC 1323, TCP Extensions for High Performance, Section 3.2
  11. ^ RFC 1146, TCP Alternate Checksum Options
  12. ^ Stone; Partridge (2000). "When The CRC and TCP Checksum Disagree". Sigcomm
  13. ^ http://lwn.net/Articles/92727/
  14. ^ Gont, Fernando (2008-11). "On the implementation of TCP urgent data". 73rd IETF meeting. Retrieved 2009-01-04.
  15. ^ Peterson, Larry (2003). Computer Networks. Morgan Kaufmann. pp. 401. ISBN 155860832X.
  16. ^ Tanenbaum, Andrew S. (2003-03-17). Computer Networks (Fourth ed.). Prentice Hall. ISBN 0-13-066102-3.
  17. ^ Security Assessment of the Transmission Control Protocol (TCP)
  18. ^ Security Assessment of the Transmission Control Protocol (TCP)
  19. ^ http://www.gont.com.ar/talks/hacklu2009/fgont-hacklu2009-tcp-security.pdf Some insights about the recent TCP DoS (Denial of Service) vulnerabilities
  20. ^ Exploiting TCP and the Persist Timer Infiniteness
  21. ^ Laurent Joncheray, Simple Active Attack Against TCP, 1995
  22. ^ TCP Interactive (iTCP)
  23. ^ draft-ietf-mptcp-architecture
  24. ^ http://portal.acm.org/citation.cfm?id=1794199
  25. ^ a b "TCP performance over CDMA2000 RLP". Retrieved 2010-08-30
  26. ^ Muhammad Adeel & Ahmad Ali Iqbal (2004). "TCP Congestion Window Optimization for CDMA2000 Packet Data Networks". International Conference on Information Technology (ITNG'07): 31–35. doi:10.1109/ITNG.2007.190. ISBN 978-0-7695-2776-5.

[edit] Further reading

[edit] External links

No comments:

Post a Comment