This blog post follows on from the one I wrote a week ago on TCP best practices for Office 365 and will focus on capturing and analysing a TCP connection to Office 365 and comparing the information available in the packet capture to best practice for TCP discussed in the previous blog post.
Summary of TCP Best Practices
Best Practice | Value |
---|---|
Target Bandwidth Per User | Min 10Mbps (ignoring concurrency etc which I’ll cover another time) |
Target RTT Per User | Max 50ms |
TCP Window Size | 64kB |
TCP Window Scaling Factor | 4 or 8 |
TCP MSS | 1460 unless there is a legitimate reason for it being lower (e.g. Cisco CAPWAP in use) |
Analysing a packet flow – MSS, Initial RTT, Initial Window Size & Scaling Factor
For my example I set a packet capture running in Wireshark and then copied a file from my desktop to O365 Onedrive. This effectively caused OneDrive to upload the file to OneDrive in the cloud and allowed me to see what happened.
During the TCP connection my IP address was 192.168.1.142, whilst Office 365 OneDrive was using 13.107.136.9. I filtered the connection to only show traffic between these endpoints using the Wireshark filter “ip.addr == 13.107.136.9”. This leaves me with a total of 4557 packets in total.
As you would expect the first three packets in my connection are a SYN, a SYN-ACK, and an ACK:

Looking at the first of these packets we can identify the proposed MSS, window size, and scaling factor from 192.168.1.142:

Window Size: 64240
MSS: 1460 bytes
Window Scale: 8 (multiply by 256)
Looking at the second of these packets we can identify the proposed MSS, window size, and scaling factor from 13.107.136.9:

Window Size: 65535
MSS: 1440 bytes
Window Scale: 8 (multiply by 256)
We also should note that both ends have in the TCP options the “SACK permitted” option set – this means both ends support TCP selective acknowledgements for packet loss.
We also see here the initial Round Trip Time (iRTT) for the first SYN – SYN-ACK sequence – its 0.0388 seconds or 38ms.
So far we have identified that the Window Size, Window Scaling and MSS is acceptable in both directions. We have also identified that the initial Round Trip time is acceptable.
What about Packet Loss?
You can easily see how many Duplicate ACK packets Wireshark believes there to be by using the “Analyse -> Expert Information” option which provides this screen:

This shows there are 469 duplicate ACK packets. This would suggest that the packet loss is arrived at from the following calculation:
469 Duplicate ACK Packets * 100 / 4468 packets in total connection stream = 10.49%
But do they really represent 469 lost packets? By selecting the “>” its possible to see the packet summary details:

This shows that the first 63 TCP Duplicate ACK packets are duplicate ACK for packet 449 – this suggests that they might actually be selective ACK packets. You can see this in the packet detail in Wireshark – for packet 450 for example:

The TCP SACK option can specify multiple received sections – the “left edge” and “right edge” show what has been received. So the expected packet was 212160 but the packets received had starting offset of 233760, and an ending offset of 235200, so some packets have been lost.
Looking at the ACK packet (Packet 449) you can see it is an ACK to packet 264 which has sequence number 210720, meaning next sequence number expected is 212160 (i.e. 210720 add the MSS of 1440).
But there are 63 TCP Duplicate ACKs for packet 449 – so how many packets have actually been lost?
The MSS 1440 bytes, so we can calculate how many packets are lost between the packet acknowledged, and this acknowledgement based on the sequence number and the SACK left edge.
233760 – 210720 = 23040 bytes
23040 / 1440 = 16 packets
These packets are resent as packets 461 through 477. But this doesn’t deal with the selective ACK lost packets. The same selective ack information can appear in multiple duplicate ACK packets with different ACK sequence numbers – meaning that counting duplicate ACKs doesn’t help – each one could refer to a single or many lost packets and could even refer to the same lost packets.
In fact counting the number of packets containing a Selective Acknowledgement doesn’t help either, for example consider the SACK information in packet 456 below:

This has duplicate ACK number 212160 same as our earlier packet.
Lost packets here then are:
212160 – 233760 / 1440 + 1 = 16 packets lost
251040 – 235200 / 1440 = 11 packets lost
259680 – 252480 / 1440 = 5 packet lost
Now consider SACK information in packet 457 below:

Again this has duplicate ACK number 212160 as before.
Lost packets here then are:
212160 – 233760 / 1440 + 1 = 16 packets lost
251040 – 235200 / 1440 = 11 packets lost
259680 – 252480 / 1440 = 5 packet lost
All that has changed is the right edge of the first SACK has incremented by 1440 bytes – which simply tells us an additional packet has been received.
There has to be a better way to determine packet loss then…
When a TCP connection has lost packets and duplicate ACKs / selective ACKs are in use the response of the sender is to resend the packets – these are shown by wireshark as “TCP Fast Retransmission” followed by “TCP Retransmission”, or “TCP Fast Retransmission” followed by a number of “TCP Out-Of-Order” – for example:

and

So, a better method would be to inspect how many retransmissions, fast retransmissions and out of order packets exist in the life of the connection – looking at the expert information there are:
Out of Order Packets: 98
Retransmissions: 37
Fast Retransmissions: 6
TOTAL: 141
Potential Packet Loss Rate: 141 packets * 100 / 4468 packets = 3.15% packet loss
TCP Round Trip Time
Thankfully Wireshark has a stream graph for TCP Round Trip Time that makes it much easier to see how this varies over time – this is found in the “Statistics -> TCP Stream Graphs -> Round Trip Time” and shows the statistics for both directions – since my capture is for an upload I have chosen from 192.168.1.142 to 13.107.136.9:

This shows that for most of the time during my connection the round trip time stayed well below 100ms but peaked at 700ms a few times during the connection.
It is over my proposed target of 50ms however – but I’m working from home just using broadband so I would expect it to be somewhat slower.
This graph doesn’t allow me to see percentile results though which would most likely remove the few data points over 100ms. If I wanted to verify this I could extract the RTT values to Excel and do the calculation – I haven’t bothered to do this just now.
TCP Throughput
Again, Wireshark has a stream graph for TCP Throughput that makes it much easier to see how this varies over time – this is found in the “Statistics -> TCP Stream Graphs -> Throughput” and shows the statistics for both directions – since my capture is for an upload I have chosen from 192.168.1.142 to 13.107.136.9:

Final Thoughts
I did this once for a single file uploaded to Onedrive – to properly test this you would need to repeat this test several times for uploads, and several times for downloads. Not just for Onedrive but also for Sharepoint and Teams. You would also need to test this for each of a range of indicative sample sites on your WAN network.
I also suspect that there maybe other ways of achieving the same using tooling – for example the Microsoft tool here is quite good. Its only a PoC tool but it does allow quite extensive testing of your O365 configuration and highlights things that need fixing. If you open it in Edge and perform the advanced tests (you have to run a download) then it performs a lot of relevant tests – some 470 in fact.