构建电子邮件归档系统。挑战和当然的解决方案 - 第一部分

Building an 电子邮件 Archiving System: ǞǞǞ Challenges and of Course the Solution – Part 1

Feb 4, 2019

出版商

Jeff Goldstein

•

类别

电子邮件

Ready to see Bird
in action?

预约演示

构建电子邮件归档系统。挑战和当然的解决方案 - 第一部分

关于 a year ago I wrote a blog on how to retrieve copies of emails for archival and viewing but I did not broach the actual storing of the email or related data, and recently I wrote a blog on storing all of the event data (i.e. when the email was sent, opens, clicks bounces, unsubscribes, etc) on an email for the purpose of auditing, but chose not to create any supporting code.

随着监管环境中电子邮件使用的增加，我决定是时候开始一个新的项目，将所有这些与如何存储电子邮件正文及其所有相关数据的代码样本结合起来。在接下来的一年里，我将继续建立这个项目，目的是为存档的电子邮件和SparkPost产生的所有日志信息创建一个有效的存储和查看应用程序。SparkPost没有一个系统可以归档电子邮件正文，但它确实使建立一个归档平台相当容易。

In this blog series, I will describe the process I went through in order to store the email body onto S3 (Amazon’s Simple Store Service) and all relevant log data in MySQL for easy cross-referencing. Ultimately, this is the starting point for building an application that will allow for easy searching of archived emails, then displaying those emails along with the event (log) data. ǞǞǞ code for this project can be found in the following GitHub repository: https://github.com/jeff-goldstein/PHPArchivePlatform

该系列博客的第一篇文章将描述这一挑战，并为解决方案提供一个架构。其余的博客将详细介绍该解决方案的部分内容以及代码样本。

我的第一步是弄清楚我如何获得发给原收件人的电子邮件的副本。为了获得电子邮件正文的副本，你需要：1:

在发送电子邮件之前，捕获电子邮件正文
让电子邮件服务器存储一份副本
让电子邮件服务器为你创建一个副本来存储

如果电子邮件服务器正在添加链接跟踪或打开跟踪等项目，你就不能使用#1，因为它不会反映打开/点击跟踪的变化。

这意味着，要么服务器必须存储该邮件，要么以某种方式向您提供该邮件的副本进行存储。由于SparkPost没有邮件正文的存储机制，但有办法创建邮件的副本，我们将让SparkPost给我们发送一份邮件的副本，让我们存储在S3中。

This is done by using SparkPost’s Archive feature. SparkPost’s Archive feature gives the sender the ability to tell SparkPost to send a duplicate of the email to one or more email addresses and use the same tracking and open links as the original. SparkPost文档 defines their Archive feature in the following manner:

存档列表中的收件人将收到发送至RCPT TO地址的邮件的精确复制品。特别是，任何用于RCPT TO收件人的编码链接在存档信息中都是相同的。

与RCPT TO邮件的唯一区别是，由于归档邮件的目标地址不同，一些标题会有所不同，但邮件的正文将是一个完全的复制品!

If you want a deeper explanation here is a link 到 SparkPost documentation on creating duplicate (or archive) copies of an email.

顺便提一下，SparkPost实际上允许你发送电子邮件到抄送、密送和存档电子邮件地址。在这个解决方案中，我们专注于存档地址。

* 注意 * 存档邮件只能在通过SMTP向SparkPost注入邮件时创建!

Now that we know how to obtain a copy of the original email, we need to look 在 log data that is produced and some of the subtle nuances within that data. SparkPost tracks everything that happens on its servers and offers that information up to you in the form of message-events. Those events are stored on SparkPost for 10 days and can be pulled from the server via a RESTful API called message-events, or you can have SparkPost push those events to any number of collecting applications that you wish. The push mechanism is done through webhooks and is done in real time.

Currently, there are 14 different events that may happen to an email. Here is a list of the current events:

蹦蹦跳跳
点击延时
交付
代际故障
代拒绝
初次开放
注射链接退订
列表退订
开放式
乐队外的人
政策拒绝垃圾邮件投诉

* Follow 此链接 for an up to date reference guide for a description of each event along with the data that is shared for each event.

Each event has numerous fields that match the event type. Some fields like the transmission_id are found in every event, but other fields may be more event-specific; for example, only open and click events have geotag information.

One very important message event entry to this project is the transmission_id. All of the message event entries for the original email, archived email, and any cc and bcc addresses will share the same transmission_id.

There is also a common entry called the message_id that will have the same id for each entry of the original email and the archived email. Any cc or bcc addresses will have their own id for the message_id entry.

So far this sounds great and frankly fairly easy, but now is the challenging part. Remember, in order to get the archive email, we have SparkPost send a duplicate of the original email to another email address which corresponds to some inbox that you have access to. But in order to automate this solution and store the email body, I’m going to use another feature of SparkPost’s called 入境邮件转发. What that does, is take all emails sent to a specific domain and process them. By processing them, it rips the email apart and creates a JSON structure which is then delivered to an application via a webhook. See Appendix A for a sample JSON.

If you look real carefully, you will notice that the JSON structure from the inbound relay is missing a very important field; the transmission_id. While all of the outbound emails have the transmission_id with the same entry which binds all of the data from the original email, archive, cc, and bcc addresses; SparkPost has no way to know that the email captured by the inbound process is connected to any of the outbound emails. The inbound process simply knows that an email was sent to a specific domain and to parse the email. That’s it. It will treat any email sent to that domain the same way, be it a reply from a customer or the archive email send from SparkPost.

因此，诀窍是；你如何将出站数据与刚刚抓取电子邮件归档版本的入站过程相连接？我决定做的是在电子邮件的正文中隐藏一个独特的ID。如何做到这一点取决于你，但我简单地创建了一个输入字段，并打开了隐藏标签。

我还把这个字段添加到X-MSYS-API头的元数据块中，在注入时传递给SparkPost。这个隐藏的UID最终将成为整个过程的粘合剂，是项目的一个主要组成部分，将在下面的博文中深入讨论。

现在我们有了将这个项目粘合在一起的UID，并理解了为什么它是必要的，我可以开始建立整个项目的愿景和相应的博客文章。

捕捉和储存存档的电子邮件，以及用于搜索/索引的数据库条目
捕获所有的消息事件数据
创建一个应用程序来查看电子邮件和所有相应的数据

以下是项目的简单示意图。

build an email archiving system - diagram

The first drop of code will cover the archive process and storing the email onto S3, while the second code drop will cover storing all of the log data from message-events into MySQL. You can expect the first two code drops and blog entries sometime in early 2019. If you have any questions or suggestions, please feel free to pass them along.

快乐的发送。

- 杰夫