Products

资源

合作伙伴

定价

获取演示

构建邮件归档系统。存储邮件正文

Building an 电子邮件 Archiving System: Storing the 电子邮件 Body

Mar 4, 2019

出版商

Bird

•

类别

Ready to see Bird
in action?

预约演示

构建邮件归档系统。存储邮件正文

In this blog, I will describe the process I went through to store the body of the email onto S3 (Amazon’s Simple Store Service) and ancillary data into a MySQL table for easy cross-referencing. Ultimately, this is the starting point for the code base that will include an application that will allow for easy searching of archived emails, and then displaying those emails along with the event (log) data. ǞǞǞ code for this project can be found in the following GitHub repository: https://github.com/jeff-goldstein/PHPArchivePlatform.

虽然我将在这个项目中利用S3和MySQL，但这绝不是唯一可以用来建立归档平台的技术，但考虑到它们的普遍性，我认为它们是这个项目的一个好选择。在一个全面的高容量系统中，我会使用比MySQL更高性能的数据库，但对于这个样本项目，MySQL是完美的。

I have detailed below, the steps I took in this 第一阶段 of the project:

创建归档的重复邮件
使用SparkPost的存档和入站中继功能，将原始邮件的副本送回SparkPost处理成JSON结构，然后发送到网络钩子收集器（应用）。
拆解JSON结构以获得必要的组件
发送电子邮件的正文到S3进行存储
为每封电子邮件在MySQL中记录一个条目，以便相互参照。

创建一个重复的电子邮件

In SparkPost the best way to archive an email is to create an identical copy of the email specifically designed for archival purposes. This is done by using SparkPost’s Archive feature. SparkPost’s Archive feature gives the sender the ability to send a duplicate of the email to one or more email address. This duplicate uses the same tracking and open links as the original. ǞǞǞ SparkPost documentation defines the Archive feature in the following way:

存档列表中的收件人将收到发送至RCPT TO地址的邮件的精确复制品。特别是，任何用于RCPT TO收件人的编码链接在存档信息中都是相同的。

该归档副本与原始的RCPT TO电子邮件之间的唯一区别是，由于归档电子邮件的目标地址不同，一些标题会有所不同，但电子邮件的正文将是一个完全的复制品

If you want a deeper explanation, here is a link 到 SparkPost documentation on creating duplicate (or archive) copies of an email. Sample X-MSYS-API headers for this project are shown later in this blog.

这种方法有一个注意事项；虽然原始邮件中的所有事件信息都是由传输ID和消息ID联系在一起的，但在入站中继事件（获取和传播存档邮件的机制）中，没有任何信息能与这两个ID中的一个联系起来，因此也就没有原始邮件的信息。这意味着我们需要将数据放在电子邮件正文和原始电子邮件的标题中，作为一种方式将原始和存档电子邮件的所有SparkPost数据联系起来。

为了创建放置在电子邮件正文中的代码，我在电子邮件创建应用程序中使用了以下过程。

Somewhere in the email body, I placed the following input entry:<input name="ArchiveCode" type="hidden" value="<<UID>>">
Then I created a unique code and replaced the <<UID>> field:$uid = md5(uniqid(rand(), true)); $emailBody = str_replace(“<<UID>>,$uid,$emailBody);
下面是一个输出的例子。
<input name="ArchiveCode" type="hidden" value="00006365263145">
接下来，我确保将$UID添加到X-MSYS-API头的meta_data块中。这一步确保UID被嵌入到原始电子邮件的每个事件输出中。

X-MSYS-API:{ "campaign_id":"<my_campaign>", "metadata":{ "UID":"<UID>" }, "archive":[ { "email":"archive@geekwithapersonality.com" } ], "options":{ "open_tracking":false, "click_tracking":false, "transactional":false, "ip_pool":"<my_ip_pool>" } }

现在我们有了一种方法，可以将原始邮件的所有数据与存档的邮件正文联系起来。

获得存档版本

为了获得一份存档的电子邮件的副本，你需要采取以下步骤。

创建一个子域，你将把所有存档（重复）的电子邮件发送到该子域。
设置适当的DNS记录，让所有发送到该子域的电子邮件都发送到SparkPost。
在SparkPost中创建一个入站域
在SparkPost中创建一个入站webhook
创建一个应用程序（收集器）来接收SparkPost webhook数据流

以下两个链接可以用来帮助指导你完成这一过程。

SparkPost technical doc: Enabling Inbound Email Relaying & Relay Webhooks
Also, the blog I wrote last year, 归档电子邮件。追踪已发邮件的操作指南 will walk you through the creation of the inbound relay within SparkPost

* Note: as of Oct 2018, the Archive feature only works when sending emails using an SMTP connection to SparkPost, the RESTful API does not support this feature. That probably isn’t an issue because most emails that need this level of audit control tend to be personalized emails that are fully built out by a backend application before email delivery is needed.

在JSON结构中获得重复的电子邮件

In the first phase of this project, all I’m storing is the rfc822 email format in S3 and some high-level description fields into a SQL table for searching. Since SparkPost will send the email data in a JSON structure to my archiving platform via webhook data streams, I built an application (often referred to as a 采集器) that accepts the 中继_网络钩子 data stream.

Each package from the SparkPost Relay_Webhook will contain the information of one duplicate email at a time, so breaking the JSON structure down into the targeted components for this project is rather straightforward. In my PHP code, getting the rfc822 formatted email was as easy as the following few lines of code:

if ($verb == "POST") { $body = file_get_contents("php://input"); $fields = json_decode($body, true); $rfc822body = $fields['0']['msys']['relay_message']['content']['email_rfc822']; $htmlbody = $fields['0']['msys']['relay_message']['content'][html'] $headers = $fields['0']['msys']['relay_message']['content']['headers'];}

Some of the information that I want to store into my SQL table resides in an array of header fields. So I wrote a small function that accepted the header array and looped through the array in order to obtain the data I was interested in storing:

现在我有了数据，我准备将主体存储到S3。

将重复的电子邮件存储在S3中

I’m sorry to disappoint you but I’m not going to give a step by step tutorial on creating an S3 bucket for storing the email nor am I going to describe how to create the necessary access key you will need in your application for uploading content to your bucket; there are better tutorials on this subject than I could ever write. Here a couple of articles that may help:

https://docs.aws.amazon.com/quickstarts/latest/s3backup/step-1-create-bucket.html
https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/

我将做的是指出我选择的与这样一个项目有关的一些设置。

访问控制。 You not only need to set the security for the bucket, but you need to set the permissions for the items themselves. In my project, I use a very open policy of public-read because the sample data is not personal and I wanted easy access 到 data. You will probably want a much stricter set of ACL policies. Here is a nice article on ACL settings:  https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html
归档的档案。 In S3 there is something called Lifecycle Management. This allows you to move data from one type of S3 storage class to another. The different storage classes represent the amount of access you need to the stored data with lower costs associated with the storage you access the least. A good write up of the different classes and transitioning through them can be found in an AWS guide called, 过渡对象. In my case, I chose to create a lifecycle that moved each object from Standard to Glacier after one year. Glacier access is much cheaper than the standard S3 archive and will save me money in storage costs.

一旦我创建了S3桶并进行了设置，S3就可以让我上传我从SparkPost Relay Webhook数据流中获得的符合rfc822标准的电子邮件。但在上传rfc822电子邮件有效载荷到S3之前，我需要创建一个独特的文件名，用来存储该电子邮件。

对于唯一的文件名，我将在邮件正文中搜索发送应用程序放在邮件中的隐藏ID，并使用该ID作为文件名。有一些更优雅的方法可以从html正文中提取connectorId，但为了简单明了，我将使用以下代码。

$start = strpos($htmlbody, $inputField); $start = strpos($htmlbody, "value=", $start) + 7; $end = strpos($htmlbody, ">", $start) - 1; $length = $end - $start; $UID = substr($html, $start, $length);

*我们假设$inputField持有 "ArchiveCode "值，并且在我的config.php文件中找到。

有了UID，我们就可以做出将在S3中使用的文件名。

$fileName = $ArchiveDirectory .'/' . $UID .'.eml'。

现在我能够打开与S3的连接并上传文件了。如果你看一下GitHub仓库中的s3.php文件，你会发现上传文件只需要很少的代码。

我的最后一步是把这个条目记录到MYSQL表中。

在MySQL中存储元数据

We grabbed all of the data necessary in a previous step, so the step of storage is easy. In this first phase I chose to build a table with the following fields:

自动输入日期/时间的字段
目标电子邮件地址（RCPT_TO）。
来自电子邮件DATE标头的时间戳。
SUBJECT标题
FROM电子邮件地址的标头
S3桶中使用的目录
归档邮件的S3文件名

在upload.php应用程序文件中，名为MySQLLog的函数经历了必要的步骤来打开与MySQL的链接，注入新行，测试结果并关闭链接。我确实添加了一个其他的步骤，那就是将这些数据记录到一个文本文件中。我应该为错误做更多的日志记录吗？是的。但我确实想让这段代码保持精简，以便让它运行得非常快。有时，这段代码每分钟会被调用几百次，因此需要尽可能的高效。在未来的更新中，我将添加辅助代码，以处理故障，并将这些故障通过电子邮件发送给管理员进行监控。

收尾工作

So in a few fairly easy steps, we were able to walk through the first phase of building a robust email archiving system that holds the email duplicate in S3 and cross-referencing data in a MySQL table. This will give us a foundation for the rest of the project that will be tackled in several future posts.

在这个项目的未来修订中，我希望能。