Scrape Books from URL with Dumpling AI, Clean HTML, Save to Sheets, Email as CSV

工作流概述

这是一个包含11个节点的复杂工作流,主要用于自动化处理各种任务。

工作流源代码

下载
{
  "id": "DswhuYzoemjA6iNN",
  "meta": {
    "instanceId": "a1ae5c8dc6c65e674f9c3947d083abcc749ef2546dff9f4ff01de4d6a36ebfe6",
    "templateCredsSetupCompleted": true
  },
  "name": "Scrape Books from URL with Dumpling AI, Clean HTML, Save to Sheets, Email as CSV",
  "tags": [
    {
      "id": "TlcNkmb96fUfZ2eA",
      "name": "Tutorials",
      "createdAt": "2025-04-15T17:02:00.249Z",
      "updatedAt": "2025-04-15T17:02:00.249Z"
    }
  ],
  "nodes": [
    {
      "id": "2e4f64a5-353c-4dd3-9822-62df795d4940",
      "name": "Convert to CSV File",
      "type": "n8n-nodes-base.convertToFile",
      "position": [
        1640,
        340
      ],
      "parameters": {
        "options": {}
      },
      "typeVersion": 1.1
    },
    {
      "id": "472442d3-a691-4310-93f8-019579d0c473",
      "name": "Extract all books from the page",
      "type": "n8n-nodes-base.html",
      "position": [
        760,
        340
      ],
      "parameters": {
        "options": {},
        "operation": "extractHtmlContent",
        "dataPropertyName": "content",
        "extractionValues": {
          "values": [
            {
              "key": "books",
              "cssSelector": ".row > li",
              "returnArray": true,
              "returnValue": "html"
            }
          ]
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "92765257-d64d-47c9-bd57-50914342138b",
      "name": "Sort by price",
      "type": "n8n-nodes-base.sort",
      "position": [
        1420,
        340
      ],
      "parameters": {
        "options": {},
        "sortFieldsUi": {
          "sortField": [
            {
              "order": "descending",
              "fieldName": "price"
            }
          ]
        }
      },
      "typeVersion": 1
    },
    {
      "id": "efc2f33f-1bef-4906-b3b7-b02868080a54",
      "name": "Extract individual book price",
      "type": "n8n-nodes-base.html",
      "position": [
        1200,
        340
      ],
      "parameters": {
        "options": {},
        "operation": "extractHtmlContent",
        "dataPropertyName": "books",
        "extractionValues": {
          "values": [
            {
              "key": "title",
              "attribute": "title",
              "cssSelector": "h3 > a",
              "returnValue": "attribute"
            },
            {
              "key": "price",
              "cssSelector": ".price_color"
            }
          ]
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "74c7c3af-d63c-4b6c-95a0-15f45b19134b",
      "name": "Send CSV via e-mail",
      "type": "n8n-nodes-base.gmail",
      "position": [
        1860,
        340
      ],
      "webhookId": "40f2d609-52ed-40bf-b190-1f1cebbe3fb7",
      "parameters": {
        "sendTo": "",
        "message": "Hey, here's the scraped data from the online bookstore!",
        "options": {
          "attachmentsUi": {
            "attachmentsBinary": [
              {}
            ]
          }
        },
        "subject": "bookstore csv",
        "emailType": "text"
      },
      "credentials": {
        "gmailOAuth2": {
          "id": "j70r3RTMED1pgN3R",
          "name": "Gmail account 2"
        }
      },
      "typeVersion": 2.1
    },
    {
      "id": "95c7998b-ece0-4dea-b99e-97ac22fb8a59",
      "name": "Sticky Note3",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        140,
        -260
      ],
      "parameters": {
        "width": 619,
        "height": 297,
        "content": "### Scrape Books from URL with Dumpling AI, Clean HTML, Save to Sheets, Email as CSV

📌 This workflow scrapes book data from a website, turns it into a CSV, saves it, and sends it by email.

🔧 It starts from a Google Sheets trigger, fetches the page using DumplingAI, extracts books, sorts by price, and emails the CSV.

✅ Make sure APIs for Gmail, Sheets & Drive are enabled in Google Cloud. Update the URL in the \"Fetch website content\" node.
"
      },
      "typeVersion": 1
    },
    {
      "id": "f599028a-49a9-4b85-b484-5abf1229e373",
      "name": "Sticky Note",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        140,
        60
      ],
      "parameters": {
        "color": 4,
        "width": 900,
        "height": 300,
        "content": "### 🔁 Trigger to Raw Book HTML

1. **Google Sheets Trigger**  
   Watches a sheet for new row entries. Once a new URL is added, the workflow starts.

2. **Fetch Website Content (Dumpling AI)**  
   Makes an HTTP POST request to Dumpling AI to scrape and return the full HTML of the target URL.

3. **Extract All Books**  
   Uses CSS selectors to isolate the list items (`li.row > li`) containing book entries.

4. **Split Out Node**  
   Breaks the array of book HTML blocks into individual items, so each book can be processed separately in the next steps.
"
      },
      "typeVersion": 1
    },
    {
      "id": "bc6ab72c-de03-4e79-9da0-ca12ddf31811",
      "name": "Sticky Note1",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1140,
        60
      ],
      "parameters": {
        "color": 6,
        "width": 840,
        "height": 300,
        "content": "### 📦 Parse, Sort, Export & Email

5. **Extract Individual Book Data**  
   From each book, extract the title (`<h3>a` title attribute) and price (`.price_color` content).

6. **Sort by Price**  
   Organizes the extracted data in descending order using the price field.

7. **Convert to CSV File**  
   Transforms the sorted JSON data into a downloadable CSV file format.

8. **Send CSV via Gmail**  
   Automatically sends an email with the CSV file attached to the predefined address.
"
      },
      "typeVersion": 1
    },
    {
      "id": "a1246b4e-212f-4bd3-970b-b0ff8db2f834",
      "name": "Trigger- Watches For new URL in Spreadsheet",
      "type": "n8n-nodes-base.googleSheetsTrigger",
      "position": [
        320,
        340
      ],
      "parameters": {
        "event": "rowAdded",
        "options": {},
        "pollTimes": {
          "item": [
            {
              "mode": "everyMinute"
            }
          ]
        },
        "sheetName": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "https://docs.google.com/spreadsheets/d/1pb4WLqv2EruLM1z9-utehcINolSj0vlUqZionyLoRUs/edit#gid=0",
          "cachedResultName": "Sheet1"
        },
        "documentId": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "https://docs.google.com/spreadsheets/d/1pb4WLqv2EruLM1z9-utehcINolSj0vlUqZionyLoRUs/edit?usp=drivesdk",
          "cachedResultName": "URLs"
        }
      },
      "credentials": {
        "googleSheetsTriggerOAuth2Api": {
          "id": "qDzHSzTkclwDHpSR",
          "name": "Google Sheets Trigger account"
        }
      },
      "typeVersion": 1
    },
    {
      "id": "b19aa287-3be4-4e16-908d-b0cb484519e3",
      "name": "Scrape Website Content with Dumpling AI",
      "type": "n8n-nodes-base.httpRequest",
      "position": [
        540,
        340
      ],
      "parameters": {
        "url": "https://app.dumplingai.com/api/v1/scrape",
        "method": "POST",
        "options": {
          "allowUnauthorizedCerts": true
        },
        "jsonBody": "={
  \"url\": \"{{ $('Trigger- Watches For new URL in Spreadsheet')}}\", 
  \"format\": \"html\",
  \"cleaned\": \"True\"
  }",
        "sendBody": true,
        "sendHeaders": true,
        "specifyBody": "json",
        "authentication": "genericCredentialType",
        "genericAuthType": "httpHeaderAuth",
        "headerParameters": {
          "parameters": [
            {
              "name": "Content-Type",
              "value": "application/json"
            }
          ]
        }
      },
      "credentials": {
        "httpBasicAuth": {
          "id": "mznexGH3YDtrUTAk",
          "name": "Unnamed credential"
        },
        "httpHeaderAuth": {
          "id": "xamyMqCpAech5BeT",
          "name": "Header Auth account"
        }
      },
      "typeVersion": 4.1
    },
    {
      "id": "02cbc6f9-bdcb-45fc-9973-ded42346ffbc",
      "name": "Split HTML Array into Individual Books",
      "type": "n8n-nodes-base.splitOut",
      "position": [
        980,
        340
      ],
      "parameters": {
        "options": {},
        "fieldToSplitOut": "books"
      },
      "typeVersion": 1
    }
  ],
  "active": false,
  "pinData": {},
  "settings": {
    "executionOrder": "v1"
  },
  "versionId": "264412ff-9d74-443c-a2ff-69be1e042a82",
  "connections": {
    "Sort by price": {
      "main": [
        [
          {
            "node": "Convert to CSV File",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Convert to CSV File": {
      "main": [
        [
          {
            "node": "Send CSV via e-mail",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Extract individual book price": {
      "main": [
        [
          {
            "node": "Sort by price",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Extract all books from the page": {
      "main": [
        [
          {
            "node": "Split HTML Array into Individual Books",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Split HTML Array into Individual Books": {
      "main": [
        [
          {
            "node": "Extract individual book price",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Scrape Website Content with Dumpling AI": {
      "main": [
        [
          {
            "node": "Extract all books from the page",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Trigger- Watches For new URL in Spreadsheet": {
      "main": [
        [
          {
            "node": "Scrape Website Content with Dumpling AI",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}

功能特点

  • 自动检测新邮件
  • AI智能内容分析
  • 自定义分类规则
  • 批量处理能力
  • 详细的处理日志

技术分析

节点类型及作用

  • Converttofile
  • Html
  • Sort
  • Gmail
  • Stickynote

复杂度评估

配置难度:
★★★★☆
维护难度:
★★☆☆☆
扩展性:
★★★★☆

实施指南

前置条件

  • 有效的Gmail账户
  • n8n平台访问权限
  • Google API凭证
  • AI分类服务订阅

配置步骤

  1. 在n8n中导入工作流JSON文件
  2. 配置Gmail节点的认证信息
  3. 设置AI分类器的API密钥
  4. 自定义分类规则和标签映射
  5. 测试工作流执行
  6. 配置定时触发器(可选)

关键参数

参数名称 默认值 说明
maxEmails 50 单次处理的最大邮件数量
confidenceThreshold 0.8 分类置信度阈值
autoLabel true 是否自动添加标签

最佳实践

优化建议

  • 定期更新AI分类模型以提高准确性
  • 根据邮件量调整处理批次大小
  • 设置合理的分类置信度阈值
  • 定期清理过期的分类规则

安全注意事项

  • 妥善保管API密钥和认证信息
  • 限制工作流的访问权限
  • 定期审查处理日志
  • 启用双因素认证保护Gmail账户

性能优化

  • 使用增量处理减少重复工作
  • 缓存频繁访问的数据
  • 并行处理多个邮件分类任务
  • 监控系统资源使用情况

故障排除

常见问题

邮件未被正确分类

检查AI分类器的置信度阈值设置,适当降低阈值或更新训练数据。

Gmail认证失败

确认Google API凭证有效且具有正确的权限范围,重新进行OAuth授权。

调试技巧

  • 启用详细日志记录查看每个步骤的执行情况
  • 使用测试邮件验证分类逻辑
  • 检查网络连接和API服务状态
  • 逐步执行工作流定位问题节点

错误处理

工作流包含以下错误处理机制:

  • 网络超时自动重试(最多3次)
  • API错误记录和告警
  • 处理失败邮件的隔离机制
  • 异常情况下的回滚操作