Skip to content

Lab 080: MarkItDown + MCP β€” Document Ingestion for AgentsΒΆ

Level: L200 Path: All paths Time: ~60 min πŸ’° Cost: Free

What You'll LearnΒΆ

  • What Microsoft MarkItDown is β€” a library that converts PDF, Word, Excel, PowerPoint, HTML, and images into clean Markdown for LLM consumption
  • How MarkItDown's MCP server exposes document conversion as a tool that any MCP-compatible agent can call
  • Analyze conversion quality across different file types to understand strengths and limitations
  • Measure conversion speed and identify which formats are fastest to process
  • Debug a broken MarkItDown analysis script by fixing 3 bugs

IntroductionΒΆ

Large Language Models work best with plain text, but enterprise documents come in dozens of formats β€” PDFs with tables, Word documents with embedded images, Excel spreadsheets, PowerPoint decks, and HTML pages. Manually converting these to text loses structure, and OCR-based approaches are slow and error-prone.

Microsoft MarkItDown solves this by converting rich documents into well-structured Markdown that preserves tables, headings, lists, and image references. It supports PDF, DOCX, XLSX, PPTX, HTML, CSV, JSON, and even images (via OCR/captioning). When combined with its MCP server, any agent can call convert_to_markdown as a tool β€” enabling seamless document ingestion workflows.

The ScenarioΒΆ

You are a Platform Engineer at OutdoorGear Inc. The company has a growing document corpus β€” quarterly reports, product catalogs, training manuals, and contracts β€” that agents need to search and reason over. You will evaluate MarkItDown's conversion quality across 12 file conversions covering 7 different file types.

No MarkItDown Installation Required

This lab analyzes a pre-recorded benchmark dataset of conversion results. You don't need to install MarkItDown β€” all analysis is done locally with pandas. If you want to run live conversions, install with pip install markitdown.

PrerequisitesΒΆ

Requirement Why
Python 3.10+ Run analysis scripts
pandas library DataFrame operations
(Optional) markitdown For live document conversions
pip install pandas

Quick Start with GitHub Codespaces

Open in GitHub Codespaces

All dependencies are pre-installed in the devcontainer.

πŸ“¦ Supporting FilesΒΆ

Download these files before starting the lab

Save all files to a lab-080/ folder in your working directory.

File Description Download
broken_markitdown.py Bug-fix exercise (3 bugs + self-tests) πŸ“₯ Download
conversion_results.csv Dataset β€” 12 file conversions across 7 formats πŸ“₯ Download

Step 1: Understanding MarkItDownΒΆ

MarkItDown follows a simple pipeline β€” detect the file type, apply the appropriate converter, and produce structured Markdown:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Input File  │────▢│  Converter   │────▢│  Markdown    β”‚
β”‚  (PDF/DOCX…) β”‚     β”‚  (per-type)  β”‚     β”‚  (structured)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Supported converters:

Format Converter Preserves
PDF pdfminer Text, headings, tables (limited)
DOCX python-docx Headings, tables, lists, styles
XLSX openpyxl Sheet data as Markdown tables
PPTX python-pptx Slide text, speaker notes, images
HTML BeautifulSoup Structure, links, tables
CSV/JSON Built-in Tabular data
Images OCR / LLM captioning Extracted text or descriptions

MCP Server IntegrationΒΆ

MarkItDown ships with an MCP server that exposes conversion as a tool:

{
  "tools": [
    {
      "name": "convert_to_markdown",
      "description": "Convert a document file to Markdown",
      "inputSchema": {
        "type": "object",
        "properties": {
          "uri": { "type": "string", "description": "File path or URL" }
        }
      }
    }
  ]
}

Any MCP-compatible agent (GitHub Copilot, Claude Desktop, custom agents) can call this tool to ingest documents on the fly.


Step 2: Load the Conversion ResultsΒΆ

The dataset contains 12 file conversions across 7 different formats:

import pandas as pd

results = pd.read_csv("lab-080/conversion_results.csv")
print(f"Total conversions: {len(results)}")
print(f"File types: {sorted(results['file_type'].unique())}")
print(f"\nDataset preview:")
print(results[["test_id", "input_file", "file_type", "conversion_success", "quality_score"]].to_string(index=False))

Expected output:

Total conversions: 12
File types: ['csv', 'docx', 'html', 'image', 'json', 'pdf', 'pptx', 'xlsx']
test_id input_file file_type conversion_success quality_score
D01 quarterly_report.pdf pdf True 0.92
D02 product_catalog.docx docx True 0.95
... ... ... ... ...
D11 corrupted_file.pdf pdf False 0.00
D12 scanned_receipt.png image True 0.72

Step 3: Analyze Conversion SuccessΒΆ

Calculate overall success rate and identify failures:

successful = results[results["conversion_success"] == True]
failed = results[results["conversion_success"] == False]

print(f"Successful conversions: {len(successful)}/{len(results)}")
print(f"Success rate: {len(successful)/len(results)*100:.0f}%")

if len(failed) > 0:
    print(f"\nFailed conversions:")
    print(failed[["test_id", "input_file", "file_type"]].to_string(index=False))

Expected output:

Successful conversions: 11/12
Success rate: 92%

Failed conversions:
 test_id      input_file file_type
     D11 corrupted_file.pdf       pdf

Insight

The only failure is a corrupted PDF (D11, file_size_kb = 0). MarkItDown handles all 7 supported formats successfully when the input file is valid.


Step 4: Analyze Conversion QualityΒΆ

Compare quality scores across file types:

print("Quality scores by file type (successful only):")
quality = successful.groupby("file_type")["quality_score"].agg(["mean", "count"])
quality.columns = ["avg_quality", "count"]
print(quality.sort_values("avg_quality", ascending=False).to_string())

avg_quality = successful["quality_score"].mean()
print(f"\nOverall average quality: {avg_quality:.3f}")

Expected output:

Quality scores by file type (successful only):
           avg_quality  count
csv              0.990      1
json             0.980      1
xlsx             0.980      1
html             0.970      1
docx             0.955      2
pdf              0.893      3
pptx             0.850      1
image            0.720      1

Overall average quality: β‰ˆ 0.916

Insight

Structured formats (CSV, JSON, XLSX) achieve near-perfect quality (β‰₯0.98), while images have the lowest quality (0.72) β€” OCR/captioning is inherently lossy. PDFs vary based on complexity; the large training manual (D10, 12 MB) scored 0.82.


Step 5: Analyze Conversion SpeedΒΆ

Measure conversion times and identify bottlenecks:

print("Conversion time by file type (successful only):")
for _, row in successful.sort_values("conversion_time_ms", ascending=False).iterrows():
    print(f"  {row['test_id']} ({row['file_type']:>5}): {row['conversion_time_ms']:,}ms "
          f"({row['file_size_kb']:,} KB)")

Expected output:

  D10 (  pdf): 4,500ms (12,000 KB)
  D12 (image): 2,200ms (450 KB)
  D04 ( pptx): 1,800ms (5,200 KB)
  D01 (  pdf): 1,200ms (2,450 KB)
  ...
  D08 (  csv):    30ms (45 KB)
total_tables = successful["tables_found"].sum()
total_images = successful["images_found"].sum()
total_headings = successful["headings_found"].sum()

print(f"\nExtracted elements (successful conversions):")
print(f"  Tables found:   {total_tables}")
print(f"  Images found:   {total_images}")
print(f"  Headings found: {total_headings}")

Expected output:

Extracted elements (successful conversions):
  Tables found:   31
  Images found:   62
  Headings found: 103

Insight

Large PDFs and images are the slowest to convert. The training manual (D10, 12 MB) took 4.5 seconds but extracted 15 tables, 28 images, and 32 headings β€” a rich document that would be extremely tedious to process manually.


Step 6: MCP Server ArchitectureΒΆ

When MarkItDown runs as an MCP server, agents can convert documents on demand:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Agent      │────▢│  MCP Server  │────▢│  MarkItDown  β”‚
β”‚  (Copilot,   β”‚     β”‚  (stdio/SSE) β”‚     β”‚  (converter) β”‚
β”‚   Claude)    │◀────│              │◀────│              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     request              route              convert
     markdown             return             to .md

To start the MCP server locally:

# Install MarkItDown with MCP support
pip install 'markitdown[mcp]'

# Start the MCP server (stdio transport)
markitdown --mcp

Then add it to your MCP client configuration:

{
  "mcpServers": {
    "markitdown": {
      "command": "markitdown",
      "args": ["--mcp"]
    }
  }
}

πŸ› Bug-Fix ExerciseΒΆ

The file lab-080/broken_markitdown.py has 3 bugs in the analysis functions. Can you find and fix them all?

Run the self-tests to see which ones fail:

python lab-080/broken_markitdown.py

You should see 3 failed tests. Each test corresponds to one bug:

Test What it checks Hint
Test 1 Success rate calculation Should count True, not False
Test 2 Average quality calculation Must filter to successful conversions first
Test 3 Total tables found Should sum tables_found, not images_found

Fix all 3 bugs, then re-run. When you see All passed!, you're done!


🧠 Knowledge Check¢

Q1 (Multiple Choice): What formats does MarkItDown support for conversion to Markdown?
  • A) Only PDF and Word documents
  • B) PDF, DOCX, XLSX, PPTX, HTML, CSV, JSON, and images
  • C) Only text-based formats like HTML and CSV
  • D) Any format, including video and audio files
βœ… Reveal Answer

Correct: B) PDF, DOCX, XLSX, PPTX, HTML, CSV, JSON, and images

MarkItDown supports a wide range of document formats including PDF (via pdfminer), Word documents (python-docx), Excel spreadsheets (openpyxl), PowerPoint presentations (python-pptx), HTML (BeautifulSoup), CSV, JSON, and images (via OCR or LLM captioning). It does not support audio or video files.

Q2 (Multiple Choice): How does MarkItDown's MCP server enable agent-based document ingestion?
  • A) It converts documents to embeddings directly
  • B) It exposes a convert_to_markdown tool that any MCP-compatible agent can call
  • C) It requires agents to download and parse files themselves
  • D) It stores converted documents in a vector database automatically
βœ… Reveal Answer

Correct: B) It exposes a convert_to_markdown tool that any MCP-compatible agent can call

The MarkItDown MCP server runs as a standard MCP tool server (via stdio or SSE transport). It exposes a convert_to_markdown tool that accepts a file URI and returns the converted Markdown. Any MCP-compatible client β€” GitHub Copilot, Claude Desktop, or custom agents β€” can call this tool to ingest documents on the fly without any custom integration code.

Q3 (Run the Lab): How many of the 12 file conversions were successful?

Load πŸ“₯ conversion_results.csv and count rows where conversion_success == True.

βœ… Reveal Answer

11 of 12

All conversions succeeded except D11 (corrupted_file.pdf), which was a corrupted PDF with 0 KB file size. MarkItDown reliably handles valid files across all 7 tested formats.

Q4 (Run the Lab): What is the total number of tables found across all successful conversions?

Filter to successful conversions and compute tables_found.sum().

βœ… Reveal Answer

31

Sum of tables_found across the 11 successful conversions: D01(6) + D02(2) + D03(5) + D04(1) + D05(0) + D06(0) + D07(1) + D08(1) + D09(0) + D10(15) + D12(0) = 31 tables.

Q5 (Run the Lab): What is the average quality score for successful conversions?

Filter to conversion_success == True, then compute quality_score.mean().

βœ… Reveal Answer

β‰ˆ 0.916

Quality scores for the 11 successful conversions: 0.92 + 0.95 + 0.98 + 0.85 + 0.97 + 0.94 + 0.96 + 0.99 + 0.98 + 0.82 + 0.72 = 10.08. Average = 10.08 Γ· 11 β‰ˆ 0.916.


SummaryΒΆ

Topic What You Learned
MarkItDown Converts PDF, DOCX, XLSX, PPTX, HTML, CSV, JSON, and images to structured Markdown
MCP Integration MCP server exposes convert_to_markdown tool for any compatible agent
Quality Analysis Structured formats (CSV, JSON, XLSX) achieve β‰₯0.98 quality; images lowest at 0.72
Speed Analysis Large PDFs and images are slowest; CSV/JSON convert in under 50ms
Success Rate 11/12 conversions succeeded β€” only corrupted files fail
Element Extraction 31 tables, 62 images, 103 headings extracted across successful conversions

Next StepsΒΆ